SSD-KD: The latest distillation study without raw data from Tianyi Cloud & Tsinghua

Data-free knowledge distillation is able to utilize the knowledge learned by large teacher networks to augment the training of smaller student networks without accessing the original training data, thus avoiding privacy, security, and proprietary risks in real-world applications. In this area of research, existing approaches typically follow an inverse distillation paradigm where generative adversarial networks are trained in real-time under the guidance of pre-trained teacher networks to synthesize a large-scale sample set for knowledge distillation. The paper revisits this common data-free knowledge distillation paradigm and shows that there is considerable room for improvement in overall training efficiency through the lens of "knowledge distillation with small-scale adversarial data". Based on three empirical observations that demonstrate the importance of balancing category distribution in terms of the diversity and difficulty of synthesizing samples during data reversal and distillation, the paper proposes small-scale data-free knowledge distillation (SSD-KD). In the formalization of theSSD-KDA conditioning function is introduced to balance the synthesized samples, and a preferential sampling function is introduced to select appropriate samples, aided by a dynamic playback buffer and a reinforcement learning strategy.SSD-KDSynthetic samples can be synthesized at very small sizes (e.g., 10 times smaller than the original training data size).\(\times\) ) conditions for distillation training, resulting in an overall training efficiency that is one or two orders of magnitude faster than many mainstream methods, while maintaining superior or competitive model performance, as demonstrated on popular image classification and semantic segmentation benchmarks.

discuss a paper or thesis (old): Small Scale Data-Free Knowledge Distillation

Paper Address:/abs/2406.07876
Thesis Code:/OSVAI/SSD-KD

Introduction

For computer vision applications on resource-constrained devices, the key question is how to learn portable neural networks while maintaining satisfactory prediction accuracy. Knowledge distillation (KD) Utilizing information from pre-trained large teacher networks to facilitate training of smaller target student networks on the same training data has become a mainstream solution. The traditionalKDThe method assumes that the original training data is always available. However, access to the source dataset used for teacher network training is usually not available in practice due to potential privacy, security, proprietary or data size issues. To address the limitations of training data, data-agnostic knowledge distillation has recently attracted increasing attention.

Data-independent knowledge distillation (D-KD) The basic idea is to construct synthetic samples for knowledge distillation based on pre-trained teacher networks that should match the latent distribution of the original training data. The existing topD-KDMethods typically use adversarial inversion and distillation paradigms.

In the inversion process, the generator is trained by using the pre-trained teacher network as a discriminator.
In the subsequent knowledge distillation process, the generator of real-time learning will synthesize pseudo-samples for training the student network.

However, the adversarialD-KDmethods typically require the generation of a large number of synthetic samples (compared to the size of the original training dataset) to ensure credible knowledge distillation. This places a heavy burden on training resource consumption and inhibits their use in practical applications. In a recent work, the authors propose an effective meta-learning strategy that searches for common features and uses them as an initial prior to reduce the number of iterative steps required for generator convergence. Although faster data synthesis can be achieved, it is still necessary to generate a sufficiently large number of synthetic samples to ensure effective knowledge distillation. This ignores the efficiency of the subsequent knowledge distillation process and becomes a major bottleneck in overall training efficiency. In conclusion, no research work has yet been done to improve the efficiency of data inversion and knowledge distillation processes by jointly considering theD-KDof overall training efficiency.

To bridge this critical gap, the paper proposes the first fully efficientD-KDmethod, i.e., small-scale data-free knowledge distillation (SSD-KDThe results are summarized as follows.) The overall training efficiency of the adversarial inversion and distillation paradigm is improved from a novel "small data size" perspective. In this work, "data size" refers to the total number of inversion samples used for knowledge distillation in a training cycle.

SSD-KDThe presentation is largely inspired by three observations.

First, the paper observes that when the data sizes of the synthetic and original samples are significantly reduced to the same level on different teacher-student network pairs (e.g., the source training dataset size of the10%) when the student network trained on the synthetic samples shows better performance compared to the corresponding network trained on the original samples, as shown in Figure1Shown. Note that the synthetic samples are generated under the guidance of a pre-trained teacher network on the full source dataset, which naturally reflects a different perspective of the original data distribution. At sufficiently small data sizes, this allows the synthetic samples to outperform the original samples in fitting the underlying distribution across the source dataset. This illuminating observation suggests that if a small set of high-quality synthetic samples can be constructed, it will create a pathway to a fully efficientD-KDof promising approaches. In principle, the paper argues that a high-quality small-scale synthetic dataset should have a good balance of class distributions in terms of the diversity and difficulty of synthetic samples.

However, two other observations of the paper suggest that existingD-KDMethods, both traditional and the most efficient designs, do not have a good ability to balance both of these aspects of the class distribution at small data sizes, as shown in Figure2Shown. Note that there are already someD-KDmethods to enhance the diversity of synthetic samples, but the diversity of synthetic samples in terms of sample difficulty has not been explored.

Based on these observations and analysis, the paper presentsSSD-KD, two interdependent modules were introduced to significantly improve the overall training efficiency of the main adversarial inversion and distillation paradigm.SSD-KDThe first module relies on a novel modulation function that defines a diversity-aware term and a difficulty-aware term to co-balance the category distribution of synthetic samples during data synthesis and knowledge distillation in an explicit manner. The second module defines a novel preferential sampling function, a function that is implemented via a reinforcement learning strategy to further improve the end-to-end training efficiency by selecting a small number of appropriate synthetic samples from the candidate samples that are stored in a dynamic replay buffer for knowledge distillation. Thanks to these two modules, theSSD-KDhas two attractive advantages. On the one hand.SSD-KDSynthetic samples can be synthesized at a very small scale (smaller than the original training data size)10times) on distillation training, making the overall training efficiency higher than many mainstreamD-KDmethods are one or two orders of magnitude faster while maintaining competitive model performance. On the other hand.SSD-KDSubstantial gains in student model accuracy were obtained, and when the data size of the synthetic sample was relaxed to a relatively large number (still smaller than the existingD-KDscale of the method) maintains overall training efficiency.

Method

Preliminaries: D-KD

found\({f_t(\cdot;\theta_t)}\) for a teacher model pre-trained on the original task dataset, which is now inaccessible.D-KDThe goal is to first construct a set of synthetic training samples by inverting the data distribution information learned by the teacher model\(x\) , and then train a target student model on these samples\({f_s(\cdot;\theta_s)}\) that forces the student model to mimic the function of the teacher. The existingD-KDMethods mostly use generative adversarial networks\(g(;\theta_g)\) to generate synthetic training samples\(x=g(z;\theta_g)\) which\(z\) for potentially noisy inputs, the training of this generative adversarial network is performed by using the teacher model as a discriminator.

D-KDThe optimization of contains a common distillation regularization term for minimizing functional differences between teacher-student models\(\mathcal{L}_{\text{KD}}({x})=D_\text{KD}(f_t(x; \theta_t)\| f_s(x; \theta_s))\) The difference is based onKLDispersion. In addition, a task-oriented regularization term is included\(\mathcal{L}_{\text{Task}}({x})\) , for example, using the predictions of the teacher model as a loss of cross-entropy for true labeling. In addition to this, sinceD-KDBased primarily on the assumption that teacher models are pre-trained to capture the distribution of source training data, recent years'D-KDThe method introduces an additional loss term to regularize the statistics of the training data distribution during data inversion (batch normalization (BN) parameters).

\[\begin{equation} \mathcal{L}_{\text{BN}}({x})= \sum_l \big\|\mu_l(x)-\mathbb{E}(\mu_l)\big\|_2+\big\| \sigma_l^2(x)-\mathbb{E}(\sigma_l^2)\big\|_2, \end{equation} \]

Among them.\(\mu_l(\cdot)\) cap (a poem)\(\sigma_l(\cdot)\) denote, respectively, the first\(l\) Batch mean and variance estimation of layer feature maps;\(\mathbb{E}(\cdot)\) For batch normalization (BN) The expectation of a statistic can be approximated by running the mean or variance instead.

D-KDThe validity of is heavily dependent on the quality of the synthetic samples obtained from inversions utilizing pre-trained teacher model knowledge. The current adversarialD-KDThe paradigm contains two processes, i.e., data inversion and knowledge distillation. From the perspectives of both efficiency and effectiveness, on the one hand, the data inversion process affects the optimization performance of the student model to a great extent; on the other hand, the training time cost of knowledge distillation becomes theD-KDA significant limiting factor in overall training efficiency.

Our Design: SSD-KD

SSD-KDFocuses on improving adversarial through the lens of "small-scale inverse data for knowledge distillation".D-KDparadigm that focuses on utilizing feedback from pre-trained teacher models and knowledge distillation processes to guide the data inversion process, thereby significantly accelerating overall training efficiency. Based on the notation in the previousSSD-KDThe optimization objective is defined as

\[\begin{align} \min\limits_{f_s} \max\limits_{g} \mathbb{E}_{x = \delta \circ g\left(z\right)}\big(\mathcal{L}_{\text{BN}}({x})+\mathcal{L}_{\text{KD}}({x})+\phi(x)\mathcal{L}_{\text{Task}}({x})\big), \label{eq:loss} \end{align} \]

Among them.

A diversity-aware modulation function is used\(\phi(x)\) , which assigns different priorities to the predicted categories of each synthetic sample according to the teacher model.
existBNThe estimation is constrained by the\(\phi(x)\) The generator is encouraged to explore synthetic samples that are more challenging for the teacher model whenever possible.
Instead of using a random sampling strategy to select samples for knowledge distillation, a reweighting strategy is used to control the sampling process. Symbol.\(\circ\) to denote the application of a priority-based sampling function\(\delta\) 。
Each synthesized sample consists not only of its modulation function\(\phi(x)\) for prioritization, and it will also be reweighted in the sampling phase, which reuses the same techniques as the\(\phi(x)\) Same median value.

even thoughD-KDPipelining allows training samples to be synthesized and used to train student models on the same task, but there is a large amount of data redundancy that prevents theD-KDMethods of training efficiency.SSD-KDIt is a completely efficientD-KDmethodology that enables the use of very small-scale synthetic data while comparing with existingD-KDmethods to achieve competitive performance compared to

SSD-KDThe process is summarized in the algorithm1In the existing adversarialD-KDThe optimization process of the method (both traditional families and more efficient families) was compared to ourSSD-KDThe comparison is shown in Fig.3Shown.

Data Inversion with Distribution Balancing

seek2displayedD-KDin the data redundancy due to large imbalances in the synthetic data. The two subfigures on the right depict the distribution of categories predicted by the teacher model, showing the significant imbalance in the data categories, while the two subfigures on the left show the number of samples at different levels of prediction difficulty (the difficulty is measured by the probability of a teacher model prediction). ForD-KDFor example, this suggests that generating samples based solely on teacher-student differences leads to a very uneven distribution of sample difficulty and tends to yield data samples that are easy to predict. For data samples that are all synthesizedD-KDtask, the data generation process needs to take into account both teacher-student differences and teachers' own pre-training knowledge, based on which theSSD-KDData synthesis methods that take diversity and difficulty into account are proposed.

Diversity-aware balancing

The problem of sample difficulty imbalance is first addressed in the data inversion process. Specifically, maintain a playback buffer\(\mathcal{B}\) , which stores a certain amount (denoted as\(|\mathcal{B}|\) ) of the synthetic data samples. For the\(\mathcal{B}\) Each data sample in the\(x\) Punishment and\(x\) The total number of samples sharing the same predicted category (predicted by the teacher model). To achieve this, a balancing term that takes diversity into account is used, which encourages the generator to synthesize samples with uncommon categories.

Difficulty-aware balancing

Drawing on the experience of using focus loss for highly unbalanced samples in the field of object detection, a further per-sample\(x\) Introduces a predictive probability based\(p_T(x)\) of the difficulty-perceived balance term. Here, difficult synthetic samples are considered to be those with low predictive confidence by the teacher model, and these are encouraged by the difficulty-perceived balance term.

In summary, the paper introduces a regulation function\(\phi(x)\) to adjust the optimization of the generator based on predictive feedback from pre-trained teacher models.\(\phi(x)\) aims to balance the category distribution and dynamically differentiate between easy and difficult synthetic samples so that easy samples do not overly dominate the distillation process. Formally, for synthetic data samples\(x\in\mathcal{B}\) Its governing function\(\phi(x)\) The formula for this is

\[\begin{equation} \phi (x) = \underbrace {\Big(1 - \frac{1}{|\mathcal{B}|}\sum_{x'\in\mathcal{B}}\mathbb{I}_{c_T(x')=c_T(x)}\Big)}_{\text{diversity-aware balancing}} \underbrace{{\Big(1 - p_T(x)\Big)}^\gamma }_{\text{difficulty-aware balancing}} \label{eq:phi} \end{equation} \]

Among them.\(c_T(x)\) cap (a poem)\(p_T(x)\) refer to the category index and probability (or confidence) of the pretrained teacher model predictions, respectively;\(\mathbb{I}_{c_T(x')=c_T(x)}\) denotes an indicator function if\(x'\) The projected categories are the same as the\(x\) is the same, then the function is equal to\(1\) otherwise\(0\) ； \(\gamma\) is a hyperparameter.

Emphasis on the moderating function\(\phi(x)\) The two properties of the

For data samples with high predictive confidence in teacher models\(x\) (i.e., they are considered easy samples).\(\phi(x)\) close to the low value and thus to the task-related losses\(\mathcal{L}_{\text{Task}}({x})\) The effect is smaller in Eq. (`ref`{eq:loss}).
When the synthetic data samples predicted by the teacher network are in\(\mathcal{B}\) When the distribution of categories in is severely unbalanced with the\(\mathcal{B}\) More samples of sample-sharing categories in\(x\) It will be penalized so that\(\phi(x)\) will weaken the corresponding\(\mathcal{L}_{\text{Task}}({x})\) 。

Although the formula indicates that the adjustment function\(\phi(x)\) value is partially determined by the current replay buffer\(\mathcal{B}\) Decision, with the caveat that\(\mathcal{B}\) is dynamic and also subject to\(\phi(x)\) of the effect of the This is because the term in Eq.\(\phi(x)\mathcal{L}_{\text{Task}}({x})\) directly optimizes the generator that synthesizes the data samples to compose the\(\mathcal{B}\) . In this sense, considering\(\mathcal{B}\) cap (a poem)\(\phi(x)\) interactions that maintain a balance of category diversity during the training process. Balancing the categories of data samples is especially important during data reversal.

as shown2As shown, with the help of these two equilibrium termsSSD-KDModerate sample distributions were generated in terms of sample category and difficulty.

Distillation with Priority Sampling

The original prioritized experience replay method reuses the transfer of important knowledge more frequently, thus improving learning efficiency. In contrast, the paper's prioritized sampling approach does not derive rewards from the environment, but is designed to adapt to data-free knowledge distillation and obtain feedback from the framework itself. In other words, the prioritized sampling method plays the opposite role in the data-free knowledge distillation approach: it focuses on training a small set of highly prioritized samples instead of sampling them uniformly, thus speeding up the training process.

From the current replay buffer\(\mathcal{B}\) Sampling synthetic data in\(x\) , the paper proposes a method called prioritized sampling (Priority Sampling, PS) sampling strategy to regulate the sampling probability instead of uniform sampling.PSThe basic function is to measure\(\mathcal{B}\) Each sample in the\(x\) The importance of the prioritized sampling function is therefore introduced\(\delta_{i}(x)\) 。

\[\begin{equation} \delta_{i}(x) = w_{i-1}(x) KL(f_t(x;\theta_t)||f_s(x;\theta_s)), \end{equation} \]

\(KL\) indicatesoftmaxexports\(f_t(x;\theta_t)\) cap (a poem)\(f_s(x;\theta_s)\) inter-KLDispersion;\(\theta_t\) cap (a poem)\(\theta_s\) Dependent on training steps\(i\) ； \(w_{i}(x)\) is used for normalization\(\mathcal{B}\) in the sample's calibration term, as formalized in the following equation, and, in particular, when the\(i=0\) when\(w_{-1}(x)=1\) 。

Training for knowledge distillation using randomized updates relies on these updates having the same distribution as their expected values. Prioritizing the sampled data introduces bias as it may change the data distribution and thus affect the solution to which the estimates converge. Therefore, by introducing an importance sampling (Importance Sampling, IS) weights\(w_i(x)\) to correct for this bias for the data sample\(x\) ：

\[\begin{equation} \label{eq_wi} w_{i}(x)=(N \cdot P_{i}(x))^{-\beta}, \end{equation} \]

included among these\(\beta\) is a hyperparameter;\(P_{i}(x)\) is the sampling transition probability defined as

\[\begin{equation} \label{pi} P_{i}(x)=\frac{\big(\left|\delta_{i}(x)\right|+\epsilon\big)^\alpha}{\sum_{x'\in\mathcal{B}}\big(\left|\delta_{i}(x')\right|+\epsilon\big)^\alpha}, \end{equation} \]

included among these\(\epsilon\) is a small positive number used to prevent the extreme case where transfers are not selected when the priority is zero.

prioritized sampling function\(\delta(x)\) has two distinguishing characteristics:

in the wake ofdeltaIncrease in value.\(\delta(x)\) Reflects current\(\mathcal{B}\) greater information difference between the teacher model and the student model for the synthetic sample in the middle. Therefore, the student model should be optimized from the sample with greater information difference, which helps to obtain the teacher model faster.
\(\delta(x)\) will change dynamically with each update iteration of the student model and the generative model. Thus, when the student model gains the capability of the teacher model on some samples, it will continue to learn from samples that are more different relative to the teacher model, based on the new sample distribution. This further improves the performance of the student model.

Experiment

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

SSD-KD: The latest distillation study without raw data from Tianyi Cloud & Tsinghua | CVPR'24

Preliminaries: D-KD

Our Design: SSD-KD

Data Inversion with Distribution Balancing

Diversity-aware balancing

Difficulty-aware balancing

Distillation with Priority Sampling