Dataset distillation aims to synthesize each category from large datasets (
IPC
) a small number of images to approximate the full dataset training with minimal performance loss. Despite the fact that a very smallIPC
valid within the scope, but with theIPC
increases, many distillation methods become less effective or even perform less well than random sample selection. The paper provides a review of variousIPC
The state-of-the-art trajectory-matching based distillation methods under the range were investigated and found to be effective in increasing theIPC
case it is difficult to incorporate complex, rare features from harder samples into synthetic datasets, leading to a persistent coverage gap between easy and hard test samples. Inspired by these observations, the paper proposes that theSelMatch
The first is a system that can efficiently follow theIPC
Extended Novel Distillation Methods.SelMatch
Manage synthetic datasets using selection-based initialization and partial updating via trajectory matching to adapt for targeting theIPC
The desired difficulty level for range customization. In the case of a customized difficulty level for theCIFAR-10
/100
cap (a poem)TinyImageNet
In the test of theSelMatch
exist5%
until (a time)30%
consistently outperforms mainstream selection-only and distillation-only methods in terms of subset ratios.
discuss a paper or thesis (old): SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching
- Paper Address:/abs/2406.18561
- Thesis Code:/Yongalls/SelMatch
Introduction
Dataset reduction is essential for efficient learning from data, which involves synthesizing or selecting a smaller number of samples from a large dataset while ensuring that the performance of models trained on this reduced dataset remains comparable or minimizes performance degradation compared to those trained on the full dataset. This approach addresses the challenges faced when training neural networks on large datasets, such as high computational costs and memory requirements.
An important technique in this field is dataset distillation, also known as dataset coalescing. This method distills a large dataset into a smaller synthetic dataset. Compared to core set selection methods, data distillation shows significant performance in image classification tasks, especially at very small scales. For example, matching training trajectories (MTT
) The algorithm only usesCIFAR-10
data set1%
In the simpleConvNet
has been realized on71.6%
The accuracy of the complete dataset is close to the84.8%
Accuracy. This remarkable efficiency comes from an optimization process in which synthetic samples are optimally learned in continuous space rather than being selected directly from the original dataset.
However, recent studies have shown that with the size of the synthesized dataset or each class of images (IPC
) increases, many dataset distillation methods lose effectiveness and even perform worse than random sample selection. This phenomenon is puzzling, considering the greater optimization freedom offered by distillation relative to discrete sample selection. Specifically.DATM
By analyzing state-of-the-artMTT
The training trajectory of the method to investigate this phenomenon points out how the stage of the training trajectory that the method focuses on during the synthesis of the dataset significantly affects the effectiveness of the distilled dataset. In particular, the simple patterns learned in the early trajectories and the difficult patterns learned in the later stages significantly influence theMTT
differentIPC
performance in the case.
The thesis is further developed by comparing the results obtained in differentIPC
caseMTT
The methodology covers the case of simple and difficult real samples in the synthetic dataset and finds that as theIPC
increase, distillation methods fail to adequately incorporate rare features of difficult samples into the synthetic dataset, which leads to a consistent coverage gap between simple and difficult samples. At higherIPC
range, part of the reason for the reduced effectiveness of dataset distillation methods is that they tend to focus on simpler, more representative features in the dataset. In contrast, with theIPC
increase, covering harder and rarer features becomes more critical to the generalization ability of models trained on shrinking datasets, something that has been empirically and theoretically validated in data selection studies.
Inspired by these observations, the paper proposes a novel approach calledSelMatch
, as a solution to efficiently extend the distillation method for data sets. With theIPC
increase, the synthetic dataset should cover the more complex and diverse features of the real dataset with an appropriate difficulty level. The desired difficulty level of the synthetic dataset is managed through selection-based initialization and partial updating through trajectory matching.
- Selection-based initialization: to overcome the limitation of traditional trajectory matching methods that overconcentrate on simple patterns, even if the
IPC
Increase, using a method that is specific to eachIPC
Real images of appropriate difficulty level for optimization are performed to initialize the synthetic dataset. Traditional trajectory matching methods typically use randomly selected samples or simple or representative samples close to the class center to initialize the synthetic dataset in order to increase the convergence rate of distillation. The paper's approach instead initializes the synthetic dataset using a carefully selected subset that contains samples that fit the size of the synthetic dataset at just the right difficulty level. This approach ensures that the subsequent distillation process is performed in a way that is tailored to the specificIPC
range optimization difficulty level samples to begin with. The experimental results show that selection-based initialization plays an important role in performance performance. - Partial update: In traditional dataset distillation methods, each sample in the synthetic dataset is updated during distillation iterations. However, as the number of distillation iterations increases, the process continuously reduces the diversity of samples in the synthetic dataset, as the distillation provides signals that are biased towards simple patterns in the full dataset. Therefore, in order to maintain the rare and complex features of the difficult samples (which are essential for modeling in larger
IPC
range of generalization capabilities is crucial), the paper introduces a partial update of the synthetic dataset. The main idea is to keep a fixed portion of the synthetic dataset unchanged while updating the rest by distilling the signal, and the proportion of the unchanged portion according to theIPC
Make adjustments. Experimental results show that such partial updates play an important role in effectively extending the distillation of the dataset.
existCIFAR-10
/100
cap (a poem)TinyImageNet
Evaluated onSelMatch
, and demonstrated a great deal of success in the transition from5%
until (a time)30%
superiority over state-of-the-art selection-only and distillation-only methods in the subset scaling setup. It is noteworthy that theCIFAR-100
In the case when each class has50
The images (10%
ratio) in comparison to the leading method in the case of theSelMatch
The accuracy of the test was improved by3.5
%。
Related Works
Two main approaches in dataset reduction: sample selection and dataset distillation.
-
Sample Selection
There are two main approaches in sample selection: optimization-based and score-based selection.
Optimization-based selection aims to identify a small core set that effectively represents the various features of the complete dataset. For example.Herding
cap (a poem)K-center
Choose a core set that approximates the distribution of the complete dataset.Craig
cap (a poem)GradMatch
seeks a core set that minimizes the average gradient difference from the full dataset in neural network training. Although in small to mediumIPC
range of validity, but these methods often face problems in terms of scalability and performance when compared to scoring-based options, especially as theIPC
The increase.
Scoring-based selection can assign values based on the difficulty or impact of each instance in the neural network training. For example.Forgetting
The learning difficulty of instances was assessed by counting the number of times they were previously correctly categorized but misclassified in multiple subsequent periods.C-score
Difficulty is evaluated as the probability of misclassification when removing samples from the training set. These methods prioritize difficult samples, capturing rare and complex features with largerIPC
outperforms optimization-based selection methods at scale. These studies show that withIPC
increase, the introduction of harder or rarer features becomes more and more important for the improvement of the generalization ability of the model.
-
Dataset Distillation
Dataset distillation aims to create a small synthetic set\(\mathcal{S}\) in order to\(\mathcal{S}\) Models trained on\(\theta^\mathcal{S}\) Can achieve good generalization performance over the full dataset\(\mathcal{T}\) Good performance on it:
Here.\(\mathcal{L}^\mathcal{T}\) cap (a poem)\(\mathcal{L}^\mathcal{S}\) were...\(\mathcal{T}\) cap (a poem)\(\mathcal{S}\) Loss on. To cope with the computational complexity and memory requirements of two-layer optimization, existing work has used two approaches: substitution-based matching and kernel-based approaches. Substitution-based matching replaces complex primitive goals with simpler agent tasks. For example.DC
、DSA
cap (a poem)MTT
Designed to match the gradients or trajectories in the\(\mathcal{S}\) Models trained on\(\theta^\mathcal{S}\) The trajectory with the full dataset\(\mathcal{T}\) The trajectory is consistent.DM
assure\(\mathcal{S}\) cap (a poem)\(\mathcal{T}\) have similar distributions in the feature space. In addition, kernel-based methods utilize the kernel method to approximate the neural network on the\(\theta^\mathcal{S}\) of training and derive closed-form solutions for internal optimization. For example.KIP
Using the neural tangent kernel (NTK
) performing kernel ridge regression.FrePo
Reduce training costs by focusing only on regressions in the last learnable layer. However, with theIPC
increase, both substitution-based matching and kernel-based approaches struggle to scale effectively in terms of scalability or performance.DC-BENCH
Noted that with the highIPC
These methods perform poorly compared to random sample selection in the case of
Recent research has been devoted to solving state-of-the-artMTT
The issue of scalability of the approach focuses on either the computational aspect, by reducing the memory requirements, or the performance aspect, by utilizing the training trajectories of the complete dataset in subsequent periods. Specifically.DATM
Consistency with early training trajectories was found to be enhanced in lowIPC
performance in the regime, while consistency with later trajectories is important for highIPC
system is more beneficial. Based on this observation.DATM
according toIPC
The trajectory matching range is optimized so as to adaptively incorporate easier or more difficult patterns in the expert trajectories, thus improving theMTT
of scalability. WhileDATM
Lower and upper bounds on the range of trajectory matches can be efficiently determined, but explicitly quantifying or searching for the level of training trajectory difficulty required in terms of trends in matching losses outside these ranges remains a challenging task. In contrast, the paper'sSelMatch
Utilizing selection-based initialization and partial updating via trajectory matching to incorporate a suitable fit for eachIPC
the complex characterization of difficult samples. In particular, the paper's approach introduces a novel strategy of targeting everyIPC
range of difficulty levels customized for synthetic sample initialization, which has not been explored in the literature on distillation of previous datasets. In addition, in contrast to the specifically designed to enhance theMTT
(used form a nominal expression)DATM
Different.SelMatch
The main components of the process, i.e., selection-based initialization and partial updating, have wider applicability across a variety of distillation methods.
Motivation
Preliminary
-
Matching Training Trajectories (MTT)
State-of-the-art data set distillation methodsMTT
will be used as a benchmark for analyzing traditional dataset distillation methods in largeIPC
Limitations in scope.MTT
The goal is to match real datasets by\(\mathcal{D}_\textrm{real}\) and synthetic data sets\(\mathcal{D}_\textrm{syn}\) between training trajectories to generate synthetic datasets. In each distillation iteration, the synthetic dataset is updated to minimize the matching loss, which is measured in terms of the true dataset\(\mathcal{D}_\textrm{real}\) The training trajectory of the\(\{\theta_t^*\}\) and synthetic data sets\(\mathcal{D}_\textrm{syn}\) The training trajectory of the\(\{\hat{\theta}_t\}\) For the definition.
Among them.\(\theta_t^*\) It's in the first\(t\) walk on\(\mathcal{D}_\textrm{real}\) parameters of the model trained on it. The parameters of the model trained from the\(\hat{\theta}_{t}=\theta_t^*\) Beginning.\(\hat{\theta}_{t+N}\) It is done through the use of the\(\mathcal{D}_\textrm{syn}\) hand-on training\(N\) model parameters obtained after the step, while the\({\theta}^*_{t+M}\) Yes, it is.\(\mathcal{D}_\textrm{real}\) hand-on training\(M\) parameters obtained after the step.
Limitations of Traditional Methods in Larger IPC
first analyzeMTT
How does the pattern of the generated synthetic data vary with each type of image (IPC
) evolves as it increases. For dataset distillation methods to remain effective on larger synthetic datasets, the distillation process should continue to provide the synthetic samples with novel and complex patterns from the real dataset as each class of images increases. The trajectory matching method has been shown to be effective in lowIPC
Although state-of-the-art in terms of level, there are shortcomings in achieving this goal.
The paper demonstrates this by examining the "coverage" of the real (test) dataset. "Coverage" is defined as the percentage of synthetic samples that are less than a certain radius away from the synthetic samples in the feature space (\(r\) ) of the true sample, radius\(r\) is set to the average nearest neighbor distance of the real training samples in the feature space. The higher coverage indicates that the synthetic dataset captures the diverse features of the real samples, enabling the models trained on the synthetic dataset to learn not only simple but also complex patterns in the real dataset.
seek1a
(left) demonstrates that with theCIFAR-10
The number of images per class in the dataset (IPC
) increases and how the coverage changes. In addition, in Figure1a
(right) were analyzed for two sample groups. "Simple."50%
and "difficulties"50%
(Difficulty measure for real samples based on forgetting scores).
Observations showed that the use ofMTT
The coverage does not effectively follow theIPC
extension, is consistently lower than the coverage of random selection. In addition, the coverage of the difficult sample group is much lower than that of the simple sample group. This suggests that even though theIPC
Increase.MTT
It is also not possible to effectively embed difficult and complex data patterns into synthetic samples, which may beMTT
Scaling reasons for poor performance. And the methodology of the paperSelMatch
Demonstrates superior overall coverage, especially in theIPC
Difficulty group coverage improves significantly when increased.
Another important finding is that as the number of distillation iterations increased, theMTT
The coverage is decreasing, as shown in1b
This is shown. This observation further suggests that traditional distillation methods capture "simple" patterns mainly over multiple iterations, making the synthetic dataset less diverse as the number of distillation iterations increases. In contrast, even as the number of iterations increases, the use of theSelMatch
coverage remained stable. As shown in Figure1c
As shown, coverage also affects test accuracy. A significant difference in coverage between the simple and difficult test sample groups resulted in a significant gap in test accuracy between the two groups.SelMatch
Improved coverage of both groups, thus improving overall test accuracy, especially in theIPC
The accuracy of the test for the difficult group improved when it was increased.
Main Method: SelMatch
seek2
displayedSelMatch
the core idea of the method, which combines selection-based initialization with partial updating via trajectory matching. Traditional trajectory matching methods typically use randomly selected real datasets\(\mathcal{D}_\textrm{real}\) subset of the synthetic dataset to the synthetic dataset\(\mathcal{D}_\textrm{syn}\) initialization is performed without any specific selection criteria. During each distillation iteration, the entire\(\mathcal{D}_\textrm{syn}\) are all updated to minimize the number of variables defined in Eq.1
Matching loss in\(\mathcal{L}(\mathcal{D}_\textrm{syn}, \mathcal{D}_\textrm{real})\) 。
In comparison.SelMatch
First use a carefully selected subset\(\mathcal{D}_\textrm{initial}\) treat (sb a certain way)\(\mathcal{D}_\textrm{syn}\) Initialization is performed with a subset containing tailored samples suitable for the size of the synthetic dataset with appropriate difficulty levels. Then, in each distillation iteration, theSelMatch
update only\(\mathcal{D}_\textrm{syn}\) A particular portion of the\(\alpha\in[0,1]\) , (called\(\mathcal{D}_\textrm{distill}\) ), while the remainder of the dataset (called the\(\mathcal{D}_\textrm{select}\) ) remains constant. This process aims to minimize Eq.1
Loss of identical matches in\(\mathcal{L}(\mathcal{D}_\textrm{syn}, \mathcal{D}_\textrm{real})\) But now\(\mathcal{D}_\textrm{syn}\) be\(\mathcal{D}_\textrm{distill}\) cap (a poem)\(\mathcal{D}_\textrm{select}\) The combination of the
-
Selection-Based Initialization: Sliding Window Alg
seek1
An important observation in this is that traditional trajectory matching methods tend to focus on simple and representative patterns in the complete dataset rather than on complex data patterns, resulting in the largerIPC
poor scalability in the setup. To overcome this problem, the paper proposes to use a carefully chosen difficulty level for the synthetic dataset\(\mathcal{D}_\textrm{syn}\) Perform the initialization, which is the difficulty level in theIPC
The increase includes more complex patterns from real datasets. Therefore, the challenge is to select real datasets\(\mathcal{D}_\textrm{real}\) a subset of the set with an appropriate level of complexity, taking into account that the\(\mathcal{D}_\textrm{syn}\) The scale of the
To solve this problem, the paper designs a sliding window algorithm. Based on the pre-computed difficulty scores (inCIFAR-10
/100
using pre-calculatedC-score
while inTiny Imagenet
prioritize the use of sth.Forgetting score
as the difficulty score.) , the training samples were arranged in descending order of difficulty (from hardest to easiest). Window subsets of these samples are then evaluated by comparing test accuracies by training the model on each window subset at different starting points. For a given threshold\(\beta\in[0,100]\%\) In eliminating the most difficult\(\beta\) After the % samples, the window subset consists of the data from the\([\beta, \beta+r]\) Samples in the % range where\(r=(|\mathcal{D}_\textrm{syn}|/|\mathcal{D}_\textrm{real}|)\times 100\%\) , \(|\mathcal{D}_\textrm{syn}|\) be tantamount toIPC
Multiply by the number of categories. Here, make sure that each window subset contains the same number of samples from each category.
as shown3
shown, the start of the window corresponds to the level of difficulty and significantly affects the model's ability to generalize (as measured by test accuracy). In particular, for smaller windows (5-10%
(range), the test accuracy can vary up to40%
of bias. In addition, the best-performing window subsets, i.e., those that achieved the highest test accuracy, tended to include more difficult samples as the subset size increased (smaller\(\beta\) ). This is consistent with the intuition that withIPC
increase, incorporating complex patterns from real datasets into the model can enhance its generalization capabilities.
Based on this observation, it would be\(\mathcal{D}_\textrm{syn}\) The initialization is set to\(\mathcal{D}_\textrm{initial}\) which\(\mathcal{D}_\textrm{initial}\) is provided by the sliding window algorithm for a given\(\mathcal{D}_\textrm{syn}\) size determines the best performing subset of windows. This approach ensures that the subsequent extraction process is performed from a specificIPC
The system is optimized for difficulty level images to start with.
-
Partial Updates
In the optimal subset of windows selected with the sliding window algorithm\(\mathcal{D}_\textrm{initial}\) For synthetic datasets\(\mathcal{D}_\textrm{syn}\) After performing the initialization, the next goal is to update the dataset distillation through the\(\mathcal{D}_\textrm{syn}\) in order to efficiently integrate data from the entire real dataset\(\mathcal{D}_\textrm{real}\) information embedded in it. Traditionally, matching training trajectories (MTT
) The algorithm works by performing the\(N\) The individual model updates are backpropagated to minimize the matching loss Eq.1
and thus updating the\(\mathcal{D}_\textrm{syn}\) in all of the samples. However, as shown in Figure1b
shown, this approach favors simpler patterns in the dataset, leading to a reduction in coverage in successive extraction iterations. Therefore, in order to address this problem and maintain some of the unique and complex features of real samples (for models in largerIPC
range of generalization ability is crucial), the paper introduces a study of the\(\mathcal{D}_\textrm{syn}\) Part of the update.
Based on the difficulty scores of each sample, the initial synthetic dataset will be\(\mathcal{D}_\textrm{syn}=\mathcal{D}_\textrm{initial}\) Divided into two subsets\(\mathcal{D}_\textrm{select}\) cap (a poem)\(\mathcal{D}_\textrm{distill}\) The subset\(\mathcal{D}_\textrm{select}\) embody\((1-\alpha) \times |\mathcal{D}_\textrm{syn}|\) A difficult sample with the remaining\(\alpha\) Part of the sample is assigned to\(\mathcal{D}_\textrm{distill}\) which\(\alpha\in[0,1]\) based onIPC
Tuned hyperparameters.
During the extraction iterations, keep\(\mathcal{D}_\textrm{select}\) No change, only updates\(\mathcal{D}_\textrm{distill}\) Subset. The goal of the update is to minimize the entire\(\mathcal{D}_\textrm{syn}=\mathcal{D}_\textrm{select}\cup \mathcal{D}_\textrm{distill}\) cap (a poem)\(\mathcal{D}_\textrm{real}\) The loss of match between, i.e:
with minimization\(\mathcal{L}(\mathcal{D}_\textrm{distill}, \mathcal{D}_\textrm{real})\) Different, only partially updated\(\mathcal{D}_\textrm{syn}\) The loss strategy encourages\(\mathcal{D}_\textrm{distill}\) concentrate on\(\mathcal{D}_\textrm{select}\) knowledge that doesn't exist in the world, thus enriching\(\mathcal{D}_\textrm{syn}\) The overall message in the
-
Combined Augmentation
When creating a synthetic dataset\(\mathcal{D}_\textrm{syn}\) Afterwards, its effectiveness is evaluated by training a randomly initialized neural network using this dataset. Typically, previous distillation methods have used theDif- ferentiable Siamese Augmentation
(DSA
) to evaluate synthetic datasets. This approach, which involves more sophisticated enhancement techniques than the simple methods commonly used for real datasets (e.g., random cropping and horizontal flipping), achieved better results with synthetic data. This enhanced performance may be due to the fact that the synthetic datasets primarily capture simpler patterns, making them more suitable for use by theDSA
Make stronger enhancements.
However, across the entire synthetic dataset\(\mathcal{D}_\textrm{syn}\) plug-in (software)DSA
may not be ideal, especially given the inclusion of subsets of hard-to-process samples\(\mathcal{D}_\textrm{select}\) of existence. To address this problem, the paper proposes a comprehensive enhancement strategy tailored specifically to the synthetic dataset of the paper. Specifically, the combination of theDSA
Applied to the refining section\(\mathcal{D}_\textrm{distill}\) , and to a selected, more complex subset of the\(\mathcal{D}_\textrm{select}\) using simpler, more traditional enhancement techniques. This combined approach aims to utilize the strengths of both enhancement methods to improve the overall performance of the synthetic dataset.
Putting it all together.SelMatch
In the algorithm1
Summarized in.
Experimental Results
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].