SelMatch: distillation of the latest dataset is possible with only 5% training data

Dataset distillation aims to synthesize each category from large datasets (IPC) a small number of images to approximate the full dataset training with minimal performance loss. Despite the fact that a very smallIPCvalid within the scope, but with theIPCincreases, many distillation methods become less effective or even perform less well than random sample selection. The paper provides a review of variousIPCThe state-of-the-art trajectory-matching based distillation methods under the range were investigated and found to be effective in increasing theIPCcase it is difficult to incorporate complex, rare features from harder samples into synthetic datasets, leading to a persistent coverage gap between easy and hard test samples. Inspired by these observations, the paper proposes that theSelMatchThe first is a system that can efficiently follow theIPCExtended Novel Distillation Methods.SelMatchManage synthetic datasets using selection-based initialization and partial updating via trajectory matching to adapt for targeting theIPCThe desired difficulty level for range customization. In the case of a customized difficulty level for theCIFAR-10/100cap (a poem)TinyImageNetIn the test of theSelMatchexist5%until (a time)30%consistently outperforms mainstream selection-only and distillation-only methods in terms of subset ratios.

discuss a paper or thesis (old): SelMatch: Effectively Scaling Up Dataset Distillation via Selection-Based Initialization and Partial Updates by Trajectory Matching

Paper Address:/abs/2406.18561
Thesis Code:/Yongalls/SelMatch

Introduction

Dataset reduction is essential for efficient learning from data, which involves synthesizing or selecting a smaller number of samples from a large dataset while ensuring that the performance of models trained on this reduced dataset remains comparable or minimizes performance degradation compared to those trained on the full dataset. This approach addresses the challenges faced when training neural networks on large datasets, such as high computational costs and memory requirements.

An important technique in this field is dataset distillation, also known as dataset coalescing. This method distills a large dataset into a smaller synthetic dataset. Compared to core set selection methods, data distillation shows significant performance in image classification tasks, especially at very small scales. For example, matching training trajectories (MTT) The algorithm only usesCIFAR-10data set1%In the simpleConvNethas been realized on71.6%The accuracy of the complete dataset is close to the84.8%Accuracy. This remarkable efficiency comes from an optimization process in which synthetic samples are optimally learned in continuous space rather than being selected directly from the original dataset.

However, recent studies have shown that with the size of the synthesized dataset or each class of images (IPC) increases, many dataset distillation methods lose effectiveness and even perform worse than random sample selection. This phenomenon is puzzling, considering the greater optimization freedom offered by distillation relative to discrete sample selection. Specifically.DATMBy analyzing state-of-the-artMTTThe training trajectory of the method to investigate this phenomenon points out how the stage of the training trajectory that the method focuses on during the synthesis of the dataset significantly affects the effectiveness of the distilled dataset. In particular, the simple patterns learned in the early trajectories and the difficult patterns learned in the later stages significantly influence theMTTdifferentIPCperformance in the case.

The thesis is further developed by comparing the results obtained in differentIPCcaseMTTThe methodology covers the case of simple and difficult real samples in the synthetic dataset and finds that as theIPCincrease, distillation methods fail to adequately incorporate rare features of difficult samples into the synthetic dataset, which leads to a consistent coverage gap between simple and difficult samples. At higherIPCrange, part of the reason for the reduced effectiveness of dataset distillation methods is that they tend to focus on simpler, more representative features in the dataset. In contrast, with theIPCincrease, covering harder and rarer features becomes more critical to the generalization ability of models trained on shrinking datasets, something that has been empirically and theoretically validated in data selection studies.

Inspired by these observations, the paper proposes a novel approach calledSelMatch, as a solution to efficiently extend the distillation method for data sets. With theIPCincrease, the synthetic dataset should cover the more complex and diverse features of the real dataset with an appropriate difficulty level. The desired difficulty level of the synthetic dataset is managed through selection-based initialization and partial updating through trajectory matching.

Selection-based initialization: to overcome the limitation of traditional trajectory matching methods that overconcentrate on simple patterns, even if theIPCIncrease, using a method that is specific to eachIPCReal images of appropriate difficulty level for optimization are performed to initialize the synthetic dataset. Traditional trajectory matching methods typically use randomly selected samples or simple or representative samples close to the class center to initialize the synthetic dataset in order to increase the convergence rate of distillation. The paper's approach instead initializes the synthetic dataset using a carefully selected subset that contains samples that fit the size of the synthetic dataset at just the right difficulty level. This approach ensures that the subsequent distillation process is performed in a way that is tailored to the specificIPCrange optimization difficulty level samples to begin with. The experimental results show that selection-based initialization plays an important role in performance performance.
Partial update: In traditional dataset distillation methods, each sample in the synthetic dataset is updated during distillation iterations. However, as the number of distillation iterations increases, the process continuously reduces the diversity of samples in the synthetic dataset, as the distillation provides signals that are biased towards simple patterns in the full dataset. Therefore, in order to maintain the rare and complex features of the difficult samples (which are essential for modeling in largerIPCrange of generalization capabilities is crucial), the paper introduces a partial update of the synthetic dataset. The main idea is to keep a fixed portion of the synthetic dataset unchanged while updating the rest by distilling the signal, and the proportion of the unchanged portion according to theIPCMake adjustments. Experimental results show that such partial updates play an important role in effectively extending the distillation of the dataset.

existCIFAR-10/100cap (a poem)TinyImageNetEvaluated onSelMatch, and demonstrated a great deal of success in the transition from5%until (a time)30%superiority over state-of-the-art selection-only and distillation-only methods in the subset scaling setup. It is noteworthy that theCIFAR-100In the case when each class has50The images (10%ratio) in comparison to the leading method in the case of theSelMatchThe accuracy of the test was improved by3.5％。

Related Works

Two main approaches in dataset reduction: sample selection and dataset distillation.

Sample Selection

There are two main approaches in sample selection: optimization-based and score-based selection.

Optimization-based selection aims to identify a small core set that effectively represents the various features of the complete dataset. For example.Herdingcap (a poem)K-centerChoose a core set that approximates the distribution of the complete dataset.Craigcap (a poem)GradMatchseeks a core set that minimizes the average gradient difference from the full dataset in neural network training. Although in small to mediumIPCrange of validity, but these methods often face problems in terms of scalability and performance when compared to scoring-based options, especially as theIPCThe increase.

Scoring-based selection can assign values based on the difficulty or impact of each instance in the neural network training. For example.ForgettingThe learning difficulty of instances was assessed by counting the number of times they were previously correctly categorized but misclassified in multiple subsequent periods.C-scoreDifficulty is evaluated as the probability of misclassification when removing samples from the training set. These methods prioritize difficult samples, capturing rare and complex features with largerIPCoutperforms optimization-based selection methods at scale. These studies show that withIPCincrease, the introduction of harder or rarer features becomes more and more important for the improvement of the generalization ability of the model.

Dataset Distillation

Dataset distillation aims to create a small synthetic set\(\mathcal{S}\) in order to\(\mathcal{S}\) Models trained on\(\theta^\mathcal{S}\) Can achieve good generalization performance over the full dataset\(\mathcal{T}\) Good performance on it:

\[\mathcal{S^*} = \underset{\mathcal{S}}{\text{arg min}} \mathcal{L}^\mathcal{T}(\theta^\mathcal{S}), \text{ with } \theta^\mathcal{S} = \underset{\theta}{\text{arg min}} \mathcal{L}^\mathcal{S}(\theta) \]

Here.\(\mathcal{L}^\mathcal{T}\) cap (a poem)\(\mathcal{L}^\mathcal{S}\) were...\(\mathcal{T}\) cap (a poem)\(\mathcal{S}\) Loss on. To cope with the computational complexity and memory requirements of two-layer optimization, existing work has used two approaches: substitution-based matching and kernel-based approaches. Substitution-based matching replaces complex primitive goals with simpler agent tasks. For example.DC、DSAcap (a poem)MTTDesigned to match the gradients or trajectories in the\(\mathcal{S}\) Models trained on\(\theta^\mathcal{S}\) The trajectory with the full dataset\(\mathcal{T}\) The trajectory is consistent.DMassure\(\mathcal{S}\) cap (a poem)\(\mathcal{T}\) have similar distributions in the feature space. In addition, kernel-based methods utilize the kernel method to approximate the neural network on the\(\theta^\mathcal{S}\) of training and derive closed-form solutions for internal optimization. For example.KIPUsing the neural tangent kernel (NTK) performing kernel ridge regression.FrePoReduce training costs by focusing only on regressions in the last learnable layer. However, with theIPCincrease, both substitution-based matching and kernel-based approaches struggle to scale effectively in terms of scalability or performance.DC-BENCHNoted that with the highIPCThese methods perform poorly compared to random sample selection in the case of

Recent research has been devoted to solving state-of-the-artMTTThe issue of scalability of the approach focuses on either the computational aspect, by reducing the memory requirements, or the performance aspect, by utilizing the training trajectories of the complete dataset in subsequent periods. Specifically.DATMConsistency with early training trajectories was found to be enhanced in lowIPCperformance in the regime, while consistency with later trajectories is important for highIPCsystem is more beneficial. Based on this observation.DATMaccording toIPCThe trajectory matching range is optimized so as to adaptively incorporate easier or more difficult patterns in the expert trajectories, thus improving theMTTof scalability. WhileDATMLower and upper bounds on the range of trajectory matches can be efficiently determined, but explicitly quantifying or searching for the level of training trajectory difficulty required in terms of trends in matching losses outside these ranges remains a challenging task. In contrast, the paper'sSelMatchUtilizing selection-based initialization and partial updating via trajectory matching to incorporate a suitable fit for eachIPCthe complex characterization of difficult samples. In particular, the paper's approach introduces a novel strategy of targeting everyIPCrange of difficulty levels customized for synthetic sample initialization, which has not been explored in the literature on distillation of previous datasets. In addition, in contrast to the specifically designed to enhance theMTT(used form a nominal expression)DATMDifferent.SelMatchThe main components of the process, i.e., selection-based initialization and partial updating, have wider applicability across a variety of distillation methods.

Motivation

Preliminary

Matching Training Trajectories (MTT)

State-of-the-art data set distillation methodsMTTwill be used as a benchmark for analyzing traditional dataset distillation methods in largeIPCLimitations in scope.MTTThe goal is to match real datasets by\(\mathcal{D}_\textrm{real}\) and synthetic data sets\(\mathcal{D}_\textrm{syn}\) between training trajectories to generate synthetic datasets. In each distillation iteration, the synthetic dataset is updated to minimize the matching loss, which is measured in terms of the true dataset\(\mathcal{D}_\textrm{real}\) The training trajectory of the\(\{\theta_t^*\}\) and synthetic data sets\(\mathcal{D}_\textrm{syn}\) The training trajectory of the\(\{\hat{\theta}_t\}\) For the definition.

\[\begin{equation} \mathcal{L}(\mathcal{D}_\textrm{syn}, \mathcal{D}_\textrm{real})= \frac{\|\hat{\theta}_{t+N} - \theta^*_{t+M}\|^2_2}{\|\theta^*_{t} - \theta^*_{t+M}\|^2_2}, \end{equation} \]

Among them.\(\theta_t^*\) It's in the first\(t\) walk on\(\mathcal{D}_\textrm{real}\) parameters of the model trained on it. The parameters of the model trained from the\(\hat{\theta}_{t}=\theta_t^*\) Beginning.\(\hat{\theta}_{t+N}\) It is done through the use of the\(\mathcal{D}_\textrm{syn}\) hand-on training\(N\) model parameters obtained after the step, while the\({\theta}^*_{t+M}\) Yes, it is.\(\mathcal{D}_\textrm{real}\) hand-on training\(M\) parameters obtained after the step.

Limitations of Traditional Methods in Larger IPC

first analyzeMTTHow does the pattern of the generated synthetic data vary with each type of image (IPC) evolves as it increases. For dataset distillation methods to remain effective on larger synthetic datasets, the distillation process should continue to provide the synthetic samples with novel and complex patterns from the real dataset as each class of images increases. The trajectory matching method has been shown to be effective in lowIPCAlthough state-of-the-art in terms of level, there are shortcomings in achieving this goal.

The paper demonstrates this by examining the "coverage" of the real (test) dataset. "Coverage" is defined as the percentage of synthetic samples that are less than a certain radius away from the synthetic samples in the feature space (\(r\) ) of the true sample, radius\(r\) is set to the average nearest neighbor distance of the real training samples in the feature space. The higher coverage indicates that the synthetic dataset captures the diverse features of the real samples, enabling the models trained on the synthetic dataset to learn not only simple but also complex patterns in the real dataset.

seek1a(left) demonstrates that with theCIFAR-10The number of images per class in the dataset (IPC) increases and how the coverage changes. In addition, in Figure1a(right) were analyzed for two sample groups. "Simple."50%and "difficulties"50%(Difficulty measure for real samples based on forgetting scores).

Observations showed that the use ofMTTThe coverage does not effectively follow theIPCextension, is consistently lower than the coverage of random selection. In addition, the coverage of the difficult sample group is much lower than that of the simple sample group. This suggests that even though theIPCIncrease.MTTIt is also not possible to effectively embed difficult and complex data patterns into synthetic samples, which may beMTTScaling reasons for poor performance. And the methodology of the paperSelMatchDemonstrates superior overall coverage, especially in theIPCDifficulty group coverage improves significantly when increased.

Another important finding is that as the number of distillation iterations increased, theMTTThe coverage is decreasing, as shown in1bThis is shown. This observation further suggests that traditional distillation methods capture "simple" patterns mainly over multiple iterations, making the synthetic dataset less diverse as the number of distillation iterations increases. In contrast, even as the number of iterations increases, the use of theSelMatchcoverage remained stable. As shown in Figure1cAs shown, coverage also affects test accuracy. A significant difference in coverage between the simple and difficult test sample groups resulted in a significant gap in test accuracy between the two groups.SelMatchImproved coverage of both groups, thus improving overall test accuracy, especially in theIPCThe accuracy of the test for the difficult group improved when it was increased.

Main Method: SelMatch

seek2displayedSelMatchthe core idea of the method, which combines selection-based initialization with partial updating via trajectory matching. Traditional trajectory matching methods typically use randomly selected real datasets\(\mathcal{D}_\textrm{real}\) subset of the synthetic dataset to the synthetic dataset\(\mathcal{D}_\textrm{syn}\) initialization is performed without any specific selection criteria. During each distillation iteration, the entire\(\mathcal{D}_\textrm{syn}\) are all updated to minimize the number of variables defined in Eq.1Matching loss in\(\mathcal{L}(\mathcal{D}_\textrm{syn}, \mathcal{D}_\textrm{real})\) 。

In comparison.SelMatchFirst use a carefully selected subset\(\mathcal{D}_\textrm{initial}\) treat (sb a certain way)\(\mathcal{D}_\textrm{syn}\) Initialization is performed with a subset containing tailored samples suitable for the size of the synthetic dataset with appropriate difficulty levels. Then, in each distillation iteration, theSelMatchupdate only\(\mathcal{D}_\textrm{syn}\) A particular portion of the\(\alpha\in[0,1]\) , (called\(\mathcal{D}_\textrm{distill}\) ), while the remainder of the dataset (called the\(\mathcal{D}_\textrm{select}\) ) remains constant. This process aims to minimize Eq.1Loss of identical matches in\(\mathcal{L}(\mathcal{D}_\textrm{syn}, \mathcal{D}_\textrm{real})\) But now\(\mathcal{D}_\textrm{syn}\) be\(\mathcal{D}_\textrm{distill}\) cap (a poem)\(\mathcal{D}_\textrm{select}\) The combination of the

Selection-Based Initialization: Sliding Window Alg

seek1An important observation in this is that traditional trajectory matching methods tend to focus on simple and representative patterns in the complete dataset rather than on complex data patterns, resulting in the largerIPCpoor scalability in the setup. To overcome this problem, the paper proposes to use a carefully chosen difficulty level for the synthetic dataset\(\mathcal{D}_\textrm{syn}\) Perform the initialization, which is the difficulty level in theIPCThe increase includes more complex patterns from real datasets. Therefore, the challenge is to select real datasets\(\mathcal{D}_\textrm{real}\) a subset of the set with an appropriate level of complexity, taking into account that the\(\mathcal{D}_\textrm{syn}\) The scale of the

To solve this problem, the paper designs a sliding window algorithm. Based on the pre-computed difficulty scores (inCIFAR-10/100using pre-calculatedC-scorewhile inTiny Imagenetprioritize the use of sth.Forgetting scoreas the difficulty score.) , the training samples were arranged in descending order of difficulty (from hardest to easiest). Window subsets of these samples are then evaluated by comparing test accuracies by training the model on each window subset at different starting points. For a given threshold\(\beta\in[0,100]\%\) In eliminating the most difficult\(\beta\) After the % samples, the window subset consists of the data from the\([\beta, \beta+r]\) Samples in the % range where\(r=(|\mathcal{D}_\textrm{syn}|/|\mathcal{D}_\textrm{real}|)\times 100\%\) ， \(|\mathcal{D}_\textrm{syn}|\) be tantamount toIPCMultiply by the number of categories. Here, make sure that each window subset contains the same number of samples from each category.

as shown3shown, the start of the window corresponds to the level of difficulty and significantly affects the model's ability to generalize (as measured by test accuracy). In particular, for smaller windows (5-10%(range), the test accuracy can vary up to40%of bias. In addition, the best-performing window subsets, i.e., those that achieved the highest test accuracy, tended to include more difficult samples as the subset size increased (smaller\(\beta\) ). This is consistent with the intuition that withIPCincrease, incorporating complex patterns from real datasets into the model can enhance its generalization capabilities.

Based on this observation, it would be\(\mathcal{D}_\textrm{syn}\) The initialization is set to\(\mathcal{D}_\textrm{initial}\) which\(\mathcal{D}_\textrm{initial}\) is provided by the sliding window algorithm for a given\(\mathcal{D}_\textrm{syn}\) size determines the best performing subset of windows. This approach ensures that the subsequent extraction process is performed from a specificIPCThe system is optimized for difficulty level images to start with.

Partial Updates

In the optimal subset of windows selected with the sliding window algorithm\(\mathcal{D}_\textrm{initial}\) For synthetic datasets\(\mathcal{D}_\textrm{syn}\) After performing the initialization, the next goal is to update the dataset distillation through the\(\mathcal{D}_\textrm{syn}\) in order to efficiently integrate data from the entire real dataset\(\mathcal{D}_\textrm{real}\) information embedded in it. Traditionally, matching training trajectories (MTT) The algorithm works by performing the\(N\) The individual model updates are backpropagated to minimize the matching loss Eq.1and thus updating the\(\mathcal{D}_\textrm{syn}\) in all of the samples. However, as shown in Figure1bshown, this approach favors simpler patterns in the dataset, leading to a reduction in coverage in successive extraction iterations. Therefore, in order to address this problem and maintain some of the unique and complex features of real samples (for models in largerIPCrange of generalization ability is crucial), the paper introduces a study of the\(\mathcal{D}_\textrm{syn}\) Part of the update.

Based on the difficulty scores of each sample, the initial synthetic dataset will be\(\mathcal{D}_\textrm{syn}=\mathcal{D}_\textrm{initial}\) Divided into two subsets\(\mathcal{D}_\textrm{select}\) cap (a poem)\(\mathcal{D}_\textrm{distill}\) The subset\(\mathcal{D}_\textrm{select}\) embody\((1-\alpha) \times |\mathcal{D}_\textrm{syn}|\) A difficult sample with the remaining\(\alpha\) Part of the sample is assigned to\(\mathcal{D}_\textrm{distill}\) which\(\alpha\in[0,1]\) based onIPCTuned hyperparameters.

During the extraction iterations, keep\(\mathcal{D}_\textrm{select}\) No change, only updates\(\mathcal{D}_\textrm{distill}\) Subset. The goal of the update is to minimize the entire\(\mathcal{D}_\textrm{syn}=\mathcal{D}_\textrm{select}\cup \mathcal{D}_\textrm{distill}\) cap (a poem)\(\mathcal{D}_\textrm{real}\) The loss of match between, i.e:

\[\begin{equation} \label{eq:partial_update} \mathcal{L}( \mathcal{D}_\textrm{select}\cup \mathcal{D}_\textrm{distill}, \mathcal{D}_\textrm{real}), \end{equation} \]

with minimization\(\mathcal{L}(\mathcal{D}_\textrm{distill}, \mathcal{D}_\textrm{real})\) Different, only partially updated\(\mathcal{D}_\textrm{syn}\) The loss strategy encourages\(\mathcal{D}_\textrm{distill}\) concentrate on\(\mathcal{D}_\textrm{select}\) knowledge that doesn't exist in the world, thus enriching\(\mathcal{D}_\textrm{syn}\) The overall message in the

Combined Augmentation

When creating a synthetic dataset\(\mathcal{D}_\textrm{syn}\) Afterwards, its effectiveness is evaluated by training a randomly initialized neural network using this dataset. Typically, previous distillation methods have used theDif- ferentiable Siamese Augmentation（DSA) to evaluate synthetic datasets. This approach, which involves more sophisticated enhancement techniques than the simple methods commonly used for real datasets (e.g., random cropping and horizontal flipping), achieved better results with synthetic data. This enhanced performance may be due to the fact that the synthetic datasets primarily capture simpler patterns, making them more suitable for use by theDSAMake stronger enhancements.

However, across the entire synthetic dataset\(\mathcal{D}_\textrm{syn}\) plug-in (software)DSAmay not be ideal, especially given the inclusion of subsets of hard-to-process samples\(\mathcal{D}_\textrm{select}\) of existence. To address this problem, the paper proposes a comprehensive enhancement strategy tailored specifically to the synthetic dataset of the paper. Specifically, the combination of theDSAApplied to the refining section\(\mathcal{D}_\textrm{distill}\) , and to a selected, more complex subset of the\(\mathcal{D}_\textrm{select}\) using simpler, more traditional enhancement techniques. This combined approach aims to utilize the strengths of both enhancement methods to improve the overall performance of the synthetic dataset.

Putting it all together.SelMatchIn the algorithm1Summarized in.

Experimental Results

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

SelMatch: distillation of the latest dataset is possible with only 5% training data | ICML'24

Sample Selection

Dataset Distillation

Preliminary

Matching Training Trajectories (MTT)

Limitations of Traditional Methods in Larger IPC

Selection-Based Initialization: Sliding Window Alg

Partial Updates

Combined Augmentation