Fractal geometry is a branch of mathematics that is mainly applied to graphing. In general, fractals are the result of numerous recursive iterations. For example, if you take a line segment and erase the middle third, you will get two line segments that are one-third the length of the original one-third long, separated by gaps of the same length. This is then repeated until all the line segments have been erased, and an infinite number of points separated by gaps appearing in a fixed pattern will be obtained, which is the Cantor set.
There are many studies on pre-training of models by generating fractal images, which can achieve the pre-training effect on large-scale datasets without using real images at all or even training images that are completely irrelevant to the downstream task.
The paper searches for a minimal, purely synthetic pre-training dataset that is capable of achieving the same results as the
ImageNet-1k
(used form a nominal expression)100
equivalent performance for 10,000 images. The paper constructs such a dataset by generating perturbations from a single fractal containing only the1
A fractal picture.Source: Xiaofei's Algorithmic Engineering Notes Public
discuss a paper or thesis (old): Scaling Backwards: Minimal Synthetic Pre-training?
- Paper Address:/abs/2408.00677
- Thesis Code:/SUPER-TADORY/1p-frac
Abstract
Pre-training and transfer learning are important building blocks for current computer vision systems. While pre-training is usually performed on large-scale real-world image datasets, in this thesis, we raise an important question - whether it is really necessary to use such datasets. To this end, we present the following three findings through these works, our main contributions.
(i) Even with a very limited number of synthetic images, we are able to show that the pretraining is still effective, with performance at full-volume fine-tuning being comparable to that using large-scale pretraining datasets such asImageNet-1k
Quite.
(ii
) We investigated the way in which the individual parameters used in constructing the dataset are used to construct artificial categories. We find that although shape differences appear almost indistinguishable to humans, it is these differences that are critical for obtaining robust performance.
(iii
) Finally, we examined the minimum requirements for successful pretraining. Surprisingly, from1k
The number of composite images has been dramatically reduced to just1
The fact that the number of pre-training sessions can be increased by as many as 10, and possibly even the performance of pre-training, inspires us to further explore the possibilities of "reverse scaling".
Finally, we expanded from synthetic to real images and found that even a single real image can show similar pre-training effects through shape enhancement. We found that even real images can be effectively "rescaled backwards" using grayscale images and affine transformations. The source code is available athttps
😕//
SUPER-TADORY
/1p-frac
Up.
Introduction
In image recognition, pretraining can help discover the underlying visual representation for downstream task applications, improving the performance of visual tasks and allowing the utilization of small task-specific datasets. Recently, pretraining has been used as a key technique for building base models trained on over hundreds of millions of images. In some cases, the base model can be tuned for zero-sample recognition without additional data.
Pre-training is often interpreted as the discovery of generic structures in large-scale datasets that subsequently help to adapt to downstream tasks. The paper challenges this interpretation by providing a framework consisting of a single fractal (fractal
) generates a minimal pre-training dataset which achieves similar downstream performance. The central issue of the paper's research is that pre-training may simply be better weight initialization rather than the discovery of useful visual concepts. If true, expensive pre-training with hundreds of millions of images may not be necessary and could also insulate pre-training from licensing or ethical issues.
Since the rise of deep neural networksImageNet
dataset has been one of the most commonly used pre-training datasets. Initially, pre-training was done by supervised learning using manually provided annotations (SL
) to be performed. However, it is now clear that learning by self-supervised learning (SSL
) can also enable pre-training without manually supplied labels.
In this case.Asano
et al. succeeded in obtaining a visual representation while greatly reducing the number of images required. They concluded that theSSL
Even with only one training example, adequate image representations can be generated, but only for earlier layers of the recognition model. However, it is unclear how these findings translate to modern architectures and representation learning methods. Based on this, the visual transformer (ViT
) by way of example to determine (instance discrimination
) to learn the signal, using only the\(2{,}040\) A real image can be pre-trained.
Recent studies have shown that basic visual representations can be obtained even without the use of real images and manually provided labels, and the trend of generating labeled images for synthetic pretraining is on the rise. Formula-based supervised learning (FDSL
) generates an image from the generating formula and a label from its parameters. In theFDSL
framework, the synthesized pre-trained image dataset can be adjusted by changing the formula. While theFractalDB
A dataset of millions of images was constructed, but the paper found that synthetic pre-training could actually be reduced to fewer fractal images.
Inspired by these findings, the paper believes it is possible to find the key points of pre-training for image recognition. There have been studies using only\(1{,}000\) A manually generated image to complete theViT
training, the paper therefore believes that the same performance can be achieved even with fewer images. This consideration is undoubtedly important as we approach the minimization of synthetic pre-training datasets in image recognition, which goes against the trend of the base model towards increasing dataset size.
In this paper, the authors introduce a minimized synthetic dataset, the1-parameter Fractal as Data
(1p-frac
), as shown in Figure1
shown, which includes a single fractal as well as a loss function for pre-training.
The contribution of the paper on minimalist synthetic pretraining is as follows:
-
Ordinal minimalism
: Introducing the localized perturbation cross entropy (LPCE
) loss performs pre-training based on a single fractal, using perturbed fractal images for training, and the neural network learns to classify small perturbations. In the experiments, the paper demonstrates that pre-training can be performed even with only one fractal. Moreover.1p-frac
The pre-training results are comparable to million labeled image datasets. -
Distributional minimalism
: Introduce a controlled level of perturbation\(\Delta\) The experience of local integration (LIEP
) Distribution\(p_{\Delta}\) , to investigate the minimal support of the probability density distribution of synthetic images. The paper observes that even by generating small differences in shape that are indistinguishable from human\(\Delta\) that can also produce positive pre-training effects. The paper also demonstrates that if\(\Delta\) Too small and visual pretraining breaks down. Based on these observations, the paper establishes generic bounds for generating good pre-trained images from mathematical formulas. -
Instance minimalism
: According to the experimental results, synthetic images should not contain only complex shapes, but should be used in visual pretraining with recursive image patterns similar to those of objects in nature. Experiments on enhanced classification of real images show that good pre-training results can be achieved by performing "affine transformations" on objects with prominent edges in grayscale images. These operations are almost identical to the proposed1p-frac
configuration is tautological.
In summary, the paper significantly reduces the size of the pre-training dataset from the original100
million images (fractal database (FractalDB
) or1000
images (single-instance fractal database (OFDB
), reduced to only1
Zhang and showed that this even improved the pre-training effect, which motivated thescaling backwards
The idea.
Scaling Backwards with a Single Fractal
1-parameter Fractal as Data
(1p-frac
) contains only a single fractal and proposes a method for pre-training neural networks on it. The key idea is to introduce a local integration experience (LIEP
) Distribution\(p_{\Delta}\) that enables pre-training even if there is only one fractal image. Since theLIEP
The distribution is designed to be at the level of perturbation\(\Delta \in \mathbb{R}_{\geq 0}\) Converges to a single image as it approaches zero\(I\) The empirical distribution of\(p_{\text{data}}(x) = \delta(x-I)\) , which can be reduced by decreasing the\(\Delta\) to narrow the support of the distribution, as shown in2a
Shown.
Preliminary
-
FractalDB
FractalDB
Can be effectively utilizedFractalDB
To pre-train the neural network, the database is composed of a system of iterative functions (IFSs
) generated by a set of fractal images. Specifically.FractalDB
(used form a nominal expression)\(\mathcal{F}\) Includes one million composite images:\(\mathcal{F} = \{(\Omega_{c}, \{I_{i}^{c}\}_{i=0}^{M-1})\}_{c=0}^{C-1}\) which\(\Omega_{c}\) anIFS
, \(I^{c}_{i}\) attributable\(\Omega_{c}\) generated fractal images.\(C = 1{,}000\) is the number of fractal categories that\(M = 1{,}000\) is the number of images in each category. EachIFS
Specialized generation of a fractal category\(c\) , and defined as follows:
included among these\(w_{j} : \mathbb{R}^{2} \to \mathbb{R}^{2}\) is a predefined two-dimensional affine transformation in which the parameters are distributed from a uniform distribution\([-1, 1]^6\) generated by sampling in the\(N_c\) is the number obtained from a random sample of candidates that determines the currentIFS
of the number of affine transformations.\(p_{j}\) It's a program based on\(w_j\) The computed probability mass distribution, and the1
Representatives\(w_j\) Probability of being drawn.\(w_j\) define as
Each fractal image\(I^{c}_{i}\) attributable\(F = \{\mathbf{v}_{t}\}_{t=0}^{T} \subset \mathbb{R}^{2}\) Constantly draw a two-dimensional image of the point rendering, where the point\(\mathbf{v}_{t}\) is determined by a recurrence relation, i.e.\(\mathbf{v}_{t+1} = w_{\sigma_{t}} (\mathbf{v}_{t})\) insofar as\(t = 0, 1, 2, \cdots, T-1\) . Here, the initial point is set as\(\mathbf{v}_{0} = (0, 0)^{\top}\) Index\(\sigma_{t}\) at each\(t\) time independently from the probability mass distribution\(p(\sigma_{t} = j) = p_{j}\) Sampling in.
utilizationFractalDB
For pre-training, a cross-entropy loss function can be used:
included among these\(p_{\theta}\) is the distribution of categories predicted by the neural network, and the\(\theta\) is a set of learnable parameters. The joint empirical distribution on the sample dataset\(p_{\text{data}}\) It can be defined as follows:
included among these\(\delta\) in the name ofDirac’s delta
Functions. A model pretrained on such a dataset can perform as well on some downstream tasks as it does on theReal-world
Image datasets (e.g.ImageNet-1k
cap (a poem)Places365
) on the pre-trained model is comparable.
-
OFDB
OFDB
This dataset consists of1,000
composed of one fractal image. Specifically, theOFDB
(used form a nominal expression)\(\mathcal{F}_{\text{OF}}\) involves only one representative image from each category, the\(\mathcal{F}_{\text{OF}} = \{\Omega_{c}, I_{c}\}_{c=0}^{C-1}\) . Thus, the joint empirical distribution simplifies to:
The pre-trained model on this dataset performs even better or as well as the model on theFractalDB
on the pre-trained model is comparable. This work suggests that there exists a small critical mass of images for visual pretraining. However, the number of fractal images that will be\(C\) Reduced to less than1,000
will degrade performance.
Pre-training with a Single Fractal
-
Scaling backwards
To further facilitate the analysis of the minimum number of images required for successful visual pretraining, the paper proposes the1p-frac
The eventualIFSs
and the number of images is reduced to one, i.e.\(\mathcal{F}_{\text{OP}} = (\Omega, I)\) . The empirical distribution when using this dataset is as follows:
However, the paper notes that training the neural network using the cross-entropy loss function is not effective because the\(p_{\theta}(y=0|x) \equiv 1~(\forall x)\) would lead to a loss-minimizing mundane solution. To solve this problem, a local perturbation cross-entropy is introduced (LPCE
(a) Losses\(\mathcal{L}_{\Delta}\) It's a kind of a system that's based onLIEP
A variant of the cross-entropy loss defined by the distribution.
Definition 1
: Let\(I_{\mathbf{\epsilon}} \in \mathcal{X}\) is a perturbed image where\(\mathcal{X}\) is the set of images that\(\mathbf{\epsilon} \in \mathbb{R}^{d}\) It's the one that fulfills\(d \in \mathbb{N}_{>0}\) of small perturbations.\(I_{\mathbf{0}} = I\) is the original image. Defined byLIEP
Distribution:
included among these\(\mathcal{R}_{\Delta} \subset \mathbb{R}^{d}\) is the set containing the origin of\(|\mathcal{R}_{\Delta}|\) is its volume of order\(O(|\Delta|^{d})\) 。
Definition 2
: Defined byLPCE
Losses:
included among these\(p_{\Delta}\) beLIEP
Distribution.
in the event that\(\mathcal{R}_{\Delta}\) is a small hypercube or hypersphere, and when the\(\Delta\) converges to zero when\(p_{\Delta}\) Approximating the formula6
of the empirical distribution. Thus, this loss allows analyzing the visual pretraining effect by narrowing the support of the distribution around a single image.
on the basis of1p-frac
that perturbs the affine transformation. In this way, the perturbation\(\mathbf{\epsilon}\) right away\(\mathbb{R}^{6*j}\) in. A perturbed image is obtained by affine transformation with noise\(I_{\mathbf{\epsilon}}\):
included among these\(\mathbf{\epsilon}_j \in \mathcal{R}_{\Delta} = [-\Delta/2, \Delta/2]^{6*j} \subset \mathbb{R}^{6*j}\) is an element of side length\(\Delta\) of the hypercube.\(\left|\mathcal{R}_{\Delta}\right| = \Delta^{6*j}\) . Note that in practice the formula7
The numerical integration, which is done through the integration of the\(\mathcal{R}_{\Delta}\) Uniform sampling in\(L\) The approximation obtained for each point, where the default setting is\(L = 1{,}000\) 。
-
Visualization
seek3b
Demonstrated for use in computingLPCE
Example of a lossy perturbed image. While most shape differences are indistinguishable to humans, the neural network minimizes them by minimizing theLPCE
loss to learn to distinguish between perturbations applied to individual images.
-
Complexity of \(\Omega\)
utilization\(\sigma\) Factors to evaluateIFS
of complexity. As shown in Figure3c
As shown, the smaller\(\sigma\) values will produce complex fractal shapes.
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].