1p-frac: open-sourced, rivals ImageNet pre-training with just a single fractal image

Fractal geometry is a branch of mathematics that is mainly applied to graphing. In general, fractals are the result of numerous recursive iterations. For example, if you take a line segment and erase the middle third, you will get two line segments that are one-third the length of the original one-third long, separated by gaps of the same length. This is then repeated until all the line segments have been erased, and an infinite number of points separated by gaps appearing in a fixed pattern will be obtained, which is the Cantor set.

There are many studies on pre-training of models by generating fractal images, which can achieve the pre-training effect on large-scale datasets without using real images at all or even training images that are completely irrelevant to the downstream task.

The paper searches for a minimal, purely synthetic pre-training dataset that is capable of achieving the same results as theImageNet-1k(used form a nominal expression)100equivalent performance for 10,000 images. The paper constructs such a dataset by generating perturbations from a single fractal containing only the1A fractal picture.

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): Scaling Backwards: Minimal Synthetic Pre-training?

Paper Address:/abs/2408.00677
Thesis Code:/SUPER-TADORY/1p-frac

Abstract

Pre-training and transfer learning are important building blocks for current computer vision systems. While pre-training is usually performed on large-scale real-world image datasets, in this thesis, we raise an important question - whether it is really necessary to use such datasets. To this end, we present the following three findings through these works, our main contributions.

(i) Even with a very limited number of synthetic images, we are able to show that the pretraining is still effective, with performance at full-volume fine-tuning being comparable to that using large-scale pretraining datasets such asImageNet-1kQuite.

（ii) We investigated the way in which the individual parameters used in constructing the dataset are used to construct artificial categories. We find that although shape differences appear almost indistinguishable to humans, it is these differences that are critical for obtaining robust performance.

（iii) Finally, we examined the minimum requirements for successful pretraining. Surprisingly, from1kThe number of composite images has been dramatically reduced to just1The fact that the number of pre-training sessions can be increased by as many as 10, and possibly even the performance of pre-training, inspires us to further explore the possibilities of "reverse scaling".

Finally, we expanded from synthetic to real images and found that even a single real image can show similar pre-training effects through shape enhancement. We found that even real images can be effectively "rescaled backwards" using grayscale images and affine transformations. The source code is available athttps😕//SUPER-TADORY/1p-fracUp.

Introduction

In image recognition, pretraining can help discover the underlying visual representation for downstream task applications, improving the performance of visual tasks and allowing the utilization of small task-specific datasets. Recently, pretraining has been used as a key technique for building base models trained on over hundreds of millions of images. In some cases, the base model can be tuned for zero-sample recognition without additional data.

Pre-training is often interpreted as the discovery of generic structures in large-scale datasets that subsequently help to adapt to downstream tasks. The paper challenges this interpretation by providing a framework consisting of a single fractal (fractal) generates a minimal pre-training dataset which achieves similar downstream performance. The central issue of the paper's research is that pre-training may simply be better weight initialization rather than the discovery of useful visual concepts. If true, expensive pre-training with hundreds of millions of images may not be necessary and could also insulate pre-training from licensing or ethical issues.

Since the rise of deep neural networksImageNetdataset has been one of the most commonly used pre-training datasets. Initially, pre-training was done by supervised learning using manually provided annotations (SL) to be performed. However, it is now clear that learning by self-supervised learning (SSL) can also enable pre-training without manually supplied labels.

In this case.Asanoet al. succeeded in obtaining a visual representation while greatly reducing the number of images required. They concluded that theSSLEven with only one training example, adequate image representations can be generated, but only for earlier layers of the recognition model. However, it is unclear how these findings translate to modern architectures and representation learning methods. Based on this, the visual transformer (ViT) by way of example to determine (instance discrimination) to learn the signal, using only the\(2{,}040\) A real image can be pre-trained.

Recent studies have shown that basic visual representations can be obtained even without the use of real images and manually provided labels, and the trend of generating labeled images for synthetic pretraining is on the rise. Formula-based supervised learning (FDSL) generates an image from the generating formula and a label from its parameters. In theFDSLframework, the synthesized pre-trained image dataset can be adjusted by changing the formula. While theFractalDBA dataset of millions of images was constructed, but the paper found that synthetic pre-training could actually be reduced to fewer fractal images.

Inspired by these findings, the paper believes it is possible to find the key points of pre-training for image recognition. There have been studies using only\(1{,}000\) A manually generated image to complete theViTtraining, the paper therefore believes that the same performance can be achieved even with fewer images. This consideration is undoubtedly important as we approach the minimization of synthetic pre-training datasets in image recognition, which goes against the trend of the base model towards increasing dataset size.

In this paper, the authors introduce a minimized synthetic dataset, the1-parameter Fractal as Data(1p-frac), as shown in Figure1shown, which includes a single fractal as well as a loss function for pre-training.

The contribution of the paper on minimalist synthetic pretraining is as follows:

Ordinal minimalism: Introducing the localized perturbation cross entropy (LPCE) loss performs pre-training based on a single fractal, using perturbed fractal images for training, and the neural network learns to classify small perturbations. In the experiments, the paper demonstrates that pre-training can be performed even with only one fractal. Moreover.1p-fracThe pre-training results are comparable to million labeled image datasets.
Distributional minimalism: Introduce a controlled level of perturbation\(\Delta\) The experience of local integration (LIEP) Distribution\(p_{\Delta}\) , to investigate the minimal support of the probability density distribution of synthetic images. The paper observes that even by generating small differences in shape that are indistinguishable from human\(\Delta\) that can also produce positive pre-training effects. The paper also demonstrates that if\(\Delta\) Too small and visual pretraining breaks down. Based on these observations, the paper establishes generic bounds for generating good pre-trained images from mathematical formulas.
Instance minimalism: According to the experimental results, synthetic images should not contain only complex shapes, but should be used in visual pretraining with recursive image patterns similar to those of objects in nature. Experiments on enhanced classification of real images show that good pre-training results can be achieved by performing "affine transformations" on objects with prominent edges in grayscale images. These operations are almost identical to the proposed1p-fracconfiguration is tautological.

In summary, the paper significantly reduces the size of the pre-training dataset from the original100million images (fractal database (FractalDB) or1000images (single-instance fractal database (OFDB), reduced to only1Zhang and showed that this even improved the pre-training effect, which motivated thescaling backwardsThe idea.

Scaling Backwards with a Single Fractal

1-parameter Fractal as Data（1p-frac) contains only a single fractal and proposes a method for pre-training neural networks on it. The key idea is to introduce a local integration experience (LIEP) Distribution\(p_{\Delta}\) that enables pre-training even if there is only one fractal image. Since theLIEPThe distribution is designed to be at the level of perturbation\(\Delta \in \mathbb{R}_{\geq 0}\) Converges to a single image as it approaches zero\(I\) The empirical distribution of\(p_{\text{data}}(x) = \delta(x-I)\) , which can be reduced by decreasing the\(\Delta\) to narrow the support of the distribution, as shown in2aShown.

Preliminary

FractalDB

FractalDBCan be effectively utilizedFractalDBTo pre-train the neural network, the database is composed of a system of iterative functions (IFSs) generated by a set of fractal images. Specifically.FractalDB(used form a nominal expression)\(\mathcal{F}\) Includes one million composite images:\(\mathcal{F} = \{(\Omega_{c}, \{I_{i}^{c}\}_{i=0}^{M-1})\}_{c=0}^{C-1}\) which\(\Omega_{c}\) anIFS， \(I^{c}_{i}\) attributable\(\Omega_{c}\) generated fractal images.\(C = 1{,}000\) is the number of fractal categories that\(M = 1{,}000\) is the number of images in each category. EachIFSSpecialized generation of a fractal category\(c\) , and defined as follows:

\[\begin{align} \Omega_{c} = \{\mathbb{R}^{2}; w_{1}, w_{2}, \ldots, w_{N_{c}}; p_{1}, p_{2}, \ldots, p_{N_{c}} \}, \end{align} \]

included among these\(w_{j} : \mathbb{R}^{2} \to \mathbb{R}^{2}\) is a predefined two-dimensional affine transformation in which the parameters are distributed from a uniform distribution\([-1, 1]^6\) generated by sampling in the\(N_c\) is the number obtained from a random sample of candidates that determines the currentIFSof the number of affine transformations.\(p_{j}\) It's a program based on\(w_j\) The computed probability mass distribution, and the1Representatives\(w_j\) Probability of being drawn.\(w_j\) define as

\[\begin{align} w_{j} \left(\mathbf{v} \right) = \begin{bmatrix} a_j & b_j \\ c_j & d_j \end{bmatrix} \mathbf{v} + \begin{bmatrix} e_j\\ f_j \end{bmatrix} ~(\mathbf{v} \in \mathbb{R}^{2}), \end{align} \]

Each fractal image\(I^{c}_{i}\) attributable\(F = \{\mathbf{v}_{t}\}_{t=0}^{T} \subset \mathbb{R}^{2}\) Constantly draw a two-dimensional image of the point rendering, where the point\(\mathbf{v}_{t}\) is determined by a recurrence relation, i.e.\(\mathbf{v}_{t+1} = w_{\sigma_{t}} (\mathbf{v}_{t})\) insofar as\(t = 0, 1, 2, \cdots, T-1\) . Here, the initial point is set as\(\mathbf{v}_{0} = (0, 0)^{\top}\) Index\(\sigma_{t}\) at each\(t\) time independently from the probability mass distribution\(p(\sigma_{t} = j) = p_{j}\) Sampling in.

utilizationFractalDBFor pre-training, a cross-entropy loss function can be used:

\[\begin{align} \mathcal{L} = -\mathbb{E}_{x,y \sim p_{\text{data}}}[\log p_{\theta}(y|x)], \end{align} \]

included among these\(p_{\theta}\) is the distribution of categories predicted by the neural network, and the\(\theta\) is a set of learnable parameters. The joint empirical distribution on the sample dataset\(p_{\text{data}}\) It can be defined as follows:

\[\begin{align} p_{\text{data}}(x,y; \mathcal{F}) = \frac{1}{MC} \sum_{i=0}^{M-1} \sum_{c=0}^{C-1} \delta(x-I^{c}_{i}) \delta(y - c) \end{align} \]

included among these\(\delta\) in the name ofDirac’s deltaFunctions. A model pretrained on such a dataset can perform as well on some downstream tasks as it does on theReal-worldImage datasets (e.g.ImageNet-1kcap (a poem)Places365) on the pre-trained model is comparable.

OFDB

OFDBThis dataset consists of1,000composed of one fractal image. Specifically, theOFDB(used form a nominal expression)\(\mathcal{F}_{\text{OF}}\) involves only one representative image from each category, the\(\mathcal{F}_{\text{OF}} = \{\Omega_{c}, I_{c}\}_{c=0}^{C-1}\) . Thus, the joint empirical distribution simplifies to:

\[\begin{align} p_{\text{data}}(x,y; \mathcal{F}_{\text{OF}}) = \frac{1}{C} \sum_{c=0}^{C-1} \delta(x-I_{c}) \delta(y - c). \end{align} \]

The pre-trained model on this dataset performs even better or as well as the model on theFractalDBon the pre-trained model is comparable. This work suggests that there exists a small critical mass of images for visual pretraining. However, the number of fractal images that will be\(C\) Reduced to less than1,000will degrade performance.

Pre-training with a Single Fractal

Scaling backwards

To further facilitate the analysis of the minimum number of images required for successful visual pretraining, the paper proposes the1p-fracThe eventualIFSsand the number of images is reduced to one, i.e.\(\mathcal{F}_{\text{OP}} = (\Omega, I)\) . The empirical distribution when using this dataset is as follows:

\[\begin{align} \label{eq:emp_one_image} p_{\text{data}}(x,y; \mathcal{F}_{\text{OP}}) = \delta(x - I) \delta(y). \end{align} \]

However, the paper notes that training the neural network using the cross-entropy loss function is not effective because the\(p_{\theta}(y=0|x) \equiv 1~(\forall x)\) would lead to a loss-minimizing mundane solution. To solve this problem, a local perturbation cross-entropy is introduced (LPCE(a) Losses\(\mathcal{L}_{\Delta}\) It's a kind of a system that's based onLIEPA variant of the cross-entropy loss defined by the distribution.

Definition 1: Let\(I_{\mathbf{\epsilon}} \in \mathcal{X}\) is a perturbed image where\(\mathcal{X}\) is the set of images that\(\mathbf{\epsilon} \in \mathbb{R}^{d}\) It's the one that fulfills\(d \in \mathbb{N}_{>0}\) of small perturbations.\(I_{\mathbf{0}} = I\) is the original image. Defined byLIEPDistribution:

\[\begin{align} \label{eq:distribution} p_{\Delta}(x, y) &= \frac{1}{|\mathcal{R}_{\Delta}|} \int_{\mathcal{R}_{\Delta}} \delta(x - I_{\mathbf{\epsilon}}) \delta(y - \mathbf{\epsilon}) d \mathbf{\epsilon} \end{align} \]

included among these\(\mathcal{R}_{\Delta} \subset \mathbb{R}^{d}\) is the set containing the origin of\(|\mathcal{R}_{\Delta}|\) is its volume of order\(O(|\Delta|^{d})\) 。

Definition 2: Defined byLPCELosses:

\[\begin{align} \mathcal{L}_{\Delta} &= -\mathbb{E}_{x, y \sim p_{\Delta}} \left[ \log p_{\theta}(y | x) \right], \end{align} \]

included among these\(p_{\Delta}\) beLIEPDistribution.

in the event that\(\mathcal{R}_{\Delta}\) is a small hypercube or hypersphere, and when the\(\Delta\) converges to zero when\(p_{\Delta}\) Approximating the formula6of the empirical distribution. Thus, this loss allows analyzing the visual pretraining effect by narrowing the support of the distribution around a single image.

on the basis of1p-fracthat perturbs the affine transformation. In this way, the perturbation\(\mathbf{\epsilon}\) right away\(\mathbb{R}^{6*j}\) in. A perturbed image is obtained by affine transformation with noise\(I_{\mathbf{\epsilon}}\)：

\[\begin{align} w_{j} \left(\mathbf{v}; \mathbf{\epsilon}_j \right) = \left( \begin{bmatrix} a_{j} & b_{j} & e_{j}\\ c_{j} & d_{j} & f_{j} \end{bmatrix} + \mathbf{\epsilon}_j \right) \begin{bmatrix} \mathbf{v}\\ 1 \end{bmatrix} \end{align} \]

included among these\(\mathbf{\epsilon}_j \in \mathcal{R}_{\Delta} = [-\Delta/2, \Delta/2]^{6*j} \subset \mathbb{R}^{6*j}\) is an element of side length\(\Delta\) of the hypercube.\(\left|\mathcal{R}_{\Delta}\right| = \Delta^{6*j}\) . Note that in practice the formula7The numerical integration, which is done through the integration of the\(\mathcal{R}_{\Delta}\) Uniform sampling in\(L\) The approximation obtained for each point, where the default setting is\(L = 1{,}000\) 。

Visualization

seek3bDemonstrated for use in computingLPCEExample of a lossy perturbed image. While most shape differences are indistinguishable to humans, the neural network minimizes them by minimizing theLPCEloss to learn to distinguish between perturbations applied to individual images.

Complexity of \(\Omega\)

utilization\(\sigma\) Factors to evaluateIFSof complexity. As shown in Figure3cAs shown, the smaller\(\sigma\) values will produce complex fractal shapes.

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

1p-frac: open-sourced, rivals ImageNet pre-training with just a single fractal image | ECCV 2024

Preliminary

FractalDB

OFDB

Pre-training with a Single Fractal

Scaling backwards

Visualization

Complexity of \(\Omega\)