Vision Transformer
(ViT
) architectures have traditionally used grid-based approaches for tokenization without considering the semantic content of the image. The paper proposes a modular hyperpixel non-regular tokenization strategy that decouples tokenization and feature extraction, in contrast to current approaches that treat the two as an inseparable whole. By using online content-aware tokenization as well as scale- and shape-invariant positional embedding, it is contrasted with image block-based tokenization and random partitioning as benchmarks. Significant improvements in enhancing the fidelity of attribution are demonstrated, providing pixel-level granularity in a zero-sample unsupervised dense prediction task while maintaining predictive performance in a classification task.
discuss a paper or thesis (old): A Spitting Image: Modular Superpixel Tokenization in Vision Transformers
- Paper Address:/abs/2408.07680
- Thesis Code:/dsb-ifi/SPiT
Introduction
After the convolutional architecture, theVision Transformers
(ViTs
) has become the focus of visual tasks. In the initial language model ofTransformer
in which tokenization is a crucial preprocessing step aimed at optimally segmenting the data based on a predetermined entropy metric. As the model was adapted to visual tasks, tokenization was simplified to segmenting images into square image blocks. This approach proved to be effective and quickly became a standardized method that became an important part of the architecture.
Despite the apparent success, the paper argues that there are inherent limitations to image block-based tokenization. First, the scale of the markers is tightly bound to the model architecture through a fixed image block size, ignoring redundancies in the original image. These limitations lead to a significant increase in computation at higher resolutions, as complexity and memory grow in square steps with the number of markers. In addition, regular segmentation assumes an inherent homogeneity in the distribution of semantic content, which efficiently reduces the spatial resolution.
Subsequently, several studies have utilized attention maps to visualize the attribution of class labels to improve interpretability, which is often applied to dense prediction tasks. However, the attention maps produced by square segmentation induce a loss of resolution in the image block representation, which in turn does not inherently capture the resolution of the original image. For dense prediction at pixel-level granularity, a separate decoder is required for zooming.
Motivation
Thesis from the originalViT
Take a step back in the architecture and re-evaluate the role of image block-based tokenization. By focusing on this neglected component of the architecture, defining image segmentation as the role of an adaptive modular tokenizer, which isViTs
The underutilized potential in the
In contrast to square segmentation, hyperpixels offer an opportunity to mitigate the shortcomings of image block-based tokenization by allowing for scale and shape adaptation while exploiting the inherent redundancy in visual data. Hyperpixels have been shown to align better with the semantic structure in images, which opens up the possibility of visualization in theTransformer
provides the basis for potential uses in architecture. The thesis combines the standardViTs
The classical square tokenization and hyperpixel tokenization models in (SPiT
) were compared and randomized usingVoronoi
Tokenization (RViT
) (explicitly defined mathematical objects for tessellating planes) as a control, the latter being selected for its use as a mathematical object for planar tessellation, the three tokenization schemes are shown in Figure1
A description is provided in the
Contributions
The research for the paper led to three specific questions: (a
) Is strict adherence to the square image block necessary? (b
) What is the effect of irregular segmentation on the tokenized representation? (c
) Can the tokenization scheme be designed as a modular component in the visual model?
After experimental validation, the paper obtained the following conclusions:
-
Generalized Framework
: Hyperpixel tokenization is promoted as a modularization scheme in theViTs
The visualization task provides a richerTransformer
Space, in whichTransformer
The trunk is independent of the tokenization framework. -
Efficient Tokenization
: An efficient online tokenization method is proposed that is competitive in terms of training and inference time while performing well in classification tasks. -
Refined Spatial Resolution
: Hyperpixel tokenization provides semantically aligned tokens with pixel-level granularity. The paper's method yields more significant imputations and performs well in unsupervised segmentation compared to existing interpretable methods. -
Visual Tokenization
: The main contribution of the thesis is the introduction of a novel way to think about theViTs
in tokenization, which is a neglected but central component of the modeling process.
The main objective of the thesis is to assessViTs
of the labeling scheme, emphasizing the intrinsic properties of the different labeling methods. For a fair comparative analysis, the use of the underlyingViT
architectures and established training protocols are studied. Therefore, the paper designs experiments to ensure fair comparisons with well-known baselines and without architectural optimization. Such controlled comparisons are essential for attributing observed differences to the tokenization strategy and eliminate confounding factors introduced by specific architectures or training protocols.
Methodology
In order to evaluate and compare different tokenization strategies, images need to be segmented and meaningful features extracted from these segments. While a variety of deep architectures can be used to accomplish these tasks, these approaches add a layer of complexity to the final model that invalidates any attempt to directly compare tokenization strategies. In addition, this can complicate effective transfer learning between architectures. For this reason, the thesis constructs an effective heuristic hyperpixel tokenizer and proposes an approach that is comparable to the classicalViT
Architecture-consistent non-invasive feature extraction methods for direct comparison.
-
Notation
define\(H {\mkern1mu\times\mkern1mu} W = \big\{(y, x) : 1 \leq y \leq h, 1 \leq x \leq w\big\}\) denotes a space with dimension\((h, w)\) the coordinates of the image and let\(\mathcal I\) bijective map (math.)\(i \mapsto (y, x)\) of the index set. The index set that will be added to a\(C\) The image of the channel is considered as a signal\({\xi\colon \mathcal I \to \mathbb R^C}\) , define the vectorization operator\(\mathrm{vec}\colon \mathbb{R}^{d_1 {\mkern1mu\times\mkern1mu} \dots {\mkern1mu\times\mkern1mu} d_n} \to \mathbb{R}^{d_1 \dots d_n}\) advertising network\(f(g(x)) = (f \circ g)(x)\) denotes the combination of functions.
Framework
The thesis is a study of the classical by allowing for modular tokenizers and different feature extraction methods.ViT
architecture for generalization. It is worth noting that the classicalViT
Often presented as a three-part system, including a marker embedder\(g\) , a backbone network consisting of a series of attention blocks\(f\) and a subsequent predictor head\(h\) . In fact, it is possible to rewrite the image block embedding module as a modular system of three parts containing a tokenizer\(\tau\) , a feature extractor\(\phi\) and an embedder\(\gamma\) makes\(g = \gamma \circ \phi \circ \tau\) 。
These are inherent components of the original architecture, but are masked under a simplified tokenization strategy. This provides a more complete assessment of the model as a five-part system.
included among these\(\theta\) represents the set of learnable parameters of the model. In the standardViT
In the model, the tokenizer\(\tau\) Segment the image into square regions of fixed size. This directly provides vectorized features, since these image blocks have uniform dimensions and order, and thus are not available in the standardViT
in the architecture of the\(\phi = \mathrm{vec}\) . Embedders\(\gamma\) It is usually a learnable linear layer that maps features to the embedding dimensions of a particular architecture. An alternative approach would be to combine the\(g\) Considered as a convolution operation with a convolution kernel size and step size equal to the desired image block size\(\rho\) 。
Partitioning and Tokenization
Tokenization in linguistic tasks requires partitioning text into optimally informative tokens, which is analogous to hyperpixel partitioning of spatial data into discrete connected regions. Hierarchical hyperpixel is a highly parallelizable graph-based method suitable for online tokenization. Based on this, the paper proposes a new method which at each step of the\(t\) in which fully parallel aggregation of batch image maps is performed, in addition to regularization of size and compactness. Different numbers of hyperpixels are generated at each step, dynamically adapting to the complexity of the image.
-
Superpixel Graphs
found\(E^{(0)} \subset \mathcal I {\mkern1mu\times\mkern1mu} \mathcal I\) expressed in\(H {\mkern1mu\times\mkern1mu} W\) under the four-way neighboring edges. Consider the hyperpixel as a set\(S \subset \mathcal I\) and if for\(S\) Any two pixels in the\(p\) cap (a poem)\(q\) , there exists a sequence of edges\(\big((i_j, i_{j+1}) \in E^{(0)}\big)_{j=1}^{k-1}\) makes\(i_1 = p\) cap (a poem)\(i_k = q\) For its part, it believes that\(S\) is connected. If for any two different superpixels\(S\) cap (a poem)\(S' \in \pi\) They're intersecting.\(S \cap S' = \emptyset\) , and the concatenation of all hyperpixels is equal to the set of all pixel positions in the image, i.e.\(\bigcup_{S \in \pi^{(t)}} S = \mathcal I\) , then a set of hyperpixels forms a segmentation of the image\(\pi\) 。
found\(\Pi(\mathcal I) \subset 2^{2^{\mathcal I}}\) represents the space of all the divisions of the image and has a series of divisions\((\pi^{(t)})_{t=0}^T\) . If for\(\pi^{(t)}\) All the superpixels in the\(S\) , there exists a superpixel\(S' \in \pi^{(t+1)}\) feasible\(S \subseteq S'\) Instead, it argues that the split\(\pi^{(t)}\) It's another split.\(\pi^{(t+1)}\) The refinement of this with\(\pi^{(t)} \sqsubseteq \pi^{(t+1)}\) to represent it. The goal is to construct a pixel-indexed\(T\) hierarchical partitioning\({\mathcal H = \big( \pi^{(t)} \in \Pi(\mathcal I) : \pi^{(t)} \sqsubseteq \pi^{(t+1)} \big)_{t=0}^T}\) that makes each hyperpixel connected.
In order to construct\(\mathcal H\) , gradually connecting vertices by parallel edge contraction (replacing multiple vertices with a single vertex, with the inner edges of the replaced points removed and the outer edges inherited by the replaced vertices) to update the segmentation\({\pi^{(t)} \mapsto \pi^{(t+1)}}\) . By treating each layer as a graph\(G^{(t)} = (V^{(t)}, E^{(t)})\) to be realized, where each vertex\(v \in V^{(t)}\) demerger\(\pi^{(t)}\) The index of a hyperpixel in each edge of the\((u, v) \in E^{(t)}\) On behalf of the\(t = 0, \dots, T\) neighboring superpixels in the hierarchy. Thus, the initial image can be represented as a grid map\({G^{(0)} = (V^{(0)}, E^{(0)})}\) The following is an example of a single-pixel segmentation.\({\pi^{(0)} = \big\{\{i\} : i \in \mathcal I \big\}}\) 。
-
Weight function
To apply edge contraction, define an edge weight function\(w_\xi^{(t)}\colon E^{(t)} \to \mathbb R\) .. The self-loop in the graph is retained (nodes contained by hyperpixels point to each other, and when merged behave as if the hyperpixel is pointing to itself. The self-loop is retained here because it is not necessarily necessary to add new pixels every time, and it is not added when the self-loop weight is higher than that of other nodes), and the variance of the region size is constrained by weighting the self-loop edges by their relative size as a regularizer. For non-self-ringing edges, the average feature is used\(\mu_\xi^{(t)}(v) = \sum_{i \in \pi^{(t)}_v} \xi(i) / \lvert \pi^{(t)}_v \rvert\) and apply the similarity function\(\mathrm{sim}\colon E^{(t)} \to \mathbb{R}\) as weights. The weights of the self-loop are used in the hierarchy\(t\) The characteristic mean of the region size when\(\mu^{(t)}_{\lvert \pi \rvert}\) and the characteristic standard deviation\(\sigma^{(t)}_{\lvert \pi \rvert}\) Perform weighting.
The overall weights are calculated as follows:
Compactness can be tuned selectively by computing infinite-paradigm densities:
included among these\(\mathrm{per}_\infty\) It's the wraparound hyperpixel.\(u\) cap (a poem)\(v\) the perimeter of the bounding box. This highlights two neighboring hyperpixels\(u\) cap (a poem)\(v\) closeness within its bounding box, resulting in a regularized weight function.
included among these\(\lambda \in [0,1]\) as a hyperparameter of compactness.
-
Update rule
Edge contraction is performed using a greedy parallel update rule such that each superpixel is connected to a neighboring superpixel with the highest edge weight, including all\(G^{(t)}\) The self-loop in the\(t \geq 1\) set up\(\mathfrak{N}^{(t)}(v)\) expressed in the first (i.e. not the second) month of the year\(t\) The index in the layer is\(v\) of hyperpixels in the neighborhood of adjacent vertices to construct an intermediate edge set:
Then, pass the closure\(\hat E_+^{(t)}\) (A transitive closure is one in which there is transitivity in multiple binary relations, and more relations are deduced from that transitivity, e.g., A->C can be deduced from A->B and B->C, in this case\(\hat E^{(t)}\) (the connected components) can be derived explicitly as a mapping\({V^{(t)} \mapsto V^{(t+1)}}\) makes
included among these\(\hat{\mathfrak{N}}_+^{(t)}(v)\) expressed in\(\hat E_+^{(t)}\) mid-vertex\(v\) of the connected components. This partition update rule ensures that the\((t+1)\) Each partition of the layer is a connected region as it is formed by merging neighboring hyperpixels with the highest edge weights, as shown in Fig.3
As shown in the
-
Iterative refinement
Repeat the steps of calculating aggregated mappings, regularized edge weights, and edge contractions until the desired number of layers is reached\(T\) .. At each layer, the partitions become coarser, representing larger homogeneous regions in the image. The layer structure provides a multi-scale representation of the image, capturing both local and global structure. In the first\(T\) layers to get a series of partitions\((\pi^{(t)})_{t=0}^T\) , where each level of partitioning in the hierarchy\(t\) when it is a connected region, and for all\(t\) there are\({\pi^{(t)} \sqsubseteq \pi^{(t+1)}}\) 。
In the classicViT
In the disambiguator, the paper attempts to validate different\(T\) and image block size\(\rho\) relationship between the number of markers produced, respectively. Set\(N_\mathrm{SPiT}\) cap (a poem)\(N_\mathrm{ViT}\) represent respectivelySPiT
participle andViT
The number of tokens of the disambiguator, this relation is\(\mathbb{E}(T \mid N_\mathrm{SPiT} = N_\mathrm{ViT}) = \log_2 \rho\) , regardless of image size.
Feature Extraction with Irregular Patches
even thoughViT
The choice of square image blocks in the architecture was motivated by simplicity, but this naturally reflects the challenges posed by the alternatives. Irregular image blocks are unaligned, exhibit different shapes and dimensions, and are often non-convex (very irregular in shape). These factors make it not easy to embed non-regular image blocks into a common inner product space. In addition to maintaining consistency and uniform dimensionality, the paper proposes the minimum set of attributes that any such feature needs to capture; i.e., color, texture, shape, scale, and position.
-
Positional Encoding
ViTs
A learnable position embedding is typically used for each image block in the image grid. The paper notes that this corresponds to the position histogram of the downsampled image, and that the learnable position embedding can be extended to handle more complex shapes, scales, and positions by using kernelization methods for each\(n=1,\dots,N\) Partitioned hyperpixels\(S_n\) of the coordinates to apply the joint histogram. First, the positions are normalized such that all the\((y', x') \in S_n\) They all landed in\([-1, 1]^2\) Range. Setting Fixed\(\beta\) is the dimension of the feature in each spatial direction, and the feature is represented by a Gaussian kernel\(K_\sigma\) Extraction:
Typically, the bandwidth\(\sigma\) Take a lower value in the range\([0.01, 0.05]\) . This, in effect, encodes the position of the image block in the image, as well as its shape and scale.
-
Color Features
To encode the light intensity information from the raw pixel data into the features, the bounding box of each image block is interpolated to a fixed resolution using bilinear interpolation\(\beta {\mkern1mu\times\mkern1mu} \beta\) , while masking pixel information from other surrounding image blocks. These features essentially capture the original pixel information of the original image block, but are resampled and scaled to a uniform dimension. The feature extractor will be\(\phi\) called the interpolating feature extractor.RGB
The features are also normalized to\([-1, 1]\) and vectorized such that\(\hat\xi^{\text{col}} \in \mathbb{R}^{3\beta^2}\) 。
-
Texture Features
The gradient operator provides a simple and robust method for texture information extraction. Based on the improved rotational symmetry and discretization error, the paper chooses to use theScharr
proposed gradient operator. This operator is normalized such that\(\nabla \xi \in [-1, 1]^{H{\mkern1mu\times\mkern1mu} W{\mkern1mu\times\mkern1mu} 2}\) , where the last two dimensions correspond to the gradient directions\(\nabla y\) cap (a poem)\(\nabla x\) . Similar to the process for positional features, at each superpixel\(S_n\) A Gaussian kernel is applied internally to the gradient to construct a joint histogram, making the\(\hat\xi^{\text{grad}}_n \in \mathbb{R}^{\beta^2}\) 。
The final eigenmodes are spliced as\(\hat\xi_n = [\hat\xi^{\text{col}}_n, \hat\xi^{\text{pos}}_n, \hat\xi^{\text{grad}}_n] \in \mathbb{R}^{5\beta^2}\) . Although the gradient characterization proposed in the paper is similar to the standardViT
The architecture is the same, but they represent additional dimensions of information. Therefore, the paper evaluates the effect of including or omitting gradient features. For those models that omit these features, i.e.\(\hat\xi_n \setminus \hat\xi^{\text{grad}}_n = [\hat\xi^{\text{col}}_n, \hat\xi^{\text{pos}}_n] \in \mathbb{R}^{4\beta^2}\) The extractor is called\(\phi\) for extractors that do not include gradients.
Generalization of Canonical ViT
By design, the thesis is framed in terms of a standardizedViT
A generalization of tokenization, equivalent to using a fixed image block size\(\rho\) and standard image block embedders for interpolated feature extraction excluding gradients.
found\(\tau^*\) Indicates a fixed image block size\(\rho\) exclusionary ruleViT
Tokenizer.\(\phi\) denotes an interpolating feature extractor that excludes gradients.\(\gamma^*\) cap (a poem)\(\gamma\) denotes an embedding layer with an equivalent linear projection, where\(L^*_\theta = L_\theta\) set up\(\hat\xi^{\text{pos}} \in \mathbb{R}^{N {\mkern1mu\times\mkern1mu} \beta^2}\) expressed in\(\tau^*\) joint histogram position embedding matrix under partitioning. Then, for the dimension\(H = W = \beta^2 = \rho^2\) funded by\(\gamma \circ \phi \circ \tau^*\) The embedding given is the same as that given by\(\gamma^* \circ \phi^* \circ \tau^*\) Criteria givenViT
Embedding is quantitatively equivalent.
Experiments and Results
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].