ATC: much faster, no-parameter token reduction method

Thesis: Agglomerative Token Clustering

Paper Address:/abs/2409.11923
Thesis Code:/JoakimHaurum/ATC

innovation point

propose a hierarchytokenClustering (Agglomerative Token Clustering，ATC), which is a novel parameter-free hierarchical merging of thetokenReduction methods.
on the basis ofATCThe newest and most advanced performance in image classification and image synthesis, as well as target detection and segmentation tasks, surpasses all othertokenReduction methods, including merger-based and pruning-basedtokenReduction methods.
In the tasks of image classification and target detection and segmentation, theATCIt can be used without any fine-tuning (i.e. out of the box) to achieve state-of-the-art performance comparable to that previously fine-tuned.

Content overview

arrangement of ideastokenClustering (Agglomerative Token ClusteringAbbreviationsATC) is a new type oftokenmerging methods that consistently outperform previous methods in image classification, image synthesis, and target detection and segmentation taskstokenMerging and pruning methods.ATCClusters are merged by bottom-up hierarchical clustering without introducing additional learnable parameters.

In all missions.ATCAll achieve state-of-the-art performance. Without fine-tuning, it rivals even the previous state-of-the-art.ATCParticularly effective at low retention rates, this scenario retains only a small amount of thetokenand maintaining task performance is particularly difficult.

arrangement of ideastokenclustering

Same as the previoustokenThe merger methodology is similar.ATCThe goal is to consolidate redundanciestokenwhile maintaining or upgradingViTperformance of the model. In theViTSelf-attention of blocks and multilayer perceptual machines (MLP) Inserted between modulestokenmerge operation, which is consistent with previous merge-based approaches such as theToMe。

Hierarchical clustering is a classical bottom-up hierarchical clustering method in which each element is initially its own cluster. By using a distance metric based on some link function and\(D(\cdot)\) Iterative comparison of clusters merges the two closest clusters in each iteration. This process continues until some stopping criterion is met, such as the number of clusters required (forming a static reduction method), or the minimum distance between clusters (forming a dynamic reduction method).

The paper considers static shrinkage scenarios using cosine distance as a distance metric\(D(\cdot)\) and use the keys of the self-attention module astokenFeatures. The choice of link function can have a big impact on how elements are clustered, and there are three main most common link functions: single, complete and average.

\[\begin{equation} D(I,J)^{\text{single}} = \min_{i\in I,\ j\in J} D(i,j) \end{equation} \]

\[\begin{equation} D(I,J)^{\text{complete}} = \max_{i\in I,\ j\in J} D(i,j) \end{equation} \]

\[\begin{equation} D(I,J)^{\text{average}} = \frac{1}{|I||J|}\sum_{i\in I}\sum_{j\in J}D(i,j) \end{equation} \]

included among these\(I\) cap (a poem)\(J\) is the containing element\(i \in I\) cap (a poem)\(j \in J\) The clustering of the

After the stopping criterion is met, for each cluster in thetokenare averaged to obtain an updated clustering representation. However, with thetokenof the merge, they represent more than one input image block. To better utilize the ability to capture a larger spatial extent of thetoken, using weighted averages as the clustering representation, and using proportional attention in the self-attention module.

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

ATC: much faster, no-parameter token reduction method | ECCV'24