Thesis: Agglomerative Token Clustering
- Paper Address:/abs/2409.11923
- Thesis Code:/JoakimHaurum/ATC
innovation point
- propose a hierarchy
token
Clustering (Agglomerative Token Clustering
,ATC
), which is a novel parameter-free hierarchical merging of thetoken
Reduction methods. - on the basis of
ATC
The newest and most advanced performance in image classification and image synthesis, as well as target detection and segmentation tasks, surpasses all othertoken
Reduction methods, including merger-based and pruning-basedtoken
Reduction methods. - In the tasks of image classification and target detection and segmentation, the
ATC
It can be used without any fine-tuning (i.e. out of the box) to achieve state-of-the-art performance comparable to that previously fine-tuned.
Content overview
arrangement of ideastoken
Clustering (Agglomerative Token Clustering
AbbreviationsATC
) is a new type oftoken
merging methods that consistently outperform previous methods in image classification, image synthesis, and target detection and segmentation taskstoken
Merging and pruning methods.ATC
Clusters are merged by bottom-up hierarchical clustering without introducing additional learnable parameters.
In all missions.ATC
All achieve state-of-the-art performance. Without fine-tuning, it rivals even the previous state-of-the-art.ATC
Particularly effective at low retention rates, this scenario retains only a small amount of thetoken
and maintaining task performance is particularly difficult.
arrangement of ideastoken
clustering
Same as the previoustoken
The merger methodology is similar.ATC
The goal is to consolidate redundanciestoken
while maintaining or upgradingViT
performance of the model. In theViT
Self-attention of blocks and multilayer perceptual machines (MLP
) Inserted between modulestoken
merge operation, which is consistent with previous merge-based approaches such as theToMe
。
Hierarchical clustering is a classical bottom-up hierarchical clustering method in which each element is initially its own cluster. By using a distance metric based on some link function and\(D(\cdot)\) Iterative comparison of clusters merges the two closest clusters in each iteration. This process continues until some stopping criterion is met, such as the number of clusters required (forming a static reduction method), or the minimum distance between clusters (forming a dynamic reduction method).
The paper considers static shrinkage scenarios using cosine distance as a distance metric\(D(\cdot)\) and use the keys of the self-attention module astoken
Features. The choice of link function can have a big impact on how elements are clustered, and there are three main most common link functions: single, complete and average.
included among these\(I\) cap (a poem)\(J\) is the containing element\(i \in I\) cap (a poem)\(j \in J\) The clustering of the
After the stopping criterion is met, for each cluster in thetoken
are averaged to obtain an updated clustering representation. However, with thetoken
of the merge, they represent more than one input image block. To better utilize the ability to capture a larger spatial extent of thetoken
, using weighted averages as the clustering representation, and using proportional attention in the self-attention module.
Main experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].