Post-training quantification (
PTQ
) in the visualTransformer
(ViTs
) field has attracted much attention because it has demonstrated high efficiency in model compression. However, existing methods usually ignore the complex interdependence between quantization weights and activations, leading to considerable quantization errors. The paper proposes a method calledERQ
two-stepPTQ
methods, carefully designed for sequential reduction of quantization errors due to activation and weight quantization.ERQ
The activation quantization error reduction is first introduced (Aqer
), the minimization of the activation quantization error is strategically formulated as a ridge regression problem and solved by updating the weights using full precision. Subsequently.ERQ
Weighted quantization error reduction was introduced (Wqer
), an iterative approach is used to mitigate the quantization error caused by weight quantization. In each iteration, empirically derived effective proxies are used to refine the rounding direction of the quantized weights and combined with a ridge regression solver to reduce the weight quantization error. Experimental results demonstrate the effectiveness of the method. It is worth noting that theERQ
existW3A4 ViT-S
surpasses the state-of-the-art in terms of accuracyGPTQ
The uplift is up to22.36%
。
discuss a paper or thesis (old): ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
- Paper Address:/abs/2407.06794
Introduction
visuallyTransformer
(ViTs
) significantly challenged the convolutional neural network (CNNs
), a new paradigm in computer vision.ViTs
Utilizing long self-attention (MHSA
) mechanism to capture long-distance relationships between image blocks, showing impressive progress in a variety of visual tasks.
However, great capabilities come with considerable complexity.ViTs
Inherent architectural complexity results in high computational demands and sizable memory requirements, which pose challenges when deployed in resource-constrained environments. To alleviate this dilemma, model quantization has attracted sustained attention from industry and academia. Quantization reduces model complexity by enabling low-bit representations of weights and activations, providing a promising avenue for efficient deployment. Recently, researchers have gradually focused on visualTransformer
Quantification of post-training (PTQ
), which is designed to quantify the model using a small calibration dataset and at a low cost.
In order to adapt toViTs
unique structure, there have been many studies exploring various post-training quantizations (PTQ
) Methods. For example, to deal with long-tailpost-Softmax
activation, a study has proposed\(log2/log \sqrt{2}\) Quantifiers andtwin uniform
Quantizers. To manage highly variable activations, some studies have used reparameterization techniques andpower-of-two
Factors. In addition, some studies have used evolutionary search methods to identify unstable scaling factors. However, existing methods usually ignore the complex interdependence between weights and activation quantization, which leads to considerable quantization errors during weight-activation quantization.
The paper proposes a method forViTs
Tailored two-step approach to quantifying post-trainingERQ
, which aims to sequentially reduce the quantization error caused by the quantization activation and weights. As shown in Fig.1
Shown.ERQ
consists of two steps, i.e., activation of quantization error reduction (Aqer
) and weight quantization error reduction (Wqer
)。Aqer
The quantization error due to activation quantization is formulated as a ridge regression problem, which can be solved in a closed-form solution by weight updating. Subsequently, the introduction ofWqer
Quantization errors caused by weight quantization are reduced in an iterative quantization and correction manner. In particular, in each iteration, the first half of the full-precision weights are quantized and the resulting quantization error is reduced by first performing rounding refinement and later solving the ridge regression problem again. The former derives a valid proxy for the output error, which is used to refine the rounding direction of the quantization weights to reduce the quantization error. The latter further reduces the quantization error by updating the remaining full-precision weights. Such a process continues until all weights are accurately quantized.
ERQ
In the case of variousViTs
Variants (ViT
、DeiT
cap (a poem)Swin
) and tasks (image classification, target detection, and instance segmentation) in extensive experiments proved its effectiveness. Notably, in the image categorization task, theERQ
existW3A4 ViT-S
on a par withGPTQ
The performance of the22.36%
。
Method
intertwined\(\delta{\mathbf{x}}\) cap (a poem)\(\delta\mathbf{W}\) makes it possible to find the formula4
The optimal solution of the problem becomes challenging. To make the problem tractable, the equation4
relaxed into two sequential subproblems by minimizing the error from quantizing the activations and weights, respectively. As shown in Fig.1
shown, the activation quantization error reduction is first performed (Aqer
), followed by weight quantization error reduction (Wqer
)。
Activation Quantization Error Reduction
To mitigate the error caused by activation quantization, activation quantization error reduction is introduced (Aqer
), formalizing the error mitigation problem as a ridge regression problem. Specifically, the weights are retained to full precision, and only the quantization error caused by the activation is taken into account\(\delta{\mathbf{x}}\) The mean square error due to (MSE
):
In order to minimize Eq.5
, which is formalized as a ridge regression problem, where by placing the weights\(\mathbf{W}\) and adjustments\(\delta\mathbf{W}^*\) Summing to complete the minimization:
Here.\(\delta\mathbf{W}^*\) denotes the adjustment term calculated by ridge regression.\(\bar{\mathbf{x}}=\mathbf{x}+\delta\mathbf{x}\) are quantized inputs.\(\lambda_1\| \delta\mathbf{W}^* \|_2^2\) as a regularization term.\(\lambda_1\) is the hyperparameter that controls the strength of regularization. The formula6
constitutes the ridge regression problem. In order to minimize it, first compute its relative to\(\delta\mathbf{W}^*\) The gradient of the
Then, by combining Eq.7
Set to zero to solve\(\delta\mathbf{W}^*\) :
regularization term\(\lambda_1 \mathbf{I}\) assure\(\mathbb{E} \left[\bar{\mathbf{x}}\bar{\mathbf{x}}^T \right] + \lambda_1 \mathbf{I}\) of the inverse is always present, which is crucial for computational stability. In addition, it suppresses outliers, which mitigates overfitting and improves the generalization of the model. Suppressing outliers is also crucial for subsequent weight quantization, as it limits the range of weights. This restriction prevents the quantization points from being distributed over uncovered areas, thus enhancing the expressive power of the quantization.
In practice, given a calibration dataset, using the\(\frac{1}{N}\sum_n^N \delta{\mathbf{x}}_n\bar{\mathbf{x}}_n^T\) cap (a poem)\(\frac{1}{N}\sum_n^N \bar{\mathbf{x}}_n\bar{\mathbf{x}}_n^T\) reckon separately\(\mathbb{E}\left[\delta{\mathbf{x}}\bar{\mathbf{x}}^T\right]\) cap (a poem)\(\mathbb{E}\left[\bar{\mathbf{x}}\bar{\mathbf{x}}^T \right]\) . Here.\(N = B\times T >> D_{in}^s\) which\(B\) is the size of the calibration dataset.\(T\) is the number of markers in an image. Note that the\(\delta{\mathbf{x}}\) cap (a poem)\(\bar{\mathbf{x}}\) is determined given the input and quantization parameters. After obtaining the\(\delta\mathbf{W}^*\) After passing\(\mathbf{W} = \mathbf{W} + \delta\mathbf{W}^*\) merge them into the weights of the network. By doing so, the proposedAqer
Explicitly mitigates quantization errors from quantizing activations to weights.
Weight Quantization Error Reduction
underwayAqer
After the weight quantization needs to be performed, it is proposed that the weight quantization error is reduced (Wqer
) to mitigate the resulting quantization error. Here, the objective is defined as:
Note that there is a difference in the way in which theAqer
After that, the activation value is quantized. The formula9
It is shown that the minimization is performed independently between the output channels. Therefore, analyzing separately each\(\mathcal{L}^{\text{MSE}}_i\) of minimization. Meanwhile quantization of the entire full-precision weights leads to unrecoverable quantization errors. Therefore, an iterative quantization and correction method is used to gradually reduce the quantization error caused by weight quantization.
In each iteration, the first half of the unquantized weights are first quantized and then the resulting quantization error is mitigated. Specifically, from the current full-precision weights\(\mathbf{W}_{i,:}\) and the corresponding\(\bar{\mathbf{x}}\) Start. Then, place the\(\mathbf{W}\) Divided into two parts: first half\(\mathbf{W}^s_{i,:} \in \mathbb{R}^{ 1\times D_{in}^s}\) for quantification, while the remainder\(\mathbf{W}^r_{i,:} \in \mathbb{R}^{1 \times D_{in}^r}\) maintains full precision. Correspondingly, the value from the\(\bar{\mathbf{x}}\) derived from\(\bar{\mathbf{x}}^s \in \mathbb{R}^{D_{in}^s}\) cap (a poem)\(\bar{\mathbf{x}}^r \in \mathbb{R}^{D_{in}^r}\) which\(\bar{\mathbf{x}}^s\) cap (a poem)\(\bar{\mathbf{x}}^r\) respectively, contains the same information as the\(\mathbf{W}^s_{i,:}\) cap (a poem)\(\mathbf{W}^r_{i,:}\) corresponding\(\bar{\mathbf{x}}\) of the line. The quantized\(\mathbf{W}^s_{i,:}\) The quantization error of is denoted as\(\delta\mathbf{W}^s_{i,:} = \bar{\mathbf{W}}^s_{i,:} - \mathbf{W}^s_{i,:}\) The resulting mean square error (MSE
) for:
Here.\(\mathbf{W}_{i,:} = [ \mathbf{W}^s_{i,:},\mathbf{W}^r_{i,:} ]\) , \(\bar{\mathbf{x}} = [ \bar{\mathbf{x}}^s, \bar{\mathbf{x}}^r ]\) . In order to mitigate the formula10
, first introducing rounding optimization (Rounding Refinement
), the rounding direction of the quantitative weights is refined in that process. For example, adjusting the\(\delta\mathbf{W}^s_{i,:}\) in order to reduce\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) itself. Then, after rounding and optimizing, given the\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) , construct a ridge regression (Ridge Regression
) issue by adjusting the\(\mathbf{W}^r_{i, :}\) to further mitigate that error.
Rounding Refinement
Initially, the goal is to adjust the rounding direction of the quantitative weights to minimize the\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) . Specifically, for\(\mathbf{W}^s_{i,:}\) Fourth periodic report of the Secretary-General\(j\) A value, denoted as\(\mathbf{W}^s_{i,j}\) , the quantization process involves either downward or upward rounding. Thus.\(\mathbf{W}^s_{i,:}\) The quantization error, denoted as\(\delta\mathbf{W}^s_{i,j}\) which can be expressed as\(\delta\mathbf{W}^{s\downarrow}{i, j}\) maybe\(\delta\mathbf{W}^{s\uparrow}{i, j}\) . Here.\(\delta\mathbf{W}^{s\downarrow}_{i, j} = \mathbf{W}^s_{i,j} - \text{Q}_{un\downarrow}(\mathbf{W}^s_{i,j}, b) > 0\) denotes the error resulting from the use of the downward rounding strategy, the\(\delta\mathbf{W}^{s\uparrow}_{i, j} = \mathbf{W}^s_{i,j} - \text{Q}_{un\uparrow}(\mathbf{W}^s_{i,j}, b) < 0\) denotes the error generated by using the upward rounding strategy, where\(\downarrow/\uparrow\) means that in the formula1
air marshal\(\left\lfloor \cdot \right\rceil\) Replace with\(\left\lfloor \cdot \right\rfloor\) / \(\left\lceil \cdot \right\rceil\) 。
option\(\delta\mathbf{W}^s_{i,:}\) anNP
difficult problem, whose solution can be obtained by mixed-integer quadratic programming (MIPQ
) to conduct a search. However.\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) of high computational complexity makes it a challenge to find a solution in a reasonable amount of time. As shown in Table1
As shown, using the\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) act asMIPQ
The target was consumed by approximately130
The huge time cost of hours.
-
Efficient Proxy
The goal, therefore, is to find\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) of an efficient agent. First, the\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) Rewrite for:
Here.\(\Delta\) expressed the view that the use of\(\mathbb{E}\left[ Z^2 \right] = (\mathbb{E}\left[ Z \right])^2 + \text{Var}\left[ Z \right]\) 。
According to the central limit theorem, the large number of multiplicative and additive operations in neural networks makes the activation values usually show a Gaussian distribution, which is the basic assumption of many previous studies in the field of quantization. At the same time, the graph2
The channel distributions for full precision and quantized activations are shown. It can be seen that the quantized activations still exhibit an approximate Gaussian distribution.
Thus, the thesis argues that\(\bar{\mathbf{x}}^s\) of the channel distribution can still be captured by a Gaussian distribution with the\(D_{in}^s\) Gaussian distribution in dimension\(\mathcal{N}(\boldsymbol{\mu}^s, \boldsymbol{\Sigma}^s)\) treat (sb a certain way)\(\bar{\mathbf{x}}^s\) Perform modeling where\(D_{in}^s\) be\(\bar{\mathbf{x}}^s\) The dimensions of the\(\boldsymbol{\mu}^s \in \mathbb{R}^{D_{in}^s}, \boldsymbol{\Sigma}^s \in \mathbb{R}^{D_{in}^s \times D_{in}^s}\) . Then, the formula11
Change to:
Here, the formula12
Yeah, I got it.\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) of the proxy. In practice, empirical values are estimated using a given calibration dataset\(\hat{\boldsymbol{\mu}}^s\) cap (a poem)\(\hat{\boldsymbol{\Sigma}}^s\) . Note that for all output channels, the\(\hat{\boldsymbol{\mu}}^s\) cap (a poem)\(\hat{\boldsymbol{\Sigma}}^s\) is shared and only one calculation needs to be performed.
seek3
Demonstrates the relationship between agents and\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) The relationship between the It can be seen that the proposed proxy is proportional to the true value, proving its credibility.
The computational complexity of using a proxy is\(O((D_{in}^s)^2)\) but (not)\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) The complexity of\(O(ND_{in}^s)\) which\(N >> D_{in}^s\) . Thus, this agent can be used as a low-cost target for solving the\(\delta\mathbf{W}^s_{i,:}\) . As shown in the table1
As shown, the equation12
act asMIPQ
The goal is to reduce the cost of time from approximately130
hours down to about10
Hours. However, since the current open sourceMIPQ
The implementation only supportsCPU
Inability to fully utilizeGPU
ability to do so, such a cost would still be modest. This is followed by a description of theRounding Refinement
A kind of supportGPU
approach that utilizes the gradient of the agent to more quickly adjust the\(\delta\mathbf{W}^s_{i,:}\) 。
-
Rounding Refinement
First, the closest rounding strategy is used to initialize the\(\delta\mathbf{W}^s_{i,j}\) . At this time.\(\delta\mathbf{W}^s_{i,j}\) Either it is equal to\(\delta\mathbf{W}^{s\downarrow}_{i, j}\) , either equal to\(\delta\mathbf{W}^{s\uparrow}_{i, j}\) . The goal is then to identify a collection of indexes\(\mathcal{S}\) , which contains the indexed set of the element to be modified, with its rounding direction reversed:
In order to determine\(\mathcal{S}\) , first to the agent (Eq.12
) as opposed to\(\delta\mathbf{W}^s_{i,:}\) Ask for guidance.
Only elements with the same gradient sign are selected, as this is the only way to allow for inversion. For example, when\(\delta\mathbf{W}_{i, j}^s = \delta\mathbf{W}^{s\downarrow}_{i, j}\) When, and only when\(\boldsymbol{G}_{\delta\mathbf{W}_{i, j}^s}\) together with\(\delta\mathbf{W}_{i, j}^s\) with the same symbols before replacing it with the\(\delta\mathbf{W}^{s\uparrow}_{i, j}\) . Thus, the indexed set\(\mathcal{S}\) Defined as:
Here.\(\mathrm{topk\_index}\) Back to previous\(\mathrm{k}\) The index of an element of the\(\mathbb{1}(\cdot)\) For non-negative inputs return1
The return value for negative inputs is0
, \(\lvert \cdot \rvert\) Returns the absolute value of the input.
Attainment of\(\mathcal{S}\) Afterwards, by means of Eq.13
Perform the inversion. The above process is iterated until the adjusted\(\delta\mathbf{W}^s_{i, :}\) trigger a larger agent value or reach the maximum number of iterations. After obtaining the\(\delta\mathbf{W}^s_{i, :}\) After that, quantization can be done by\(\bar{\mathbf{W}}^s_{i, :} = \mathbf{W}^s_{i, :}+\delta\mathbf{W}^s_{i, :}\) Finish. Then, the\(\bar{\mathbf{W}}^s_{i, :}\) Add to the set of quantization weights.Rounding Refinement
The overall process in the algorithm1
thirteenth meeting of the Conference of the Parties to the Convention on Biological Diversity (CBD)7
go to the end of the line18
given in the rows. As shown in the table1
Shown.Rounding Refinement
pass (a bill or inspection etc)\(150\times\) The cost of this significantly reduces the time overhead from the10
Hours reduced to4
minutes with an acceptable loss of accuracy.
-
Ridge Regression
existRounding Refinement
After that, it is recommended to use\(\delta\mathbf{W}^{r*}_{i, :}\) align\(\mathbf{W}^r_{i, :}\) to further offset\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) The following objectives were obtained:
Among them.\(\lambda_2\) is a hyperparameter that controls the regularization term\(\lambda_2\| \delta\mathbf{W}^{r*}_{i, :} \|_2^2\) The strength of the Eq.16
The minimization of forms the ridge regression problem and the solution is defined as:
In practice, by using the\(\frac{1}{N}\sum_n^N \bar{\mathbf{x}}_n^r\bar{\mathbf{x}}_n^{sT}\) cap (a poem)\(\frac{1}{N}\sum_n^N \bar{\mathbf{x}}_n^r\bar{\mathbf{x}}_n^{rT}\) estimate\(\mathbb{E}\left[\bar{\mathbf{x}}^r \bar{\mathbf{x}}^{sT}\right]\) cap (a poem)\(\mathbb{E}\left[\bar{\mathbf{x}}^r \bar{\mathbf{x}}^{rT} \right]\) . Subsequently.\(\mathbf{W}^r_{i, :} = \mathbf{W}^r_{i, :}+\delta\mathbf{W}^{r*}_{i, :}\) to minimize errors. Currently.\(\mathbf{W}^r_{i, :}\) remains as full precision and will be processed in the next iteration. The process continues until all the weights are accurately quantized. The proposedRounding Refinement
cap (a poem)Ridge Regression
incorporateWqer
, its overall process in the algorithm1
is given in. In practice, the parallel execution of multiple output channels for theWqer
。
Experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].