ERQ: 32-bit to 5-bit only drops some precision, come see two-stage post-training quantization

Post-training quantification (PTQ) in the visualTransformer（ViTs) field has attracted much attention because it has demonstrated high efficiency in model compression. However, existing methods usually ignore the complex interdependence between quantization weights and activations, leading to considerable quantization errors. The paper proposes a method calledERQtwo-stepPTQmethods, carefully designed for sequential reduction of quantization errors due to activation and weight quantization.ERQThe activation quantization error reduction is first introduced (Aqer), the minimization of the activation quantization error is strategically formulated as a ridge regression problem and solved by updating the weights using full precision. Subsequently.ERQWeighted quantization error reduction was introduced (Wqer), an iterative approach is used to mitigate the quantization error caused by weight quantization. In each iteration, empirically derived effective proxies are used to refine the rounding direction of the quantized weights and combined with a ridge regression solver to reduce the weight quantization error. Experimental results demonstrate the effectiveness of the method. It is worth noting that theERQexistW3A4 ViT-Ssurpasses the state-of-the-art in terms of accuracyGPTQThe uplift is up to22.36%。

discuss a paper or thesis (old): ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Paper Address:/abs/2407.06794

Introduction

visuallyTransformer（ViTs) significantly challenged the convolutional neural network (CNNs), a new paradigm in computer vision.ViTsUtilizing long self-attention (MHSA) mechanism to capture long-distance relationships between image blocks, showing impressive progress in a variety of visual tasks.

However, great capabilities come with considerable complexity.ViTsInherent architectural complexity results in high computational demands and sizable memory requirements, which pose challenges when deployed in resource-constrained environments. To alleviate this dilemma, model quantization has attracted sustained attention from industry and academia. Quantization reduces model complexity by enabling low-bit representations of weights and activations, providing a promising avenue for efficient deployment. Recently, researchers have gradually focused on visualTransformerQuantification of post-training (PTQ), which is designed to quantify the model using a small calibration dataset and at a low cost.

In order to adapt toViTsunique structure, there have been many studies exploring various post-training quantizations (PTQ) Methods. For example, to deal with long-tailpost-Softmaxactivation, a study has proposed\(log2/log \sqrt{2}\) Quantifiers andtwin uniformQuantizers. To manage highly variable activations, some studies have used reparameterization techniques andpower-of-twoFactors. In addition, some studies have used evolutionary search methods to identify unstable scaling factors. However, existing methods usually ignore the complex interdependence between weights and activation quantization, which leads to considerable quantization errors during weight-activation quantization.

The paper proposes a method forViTsTailored two-step approach to quantifying post-trainingERQ, which aims to sequentially reduce the quantization error caused by the quantization activation and weights. As shown in Fig.1Shown.ERQconsists of two steps, i.e., activation of quantization error reduction (Aqer) and weight quantization error reduction (Wqer）。AqerThe quantization error due to activation quantization is formulated as a ridge regression problem, which can be solved in a closed-form solution by weight updating. Subsequently, the introduction ofWqerQuantization errors caused by weight quantization are reduced in an iterative quantization and correction manner. In particular, in each iteration, the first half of the full-precision weights are quantized and the resulting quantization error is reduced by first performing rounding refinement and later solving the ridge regression problem again. The former derives a valid proxy for the output error, which is used to refine the rounding direction of the quantization weights to reduce the quantization error. The latter further reduces the quantization error by updating the remaining full-precision weights. Such a process continues until all weights are accurately quantized.

ERQIn the case of variousViTsVariants (ViT、DeiTcap (a poem)Swin) and tasks (image classification, target detection, and instance segmentation) in extensive experiments proved its effectiveness. Notably, in the image categorization task, theERQexistW3A4 ViT-Son a par withGPTQThe performance of the22.36%。

Method

intertwined\(\delta{\mathbf{x}}\) cap (a poem)\(\delta\mathbf{W}\) makes it possible to find the formula4 The optimal solution of the problem becomes challenging. To make the problem tractable, the equation4relaxed into two sequential subproblems by minimizing the error from quantizing the activations and weights, respectively. As shown in Fig.1shown, the activation quantization error reduction is first performed (Aqer), followed by weight quantization error reduction (Wqer)。

Activation Quantization Error Reduction

To mitigate the error caused by activation quantization, activation quantization error reduction is introduced (Aqer), formalizing the error mitigation problem as a ridge regression problem. Specifically, the weights are retained to full precision, and only the quantization error caused by the activation is taken into account\(\delta{\mathbf{x}}\) The mean square error due to (MSE)：

\[\begin{align} \mathcal{L}^{\text{MSE}} = \mathbb{E} \left[ \| \mathbf{W}\mathbf{x} - \mathbf{W}(\mathbf{x}+\delta{\mathbf{x}})\|_2^2 \right]. \label{eq:obj-act} \end{align} \]

In order to minimize Eq.5, which is formalized as a ridge regression problem, where by placing the weights\(\mathbf{W}\) and adjustments\(\delta\mathbf{W}^*\) Summing to complete the minimization:

\[\begin{equation} \begin{aligned} &\mathbb{E} \left[ \| \mathbf{W}\mathbf{x} - (\mathbf{W} + \delta\mathbf{W}^*)(\mathbf{x}+\delta{\mathbf{x}})\|_2^2 \right] + \lambda_1 \| \delta\mathbf{W}^* \|_2^2 \\ & = \mathbb{E} \left[\| - \delta\mathbf{W}^*(\mathbf{x}+\delta{\mathbf{x}}) - \mathbf{W}\delta{\mathbf{x}} \|_2^2\right] + \lambda_1 \| \delta\mathbf{W}^* \|_2^2 \\ & = \mathbb{E} \left[ \| \delta\mathbf{W}^*\bar{\mathbf{x}} + \mathbf{W}\delta{\mathbf{x}} \|_2^2 \right] + \lambda_1 \| \delta\mathbf{W}^* \|_2^2. \label{eq:obj-act1} \end{aligned} \end{equation} \]

Here.\(\delta\mathbf{W}^*\) denotes the adjustment term calculated by ridge regression.\(\bar{\mathbf{x}}=\mathbf{x}+\delta\mathbf{x}\) are quantized inputs.\(\lambda_1\| \delta\mathbf{W}^* \|_2^2\) as a regularization term.\(\lambda_1\) is the hyperparameter that controls the strength of regularization. The formula6constitutes the ridge regression problem. In order to minimize it, first compute its relative to\(\delta\mathbf{W}^*\) The gradient of the

\[\begin{equation} \begin{aligned} \frac{\partial}{\partial \delta\mathbf{W}^*} & \mathbb{E}\left[ \| \delta\mathbf{W}^*\bar{\mathbf{x}} + \mathbf{W}\delta{\mathbf{x}} \|_2^2 \right] + \lambda_1 \| \delta\mathbf{W}^* \|_2^2 \\ & = \mathbb{E} \left[ 2 (\delta\mathbf{W}^*\bar{\mathbf{x}} + \mathbf{W}\delta{\mathbf{x}})\bar{\mathbf{x}}^T \right] + 2\lambda_1 \delta\mathbf{W}^*. \label{eq:obj-act2} \end{aligned} \end{equation} \]

Then, by combining Eq.7Set to zero to solve\(\delta\mathbf{W}^*\) ：

\[\begin{equation} \begin{aligned} & \mathbb{E}\left[ 2 (\delta\mathbf{W}^*\bar{\mathbf{x}} + \mathbf{W}\delta{\mathbf{x}})\bar{\mathbf{x}}^T \right] + 2\lambda_1 \delta\mathbf{W}^* = 0 \\ & \Rightarrow \delta\mathbf{W}^* = -\mathbf{W} \mathbb{E} \left[\delta{\mathbf{x}}\bar{\mathbf{x}}^T\right](\mathbb{E} \left[\bar{\mathbf{x}}\bar{\mathbf{x}}^T \right] + \lambda_1 \mathbf{I})^{-1}. \end{aligned} \end{equation} \]

regularization term\(\lambda_1 \mathbf{I}\) assure\(\mathbb{E} \left[\bar{\mathbf{x}}\bar{\mathbf{x}}^T \right] + \lambda_1 \mathbf{I}\) of the inverse is always present, which is crucial for computational stability. In addition, it suppresses outliers, which mitigates overfitting and improves the generalization of the model. Suppressing outliers is also crucial for subsequent weight quantization, as it limits the range of weights. This restriction prevents the quantization points from being distributed over uncovered areas, thus enhancing the expressive power of the quantization.

In practice, given a calibration dataset, using the\(\frac{1}{N}\sum_n^N \delta{\mathbf{x}}_n\bar{\mathbf{x}}_n^T\) cap (a poem)\(\frac{1}{N}\sum_n^N \bar{\mathbf{x}}_n\bar{\mathbf{x}}_n^T\) reckon separately\(\mathbb{E}\left[\delta{\mathbf{x}}\bar{\mathbf{x}}^T\right]\) cap (a poem)\(\mathbb{E}\left[\bar{\mathbf{x}}\bar{\mathbf{x}}^T \right]\) . Here.\(N = B\times T >> D_{in}^s\) which\(B\) is the size of the calibration dataset.\(T\) is the number of markers in an image. Note that the\(\delta{\mathbf{x}}\) cap (a poem)\(\bar{\mathbf{x}}\) is determined given the input and quantization parameters. After obtaining the\(\delta\mathbf{W}^*\) After passing\(\mathbf{W} = \mathbf{W} + \delta\mathbf{W}^*\) merge them into the weights of the network. By doing so, the proposedAqerExplicitly mitigates quantization errors from quantizing activations to weights.

Weight Quantization Error Reduction

underwayAqerAfter the weight quantization needs to be performed, it is proposed that the weight quantization error is reduced (Wqer) to mitigate the resulting quantization error. Here, the objective is defined as:

\[\begin{equation} \begin{aligned} \mathcal{L}^{\text{MSE}} & = \mathbb{E} \left[\| \mathbf{W}\bar{\mathbf{x}} - (\mathbf{W}+\delta\mathbf{W})\bar{\mathbf{x}}\|_2^2 \right] = \sum_i^{D_{out}} \mathcal{L}^{\text{MSE}}_i \\ & = \sum_i^{D_{out}} \mathbb{E} \left[\| \mathbf{W}_{i,:}\bar{\mathbf{x}} - (\mathbf{W}_{i,:}+\delta\mathbf{W}_{i,:})\bar{\mathbf{x}}\|_2^2 \right]. \label{eq:obj-weight0} \end{aligned} \end{equation} \]

Note that there is a difference in the way in which theAqerAfter that, the activation value is quantized. The formula9It is shown that the minimization is performed independently between the output channels. Therefore, analyzing separately each\(\mathcal{L}^{\text{MSE}}_i\) of minimization. Meanwhile quantization of the entire full-precision weights leads to unrecoverable quantization errors. Therefore, an iterative quantization and correction method is used to gradually reduce the quantization error caused by weight quantization.

In each iteration, the first half of the unquantized weights are first quantized and then the resulting quantization error is mitigated. Specifically, from the current full-precision weights\(\mathbf{W}_{i,:}\) and the corresponding\(\bar{\mathbf{x}}\) Start. Then, place the\(\mathbf{W}\) Divided into two parts: first half\(\mathbf{W}^s_{i,:} \in \mathbb{R}^{ 1\times D_{in}^s}\) for quantification, while the remainder\(\mathbf{W}^r_{i,:} \in \mathbb{R}^{1 \times D_{in}^r}\) maintains full precision. Correspondingly, the value from the\(\bar{\mathbf{x}}\) derived from\(\bar{\mathbf{x}}^s \in \mathbb{R}^{D_{in}^s}\) cap (a poem)\(\bar{\mathbf{x}}^r \in \mathbb{R}^{D_{in}^r}\) which\(\bar{\mathbf{x}}^s\) cap (a poem)\(\bar{\mathbf{x}}^r\) respectively, contains the same information as the\(\mathbf{W}^s_{i,:}\) cap (a poem)\(\mathbf{W}^r_{i,:}\) corresponding\(\bar{\mathbf{x}}\) of the line. The quantized\(\mathbf{W}^s_{i,:}\) The quantization error of is denoted as\(\delta\mathbf{W}^s_{i,:} = \bar{\mathbf{W}}^s_{i,:} - \mathbf{W}^s_{i,:}\) The resulting mean square error (MSE) for:

\[\begin{equation} \begin{split} \mathcal{L}^{\text{MSE}}_i & = \mathbb{E} \big[ \| [ \mathbf{W}^s_{i,:},\mathbf{W}^r_{i,:} ] [ \bar{\mathbf{x}}^s, \bar{\mathbf{x}}^r ] \\ & \quad\quad\quad - [ \mathbf{W}^s_{i,:}+\delta\mathbf{W}^s_{i,:},\mathbf{W}^r_{i,:} ] [ \bar{\mathbf{x}}^s, \bar{\mathbf{x}}^r ] \|_2^2 \big] \\ & = \mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]. \end{split} \label{eq:obj-weight-divide} \end{equation} \]

Here.\(\mathbf{W}_{i,:} = [ \mathbf{W}^s_{i,:},\mathbf{W}^r_{i,:} ]\) ， \(\bar{\mathbf{x}} = [ \bar{\mathbf{x}}^s, \bar{\mathbf{x}}^r ]\) . In order to mitigate the formula10, first introducing rounding optimization (Rounding Refinement), the rounding direction of the quantitative weights is refined in that process. For example, adjusting the\(\delta\mathbf{W}^s_{i,:}\) in order to reduce\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) itself. Then, after rounding and optimizing, given the\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) , construct a ridge regression (Ridge Regression) issue by adjusting the\(\mathbf{W}^r_{i, :}\) to further mitigate that error.

Rounding Refinement

Initially, the goal is to adjust the rounding direction of the quantitative weights to minimize the\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) . Specifically, for\(\mathbf{W}^s_{i,:}\) Fourth periodic report of the Secretary-General\(j\) A value, denoted as\(\mathbf{W}^s_{i,j}\) , the quantization process involves either downward or upward rounding. Thus.\(\mathbf{W}^s_{i,:}\) The quantization error, denoted as\(\delta\mathbf{W}^s_{i,j}\) which can be expressed as\(\delta\mathbf{W}^{s\downarrow}{i, j}\) maybe\(\delta\mathbf{W}^{s\uparrow}{i, j}\) . Here.\(\delta\mathbf{W}^{s\downarrow}_{i, j} = \mathbf{W}^s_{i,j} - \text{Q}_{un\downarrow}(\mathbf{W}^s_{i,j}, b) > 0\) denotes the error resulting from the use of the downward rounding strategy, the\(\delta\mathbf{W}^{s\uparrow}_{i, j} = \mathbf{W}^s_{i,j} - \text{Q}_{un\uparrow}(\mathbf{W}^s_{i,j}, b) < 0\) denotes the error generated by using the upward rounding strategy, where\(\downarrow/\uparrow\) means that in the formula1air marshal\(\left\lfloor \cdot \right\rceil\) Replace with\(\left\lfloor \cdot \right\rfloor\) / \(\left\lceil \cdot \right\rceil\) 。

option\(\delta\mathbf{W}^s_{i,:}\) anNPdifficult problem, whose solution can be obtained by mixed-integer quadratic programming (MIPQ) to conduct a search. However.\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) of high computational complexity makes it a challenge to find a solution in a reasonable amount of time. As shown in Table1As shown, using the\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) act asMIPQThe target was consumed by approximately130The huge time cost of hours.

Efficient Proxy

The goal, therefore, is to find\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) of an efficient agent. First, the\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) Rewrite for:

\[\begin{equation} \begin{aligned} \mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right] & \overset{\Delta}{=} (\mathbb{E} \left[ \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \right])^2 + \text{Var} \left[ \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \right]. \label{eq:obj-weight1} \end{aligned} \end{equation} \]

Here.\(\Delta\) expressed the view that the use of\(\mathbb{E}\left[ Z^2 \right] = (\mathbb{E}\left[ Z \right])^2 + \text{Var}\left[ Z \right]\) 。

According to the central limit theorem, the large number of multiplicative and additive operations in neural networks makes the activation values usually show a Gaussian distribution, which is the basic assumption of many previous studies in the field of quantization. At the same time, the graph2The channel distributions for full precision and quantized activations are shown. It can be seen that the quantized activations still exhibit an approximate Gaussian distribution.

Thus, the thesis argues that\(\bar{\mathbf{x}}^s\) of the channel distribution can still be captured by a Gaussian distribution with the\(D_{in}^s\) Gaussian distribution in dimension\(\mathcal{N}(\boldsymbol{\mu}^s, \boldsymbol{\Sigma}^s)\) treat (sb a certain way)\(\bar{\mathbf{x}}^s\) Perform modeling where\(D_{in}^s\) be\(\bar{\mathbf{x}}^s\) The dimensions of the\(\boldsymbol{\mu}^s \in \mathbb{R}^{D_{in}^s}, \boldsymbol{\Sigma}^s \in \mathbb{R}^{D_{in}^s \times D_{in}^s}\) . Then, the formula11Change to:

\[\begin{equation} \begin{aligned} & \mathbb{E} \left[ \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \right]^2 + \text{Var} \left[ \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \right] \\ & \quad = \delta\mathbf{W}^s_{i,:}\boldsymbol{\mu}^s\boldsymbol{\mu}^{sT}(\delta\mathbf{W}^s_{i,:})^T + \delta\mathbf{W}_{i,:}\boldsymbol{\Sigma}^s(\delta\mathbf{W}^s_{i,:})^T \\ & \quad = \delta\mathbf{W}^s_{i,:}(\boldsymbol{\mu}^s\boldsymbol{\mu}^{sT} + \boldsymbol{\Sigma}^s)(\delta\mathbf{W}^s_{i,:})^T. \label{eq:obj-weight3} \end{aligned} \end{equation} \]

Here, the formula12Yeah, I got it.\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) of the proxy. In practice, empirical values are estimated using a given calibration dataset\(\hat{\boldsymbol{\mu}}^s\) cap (a poem)\(\hat{\boldsymbol{\Sigma}}^s\) . Note that for all output channels, the\(\hat{\boldsymbol{\mu}}^s\) cap (a poem)\(\hat{\boldsymbol{\Sigma}}^s\) is shared and only one calculation needs to be performed.

seek3Demonstrates the relationship between agents and\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) The relationship between the It can be seen that the proposed proxy is proportional to the true value, proving its credibility.

The computational complexity of using a proxy is\(O((D_{in}^s)^2)\) but (not)\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) The complexity of\(O(ND_{in}^s)\) which\(N >> D_{in}^s\) . Thus, this agent can be used as a low-cost target for solving the\(\delta\mathbf{W}^s_{i,:}\) . As shown in the table1As shown, the equation12act asMIPQThe goal is to reduce the cost of time from approximately130hours down to about10Hours. However, since the current open sourceMIPQThe implementation only supportsCPUInability to fully utilizeGPUability to do so, such a cost would still be modest. This is followed by a description of theRounding RefinementA kind of supportGPUapproach that utilizes the gradient of the agent to more quickly adjust the\(\delta\mathbf{W}^s_{i,:}\) 。

Rounding Refinement

First, the closest rounding strategy is used to initialize the\(\delta\mathbf{W}^s_{i,j}\) . At this time.\(\delta\mathbf{W}^s_{i,j}\) Either it is equal to\(\delta\mathbf{W}^{s\downarrow}_{i, j}\) , either equal to\(\delta\mathbf{W}^{s\uparrow}_{i, j}\) . The goal is then to identify a collection of indexes\(\mathcal{S}\) , which contains the indexed set of the element to be modified, with its rounding direction reversed:

\[\begin{equation} \begin{aligned} \delta\mathbf{W}_{i, j}^s = \begin{cases} \delta\mathbf{W}^{s\downarrow}_{i, j} & \text{if} \,\, \delta\mathbf{W}_{i, j}^s = \delta\mathbf{W}^{s\uparrow}_{i, j} \\ \delta\mathbf{W}^{s\uparrow}_{i, j} & \text{otherwise.} \end{cases} , j \in \mathcal{S}. \label{eq:obj-weight6} \end{aligned} \end{equation} \]

In order to determine\(\mathcal{S}\) , first to the agent (Eq.12) as opposed to\(\delta\mathbf{W}^s_{i,:}\) Ask for guidance.

\[\begin{equation} \begin{aligned} \boldsymbol{G}_{\delta\mathbf{W}^s_{i,:}} & = \frac{\partial}{\partial \delta\mathbf{W}^s_{i,:}} \delta\mathbf{W}^s_{i,:}(\boldsymbol{\mu}^s\boldsymbol{\mu}^{sT} + \boldsymbol{\Sigma}^s)(\delta\mathbf{W}^s_{i,:})^T \\ & = 2 \delta\mathbf{W}^s_{i,:}(\boldsymbol{\mu}^s\boldsymbol{\mu}^{sT} + \boldsymbol{\Sigma}^s) . \label{eq:obj-weight4} \end{aligned} \end{equation} \]

Only elements with the same gradient sign are selected, as this is the only way to allow for inversion. For example, when\(\delta\mathbf{W}_{i, j}^s = \delta\mathbf{W}^{s\downarrow}_{i, j}\) When, and only when\(\boldsymbol{G}_{\delta\mathbf{W}_{i, j}^s}\) together with\(\delta\mathbf{W}_{i, j}^s\) with the same symbols before replacing it with the\(\delta\mathbf{W}^{s\uparrow}_{i, j}\) . Thus, the indexed set\(\mathcal{S}\) Defined as:

\[\begin{equation} \begin{aligned} & \mathcal{S} = \mathrm{topk\_index}(\mathcal{M}), \\ & \mathcal{M} = \lvert \boldsymbol{G}_{\delta\mathbf{W}_{i, :}^s} \odot \mathbb{1}(\boldsymbol{G}_{\delta\mathbf{W}_{i, :}^s} \odot \delta\mathbf{W}_{i, :}^s ) \rvert \in \mathbb{R}^{D_{in}^s}. \label{eq:obj-weight5} \end{aligned} \end{equation} \]

Here.\(\mathrm{topk\_index}\) Back to previous\(\mathrm{k}\) The index of an element of the\(\mathbb{1}(\cdot)\) For non-negative inputs return1The return value for negative inputs is0， \(\lvert \cdot \rvert\) Returns the absolute value of the input.

Attainment of\(\mathcal{S}\) Afterwards, by means of Eq.13Perform the inversion. The above process is iterated until the adjusted\(\delta\mathbf{W}^s_{i, :}\) trigger a larger agent value or reach the maximum number of iterations. After obtaining the\(\delta\mathbf{W}^s_{i, :}\) After that, quantization can be done by\(\bar{\mathbf{W}}^s_{i, :} = \mathbf{W}^s_{i, :}+\delta\mathbf{W}^s_{i, :}\) Finish. Then, the\(\bar{\mathbf{W}}^s_{i, :}\) Add to the set of quantization weights.Rounding RefinementThe overall process in the algorithm1thirteenth meeting of the Conference of the Parties to the Convention on Biological Diversity (CBD)7go to the end of the line18given in the rows. As shown in the table1Shown.Rounding Refinementpass (a bill or inspection etc)\(150\times\) The cost of this significantly reduces the time overhead from the10Hours reduced to4minutes with an acceptable loss of accuracy.

Ridge Regression

existRounding RefinementAfter that, it is recommended to use\(\delta\mathbf{W}^{r*}_{i, :}\) align\(\mathbf{W}^r_{i, :}\) to further offset\(\mathbb{E} \left[ \| \delta\mathbf{W}^s_{i,:}\bar{\mathbf{x}}^s \|_2^2 \right]\) The following objectives were obtained:

\[\begin{equation} \begin{split} \mathbb{E} \big[ \|\delta\mathbf{W}^s_{i, :}\bar{\mathbf{x}}^s + \delta\mathbf{W}^{r*}_{i, :}\bar{\mathbf{x}}^r \|_2^2 \big] + \lambda_2\| \delta\mathbf{W}^{r*}_{i, :} \|_2^2, \end{split} \label{eq:obj-weight7} \end{equation} \]

Among them.\(\lambda_2\) is a hyperparameter that controls the regularization term\(\lambda_2\| \delta\mathbf{W}^{r*}_{i, :} \|_2^2\) The strength of the Eq.16The minimization of forms the ridge regression problem and the solution is defined as:

\[\begin{equation} \begin{split} \delta\mathbf{W}^{r*}_{i, :} = - \delta\mathbf{W}^s_{i, :}\mathbb{E} \left[ \bar{\mathbf{x}}^s\bar{\mathbf{x}}^{rT} \right](\mathbb{E} \left[ \bar{\mathbf{x}}^r \bar{\mathbf{x}}^{rT} \right] + \lambda_2 \mathbf{I})^{-1}. \end{split} \label{eq:obj-steptwosolution} \end{equation} \]

In practice, by using the\(\frac{1}{N}\sum_n^N \bar{\mathbf{x}}_n^r\bar{\mathbf{x}}_n^{sT}\) cap (a poem)\(\frac{1}{N}\sum_n^N \bar{\mathbf{x}}_n^r\bar{\mathbf{x}}_n^{rT}\) estimate\(\mathbb{E}\left[\bar{\mathbf{x}}^r \bar{\mathbf{x}}^{sT}\right]\) cap (a poem)\(\mathbb{E}\left[\bar{\mathbf{x}}^r \bar{\mathbf{x}}^{rT} \right]\) . Subsequently.\(\mathbf{W}^r_{i, :} = \mathbf{W}^r_{i, :}+\delta\mathbf{W}^{r*}_{i, :}\) to minimize errors. Currently.\(\mathbf{W}^r_{i, :}\) remains as full precision and will be processed in the next iteration. The process continues until all the weights are accurately quantized. The proposedRounding Refinementcap (a poem)Ridge RegressionincorporateWqer, its overall process in the algorithm1is given in. In practice, the parallel execution of multiple output channels for theWqer。

Experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

ERQ: 32-bit to 5-bit only drops some precision, come see two-stage post-training quantization | ICML 2024

Activation Quantization Error Reduction

Weight Quantization Error Reduction

Rounding Refinement

Efficient Proxy

Rounding Refinement

Ridge Regression