Turbo Sparse: an exploration of LLM sparsity

The address for this article:/wanger-sjtu/p/18352898

Observations on llama sparsity

The FFN calculation procedure for the original model of llama is:

\[f(x) = \text{silu}(xW_{Gate}) \odot xW_{UP} \times W_{Down} \]

class FeedForward():
    def forward(self, x):
        return self.w2((self.w1(x)) * self.w3(x))

Model	Sparisty
Llama-2-7B	40%
ReLULlama-7B	67%
ShiftedReLULlama-7B	71%

The paper counts the sparse nature of the first transformer block FFN layer, the sparsity of the native FFN is only 40%, the activation function can reach 67% after replacing Relu by silu, and ShiftedReLU can be further improved to 71%.
From the computation of the FFN layer, it is ostensibly the Gate part that acts as a gating to control the sparsity of the computation.In fact Up and Gate together control the sparsity of the computation, so it naturally leads to thedreluenhancement program

\[\text{Combined dReLU} (x) := max(0, xW_{gate} ) \odot max(0, xW_{up} ) \]

In terms of the training process, there is no effect on the convergence after the replacement, and there is not much effect on the evaluation metrics of the results.

The next step is to further evaluate the sparsity after the modification. Instead of using the intersection of the two masks directly, the evaluation is done according to the topk approach

\[\text{Mask}(x) := Top_k(|\text{Combined}(x)|) \]

\[ \text{Gated-MLP}(x) := (\text{Combined}(x) ∗ \text{Mask}(x))W_{down} \]

Obviously the effect is significant. Without affecting the model's performance, the sparsity reaches 80%, while sacrificing a certain degree of accuracy it can reach90%

Turbo Sparse: an exploration of LLM sparsity

Observations on llama sparsity

Sparsity of Sparsifi ed Models