The address for this article:/wanger-sjtu/p/18352898
Observations on llama sparsity
The FFN calculation procedure for the original model of llama is:
class FeedForward():
def forward(self, x):
return self.w2((self.w1(x)) * self.w3(x))
Model | Sparisty |
---|---|
Llama-2-7B | 40% |
ReLULlama-7B | 67% |
ShiftedReLULlama-7B | 71% |
The paper counts the sparse nature of the first transformer block FFN layer, the sparsity of the native FFN is only 40%, the activation function can reach 67% after replacing Relu by silu, and ShiftedReLU can be further improved to 71%.
From the computation of the FFN layer, it is ostensibly the Gate part that acts as a gating to control the sparsity of the computation.In fact Up and Gate together control the sparsity of the computation, so it naturally leads to thedreluenhancement program
In terms of the training process, there is no effect on the convergence after the replacement, and there is not much effect on the evaluation metrics of the results.
The next step is to further evaluate the sparsity after the modification. Instead of using the intersection of the two masks directly, the evaluation is done according to the topk approach
Obviously the effect is significant. Without affecting the model's performance, the sparsity reaches 80%, while sacrificing a certain degree of accuracy it can reach90%