The quantitative solution balances inference speed, memory footprint and model performance by reducing the accuracy of model parameters.
The following are the differences and applicable scenario analysis of the main quantitative solutions:
1. Analysis of quantitative naming rules
-
Basic format
byQ<bit number>_<variant type>
Denoted, for example:-
Q4_K_M: 4-bit quantization, mixed accuracy optimization
-
Q5_K_S: 5-bit quantization, simplified version of mixing accuracy
-
Q8_0: 8-bit quantization, no decimal retention
-
-
Variation type meaning
-
K: The number of partial digits of integers (for example
Q6_2_4
Represents 6 digits in total, 2 digits in integer + 4 decimal places) -
S/M/L: Mixed quantization strategy (S=simple, M=medium, L=complex), affecting the accuracy allocation at different levels.
-
2. Comparison of core quantitative solutions
Quantitative Type | Total number of digits | Typical application layer | Model size (7B) | Confusion (PPL↑) | Applicable scenarios |
---|---|---|---|---|---|
Q2_K | 2 | Some non-critical layers | 2.67GB | +100% | Extreme video memory restricted scenarios |
Q3_K_M | 3 | Full connection layer | 3.06GB | +37.4% | Low video memory devices require quick reasoning |
Q4_0 | 4 | All layers | 3.83GB | +38.3% | Regular lightweight (phased out) |
Q4_K_S | 4 | All layers | 3.56GB | +17.6% | Video memory and performance balance |
Q4_K_M | 4 | Attention layer + partially fully connected | 4.08GB | +8.2% | Recommended general scenarios |
Q5_K_S | 5 | All layers | 4.65GB | +5.4% | High precision required, medium video memory |
Q5_K_M | 5 | Attention layer + partially fully connected | 4.78GB | +6.36% | High-performance scenarios |
Q6_K | 6 | All layers | 5.53GB | +0.1% | Approximate the original F16 model accuracy |
Q8_0 | 8 | All layers | 7.16GB | Almost lossless | Research and debugging, production is not recommended |
3. Key technologies differences
-
Mixed precision strategy
-
Q4_K_M: The attention layer
wv
and full connection layerw2
Use higher precision (such as Q6_K), and other layers use Q4_K to balance video memory and performance. -
Q5_K_S: Simplify the hybrid strategy, use 5-bit quantization in the entire model, sacrificing a small amount of accuracy for faster reasoning.
-
-
Block structure optimization
-
Q4_K_M uses super blocks (8 blocks × 32 weights) and 6-bit quantized scaling factors, and the memory usage is lower.
-
Q5_K_M adopts more complex block splitting, suitable for tasks that require high precision (such as code generation).
-
-
Performance
-
Speed: Q4_K_S is nearly 4 times faster on the RTX4080 than the F16, and Q5_K_M is slightly slower but has higher accuracy.
-
Error control: The confusion degree (PPL) of Q5_K_M is only 6.36% higher than the original model, while Q4_K_M is 8.2%].
-
4. Choose suggestions
-
Scenarios with tight video memory: Choose Q4_K_M (4.08GB) to take into account performance and resource consumption.
-
High precision requirements: priority is Q5_K_M or Q6_K, close to the original model performance.
-
Extremely lightweight: Q3_K_M (3.06GB) is better than Q4_0 and has lower error].
-
Debugging research: Use Q8_0 to observe the lossless quantization effect, but the actual deployment is not recommended].
5. Quantitative effect example (7B model)
Quantitative Type | Video memory usage | Generation speed (tokens/s) | Text consistency |
---|---|---|---|
Q4_K_M | 6.58GB | 40 | medium |
Q5_K_M | 7.28GB | 35 | Higher |
Q3_K_M | 5.80GB | 45 | generally |
(Test environment: RTX4080 + 32GB RAM)
The quantitative solution achieves a balance between performance and accuracy in resource-constrained scenarios through flexible hierarchical strategies and mixed precision design. Q4_K_M and Q5_K_M are the most recommended solutions at present. The former is suitable for general scenarios, and the latter is suitable for tasks that require higher accuracy. Developers can flexibly choose according to hardware conditions and task requirements and passquantize
Tools customize quantitative strategies.
- ChatAI Online
- Transfer to pictures online
- Image conversion Base64
- Website technology stack detection
- DeepSeek
Link:/farwish/p/18768190