Location>code7788 >text

The difference between quantitative schemes (such as Q4_K_M, Q5_K_S)

Popularity:539 ℃/2025-03-12 18:32:19

 

The quantitative solution balances inference speed, memory footprint and model performance by reducing the accuracy of model parameters.

The following are the differences and applicable scenario analysis of the main quantitative solutions:


1. Analysis of quantitative naming rules

  1. Basic format
    byQ<bit number>_<variant type>Denoted, for example:

    • Q4_K_M: 4-bit quantization, mixed accuracy optimization

    • Q5_K_S: 5-bit quantization, simplified version of mixing accuracy

    • Q8_0: 8-bit quantization, no decimal retention

  2. Variation type meaning

    • K: The number of partial digits of integers (for exampleQ6_2_4Represents 6 digits in total, 2 digits in integer + 4 decimal places)

    • S/M/L: Mixed quantization strategy (S=simple, M=medium, L=complex), affecting the accuracy allocation at different levels.


2. Comparison of core quantitative solutions

Quantitative Type Total number of digits Typical application layer Model size (7B) Confusion (PPL↑) Applicable scenarios
Q2_K 2 Some non-critical layers 2.67GB +100% Extreme video memory restricted scenarios
Q3_K_M 3 Full connection layer 3.06GB +37.4% Low video memory devices require quick reasoning
Q4_0 4 All layers 3.83GB +38.3% Regular lightweight (phased out)
Q4_K_S 4 All layers 3.56GB +17.6% Video memory and performance balance
Q4_K_M 4 Attention layer + partially fully connected 4.08GB +8.2% Recommended general scenarios
Q5_K_S 5 All layers 4.65GB +5.4% High precision required, medium video memory
Q5_K_M 5 Attention layer + partially fully connected 4.78GB +6.36% High-performance scenarios
Q6_K 6 All layers 5.53GB +0.1% Approximate the original F16 model accuracy
Q8_0 8 All layers 7.16GB Almost lossless Research and debugging, production is not recommended

 


3. Key technologies differences

  1. Mixed precision strategy

    • Q4_K_M: The attention layerwvand full connection layerw2Use higher precision (such as Q6_K), and other layers use Q4_K to balance video memory and performance.

    • Q5_K_S: Simplify the hybrid strategy, use 5-bit quantization in the entire model, sacrificing a small amount of accuracy for faster reasoning.

  2. Block structure optimization

    • Q4_K_M uses super blocks (8 blocks × 32 weights) and 6-bit quantized scaling factors, and the memory usage is lower.

    • Q5_K_M adopts more complex block splitting, suitable for tasks that require high precision (such as code generation).

  3. Performance

    • Speed: Q4_K_S is nearly 4 times faster on the RTX4080 than the F16, and Q5_K_M is slightly slower but has higher accuracy.

    • Error control: The confusion degree (PPL) of Q5_K_M is only 6.36% higher than the original model, while Q4_K_M is 8.2%].


4. Choose suggestions

  1. Scenarios with tight video memory: Choose Q4_K_M (4.08GB) to take into account performance and resource consumption.

  2. High precision requirements: priority is Q5_K_M or Q6_K, close to the original model performance.

  3. Extremely lightweight: Q3_K_M (3.06GB) is better than Q4_0 and has lower error].

  4. Debugging research: Use Q8_0 to observe the lossless quantization effect, but the actual deployment is not recommended].


5. Quantitative effect example (7B model)

Quantitative Type Video memory usage Generation speed (tokens/s) Text consistency
Q4_K_M 6.58GB 40 medium
Q5_K_M 7.28GB 35 Higher
Q3_K_M 5.80GB 45 generally

(Test environment: RTX4080 + 32GB RAM)



The quantitative solution achieves a balance between performance and accuracy in resource-constrained scenarios through flexible hierarchical strategies and mixed precision design. Q4_K_M and Q5_K_M are the most recommended solutions at present. The former is suitable for general scenarios, and the latter is suitable for tasks that require higher accuracy. Developers can flexibly choose according to hardware conditions and task requirements and passquantizeTools customize quantitative strategies.

 

  • ChatAI Online
  • Transfer to pictures online
  • Image conversion Base64
  • Website technology stack detection
  • DeepSeek

Link:/farwish/p/18768190