Run the large model GPU occupancy formula:
\(M=\frac{(P * 4B)}{32 / Q} * 1/2\)
- M : GPU memory labeled in GB
- P : Number of parameters in the model, e.g. a 7B model has 7 billion parameters
- 4B : 4 bytes indicating the bytes used for each parameter
- 32 : 32 bits in 4 bytes
- Q : Number of bits that should be used to load the model, e.g., 16 bits, 8 bits, 4 bits
- 1.2 : Indicates 20% overhead for loading other content in GPU memory
Commonly used large model memory footprint
Size (billion) | Number of model bits | Video Memory Usage (GB) |
---|---|---|
1.5B | 4 | 0.9 |
1.5B | 8 | 1.8 |
1.5B | 16 | 3.6 |
7B | 4 | 4.2 |
7B | 8 | 8.4 |
7B | 16 | 16.8 |
9B | 4 | 5.4 |
9B | 8 | 10.8 |
9B | 16 | 21.6 |
40B | 4 | 24 |
40B | 8 | 48 |
40B | 16 | 96 |
70B | 4 | 42 |
70B | 8 | 84 |
70B | 16 | 168 |
The standard way to write a quantitative macromodel
It's not uncommon to see big quantitative models followed byq2_k
、ft16
、 q5_k_s
、q8_0
etc. This type of writing represents the quantified metrics of the larger model, interpreted as follows:
conventional quantization
Includes methods q4_0, q4_1, and q8_0.
E.g. q4_0. represents model bit number = 4, and 0 means 0 decimal places are retained. That is, the data will be quantized to an integer between 0 and 255
K-value quantification
as ifq2_k
、q5_k_s
The decompression method is similar to the traditional quantization. In fact, different layers are quantized with different precision, and bits are allocated in a smarter way than in traditional quantization. decompression is similar to traditional quantization, and is also fast.