Open Source Large Modeling GPU Memory Calculator

Run the large model GPU occupancy formula:

\(M=\frac{(P * 4B)}{32 / Q} * 1/2\)

M : GPU memory labeled in GB
P : Number of parameters in the model, e.g. a 7B model has 7 billion parameters
4B : 4 bytes indicating the bytes used for each parameter
32 : 32 bits in 4 bytes
Q : Number of bits that should be used to load the model, e.g., 16 bits, 8 bits, 4 bits
1.2 : Indicates 20% overhead for loading other content in GPU memory

Commonly used large model memory footprint

Size (billion)	Number of model bits	Video Memory Usage (GB)
1.5B	4	0.9
1.5B	8	1.8
1.5B	16	3.6
7B	4	4.2
7B	8	8.4
7B	16	16.8
9B	4	5.4
9B	8	10.8
9B	16	21.6
40B	4	24
40B	8	48
40B	16	96
70B	4	42
70B	8	84
70B	16	168

The standard way to write a quantitative macromodel

It's not uncommon to see big quantitative models followed byq2_k 、ft16 、 q5_k_s 、q8_0 etc. This type of writing represents the quantified metrics of the larger model, interpreted as follows:

conventional quantization

Includes methods q4_0, q4_1, and q8_0.

E.g. q4_0. represents model bit number = 4, and 0 means 0 decimal places are retained. That is, the data will be quantized to an integer between 0 and 255

K-value quantification

as ifq2_k、q5_k_s The decompression method is similar to the traditional quantization. In fact, different layers are quantized with different precision, and bits are allocated in a smarter way than in traditional quantization. decompression is similar to traditional quantization, and is also fast.