Location>code7788 >text

Open Source Large Modeling GPU Memory Calculator

Popularity:60 ℃/2024-09-08 22:51:00

Run the large model GPU occupancy formula:

\(M=\frac{(P * 4B)}{32 / Q} * 1/2\)

  • M : GPU memory labeled in GB
  • P : Number of parameters in the model, e.g. a 7B model has 7 billion parameters
  • 4B : 4 bytes indicating the bytes used for each parameter
  • 32 : 32 bits in 4 bytes
  • Q : Number of bits that should be used to load the model, e.g., 16 bits, 8 bits, 4 bits
  • 1.2 : Indicates 20% overhead for loading other content in GPU memory

Commonly used large model memory footprint

Size (billion) Number of model bits Video Memory Usage (GB)
1.5B 4 0.9
1.5B 8 1.8
1.5B 16 3.6
7B 4 4.2
7B 8 8.4
7B 16 16.8
9B 4 5.4
9B 8 10.8
9B 16 21.6
40B 4 24
40B 8 48
40B 16 96
70B 4 42
70B 8 84
70B 16 168

The standard way to write a quantitative macromodel

It's not uncommon to see big quantitative models followed byq2_kft16q5_k_sq8_0 etc. This type of writing represents the quantified metrics of the larger model, interpreted as follows:

conventional quantization

Includes methods q4_0, q4_1, and q8_0.

E.g. q4_0. represents model bit number = 4, and 0 means 0 decimal places are retained. That is, the data will be quantized to an integer between 0 and 255

K-value quantification

as ifq2_kq5_k_s The decompression method is similar to the traditional quantization. In fact, different layers are quantized with different precision, and bits are allocated in a smarter way than in traditional quantization. decompression is similar to traditional quantization, and is also fast.