DeepSeek Model Quantification

Technical background

Large Language Model (LLM) can save memory/video memory usage through quantization operations, and reduce communication overhead, thereby achieving the effect of accelerating model inference. The common thing is to convert Float16 floating point numbers into low-precision integers, such as Int4 integers. In the most extreme case, the parameters can be converted into binary Bool variables, that is, only 0 and 1, but this large quantization may lead to poor inference effect of the model. Commonly used is to use Q8 for models below 70B, and Q4 for models above 70B. The specific principles, including symmetric quantization and asymmetric quantization, will not be introduced here. We will mainly look at how to implement them in engineering, and mainly use them.to complete quantification.

Install

Here we use the local compilation and construction method to install it on Ubuntu. First, clone it from github:

$ git clone /ggerganov/
 Cloning to ''...
 remote: Enumerating objects: 43657, done.
 remote: Counting objects: 100% (15/15), done.
 remote: Compressing objects: 100% (14/14), done.
 remote: Total 43657 (delta 3), reused 5 (delta 1), pack-reused 43642 (from 3)
 In the receiving object: 100% (43657/43657), 88.26 MiB | 8.30 MiB/s, completed.
 Processing in delta: 100% (31409/31409), completed.

It is best to create a virtual environment to avoid various software dependencies. Python 3.10 is recommended:

# Create a virtual environment
 $ conda create -n llama python=3.10
 # Activate the virtual environment
 $ conda activate llama

Enter the downloaded path and install all dependencies:

$ cd /
$ python3 -m pip install -e .

Create a compilation directory and execute the compilation instructions:

$ mkdir build
$ cd build/
$ cmake ..
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.25.1") 
-- Looking for 
-- Looking for  - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE  
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5")  
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native 
-- Configuring done
-- Generating done
-- Build files have been written to: /datb/DeepSeek/llama//build
$ cmake --build . --config Release
Scanning dependencies of target ggml-base
[  0%] Building C object ggml/src/CMakeFiles//
[  1%] Building C object ggml/src/CMakeFiles//
[100%] Linking CXX executable ../../bin/llama-vdot
[100%] Built target llama-vdot

At this point, the CPU version was successfully built and can be used directly. If you need to install a GPU accelerated version, you can refer to the following section. If you feel troublesome, it is recommended to skip it directly.

CUDA acceleration

To install the GPU version, you need to install some dependencies first:

$ sudo apt install curl libcurl4-openssl-dev

The difference from the CPU version is mainly in the cmake compilation instructions (if the CPU version has been compiled, it is best to clear it first.buildFile under the path):

$ cmake .. -DCMAKE_CUDA_COMPILER=/usr/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17

A FLAG added here:-DCMAKE_CUDA_STANDARD=17Can be solvedIssue in the warehouse, if you do not add this flag, the following error may occur:

Make Error in ggml/src/ggml-cuda/:
  Target "ggml-cuda" requires the language dialect "CUDA17" (with compiler
  extensions), but CMake does not know the compile flags to use to enable it.

If it goes well, execute the following command and successfully compile and pass it will be successful:

$ cmake --build . --config Release

But if there is an error message like me, then the following must be processed separately.

/datb/DeepSeek/llama//ggml/src/ggml-cuda/vendors/:6:10: fatal error: cuda_bf16.h: There is no file or directory
  #include <cuda_bf16.h>
           ^~~~~~~~~~~~~~~~
 Compilation terminated.

This error means that the header file cannot be found, so it is in the environmentfind / -name cuda_bf16.hAfter a while, I found that there is actually this header file:

/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda_bf16.h
/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/triton/backends/nvidia/include/cuda_bf16.h

The processing method is to add this path toCPATHin:

$ export CPATH=$CPATH:/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/

If this error occurs:

/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda_fp16.h:4100:10: fatal error: nv/target: No file or directory
  #include <nv/target>
           ^~~~~~~~~~~~~
 Compilation terminated.

That is, the path to the target directory cannot be found. If there is a target path locally, you can also configure it directly toCPATHin:

$ export CPATH=/home/dechin/anaconda3/pkgs/cupy-core-13.3.0-py310h5da974a_2/lib/python3.10/site-packages/cupy/_core/include/cupy/_cccl/libcudacxx/:$CPATH

If the following errors are reported:

/datb/DeepSeek/llama//ggml/src/ggml-cuda/(138): error: identifier "cublasGetStatusString" is undefined

 /datb/DeepSeek/llama//ggml/src/ggml-cuda/(417): error: A __device__ variable cannot be marked constexpr

 /datb/DeepSeek/llama//ggml/src/ggml-cuda/(745): error: identifier "CUBLAS_TF32_TENSOR_OP_MATH" is undefined

 3 errors detected in the compilation of "/tmp/tmpxft_000a126f_00000000-9_acc.compute_75.".
 make[2]: *** [ggml/src/ggml-cuda/CMakeFiles//:82:ggml/src/ggml-cuda/CMakeFiles//] Error 1
 make[1]: *** [CMakeFiles/Makefile2:1964:ggml/src/ggml-cuda/CMakeFiles//all] Error 2
 make: *** [Makefile:160: all] Error 2

Then it is very likelycuda-toolkitThe version problem, try to install cuda-12:

$ conda install nvidia::cuda-toolkit

If you have this problem using the conda installation process:

Collecting package metadata (current_repodata.json): failed

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/home/dechin/anaconda3/lib/python3.8/site-packages/conda/gateways/repodata/__init__.py", line 132, in conda_http_errors
        yield
      File "/home/dechin/anaconda3/lib/python3.8/site-packages/conda/gateways/repodata/__init__.py", line 101, in repodata
        response.raise_for_status()
      File "/home/dechin/anaconda3/lib/python3.8/site-packages/requests/", line 1024, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    : 404 Client Error: Not Found for url: /defaults/linux-64/current_repodata.json

That should be a problem with the conda source. You can delete the old channels, use the default channels or find a mirror source that can be used in China to configure:

$ conda config --remove-key channels
$ conda config --remove-key default_channels
$ conda config --append channels conda-forge

After reinstalling, the path of nvcc has changed. Pay attention to modifying the compile time.DCMAKE_CUDA_COMPILERParameter configuration:

$ cmake .. -DCMAKE_CUDA_COMPILER=/home/dechin/anaconda3/envs/llama/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17

If the following error occurs:

-- Unable to find cuda_runtime.h in "/home/dechin/anaconda3/envs/llama/include" for CUDAToolkit_INCLUDE_DIR.
-- Could NOT find CUDAToolkit (missing: CUDAToolkit_INCLUDE_DIR) 
CMake Error at ggml/src/ggml-cuda/:151 (message):
  CUDA Toolkit not found


-- Configuring incomplete, errors occurred!
See also "/datb/DeepSeek/llama//build/CMakeFiles/".
See also "/datb/DeepSeek/llama//build/CMakeFiles/".

This is not foundCUDAToolkit_INCLUDE_DIRFor path configuration, just add an include path into the cmake instruction:

$ cmake .. -DCMAKE_CUDA_COMPILER=/home/dechin/anaconda3/envs/llama/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17 -DCUDAToolkit_INCLUDE_DIR=/home/dechin/anaconda3/envs/llama/targets/x86_64-linux/include/ -DCURL_LIBRARY=/usr/lib/x86_64-linux-gnu/

If there is still an error message after the above string of processing, then I suggest using Docker, or directly using the CPU version to execute quantize, and using Ollama to call the model, which is more convenient.

Download Hugging Face Model

Since many GGUF model files that have been quantized cannot be quantized in quadratic order, it is recommended to download the safetensors model files directly from Hugging Face. Then use a Python script inside to convert the hf model to a gguf model, and then use it to quantize the model.

Regarding the model download part, because Hugging Face access is sometimes restricted, the first recommendation here is the domestic ModelScope platform. Download the model from the ModelScope platform and install a modelscope in this Python form:

$ python3 -m pip install modelscope
Looking in indexes: /simple
Requirement already satisfied: modelscope in /home/dechin/anaconda3/lib/python3.8/site-packages (1.22.3)
Requirement already satisfied: requests>=2.25 in /home/dechin/.local/lib/python3.8/site-packages (from modelscope) (2.25.1)
Requirement already satisfied: urllib3>=1.26 in /home/dechin/.local/lib/python3.8/site-packages (from modelscope) (1.26.5)
Requirement already satisfied: tqdm>=4.64.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from modelscope) (4.67.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2021.5.30)
Requirement already satisfied: chardet<5,>=3.0.2 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2.10)

Then use modelcope to download the model:

$ modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

If an error occurs (if there is no error, ignore it and wait for the model to be downloaded):

safetensors integrity check failed, expected sha256 signature is xxx

Another way to install it:

$ sudo apt install git-lfs

Download the model:

$ git clone /deepseek-ai/
 Cloning to 'DeepSeek-R1-Distill-Qwen-32B'...
 remote: Enumerating objects: 52, done.
 remote: Counting objects: 100% (52/52), done.
 remote: Compressing objects: 100% (37/37), done.
 remote: Total 52 (delta 17), reused 42 (delta 13), pack-reused 0
 Expanded object: 100% (52/52), 2.27 MiB | 2.62 MiB/s, completed.
 Filter content: 100% (8/8), 5.02 GiB | 912.00 KiB/s, finished.
 Encountered 8 file(s) that may not have been copied correctly on Windows:
 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

 See: `git lfs help smudge` for more details.

This process will take a lot of time, please wait patiently until the model is downloaded. After the download is completed, view the path:

$ cd DeepSeek-R1-Distill-Qwen-32B/
 $ ll
 Total dosage 63999072
 drwxrwxr-x 4 dechin dechin 4096 February 12 19:22 ./
 drwxrwxr-x 3 dechin dechin 4096 February 12 17:46 ../
 -rw-rw-r-- 1 dechin dechin 664 February 12 17:46
 -rw-rw-r-- 1 dechin dechin 73 February 12 17:46
 drwxrwxr-x 2 dechin dechin 4096 February 12 17:46 figures/
 -rw-rw-r-- 1 dechin dechin 181 February 12 17:46 generation_config.json
 drwxrwxr-x 9 dechin dechin 4096 February 12 19:22 .git/
 -rw-rw-r-- 1 dechin dechin 1519 February 12 17:46 .gitattributes
 -rw-rw-r-- 1 dechin dechin 1064 February 12 17:46 LICENSE
 -rw-rw-r-- 1 dechin dechin 8792578462 February 12 19:22
 -rw-rw-r-- 1 dechin dechin 8776906899 February 12 19:03
 -rw-rw-r-- 1 dechin dechin 8776906927 February 12 19:18
 -rw-rw-r-- 1 dechin dechin 8776906927 February 12 18:56
 -rw-rw-r-- 1 dechin dechin 8776906927 February 12 18:38
 -rw-rw-r-- 1 dechin dechin 8776906927 February 12 19:19
 -rw-rw-r-- 1 dechin dechin 8776906927 February 12 19:15
 -rw-rw-r-- 1 dechin dechin 4073821536 February 12 19:02
 -rw-rw-r-- 1 dechin dechin 64018 February 12 17:46
 -rw-rw-r-- 1 dechin dechin 18985 February 12 17:46
 -rw-rw-r-- 1 dechin dechin 3071 February 12 17:46 tokenizer_config.json
 -rw-rw-r-- 1 dechin dechin 7031660 February 12 17:46

This is the download successful.

HF model to GGUF model

Found the compiledllama//For the python script file below, you can first look at its usage:

$ python3 convert_hf_to_gguf.py --help
usage: convert_hf_to_gguf.py [-h] [--vocab-only] [--outfile OUTFILE] [--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}] [--bigendian] [--use-temp-file] [--no-lazy]
                             [--model-name MODEL_NAME] [--verbose] [--split-max-tensors SPLIT_MAX_TENSORS] [--split-max-size SPLIT_MAX_SIZE] [--dry-run]
                             [--no-tensor-first-split] [--metadata METADATA] [--print-supported-models]
                             [model]

Convert a huggingface model to a GGML compatible file

positional arguments:
  model                 directory containing model file

options:
  -h, --help            show this help message and exit
  --vocab-only          extract only the vocab
  --outfile OUTFILE     path to write to; default: based on input. {ftype} will be replaced by the outtype.
  --outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}
                        output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the highest-
                        fidelity 16-bit float type depending on the first loaded tensor type
  --bigendian           model is executed on big endian machine
  --use-temp-file       use the tempfile library while processing (helpful when running out of memory, process killed)
  --no-lazy             use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)
  --model-name MODEL_NAME
                        name of the model
  --verbose             increase output verbosity
  --split-max-tensors SPLIT_MAX_TENSORS
                        max tensors in each split
  --split-max-size SPLIT_MAX_SIZE
                        max size per split N(M|G)
  --dry-run             only print out a split plan and exit, without writing any new files
  --no-tensor-first-split
                        do not add tensors to the first split (disabled by default)
  --metadata METADATA   Specify the path for an authorship metadata override file
  --print-supported-models
                        Print the supported models

Then execute the build GGUF:

$ python3 convert_hf_to_gguf.py /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B --outfile /datb/DeepSeek/models/
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/datb/DeepSeek/models/: n_tensors = 771, total_size = 65.5G
Writing: 100%|██████████████████████████████████████████████████████████████| 65.5G/65.5G [19:42<00:00, 55.4Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /datb/DeepSeek/models/

After completing the conversion, a gguf file will be generated in the specified path, that is, the model file of all-in-one. The default is the accuracy of fp32, which can be used to perform the next quantization operation.

GGUF model quantization

Compiledofbuild/bin/In the path, you can find the quantized executable file:

$ ./llama-quantize --help
usage: ./llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv]  [] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave  un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type
  --imatrix file_name: use data in file_name as importance matrix for quant optimizations
  --include-weights tensor_name: use importance matrix for this/these tensor(s)
  --exclude-weights tensor_name: use importance matrix for this/these tensor(s)
  --output-tensor-type ggml_type: use this ggml_type for the  tensor
  --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
  --keep-split: will generate quantized model in the same shards as input
  --override-kv KEY=TYPE:VALUE
      Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together

Allowed quantization types:
   2  or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
   3  or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
   8  or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
   9  or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  36  or  TQ1_0   :  1.69 bpw ternarization
  37  or  TQ2_0   :  2.06 bpw ternarization
  10  or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
  21  or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  3.41G, +1.6321 ppl @ Llama-3-8B
  12  or  Q3_K_M  :  3.74G, +0.6569 ppl @ Llama-3-8B
  13  or  Q3_K_L  :  4.03G, +0.5562 ppl @ Llama-3-8B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  4.37G, +0.2689 ppl @ Llama-3-8B
  15  or  Q4_K_M  :  4.58G, +0.1754 ppl @ Llama-3-8B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  5.21G, +0.1049 ppl @ Llama-3-8B
  17  or  Q5_K_M  :  5.33G, +0.0569 ppl @ Llama-3-8B
  18  or  Q6_K    :  6.14G, +0.0217 ppl @ Llama-3-8B
   7  or  Q8_0    :  7.96G, +0.0026 ppl @ Llama-3-8B
   1  or  F16     : 14.00G, +0.0020 ppl @ Mistral-7B
  32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
   0  or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

Here you can see the complete accuracy of quantization operations. For example, we can quantify aq4_032B model of precision:

$ ./llama-quantize /datb/DeepSeek/models/ /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B-Q4_0.gguf q4_0

Comparison of output results (the Q8_0 here is the Q8_0 model quantified by others directly downloaded from the model repository):

-rw-rw-r-- 1 dechin dechin 65535969184 February 13 09:33
 -rw-rw-r-- 1 dechin dechin 18640230304 February 13 09:51 DeepSeek-R1-Distill-Qwen-32B-Q4_0.gguf
 -rw-rw-r-- 1 dechin dechin 34820884384 February 9 01:44 DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf

From F32 to Q8 to Q4, there is a significant decline in memory footprint. We can decide how much precision quantization operations we want to do based on our local computer resources.

After the quantization is completed, you can refer to itThis articleThe content construction model can be used after the import model is successfully completed.ollama listView all local models:

$ ollama list
NAME                            ID              SIZE      MODIFIED      
deepseek-r1:32b-q2k             8d2a0c19f6e0    12 GB     5 seconds ago    
deepseek-r1:32b-q40             13c7c287f615    18 GB     3 minutes ago    
deepseek-r1:32b                 91f2de3dd7fd    34 GB     42 hours ago     
nomic-embed-text-v1.5:latest    5b3683392ccb    274 MB    43 hours ago     
deepseek-r1:14b                 ea35dfe18182    9.0 GB    7 days ago

Here q2k is also locally quantifiedQ2_Kmodel. Just fromQ4_0arriveQ2_kThere is no large reduction in parameter memory, so many people usually go toQ4_0This level can have both performance and accuracy.

Other error handling

If runningllama-quantizeThis executable file has an error:

./xxx/llama-quantize: error while loading shared libraries: : cannot open shared object file: No such file or directory

Dynamic link library pathLD_LIBRARY_PATHThere is no setting, you can also choose to enter directlybin/Run the executable file in the path.

Summary

This article mainly introduces the use of this large model tool. Because Ollama has been used to run the big model, only the application in the HF model to the GGUF model and its use in big model quantification are introduced. The parameter quantization technology of the large model allows us to run DeepSeek's distillation model under local limited budget hardware conditions.

The first link to this article is:/dechinphy/p/

Author ID: DechinPhy