Technical background
Large Language Model (LLM) can save memory/video memory usage through quantization operations, and reduce communication overhead, thereby achieving the effect of accelerating model inference. The common thing is to convert Float16 floating point numbers into low-precision integers, such as Int4 integers. In the most extreme case, the parameters can be converted into binary Bool variables, that is, only 0 and 1, but this large quantization may lead to poor inference effect of the model. Commonly used is to use Q8 for models below 70B, and Q4 for models above 70B. The specific principles, including symmetric quantization and asymmetric quantization, will not be introduced here. We will mainly look at how to implement them in engineering, and mainly use them.to complete quantification.
Install
Here we use the local compilation and construction method to install it on Ubuntu. First, clone it from github:
$ git clone /ggerganov/
Cloning to ''...
remote: Enumerating objects: 43657, done.
remote: Counting objects: 100% (15/15), done.
remote: Compressing objects: 100% (14/14), done.
remote: Total 43657 (delta 3), reused 5 (delta 1), pack-reused 43642 (from 3)
In the receiving object: 100% (43657/43657), 88.26 MiB | 8.30 MiB/s, completed.
Processing in delta: 100% (31409/31409), completed.
It is best to create a virtual environment to avoid various software dependencies. Python 3.10 is recommended:
# Create a virtual environment
$ conda create -n llama python=3.10
# Activate the virtual environment
$ conda activate llama
Enter the downloaded path and install all dependencies:
$ cd /
$ python3 -m pip install -e .
Create a compilation directory and execute the compilation instructions:
$ mkdir build
$ cd build/
$ cmake ..
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.25.1")
-- Looking for
-- Looking for - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- Including CPU backend
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- x86 detected
-- Adding CPU backend variant ggml-cpu: -march=native
-- Configuring done
-- Generating done
-- Build files have been written to: /datb/DeepSeek/llama//build
$ cmake --build . --config Release
Scanning dependencies of target ggml-base
[ 0%] Building C object ggml/src/CMakeFiles//
[ 1%] Building C object ggml/src/CMakeFiles//
[100%] Linking CXX executable ../../bin/llama-vdot
[100%] Built target llama-vdot
At this point, the CPU version was successfully built and can be used directly. If you need to install a GPU accelerated version, you can refer to the following section. If you feel troublesome, it is recommended to skip it directly.
CUDA acceleration
To install the GPU version, you need to install some dependencies first:
$ sudo apt install curl libcurl4-openssl-dev
The difference from the CPU version is mainly in the cmake compilation instructions (if the CPU version has been compiled, it is best to clear it first.build
File under the path):
$ cmake .. -DCMAKE_CUDA_COMPILER=/usr/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17
A FLAG added here:-DCMAKE_CUDA_STANDARD=17
Can be solvedIssue in the warehouse, if you do not add this flag, the following error may occur:
Make Error in ggml/src/ggml-cuda/:
Target "ggml-cuda" requires the language dialect "CUDA17" (with compiler
extensions), but CMake does not know the compile flags to use to enable it.
If it goes well, execute the following command and successfully compile and pass it will be successful:
$ cmake --build . --config Release
But if there is an error message like me, then the following must be processed separately.
/datb/DeepSeek/llama//ggml/src/ggml-cuda/vendors/:6:10: fatal error: cuda_bf16.h: There is no file or directory
#include <cuda_bf16.h>
^~~~~~~~~~~~~~~~
Compilation terminated.
This error means that the header file cannot be found, so it is in the environmentfind / -name cuda_bf16.h
After a while, I found that there is actually this header file:
/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda_bf16.h
/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/triton/backends/nvidia/include/cuda_bf16.h
The processing method is to add this path toCPATH
in:
$ export CPATH=$CPATH:/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/
If this error occurs:
/home/dechin/anaconda3/envs/llama/lib/python3.10/site-packages/nvidia/cuda_runtime/include/cuda_fp16.h:4100:10: fatal error: nv/target: No file or directory
#include <nv/target>
^~~~~~~~~~~~~
Compilation terminated.
That is, the path to the target directory cannot be found. If there is a target path locally, you can also configure it directly toCPATH
in:
$ export CPATH=/home/dechin/anaconda3/pkgs/cupy-core-13.3.0-py310h5da974a_2/lib/python3.10/site-packages/cupy/_core/include/cupy/_cccl/libcudacxx/:$CPATH
If the following errors are reported:
/datb/DeepSeek/llama//ggml/src/ggml-cuda/(138): error: identifier "cublasGetStatusString" is undefined
/datb/DeepSeek/llama//ggml/src/ggml-cuda/(417): error: A __device__ variable cannot be marked constexpr
/datb/DeepSeek/llama//ggml/src/ggml-cuda/(745): error: identifier "CUBLAS_TF32_TENSOR_OP_MATH" is undefined
3 errors detected in the compilation of "/tmp/tmpxft_000a126f_00000000-9_acc.compute_75.".
make[2]: *** [ggml/src/ggml-cuda/CMakeFiles//:82:ggml/src/ggml-cuda/CMakeFiles//] Error 1
make[1]: *** [CMakeFiles/Makefile2:1964:ggml/src/ggml-cuda/CMakeFiles//all] Error 2
make: *** [Makefile:160: all] Error 2
Then it is very likelycuda-toolkit
The version problem, try to install cuda-12:
$ conda install nvidia::cuda-toolkit
If you have this problem using the conda installation process:
Collecting package metadata (current_repodata.json): failed
# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<
Traceback (most recent call last):
File "/home/dechin/anaconda3/lib/python3.8/site-packages/conda/gateways/repodata/__init__.py", line 132, in conda_http_errors
yield
File "/home/dechin/anaconda3/lib/python3.8/site-packages/conda/gateways/repodata/__init__.py", line 101, in repodata
response.raise_for_status()
File "/home/dechin/anaconda3/lib/python3.8/site-packages/requests/", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
: 404 Client Error: Not Found for url: /defaults/linux-64/current_repodata.json
That should be a problem with the conda source. You can delete the old channels, use the default channels or find a mirror source that can be used in China to configure:
$ conda config --remove-key channels
$ conda config --remove-key default_channels
$ conda config --append channels conda-forge
After reinstalling, the path of nvcc has changed. Pay attention to modifying the compile time.DCMAKE_CUDA_COMPILER
Parameter configuration:
$ cmake .. -DCMAKE_CUDA_COMPILER=/home/dechin/anaconda3/envs/llama/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17
If the following error occurs:
-- Unable to find cuda_runtime.h in "/home/dechin/anaconda3/envs/llama/include" for CUDAToolkit_INCLUDE_DIR.
-- Could NOT find CUDAToolkit (missing: CUDAToolkit_INCLUDE_DIR)
CMake Error at ggml/src/ggml-cuda/:151 (message):
CUDA Toolkit not found
-- Configuring incomplete, errors occurred!
See also "/datb/DeepSeek/llama//build/CMakeFiles/".
See also "/datb/DeepSeek/llama//build/CMakeFiles/".
This is not foundCUDAToolkit_INCLUDE_DIR
For path configuration, just add an include path into the cmake instruction:
$ cmake .. -DCMAKE_CUDA_COMPILER=/home/dechin/anaconda3/envs/llama/bin/nvcc -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF -DLLAMA_CURL=ON -DCMAKE_CUDA_STANDARD=17 -DCUDAToolkit_INCLUDE_DIR=/home/dechin/anaconda3/envs/llama/targets/x86_64-linux/include/ -DCURL_LIBRARY=/usr/lib/x86_64-linux-gnu/
If there is still an error message after the above string of processing, then I suggest using Docker, or directly using the CPU version to execute quantize, and using Ollama to call the model, which is more convenient.
Download Hugging Face Model
Since many GGUF model files that have been quantized cannot be quantized in quadratic order, it is recommended to download the safetensors model files directly from Hugging Face. Then use a Python script inside to convert the hf model to a gguf model, and then use it to quantize the model.
Regarding the model download part, because Hugging Face access is sometimes restricted, the first recommendation here is the domestic ModelScope platform. Download the model from the ModelScope platform and install a modelscope in this Python form:
$ python3 -m pip install modelscope
Looking in indexes: /simple
Requirement already satisfied: modelscope in /home/dechin/anaconda3/lib/python3.8/site-packages (1.22.3)
Requirement already satisfied: requests>=2.25 in /home/dechin/.local/lib/python3.8/site-packages (from modelscope) (2.25.1)
Requirement already satisfied: urllib3>=1.26 in /home/dechin/.local/lib/python3.8/site-packages (from modelscope) (1.26.5)
Requirement already satisfied: tqdm>=4.64.0 in /home/dechin/anaconda3/lib/python3.8/site-packages (from modelscope) (4.67.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2021.5.30)
Requirement already satisfied: chardet<5,>=3.0.2 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /home/dechin/.local/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2.10)
Then use modelcope to download the model:
$ modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
If an error occurs (if there is no error, ignore it and wait for the model to be downloaded):
safetensors integrity check failed, expected sha256 signature is xxx
Another way to install it:
$ sudo apt install git-lfs
Download the model:
$ git clone /deepseek-ai/
Cloning to 'DeepSeek-R1-Distill-Qwen-32B'...
remote: Enumerating objects: 52, done.
remote: Counting objects: 100% (52/52), done.
remote: Compressing objects: 100% (37/37), done.
remote: Total 52 (delta 17), reused 42 (delta 13), pack-reused 0
Expanded object: 100% (52/52), 2.27 MiB | 2.62 MiB/s, completed.
Filter content: 100% (8/8), 5.02 GiB | 912.00 KiB/s, finished.
Encountered 8 file(s) that may not have been copied correctly on Windows:
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
See: `git lfs help smudge` for more details.
This process will take a lot of time, please wait patiently until the model is downloaded. After the download is completed, view the path:
$ cd DeepSeek-R1-Distill-Qwen-32B/
$ ll
Total dosage 63999072
drwxrwxr-x 4 dechin dechin 4096 February 12 19:22 ./
drwxrwxr-x 3 dechin dechin 4096 February 12 17:46 ../
-rw-rw-r-- 1 dechin dechin 664 February 12 17:46
-rw-rw-r-- 1 dechin dechin 73 February 12 17:46
drwxrwxr-x 2 dechin dechin 4096 February 12 17:46 figures/
-rw-rw-r-- 1 dechin dechin 181 February 12 17:46 generation_config.json
drwxrwxr-x 9 dechin dechin 4096 February 12 19:22 .git/
-rw-rw-r-- 1 dechin dechin 1519 February 12 17:46 .gitattributes
-rw-rw-r-- 1 dechin dechin 1064 February 12 17:46 LICENSE
-rw-rw-r-- 1 dechin dechin 8792578462 February 12 19:22
-rw-rw-r-- 1 dechin dechin 8776906899 February 12 19:03
-rw-rw-r-- 1 dechin dechin 8776906927 February 12 19:18
-rw-rw-r-- 1 dechin dechin 8776906927 February 12 18:56
-rw-rw-r-- 1 dechin dechin 8776906927 February 12 18:38
-rw-rw-r-- 1 dechin dechin 8776906927 February 12 19:19
-rw-rw-r-- 1 dechin dechin 8776906927 February 12 19:15
-rw-rw-r-- 1 dechin dechin 4073821536 February 12 19:02
-rw-rw-r-- 1 dechin dechin 64018 February 12 17:46
-rw-rw-r-- 1 dechin dechin 18985 February 12 17:46
-rw-rw-r-- 1 dechin dechin 3071 February 12 17:46 tokenizer_config.json
-rw-rw-r-- 1 dechin dechin 7031660 February 12 17:46
This is the download successful.
HF model to GGUF model
Found the compiledllama//
For the python script file below, you can first look at its usage:
$ python3 convert_hf_to_gguf.py --help
usage: convert_hf_to_gguf.py [-h] [--vocab-only] [--outfile OUTFILE] [--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}] [--bigendian] [--use-temp-file] [--no-lazy]
[--model-name MODEL_NAME] [--verbose] [--split-max-tensors SPLIT_MAX_TENSORS] [--split-max-size SPLIT_MAX_SIZE] [--dry-run]
[--no-tensor-first-split] [--metadata METADATA] [--print-supported-models]
[model]
Convert a huggingface model to a GGML compatible file
positional arguments:
model directory containing model file
options:
-h, --help show this help message and exit
--vocab-only extract only the vocab
--outfile OUTFILE path to write to; default: based on input. {ftype} will be replaced by the outtype.
--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}
output format - use f32 for float32, f16 for float16, bf16 for bfloat16, q8_0 for Q8_0, tq1_0 or tq2_0 for ternary, and auto for the highest-
fidelity 16-bit float type depending on the first loaded tensor type
--bigendian model is executed on big endian machine
--use-temp-file use the tempfile library while processing (helpful when running out of memory, process killed)
--no-lazy use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)
--model-name MODEL_NAME
name of the model
--verbose increase output verbosity
--split-max-tensors SPLIT_MAX_TENSORS
max tensors in each split
--split-max-size SPLIT_MAX_SIZE
max size per split N(M|G)
--dry-run only print out a split plan and exit, without writing any new files
--no-tensor-first-split
do not add tensors to the first split (disabled by default)
--metadata METADATA Specify the path for an authorship metadata override file
--print-supported-models
Print the supported models
Then execute the build GGUF:
$ python3 convert_hf_to_gguf.py /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B --outfile /datb/DeepSeek/models/
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/datb/DeepSeek/models/: n_tensors = 771, total_size = 65.5G
Writing: 100%|██████████████████████████████████████████████████████████████| 65.5G/65.5G [19:42<00:00, 55.4Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /datb/DeepSeek/models/
After completing the conversion, a gguf file will be generated in the specified path, that is, the model file of all-in-one. The default is the accuracy of fp32, which can be used to perform the next quantization operation.
GGUF model quantization
Compiledof
build/bin/
In the path, you can find the quantized executable file:
$ ./llama-quantize --help
usage: ./llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] [] type [nthreads]
--allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
--leave-output-tensor: Will leave un(re)quantized. Increases model size but may also increase quality, especially when requantizing
--pure: Disable k-quant mixtures and quantize all tensors to the same type
--imatrix file_name: use data in file_name as importance matrix for quant optimizations
--include-weights tensor_name: use importance matrix for this/these tensor(s)
--exclude-weights tensor_name: use importance matrix for this/these tensor(s)
--output-tensor-type ggml_type: use this ggml_type for the tensor
--token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
--keep-split: will generate quantized model in the same shards as input
--override-kv KEY=TYPE:VALUE
Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together
Allowed quantization types:
2 or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B
3 or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B
8 or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B
9 or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B
19 or IQ2_XXS : 2.06 bpw quantization
20 or IQ2_XS : 2.31 bpw quantization
28 or IQ2_S : 2.5 bpw quantization
29 or IQ2_M : 2.7 bpw quantization
24 or IQ1_S : 1.56 bpw quantization
31 or IQ1_M : 1.75 bpw quantization
36 or TQ1_0 : 1.69 bpw ternarization
37 or TQ2_0 : 2.06 bpw ternarization
10 or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B
21 or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B
23 or IQ3_XXS : 3.06 bpw quantization
26 or IQ3_S : 3.44 bpw quantization
27 or IQ3_M : 3.66 bpw quantization mix
12 or Q3_K : alias for Q3_K_M
22 or IQ3_XS : 3.3 bpw quantization
11 or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B
12 or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B
13 or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B
25 or IQ4_NL : 4.50 bpw non-linear quantization
30 or IQ4_XS : 4.25 bpw non-linear quantization
15 or Q4_K : alias for Q4_K_M
14 or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B
15 or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B
17 or Q5_K : alias for Q5_K_M
16 or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B
17 or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B
18 or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B
7 or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B
1 or F16 : 14.00G, +0.0020 ppl @ Mistral-7B
32 or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
0 or F32 : 26.00G @ 7B
COPY : only copy tensors, no quantizing
Here you can see the complete accuracy of quantization operations. For example, we can quantify aq4_0
32B model of precision:
$ ./llama-quantize /datb/DeepSeek/models/ /datb/DeepSeek/models/DeepSeek-R1-Distill-Qwen-32B-Q4_0.gguf q4_0
Comparison of output results (the Q8_0 here is the Q8_0 model quantified by others directly downloaded from the model repository):
-rw-rw-r-- 1 dechin dechin 65535969184 February 13 09:33
-rw-rw-r-- 1 dechin dechin 18640230304 February 13 09:51 DeepSeek-R1-Distill-Qwen-32B-Q4_0.gguf
-rw-rw-r-- 1 dechin dechin 34820884384 February 9 01:44 DeepSeek-R1-Distill-Qwen-32B-Q8_0.gguf
From F32 to Q8 to Q4, there is a significant decline in memory footprint. We can decide how much precision quantization operations we want to do based on our local computer resources.
After the quantization is completed, you can refer to itThis articleThe content construction model can be used after the import model is successfully completed.ollama list
View all local models:
$ ollama list
NAME ID SIZE MODIFIED
deepseek-r1:32b-q2k 8d2a0c19f6e0 12 GB 5 seconds ago
deepseek-r1:32b-q40 13c7c287f615 18 GB 3 minutes ago
deepseek-r1:32b 91f2de3dd7fd 34 GB 42 hours ago
nomic-embed-text-v1.5:latest 5b3683392ccb 274 MB 43 hours ago
deepseek-r1:14b ea35dfe18182 9.0 GB 7 days ago
Here q2k is also locally quantifiedQ2_K
model. Just fromQ4_0
arriveQ2_k
There is no large reduction in parameter memory, so many people usually go toQ4_0
This level can have both performance and accuracy.
Other error handling
If runningllama-quantize
This executable file has an error:
./xxx/llama-quantize: error while loading shared libraries: : cannot open shared object file: No such file or directory
Dynamic link library pathLD_LIBRARY_PATH
There is no setting, you can also choose to enter directlybin/
Run the executable file in the path.
Summary
This article mainly introduces the use of this large model tool. Because Ollama has been used to run the big model, only the application in the HF model to the GGUF model and its use in big model quantification are introduced. The parameter quantization technology of the large model allows us to run DeepSeek's distillation model under local limited budget hardware conditions.
Copyright Statement
The first link to this article is:/dechinphy/p/
Author ID: DechinPhy
More original articles:/dechinphy/
Please ask the blogger to have coffee:/dechinphy/gallery/image/
Reference link
- /m0_73365120/article/details/141901884