Produce and quantize GGUF models for uploading to HuggingFace and ModelScope.

beOllama、LMStudio and the underlying implementations of many other popular projects are alsoGPUStack One of the inference engines supported, it provides theGGUF Model File Format.GGUF (General Gaussian U-Net Format) is a file format for storing models for inference, designed to be optimized for inference and to allow fast loading and running of models.

Quantized models are also supported to reduce model storage and computation requirements while maintaining high model accuracy, allowing large models to be efficiently deployed on desktops, embedded devices, and resource-constrained environments with increased inference speed.

Today brings an introduction to how to make and quantifyGGUF model, upload the model to theHuggingFace cap (a poem)ModelScope A tutorial on how to operate the Model Warehouse.

Registering and Configuring HuggingFace and ModelScope

Register for HuggingFace

interviews/join Sign up for a HuggingFace account (requires some sort of internet access)

Configuring HuggingFace SSH Public Keys

Add the SSH public key of your local environment to HuggingFace and check the SSH public key of your local environment (if you don't have it you can use thessh-keygen -t rsa -b 4096 command is generated):

cat ~/.ssh/id_rsa.pub

Click on your avatar in the upper right corner of HuggingFace and select theSettings - SSH and GPG Keys, add the public key above for authentication later when uploading the model.

Register ModelScope

interviews/register?back=%2Fhome Register for a ModelScope account

Get ModelScope Token

interviews/my/myaccesstokenIf you have a Git access token, copy it and save it for later authentication when you upload the model.

Preparation Environment

Create and activateConda Environment (reference if not installed)Miniconda Installation:/miniconda/）：

conda create -n llama-cpp python=3.12 -y
conda activate llama-cpp
which python
pip -V

The latest branch of the cloned code is compiled to quantize the required binaries:

cd ~
git clone -b b4034 /ggerganov/
cd /
pip install -r 
brew install cmake
make

Once the compilation is complete, you can run the following command to confirm quantization of the required binariesllama-quantize Availability:

./llama-quantize --help

Download the original model

Download the original model that needs to be converted to GGUF format and quantized.

Download the model from HuggingFace, via the HuggingFace providedhuggingface-cli command to download, first install the dependencies:

pip install -U huggingface_hub

Domestic network configuration to download the mirror source:

export HF_ENDPOINT=

Download heremeta-llama/Llama-3.2-3B-Instruct The model, which isGated modelIf you are a member of HuggingFace, you will need to fill out an application at HuggingFace and confirm that you are authorized to access the site:

Click on your avatar in the upper right corner of HuggingFace and select theAccess Tokensto create aRead Token for the permission, save it:

downloadingmeta-llama/Llama-3.2-3B-Instruct Model.--local-dir Specifies to save to the current directory.--token Specifies the Access Token created above:

mkdir ~/
cd ~//
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct --local-dir Llama-3.2-3B-Instruct --token hf_abcdefghijklmnopqrstuvwxyz

Conversion to GGUF format with quantitative modeling

Scripts for creating GGUF formats and quantization models:

cd ~//
vim

Fill in the contents of the following script and put cap (a poem) directory path to the actual path of the current environment, which needs to be an absolute path, change thed variable in thegpustack Modify to HuggingFace username:

#!/usr/bin/env bash

llama_cpp="/Users/gpustack/"
b="/Users/gpustack/"

export PATH="$PATH:${llama_cpp}"

s="$1"
n="$(echo "${s}" | cut -d'/' -f2)"
d="gpustack/${n}-GGUF"

# prepare

mkdir -p ${b}/${d} 1>/dev/null 2>&1
pushd ${b}/${d} 1>/dev/null 2>&1
git init . 1>/dev/null 2>&1

if [[ ! -f .gitattributes ]]; then
    cp -f ${b}/${s}/.gitattributes . 1>/dev/null 2>&1 || true
    echo "*.gguf filter=lfs diff=lfs merge=lfs -text" >> .gitattributes
fi
if [[ ! -d assets ]]; then
    cp -rf ${b}/${s}/assets . 1>/dev/null 2>&1 || true
fi
if [[ ! -d images ]]; then
    cp -rf ${b}/${s}/images . 1>/dev/null 2>&1 || true
fi
if [[ ! -d imgs ]]; then
    cp -rf ${b}/${s}/imgs . 1>/dev/null 2>&1 || true
fi
if [[ ! -f  ]]; then
    cp -f ${b}/${s}/ . 1>/dev/null 2>&1 || true
fi

set -e

pushd ${llama_cpp} 1>/dev/null 2>&1

# convert

[[ -f venv/bin/activate ]] && source venv/bin/activate
echo "#### convert_hf_to_gguf.py ${b}/${s} --outfile ${b}/${d}/${n}-"
python3 convert_hf_to_gguf.py ${b}/${s} --outfile ${b}/${d}/${n}-

# quantize

qs=(
  "Q8_0"
  "Q6_K"
  "Q5_K_M"
  "Q5_0"
  "Q4_K_M"
  "Q4_0"
  "Q3_K"
  "Q2_K"
)
for q in "${qs[@]}"; do
    echo "#### llama-quantize ${b}/${d}/${n}- ${b}/${d}/${n}-${q}.gguf ${q}"
    llama-quantize ${b}/${d}/${n}- ${b}/${d}/${n}-${q}.gguf ${q}
    ls -lth ${b}/${d}
    sleep 3
done

popd 1>/dev/null 2>&1

set +e

Start converting the model toFP16 accuracy of the GGUF model, and were separately modeled with theQ8_0、Q6_K、Q5_K_M、Q5_0、Q4_K_M、Q4_0、Q3_K、Q2_K methods to quantify the model:

bash  Llama-3.2-3B-Instruct

After the script has finished executing, it confirms the successful conversion to theFP16 precision GGUF model and the quantized GGUF model:

The model is stored in the directory corresponding to the user name:

ll gpustack/Llama-3.2-3B-Instruct-GGUF/

Uploading models to HuggingFace

Click on your avatar in the upper right corner of HuggingFace and select theNew Model Create a model repository with the same name in the formatOriginal model name - GGUF：

Updates the README of the model:

cd ~//gpustack/Llama-3.2-3B-Instruct-GGUF
vim

For maintainability, after the opening metadata, record the original model and the branching code Commit message, note that it follows the original model's message and the The branch code Commit message is changed:

# Llama-3.2-3B-Instruct-GGUF

**Model creator**: [meta-llama](/meta-llama)<br/>
**Original model**: [Llama-3.2-3B-Instruct](/meta-llama/Llama-3.2-3B-Instruct)<br/>
**GGUF quantization**: based on  release [b8deef0e](/ggerganov//commit/b8deef0ec0af5febac1d2cfd9119ff330ed0b762)

---

To prepare for the upload, install Git LFS to manage large file uploads:

brew install git-lfs

Add a remote repository:

git remote add origin git@:gpustack/Llama-3.2-3B-Instruct-GGUF

Add the file and pass it through thegit ls-files Acknowledging the documents submitted.git lfs ls-files Recognize that all.gguf Files are uploaded by Git LFS management:

git add .
git ls-files
git lfs ls-files

To upload files over 5GB in size to HuggingFace you need to enable large file uploads, log in to HuggingFace from the command line and enter the Token created in the Downloading Models section above:

huggingface-cli login

Enables large file uploads for the current directory:

huggingface-cli lfs-enable-largefiles .

Upload the model to the HuggingFace repository:

git commit -m "feat: first commit" --signoff
git push origin main -f

Once the upload is complete, confirm in HuggingFace that the model file was uploaded successfully.

Uploading models to ModelScope

Click on the avatar in the upper right corner of the ModelScope and select theCreating Models Create a model repository with the same name in the formatOriginal model name - GGUFThe following is a description of the model, and other configurations such as License, model type, AI framework, and whether to make the model public or not:

Upload the local repository's file and create:

To add a remote repository, you need to use the ModelScope Git access token you obtained at the beginning of this article to provide authentication when uploading models:

git remote add modelscope https://oauth2:xxxxxxxxxxxxxxxxxxxx@/gpustack/Llama-3.

Get files that already exist in the remote repository:

git fetch modelscope master

Since ModelScope usesmaster branch rather thanmain branch, you need to switch to themaster Branching out and adoptingcherry-pick commander-in-chief (military)main Move the files undermaster branch, first check and make a note of the current Commit ID:

git log

Switch tomaster branch, and throughmain The Commit ID of the branch willmain The files under the branch are moved to themaster Branching out:

git checkout FETCH_HEAD -b master
git cherry-pick -n 833fb20e5b07231e66c677180f95e27376eb25c6

Modify the conflict file to resolve the conflict (you can use the original model's.gitattributes incorporation*.gguf filter=lfs diff=lfs merge=lfs -textreference Script Related Logic ):

vim .gitattributes

Add the file and pass it through thegit ls-files Acknowledging the documents submitted.git lfs ls-files Recognize that all.gguf Files are uploaded by Git LFS management:

git add .
git ls-files
git lfs ls-files

Upload the model to the ModelScope repository:

git commit -m "feat: first commit" --signoff
git push modelscope master -f

When the upload is complete, confirm in ModelScope that the model file was uploaded successfully.

summarize

The above areCreate and quantify GGUF models and upload them to the HuggingFace and ModelScope model repositories using theThe operation of the tutorial.

Its flexibility and efficiency make it ideal for model inference in resource-limited scenarios, which are widely used, GGUF is The model file format required to run the model. I hope the above tutorial will help you on how to manage GGUF model files.

If you think it's well written, feel free tokudos、forwarding (mail, SMS, packets of data)、focus。