Linux installs large models through huggingface
huggingface official website
/
wget /miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/ ; bash ~/ -b -p $HOME/miniconda ; eval "$($HOME/miniconda/bin/conda hook)" ; echo 'export PATH="$HOME/miniconda/bin:$PATH"' >> ~/.bashrc ; source ~/.bashrc
conda create -n huggingface python=3.12
conda activate huggingface
pip install -U huggingface_hub
export HF_ENDPOINT=
export HF_HOME=/root/autodl-tmp/huggingface
(huggingface) root@autodl-container-16494bbe83-56d7d7c3:~/autodl-tmp/huggingface# huggingface-cli login
_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|
To log in, `huggingface_hub` requires a token generated from /settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) n
huggingface-cli download --resume-download deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
vllm running model
/vllm-project/vllm
/en/latest/getting_started/
conda create -n vllm python=3.12
conda activate vllm
pip install vllm
Check the cuda environment after passing
(vllm) root@autodl-container-16494bbe83-56d7d7c3:~#
python -c "import torch; print(torch.__version__)"
python -c "import torch; print(.is_available())"
python -c "import torch; print()"
2.6.0+cu124
True
12.4
(vllm) root@autodl-container-16494bbe83-56d7d7c3:~#
(vllm) root@autodl-container-16494bbe83-56d7d7c3:~# vllm serve /root/autodl-tmp/huggingface/hub/models--deepseek-ai-DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f60 --max_model_len 4096
(llamacpp) root@autodl-container-16494bbe83-56d7d7c3:~/autodl-tmp/# nvidia-smi
Wed Mar 26 17:00:01 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02 Driver Version: 550.107.02 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 0000000:5A:00.0 Off | Off |
| 30% 31C P8 27W / 450W | 22495MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
(llamacpp) root@autodl-container-16494bbe83-56d7d7c3:~/autodl-tmp/#
(base) root@autodl-container-16494bbe83-56d7d7c3:~# curl http://localhost:8000/v1/models
{"object":"list","data":[{"id":"/root/autodl-tmp/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f60","object":"mode l","created":1742994474,"owned_by":"vllm","root":"/root/autodl-tmp/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f 60","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-9fd7c968ad76451bad886268be0ad783","object":"model_permission","created":1742994474,"allow_create_engine":false ,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}(base) root@autodl-container-16494bbe83-56d7d7c3:~#
(base) root@autodl-container-16494bbe83-56d7d7c3:~#
(base) root@autodl-container-16494bbe83-56d7d7c3:~# curl http://localhost:8000/v1/completes \
-H "Content-Type: application/json" \
-d '{
"model": "/root/autodl-tmp/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f60",
"prompt": "I was caught by the teacher when I cheated in the exam. The teacher asked me to write a review. I am a middle school student."
"max_tokens": 3096,
"temperature": 0.7
}'
{"id":"cmpl-a713436c844d4d93a8aa84cd2809ea70","object":"text_completion","created":1742994986,"model":"/root/autodl-tmp/huggingface/hub/models--deepseek-ai-DeepSeek-R1-Distill-Qwen-7B/snap shots/916b56a44061fd5cd7d6a8fb632557ed4f724f60","choices":[{"index":0,"text":", how to write it? I need specific structure and content, not too general. \nOkay, I need to write a review for cheating in the exam. I am a middle school student, so the content should be in line with the expression of this age group, not too complicated or formal, but it should also reflect sincerity and reflection. \n\nFirst of all, the review should express regret at the beginning The mood of the past shows that you realize the seriousness of the mistake. Then, describe in detail the specific process of cheating in the exam, including the environment at that time, your own psychological state, and why there was such a decision. This part should be specific, such as insomnia on the eve of the exam, physical and psychological struggles, and finally completing the answer to the test paper in a short time. \n\nNext, the review should express your gratitude to the teacher, explain that you deeply realize your mistakes, and hope that the teacher can forgive yourself. At the same time, you can mention that you will study harder, not cheat, not peek, and maintain good study habits. \n\nThe ending part can be promised to work harder in the future, and the grades will not decline like this time, expressing the determination to reflect on the future and improve. \n\nThe overall structure should be: express regret at the beginning, describe the error process in detail in the middle, and express gratitude and commitment at the end. The language should be sincere, not too exaggerated, and it is in line with the expression of middle school students. \n\nNow, I need to organize these ideas into a specific review content to ensure that each part has sufficient details while maintaining overall consistency and logic. \n</think>\n\n### Review\n\nSorry, teacher.\n\nI know that cheating in the exam is an unforgivable behavior. This mistake made me feel extremely guilty and regretful. Cheating in the exam not only violated the school rules and regulations, but also undermined the fair competition in the class, hurting the teacher's hard teaching and friendship between classmates.\n\nThe process of cheating in the exam is as follows: That night, I was too stressed to concentrate on reviewing, which caused the next day's exam to fail to perform normally. On the eve of the exam, I had insomnia and my heart was filled with anxiety and uneasiness. In the end, I decided to take the risk and complete the exam by plagiarizing my classmates' answers. Although I tried to deceive myself and the teacher, this behavior completely exposed my nature.\n\n After cheating, I felt very guilty. I know that cheating not only harms the teacher's teaching achievements, but also makes me lose the self-esteem and confidence I should have. I realized that cheating was wrong. It made me lose the opportunity to compete fairly with my classmates and also made me lose my sincere efforts on the road to learning. \n\nTeacher, I keep your words in mind. You taught us to be honest. Cheating not only hurts the teacher, but also hurts the honor of the entire class. I deeply realize that the consequences of cheating are irreversible. It not only makes me lose my score, but also makes me lose my chance to grow. \n\nFrom now on, I am determined to change. I won't cheat anymore, nor do I peek. I will treat my study harder, review the knowledge of each class seriously, and no longer touch the "shortcuts" in learning. I believe that through my own efforts, I can achieve better results and live up to the teacher's expectations. \n\nPlease believe that I will definitely prove with my own practical actions that cheating is wrong, and I will not make such mistakes again. I promise that in the days to come, I will treat my studies with a more sincere attitude, maintain good study habits, and live up to the expectations of my teachers and family. \n\nSorry, teacher. I will definitely try my best to reform myself! \n\nReviewer: \nDate:","logprobs":null,"finish_reason":"stop","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":18,"total_tokens":656,"completion_tokens":638,"prompt_tokens_details":null}}(base) root@autodl-container-16494bbe83-56d7d7c3:~#
Observe monitoring information
INFO 03-26 21:16:23 [:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 03-26 21:16:26 [:39] Received request cmpl-a713436c844d4d93a8aa84cd2809ea70-0: prompt: 'I was caught by the teacher in cheating in the exam. The teacher asked me to write a review. I am a middle school student', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.95, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=3096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [151646, 35946, 103960, 114640, 99250, 101049, 104609, 3837, 101049, 104029, 61443, 101555, 97557, 100219, 3837, 104198, 15946, 99720], lora_request: None, prompt_adapter_request: None.
INFO 03-26 21:16:26 [async_llm.py:221] Added request cmpl-a713436c844d4d93a8aa84cd2809ea70-0.
INFO 03-26 21:16:33 [:80] Avg prompt throughput: 1.8 tokens/s, Avg generation throughput: 41.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0% #At this time the model is being generated, the generation rate is 41.3 tokens/s, which performs well.
INFO: 127.0.0.1:49442 - "POST /v1/completes HTTP/1.1" 200 OK
INFO 03-26 21:16:43 [:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
--max-model-len mainly affects the token length and concurrency ability during the generation process, but will not directly change the memory usage.
(vllm) root@autodl-container-16494bbe83-56d7d7c3:~# vllm serve /root/autodl-tmp/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f60 --max_model_len 17600
17600 will exceed GPU memory
INFO 03-26 21:32:32 [:33] takes 7.17 s in total
INFO 03-26 21:32:33 [kv_cache_utils.py:566] GPU KV cache size: 17,536 tokens
INFO 03-26 21:32:33 [kv_cache_utils.py:569] Maximum concurrency for 17,500 tokens per request: 1.00x
Maximum 17536
Reduce accuracy through --quantization and reduce model size
(vllm) root@autodl-container-16494bbe83-56d7d7c3:~# vllm serve /root/autodl-tmp/huggingface/hub/models--deepseek-ai-DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f60 --quantization fp8 --max-model-len 4096
INFO 03-26 21:39:48 [:132] Cache the graph of shape None for later use
INFO 03-26 21:40:18 [:144] Compiling a graph for general shape takes 32.74 s
INFO 03-26 21:40:28 [:33] takes 42.36 s in total
INFO 03-26 21:40:29 [kv_cache_utils.py:566] GPU KV cache size: 124,736 tokens
INFO 03-26 21:40:29 [kv_cache_utils.py:569] Maximum concurrency for 4,096 tokens per request: 30.45x
INFO 03-26 21:40:54 [gpu_model_runner.py:1534] Graph capturing finished in 24 secs, took 0.52 GiB
Using python scripts
(vllm) root@autodl-container-16494bbe83-56d7d7c3:~# cat
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = (
model="/root/autodl-tmp/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f60",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."},
]
)
print("Chat response:", chat_response)
(vllm) root@autodl-container-16494bbe83-56d7d7c3:~#
No matter how you use vllm, it will eat the video memory. If no parameters are added, it will be loaded according to the maximum context of the default value of the model.
/ggml-org/
/watch?v=tcvXNThnHok&t=435s
lmstudio Quantitative Model Search
/lmstudio-community
/ggml-org//blob/master/examples/quantize/
Quantification
git clone /ggerganov/
cd
conda create -n llamacpp python=3.12
conda activate llamacpp
pip install -r
mkdir build
cd build
cmake ..
make
(llamacpp) root@autodl-container-16494bbe83-56d7d7c3:~/autodl-tmp/# python convert_hf_to_gguf.py /root/autodl-tmp/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Qwen-7B/snapshots/916b56a44061fd5cd7d6a8fb632557ed4f724f60/
(llamacpp) root@autodl-container-16494bbe83-56d7d7c3:~/autodl-tmp//build/bin# ./llama-quantize ../../ ../../deepseek7b-Q4_K_M.gguf Q4_K_M
(llamacpp) root@autodl-container-16494bbe83-56d7d7c3:~/autodl-tmp/# du -sh deepseek7b-Q4_K_M.gguf
4.4G deepseek7b-Q4_K_M.gguf
(llamacpp) root@autodl-container-16494bbe83-56d7d7c3:~/autodl-tmp/# du -sh
15G
(llamacpp) root@autodl-container-16494bbe83-56d7d7c3:~/autodl-tmp/#
/docs/quantization/gguf/
Post-quantitative testing
vllm serve deepseek7b-Q4_K_M.gguf
(base) root@autodl-container-16494bbe83-56d7d7c3:~# curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "deepseek7b-Q4_K_M.gguf",
"prompt": "I was caught by the teacher when I cheated in the exam. The teacher asked me to write a review. I am a middle school student."
"max_tokens": 3096,
"temperature": 0.7
}'
{"id":"cmpl-eab600ce936146f1b05bf97c706d6bb8","object":"text_completion","created":1743035371,"model":"deepseek7b-Q4_K_M.gguf","choices":[{"index":0,"text":", the review should be formal, but not too long, about 500 words.\n\nOkay, the user needs to write a review. He is a middle school student, and the teacher caught cheating in the exam. Now he needs to write a review. The review should be formal, but not too long, about 500 words.\n\nFirst of all, The structure of the review usually includes an introduction, admitting errors, performance analysis, correction measures and summary. I have to make sure that the content is comprehensive and the language is formal but not too long. \n\nThe introduction part should explain the reasons for writing a review, such as being found to cheat. Then admit the error and clearly know that you have cheated. \n\nNext, analyze the cause of the error. It may include insufficient trust in the teacher, not realizing the seriousness of the cheating, or too much pressure to cause abnormal performance. \n\nThen, propose corrective measures, such as studying hard, avoiding cheating, actively participating in classroom activities, and apologizing to the teacher. \n\n Finally, summarize the part to express your determination and commitment to the future, and ensure that you will not make mistakes again. \n\nIn terms of language, be more formal, use the words accurately, and avoid being too colloquial. At the same time, Control it to about 500 words, not too long. \n\n When writing in detail, you can first make an outline and then gradually fill in the content. Make sure that each part of the content is concise and clear and the logic is clear. \n\nYou may also pay attention to the appropriate words, such as "deep understanding", "serious reflection", "deep learning lessons", etc. These words can express the sincerity and determination in the heart. \n\n Finally, check whether the overall structure is reasonable, whether the content is comprehensive, and whether there are any missing parts. \n\nIn short, when writing a review, you must truly reflect your behavior and understanding, reflect the determination to repent and correct, with clear structure, appropriate language, and meet the requirements of the school. \n</think>\n\n Dear teacher: \n\nToday, I wrote for you with great guilt. This review. After deep reflection, I realized that my cheating behavior in the exam was absolutely wrong. It not only damaged your fair teaching, but also made me lose my due integrity and dignity. \n\nThe cheating happened because of my insufficient understanding of the importance of learning, and the anxiety caused by the pressure of the exam, which made me unable to perform normally. I feel deeply regretful about this and learned lessons. \n\nIn order to get rid of this bad habit, I solemnly promised: First, I will take every exam seriously in the future and ensure that my homework is completed on time; Second, I will strengthen self-management and cultivate good learning habits; Third, I will actively express my apology to you and urge me to give me the opportunity to correct myself. \n\nPlease believe that I have a deep understanding of my mistakes and will take this as a new starting point and strictly demand myself. I hope you can give me a chance and I will prove my determination with practical actions. \n\nSincerely \nSalute! ","logprobs":null,"finish_reason":"stop","stop_reason":151643,"prompt_logprobs":null}],"usage":{"prompt_tokens":17,"total_tokens":522,"completion_tokens":505,"prompt_tokens_details":null}}(base) root@autodl-container-16494bbe83-56d7d7c3:~#
(base) root@autodl-container-16494bbe83-56d7d7c3:~#