Recently in the process of privately deploying to our customers ourTorchVsystem, the resources given by the customer are sufficiently abundant, take this opportunity to record the process of deploying the Thousand Questions 72B model and share it with you!
I. Basic information
-
operating system:Ubuntu 22.04.3 LTS
-
GPU: A800(80GB) * 8
-
random access memory (RAM):1TB
II. Software information
Python: 3.10
Pytorch:2.3.0
Transformers:4.43.0
vLLM:0.5.0
cuda: 12.2
Model.QWen2-72B-Instruct
III. Installation steps
1. Install Conda
Conda is an open source package management system and environment management system designed to simplify the installation, configuration and use of software packages
For the deployment of the Python environment, it is very easy to switch environments.
You can download and install it through the link on the official conda website:/download#downloads
# Download
wget /archive/Anaconda3-2023.09-0-Linux-x86_64.sh
# Install
bash Anaconda3-2023.09-0-Linux-x86_64.sh
# Configure environment variables
echo 'export PATH="/path/to/anaconda3/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
After the installation is complete, verify that the installation was successful with the command
conda --version
After the installation is complete, you can configure the mirror source to facilitate quick download of dependency packages
# Configuration Sources
conda config --add channels /anaconda/pkgs/free/
conda config --add channels /anaconda/pkgs/main/
conda config --set show_channel_urls yes
conda config --add channels /anaconda/cloud/conda-forge/
conda config --add channels /anaconda/cloud/msys2/
conda config --add channels /anaconda/cloud/bioconda/
conda config --add channels /anaconda/cloud/menpo/
conda config --add channels /anaconda/cloud/pytorch/
Conda related commands
# Specify that the virtual environment name is llm and the python version is 3.9
conda create --name llm python=3.9
# Activate the new conda environment
conda activate llm
# View the list of current environments
conda env list
2. Download QWen2-72B-Instruct Model
Huggingface:/Qwen/Qwen2-72B-Instruct
ModelScope:/models/qwen/Qwen2-72B-Instruct
Both addresses can be downloaded, and after the download is complete, the model files are stored on the server.
⚠️ Note the server's disk space.
3、Installation of Pytorch and other environment-dependent information
⚠️ When installing Pytorch, you need to make sure that you keep the same version as the cuda driver, or you will have all kinds of inexplicable problems!
Version selection reference:/get-started/locally/
Create a new environment via conda, then install the dependencies after the switch
4. Install vLLM
vLLM
The framework is an efficient model for large languagesReasoning and deployment of service systemsThe following features are available:
-
Efficient memory management: By
PagedAttention
Algorithm.vLLM
The realization of theKV
Efficient management of the cache reduces memory waste and optimizes the operational efficiency of the model. -
high throughput:
vLLM
Support for asynchronous processing and continuous batch requests significantly improves model inference throughput and accelerates text generation and processing. -
usability:
vLLM
together withHuggingFace
Seamless model integration with support for a wide range of popular large language models simplifies model deployment and inference. Compatible withOpenAI
(used form a nominal expression)API
Server. -
distributed inference: The framework supports the use of the framework in multiple
GPU
Distributed reasoning in the environment improves the ability to handle large models through model-parallelization strategies and efficient data communication. -
open source sharing:
vLLM
Due to its open source attributes, it has active community support, which makes it easy for developers to contribute and improve, and work together to advance the technology.
GitHub:/vllm-project/vllm
Documentation:/en/latest/
After passingconda
After creating the initial environment, you can directly pass thepip
carry out the installation
pip install vllm
For more information on how to install it, you can refer to the documentation on the official website:/en/stable/getting_started/
5. Model validation
A python script can be used to verify that the current model is available.
The script is below:
#
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import os
import json
def get_completion(prompts, model, tokenizer=None, max_tokens=512, temperature=0.8, top_p=0.95, max_model_len=2048):
stop_token_ids = []
# Create sampling parameters. temperature controls the diversity of the generated text, top_p controls the probability of core sampling
sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens, stop_token_ids=stop_token_ids)
# Initialize the vLLM inference engine
llm = LLM(model=model, tokenizer=tokenizer, max_model_len=max_model_len,trust_remote_code=True)
outputs = (prompts, sampling_params)
return outputs
if __name__ == "__main__".
# Initialize the vLLM inference engine
model='/mnt/soft/models/qwen/Qwen2-72B-Instruct' # Specify the model path.
# model="qwen/Qwen2-7B-Instruct" # Specify the model name to automatically download the model
tokenizer = None
# Load the tokenizer and pass in the vLLM model, but not necessary.
# tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)
text = ["Hi, help me with what when big language model." ,
"Can you tell me a funny fairy tale?"]
outputs = get_completion(text, model, tokenizer=tokenizer, max_tokens=512, temperature=1, top_p=1, max_model_len=2048)
# The output is a list of RequestOutput objects containing prompt, generated text, and other information.
# Print the output.
for output in outputs:
prompt =
generated_text = [0].text
print(f "Prompt: {prompt!r}, Generated text: {generated_text!r}")
Execute the python script in the terminal to see if the console outputs properly
python
6, start the service & packaging OpenAI format interface
After verifying that the model is available, then the entire model service can be packaged into an OpenAI-formatted HTTP service for use by upper-tier applications through the module provided by vLLM.
Parameter configurations to be aware of:
-
--model
The parameters specify the model name & path. -
--served-model-name
Specifies the name of the service model. -
--max-model-len
Specify the maximum length of the model, if not specified, then it will be automatically loaded from the model configuration file, QWen2-72B model supports up to 128K -
--tensor-parallel-size
Specify multiple GPU services to run,QWen2-72B model, single card GPU can not support. -
--gpu-memory-utilization
The GPU memory fraction used for the model executor, ranging from 0 to 1. For example, a value of 0.5 means that the GPU memory utilization is 50%. If not specified, theDefault value 0.9。vllm preallocates some memory with this parameter to avoid frequent memory requests when the model is called.。
For more information about the parameters of vllm, you can refer to the official documentation:/en/stable/models/engine_args.html
Here you can use thetmux
command to run the service.
tmux
(Terminal Multiplexer) is a powerful terminal multiplexer that allows users to use multiple sessions in a terminal window at the same time. Using thetmux
Enables increased productivity, easy management of long-running tasks and multitasking operations
python3 -m .api_server --model /mnt/torchv/models/Qwen2-72B-Instruct --served-model-name QWen2-72B-Instruct --tensor-parallel-size 8 --gpu-memory-utilization 0.7
The appearance of information such as ports means that the current model service was started successfully!!!!
First create a new session
tmux new -t llm
Entering a session
tmux attach -t llm
Start command:
python -m xxx
Exit the current session
If you don't get a response, try a few more times.
In English, type ctrl + b and then d.
Verify that the Big Model OpenAI interface service is available via the curl command with the following script:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "QWen2-72B-Instruct",
"messages": [
{
"role": "user",
"content": "Tell me a fairy tale."
}
],
"stream": true,
"temperature": 0.9,
"top_p": 0.7,
"top_k": 20,
"max_tokens": 512
}'
IV. Summary
The current open source ecosystem is very mature, and tools such as vLLM can easily enable rapid deployment of large models, which greatly improves work efficiency
V. References
Official website resources and other information
resource (such as manpower or tourism) | address |
---|---|
QWen | GitHub:/QwenLM/Qwen Huggingface:/Qwen ModelScope:/organization/qwen?tab=model docs:/zh-cn/latest/getting_started/# |
Pytorch | /get-started/locally/ |
Conda | |
vLLM | /en/latest/getting_started/ |
Incomplete download of weights file
During this deployment, an incomplete download of the model weights file was encountered, which resulted in the download of the model through thevLLM
Deployment doesn't work, it can be done with the Linux commandsha256sum
tool to check the model weights file, compare the sha256 of the model weights file on the website is consistent, if not, you need to re-download and install the
The commands are as follows:
sha256sum your_local_file