QWen2-72B-Instruct model installation and deployment process

Recently in the process of privately deploying to our customers ourTorchVsystem, the resources given by the customer are sufficiently abundant, take this opportunity to record the process of deploying the Thousand Questions 72B model and share it with you!

I. Basic information

operating system：Ubuntu 22.04.3 LTS
GPU: A800(80GB) * 8
random access memory (RAM)：1TB

II. Software information

Python: 3.10

Pytorch：2.3.0

Transformers：4.43.0

vLLM：0.5.0

cuda： 12.2

Model.QWen2-72B-Instruct

III. Installation steps

1. Install Conda

Conda is an open source package management system and environment management system designed to simplify the installation, configuration and use of software packages

For the deployment of the Python environment, it is very easy to switch environments.

You can download and install it through the link on the official conda website:/download#downloads

# Download
wget /archive/Anaconda3-2023.09-0-Linux-x86_64.sh
# Install
bash Anaconda3-2023.09-0-Linux-x86_64.sh
# Configure environment variables
echo 'export PATH="/path/to/anaconda3/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

After the installation is complete, verify that the installation was successful with the command

conda --version

After the installation is complete, you can configure the mirror source to facilitate quick download of dependency packages

# Configuration Sources

conda config --add channels /anaconda/pkgs/free/
conda config --add channels /anaconda/pkgs/main/
conda config --set show_channel_urls yes


conda config --add channels /anaconda/cloud/conda-forge/
conda config --add channels /anaconda/cloud/msys2/
conda config --add channels /anaconda/cloud/bioconda/
conda config --add channels /anaconda/cloud/menpo/
conda config --add channels /anaconda/cloud/pytorch/

Conda related commands

 # Specify that the virtual environment name is llm and the python version is 3.9
 conda create --name llm python=3.9
 # Activate the new conda environment
 conda activate llm
 # View the list of current environments
 conda env list

2. Download QWen2-72B-Instruct Model

Huggingface：/Qwen/Qwen2-72B-Instruct

ModelScope：/models/qwen/Qwen2-72B-Instruct

Both addresses can be downloaded, and after the download is complete, the model files are stored on the server.

⚠️ Note the server's disk space.

3、Installation of Pytorch and other environment-dependent information

⚠️ When installing Pytorch, you need to make sure that you keep the same version as the cuda driver, or you will have all kinds of inexplicable problems!

Version selection reference:/get-started/locally/

Create a new environment via conda, then install the dependencies after the switch

4. Install vLLM

vLLM The framework is an efficient model for large languagesReasoning and deployment of service systemsThe following features are available:

Efficient memory management: ByPagedAttention Algorithm.vLLM The realization of theKV Efficient management of the cache reduces memory waste and optimizes the operational efficiency of the model.
high throughput：vLLM Support for asynchronous processing and continuous batch requests significantly improves model inference throughput and accelerates text generation and processing.
usability：vLLM together withHuggingFace Seamless model integration with support for a wide range of popular large language models simplifies model deployment and inference. Compatible withOpenAI (used form a nominal expression)API Server.
distributed inference: The framework supports the use of the framework in multipleGPU Distributed reasoning in the environment improves the ability to handle large models through model-parallelization strategies and efficient data communication.
open source sharing：vLLM Due to its open source attributes, it has active community support, which makes it easy for developers to contribute and improve, and work together to advance the technology.

GitHub：/vllm-project/vllm

Documentation:/en/latest/

After passingcondaAfter creating the initial environment, you can directly pass thepipcarry out the installation

pip install vllm

For more information on how to install it, you can refer to the documentation on the official website:/en/stable/getting_started/

5. Model validation

A python script can be used to verify that the current model is available.

The script is below:

#
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import os
import json

def get_completion(prompts, model, tokenizer=None, max_tokens=512, temperature=0.8, top_p=0.95, max_model_len=2048):
    stop_token_ids = []
    # Create sampling parameters. temperature controls the diversity of the generated text, top_p controls the probability of core sampling
    sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens, stop_token_ids=stop_token_ids)
    # Initialize the vLLM inference engine
    llm = LLM(model=model, tokenizer=tokenizer, max_model_len=max_model_len,trust_remote_code=True)
    outputs = (prompts, sampling_params)
    return outputs


if __name__ == "__main__".
    # Initialize the vLLM inference engine
    model='/mnt/soft/models/qwen/Qwen2-72B-Instruct' # Specify the model path.
    # model="qwen/Qwen2-7B-Instruct" # Specify the model name to automatically download the model
    tokenizer = None
    # Load the tokenizer and pass in the vLLM model, but not necessary.
    # tokenizer = AutoTokenizer.from_pretrained(model, use_fast=False)

    text = ["Hi, help me with what when big language model." ,
            "Can you tell me a funny fairy tale?"]

    outputs = get_completion(text, model, tokenizer=tokenizer, max_tokens=512, temperature=1, top_p=1, max_model_len=2048)

    # The output is a list of RequestOutput objects containing prompt, generated text, and other information.
    # Print the output.
    for output in outputs:
        prompt =
        generated_text = [0].text
        print(f "Prompt: {prompt!r}, Generated text: {generated_text!r}")

Execute the python script in the terminal to see if the console outputs properly

python

6, start the service & packaging OpenAI format interface

After verifying that the model is available, then the entire model service can be packaged into an OpenAI-formatted HTTP service for use by upper-tier applications through the module provided by vLLM.

Parameter configurations to be aware of:

--model The parameters specify the model name & path.
--served-model-name Specifies the name of the service model.
--max-model-len Specify the maximum length of the model, if not specified, then it will be automatically loaded from the model configuration file, QWen2-72B model supports up to 128K
--tensor-parallel-size Specify multiple GPU services to run,QWen2-72B model, single card GPU can not support.
--gpu-memory-utilization The GPU memory fraction used for the model executor, ranging from 0 to 1. For example, a value of 0.5 means that the GPU memory utilization is 50%. If not specified, theDefault value 0.9。vllm preallocates some memory with this parameter to avoid frequent memory requests when the model is called.。

For more information about the parameters of vllm, you can refer to the official documentation:/en/stable/models/engine_args.html

Here you can use thetmuxcommand to run the service.

tmux(Terminal Multiplexer) is a powerful terminal multiplexer that allows users to use multiple sessions in a terminal window at the same time. Using thetmux Enables increased productivity, easy management of long-running tasks and multitasking operations

python3 -m .api_server --model /mnt/torchv/models/Qwen2-72B-Instruct  --served-model-name QWen2-72B-Instruct --tensor-parallel-size 8 --gpu-memory-utilization 0.7

The appearance of information such as ports means that the current model service was started successfully!!!!

First create a new session

tmux new -t llm

Entering a session

tmux attach -t llm

Start command:

python -m xxx

Exit the current session

If you don't get a response, try a few more times.

In English, type ctrl + b and then d.

Verify that the Big Model OpenAI interface service is available via the curl command with the following script:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "QWen2-72B-Instruct",
  "messages": [
      {
          "role": "user",
          "content": "Tell me a fairy tale."
      }
  ],
  "stream": true,
  "temperature": 0.9,
  "top_p": 0.7,
  "top_k": 20,
  "max_tokens": 512
}'

IV. Summary

The current open source ecosystem is very mature, and tools such as vLLM can easily enable rapid deployment of large models, which greatly improves work efficiency

V. References

Official website resources and other information

resource (such as manpower or tourism)	address
QWen	GitHub：/QwenLM/Qwen Huggingface：/Qwen ModelScope：/organization/qwen?tab=model docs:/zh-cn/latest/getting_started/#
Pytorch	/get-started/locally/
Conda
vLLM	/en/latest/getting_started/

Incomplete download of weights file

During this deployment, an incomplete download of the model weights file was encountered, which resulted in the download of the model through thevLLMDeployment doesn't work, it can be done with the Linux commandsha256sumtool to check the model weights file, compare the sha256 of the model weights file on the website is consistent, if not, you need to re-download and install the

The commands are as follows:

sha256sum your_local_file