This article shares how to implement a large model inference service using vLLM.
1. General
There are various ways to reason about large models such as
- HuggingFace Transformers at its most basic level.
- TGI
- vLLM
- Triton + TensorRT-LLM
- ...
Among them, the most popular one should be vLLM, which has good performance and is also very simple to use. In this article, we will share how to use vLLM to start a large model inference service.
According to the vLLM official blogvLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention Says:
Two experiments, inference of LLaMA-7 B on NVIDIA A10 GPUs and inference of LLaMA-13 B on NVIDIA A100 GPUs (40 GB), were conducted.vLLM is 24x higher than the most basic HuggingFace Transformers in terms of throughput and 3.5x higher than TGI。
2. Install vLLM
The first step is to prepare a GPU environment, which can be found in this article:GPU Environment Setup Guide: How to Use GPUs in Bare Metal, Docker, K8s, etc.
You need to make sure that it can be executed properly on the host machinenvidia-smi
Command, like this:
root@test:~# nvidia-smi
Thu Jul 18 10:52:01 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:00:07.0 Off | 0 |
| 0% 45C P0 88W / 300W | 40920MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 Off | 00000000:00:08.0 Off | 0 |
| 0% 47C P0 92W / 300W | 40916MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1847616 C tritonserver 480MiB |
| 0 N/A N/A 2553571 C python3 40426MiB |
| 1 N/A N/A 1847616 C tritonserver 476MiB |
| 1 N/A N/A 2109313 C python3 40426MiB |
+-----------------------------------------------------------------------------+
Installing conda
To avoid interruptions, we'll use conda to create a separate Python virtual environment to install vLLM.
Use the following command to quickly install the latest miniconda, or you can go to the official download tarball, unzip it and configure it to the PATH variable.
mkdir -p ~/miniconda3
wget /miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/
bash ~/miniconda3/ -b -u -p ~/miniconda3
rm -rf ~/miniconda3/
Then initialize the
# Select a command to execute depending on the shell used
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
activation
source ~/.bashrc
Create a virtual environment to install vLLM
Create a virtual environment and activate it
conda create -n vllm_py310 python=3.10
conda activate vllm_py310
# configure pip root
pip config set -url /simple/
pip config set -host
# Installation in a virtual environment vllm 0.4.2 releases
pip install vllm==0.4.2
3. Preparation of models
Generally models are posted to theHuggingFaceHowever, the domestic network situation, recommended toModelScope Download.
All major models of vLLM are supported, see the official documentation for a list:vllm-supported-models
Qwen1.5-1.8B-Chat is used here for testing.
Use git lfs to download:
# Install and initialize git-lfs
apt install git-lfs -y
git lfs install
# Download the model
git lfs clone /qwen/Qwen1.5-1.
The complete content is below:
root@j99cloudvm:~/lixd/models# ls -lhS Qwen1.5-1.8B-Chat/
total 3.5G
-rw-r--r-- 1 root root 3.5G Jul 18 11:20
-rw-r--r-- 1 root root 6.8M Jul 18 11:08
-rw-r--r-- 1 root root 2.7M Jul 18 11:08
-rw-r--r-- 1 root root 1.6M Jul 18 11:08
-rw-r--r-- 1 root root 7.2K Jul 18 11:08 LICENSE
-rw-r--r-- 1 root root 4.2K Jul 18 11:08
-rw-r--r-- 1 root root 1.3K Jul 18 11:08 tokenizer_config.json
-rw-r--r-- 1 root root 662 Jul 18 11:08
-rw-r--r-- 1 root root 206 Jul 18 11:08 generation_config.json
-rw-r--r-- 1 root root 51 Jul 18 11:08
This directory contains files related to a large model. Below is a brief description of what each file does:
-
: This is the main document of the larger model and contains the weights of the model.
-
: This file contains the configuration and vocabulary of the Tokenizer. The Tokenizer is used to convert the input text into a format that the model can handle, usually numeric IDs.
-
tokenizer_config.json
: This file contains configuration options for the participant, such as the type of participant, parameter settings, and so on. -
: This file contains the configuration parameters of the model, defining the structure of the model and some settings during training. It usually includes parameters such as number of layers, number of hidden units, activation function, etc.
-
generation_config.json
: This file contains the configuration parameters used to generate the text, such as generation length, sampling strategy, and so on. -
: This file is usually an additional configuration file for the model and may contain configuration information related to the model structure or the training process.
-
: This file contains the vocabulary of the lexer, usually a mapping table from words to IDs. The lexer uses this file to convert words in the text into IDs that the model can handle.
In general, you only need to pay attention to the weights file and tokenizer.
4. Commencement of reasoning
Initiate reasoning service
vLLM support provides APIs in OpenAI format, and the startup commands are as follows:
modelpath=/models/Qwen1.5-1.8B-Chat
# single card
python3 -m .api_server \
--model $modelpath \
--served-model-name qwen \
--trust-remote-code
The output is as follows
INFO 07-18 06:42:31 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/models/Qwen1.5-1.8B-Chat', speculative_config=None, tokenizer='/models/Qwen1.5-1.8B-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=qwen)
INFO: Started server process [614]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
vLLM listens on port 8000 by default.
For multiple cards it is a matter of increasing the parametertensor-parallel-size
If you set this parameter to the number of GPUs, vLLM will start a ray cluster to slice the model across multiple GPUs, which is useful for large models.
python3 -m .api_server \
--model $modelpath \
--served-model-name qwen \
--tensor-parallel-size 8 \
--trust-remote-code
Send test request
Request directly using the OpenAI format
# model is the same as the one used earlier to start the service served-model-name parameters
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you??"}
]
}'
The output is as follows:
{"id": "cmpl-07f2f8c70bd44c10bba71d730e6e10a3", "object":"", "created":1721284973, "model": "qwen", "choices":[{"index":0, "message":{" role": "assistant", "content": "I'm a large-scale language model from Aliyun, my name is Tongyi Thousand Questions. I am a hyperscale language model developed by Aliyun independently, which can answer questions, create text, and also express opinions, write code, and write stories, as well as express opinions, write code, and write stories. I am designed to help users answer questions, create text, express opinions, write code, write stories, and perform a variety of other natural language processing tasks. If you have any questions or need help, please feel free to let me know and I'll do my best to provide support."} , "logprobs":null, "finish_reason": "stop", "stop_reason":null}], "usage":{"prompt_tokens":22, "total_tokens":121, "completion_tokens":99 }}
Check out the GPU usage, it's basically running at full capacity
Thu Jul 18 06:45:32 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:3B:00.0 Off | 0 |
| N/A 59C P0 69W / 70W | 12833MiB / 15360MiB | 84% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:86:00.0 Off | 0 |
| N/A 51C P0 30W / 70W | 4849MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3376627 C python3 12818MiB |
| 1 N/A N/A 1150487 C /usr/bin/python3 4846MiB |
+---------------------------------------------------------------------------------------+
5. Summary
This article shares how to use vLLM to deploy a large model inference service, after installing the environment, vLLM is very simple to use, a command can start.
modelpath=/models/Qwen1.5-1.8B-Chat
# single card
python3 -m .api_server \
--model $modelpath \
--served-model-name qwen \
--trust-remote-code
[Kubernetes Series]Continuously updated, search the public number [Explore Cloud Native]Subscribe to read more articles.