A Guide to Reasoning with Large Models: Using vLLM for Efficient Reasoning

This article shares how to implement a large model inference service using vLLM.

1. General

There are various ways to reason about large models such as

HuggingFace Transformers at its most basic level.
TGI
vLLM
Triton + TensorRT-LLM
...

Among them, the most popular one should be vLLM, which has good performance and is also very simple to use. In this article, we will share how to use vLLM to start a large model inference service.

According to the vLLM official blogvLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention Says:

Two experiments, inference of LLaMA-7 B on NVIDIA A10 GPUs and inference of LLaMA-13 B on NVIDIA A100 GPUs (40 GB), were conducted.vLLM is 24x higher than the most basic HuggingFace Transformers in terms of throughput and 3.5x higher than TGI。

2. Install vLLM

The first step is to prepare a GPU environment, which can be found in this article:GPU Environment Setup Guide: How to Use GPUs in Bare Metal, Docker, K8s, etc.

You need to make sure that it can be executed properly on the host machinenvidia-smi Command, like this:

root@test:~# nvidia-smi
Thu Jul 18 10:52:01 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id         | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:00:07.0 Off |                    0 |
|  0%   45C    P0    88W / 300W |  40920MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          Off  | 00000000:00:08.0 Off |                    0 |
|  0%   47C    P0    92W / 300W |  40916MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1847616      C   tritonserver                      480MiB |
|    0   N/A  N/A   2553571      C   python3                         40426MiB |
|    1   N/A  N/A   1847616      C   tritonserver                      476MiB |
|    1   N/A  N/A   2109313      C   python3                         40426MiB |
+-----------------------------------------------------------------------------+

Installing conda

To avoid interruptions, we'll use conda to create a separate Python virtual environment to install vLLM.

Use the following command to quickly install the latest miniconda, or you can go to the official download tarball, unzip it and configure it to the PATH variable.

mkdir -p ~/miniconda3

wget /miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/

bash ~/miniconda3/ -b -u -p ~/miniconda3

rm -rf ~/miniconda3/

Then initialize the

# Select a command to execute depending on the shell used
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh

activation

source ~/.bashrc

Create a virtual environment to install vLLM

Create a virtual environment and activate it

conda create -n vllm_py310 python=3.10

conda activate vllm_py310

# configure pip root
pip config set -url /simple/
pip config set -host

# Installation in a virtual environment vllm 0.4.2 releases
pip install vllm==0.4.2

3. Preparation of models

Generally models are posted to theHuggingFaceHowever, the domestic network situation, recommended toModelScope Download.

All major models of vLLM are supported, see the official documentation for a list:vllm-supported-models

Qwen1.5-1.8B-Chat is used here for testing.

Use git lfs to download:

# Install and initialize git-lfs
apt install git-lfs -y
git lfs install

# Download the model
git lfs clone /qwen/Qwen1.5-1.

The complete content is below:

root@j99cloudvm:~/lixd/models# ls -lhS Qwen1.5-1.8B-Chat/
total 3.5G
-rw-r--r-- 1 root root 3.5G Jul 18 11:20 
-rw-r--r-- 1 root root 6.8M Jul 18 11:08 
-rw-r--r-- 1 root root 2.7M Jul 18 11:08 
-rw-r--r-- 1 root root 1.6M Jul 18 11:08 
-rw-r--r-- 1 root root 7.2K Jul 18 11:08 LICENSE
-rw-r--r-- 1 root root 4.2K Jul 18 11:08 
-rw-r--r-- 1 root root 1.3K Jul 18 11:08 tokenizer_config.json
-rw-r--r-- 1 root root  662 Jul 18 11:08 
-rw-r--r-- 1 root root  206 Jul 18 11:08 generation_config.json
-rw-r--r-- 1 root root   51 Jul 18 11:08

This directory contains files related to a large model. Below is a brief description of what each file does:

: This is the main document of the larger model and contains the weights of the model.
: This file contains the configuration and vocabulary of the Tokenizer. The Tokenizer is used to convert the input text into a format that the model can handle, usually numeric IDs.
tokenizer_config.json: This file contains configuration options for the participant, such as the type of participant, parameter settings, and so on.
: This file contains the configuration parameters of the model, defining the structure of the model and some settings during training. It usually includes parameters such as number of layers, number of hidden units, activation function, etc.
generation_config.json: This file contains the configuration parameters used to generate the text, such as generation length, sampling strategy, and so on.
: This file is usually an additional configuration file for the model and may contain configuration information related to the model structure or the training process.
: This file contains the vocabulary of the lexer, usually a mapping table from words to IDs. The lexer uses this file to convert words in the text into IDs that the model can handle.

In general, you only need to pay attention to the weights file and tokenizer.

4. Commencement of reasoning

Initiate reasoning service

vLLM support provides APIs in OpenAI format, and the startup commands are as follows:

modelpath=/models/Qwen1.5-1.8B-Chat

# single card
python3 -m .api_server \
        --model $modelpath \
        --served-model-name qwen \
        --trust-remote-code

The output is as follows

INFO 07-18 06:42:31 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/models/Qwen1.5-1.8B-Chat', speculative_config=None, tokenizer='/models/Qwen1.5-1.8B-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=qwen)
INFO:     Started server process [614]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

vLLM listens on port 8000 by default.

For multiple cards it is a matter of increasing the parametertensor-parallel-size If you set this parameter to the number of GPUs, vLLM will start a ray cluster to slice the model across multiple GPUs, which is useful for large models.

python3 -m .api_server \
        --model $modelpath \
        --served-model-name qwen \
        --tensor-parallel-size 8 \
        --trust-remote-code

Send test request

Request directly using the OpenAI format

# model is the same as the one used earlier to start the service served-model-name parameters
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who are you?？"}
        ]
    }'

The output is as follows:

{"id": "cmpl-07f2f8c70bd44c10bba71d730e6e10a3", "object":"", "created":1721284973, "model": "qwen", "choices":[{"index":0, "message":{" role": "assistant", "content": "I'm a large-scale language model from Aliyun, my name is Tongyi Thousand Questions. I am a hyperscale language model developed by Aliyun independently, which can answer questions, create text, and also express opinions, write code, and write stories, as well as express opinions, write code, and write stories. I am designed to help users answer questions, create text, express opinions, write code, write stories, and perform a variety of other natural language processing tasks. If you have any questions or need help, please feel free to let me know and I'll do my best to provide support."} , "logprobs":null, "finish_reason": "stop", "stop_reason":null}], "usage":{"prompt_tokens":22, "total_tokens":121, "completion_tokens":99 }}

Check out the GPU usage, it's basically running at full capacity

Thu Jul 18 06:45:32 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id         | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:3B:00.0 Off |                    0 |
| N/A   59C    P0              69W /  70W |  12833MiB / 15360MiB |     84%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:86:00.0 Off |                    0 |
| N/A   51C    P0              30W /  70W |   4849MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3376627      C   python3                                   12818MiB |
|    1   N/A  N/A   1150487      C   /usr/bin/python3                           4846MiB |
+---------------------------------------------------------------------------------------+

5. Summary

This article shares how to use vLLM to deploy a large model inference service, after installing the environment, vLLM is very simple to use, a command can start.

modelpath=/models/Qwen1.5-1.8B-Chat

# single card
python3 -m .api_server \
        --model $modelpath \
        --served-model-name qwen \
        --trust-remote-code

[Kubernetes Series]Continuously updated, search the public number [Explore Cloud Native]Subscribe to read more articles.