Xinference Practical Guide: Comprehensively Analyze LLM Large Model Deployment Process, Work Together with Dify to Create Efficient AI Application Practice Cases, Accelerate the AI Project Landing Process

Xorbits Inference (Xinference) is an open source platform to simplify the running and integration of a wide range of AI models. With Xinference, you can use any open source LLM, embedded model, and multimodal model to run inference and create powerful AI applications in the cloud or local environments. With Xorbits Inference, it's easy to deploy your own models or built-in cutting-edge open source models with a single click!

Official website:/inference
github：/xorbitsai/inference/tree/main
The Official Handbook:/zh-cn/latest/

Xinference functionality features:
- model-based reasoning: The deployment process for large language models, speech recognition models, and multimodal models is greatly simplified. A single command can complete the deployment of models.
- forward modelingThe framework has built-in models for many cutting-edge languages in English and Chinese, including baichuan, chatglm2, and so on, which can be experienced in one click! The list of built-in models is still being rapidly updated!
- heterogeneous hardware: Reduce latency and increase throughput by using both your GPU and CPU for inference via ggml!
- interface call: Provide a variety of interfaces to use the model, including OpenAI-compatible RESTful API (including Function Calling), RPC, command line, web UI and so on. Convenient for model management and interaction.
- Cluster computing, distributed collaboration: Supports distributed deployment, allowing models of different sizes to be scheduled to different machines on demand through a built-in resource scheduler, making full use of cluster resources.
- Open Ecology, Seamless: Works seamlessly with popular three-way libraries including LangChain, LlamaIndex, Dify, FastGPT, RAGFlow, Chatbox.

1. Model support

1.1 Large model support

Reference Links:/zh-cn/latest/models/builtin/llm/

All major models are compatible and supported

1.2 Embedding Model

Reference Links:/zh-cn/latest/models/builtin/embedding/

Open source word embedding models are also supported

BAAI-bge-large-zh-v1.5
BAAI Embedding semantic vector fine-tuning reference link

1.3 Reordering Model (Reranker)

Reference Links:/zh-cn/latest/models/builtin/rerank/

bge-reranker-large
BAAI Cross-Encoder Semantic Vector Fine-Tuning Reference Links

1.4 IMAGE Model

Xinference also supports image models, which can be used for text-to-diagram and diagram-to-diagram functions. Xinference has several built-in image models, which are the various versions of Stable Diffusion (SD). The deployment method is similar to the text model, you can start the model from the WebGUI interface, and there is no need to select parameters. However, since the SD model is quite large, please make sure you have more than 50GB of space on the server before deploying the image model.

1.5 CUSTOM model

Speech modeling is a recently added feature in Xinference. Using speech modeling, you can realize speech-to-text, speech translation and other functions. Before deploying speech model, you need to install theffmpegcomponent, take the Ubuntu operating system as an example, the installation command is as follows:

sudo apt update && sudo apt install ffmpeg

1.6 Model sources

Xinference defaults fromHuggingFace If you need to use another website to download models, you can do so by setting the environment variableXINFERENCE_MODEL_SRCTo accomplish this, start the Xinference service with the following code and the models will be downloaded from Modelscope[5] when they are deployed:

XINFERENCE_MODEL_SRC=modelscope xinference-local

1.7 Model Exclusive GPU

During the process of deploying models in Xinference, if your server has only one GPU, you can only deploy one LLM model or multimodal model or image model or speech model, because currently Xinference only implements one GPU for one model when deploying these models, and if you want to deploy more than one model at the same time on one GPU, you will encounter an error. If you want to deploy more than one model on a GPU at the same time, you will encounter this error: No available slot found for the model.

1.8 Management model

In addition to starting a model, Xinference provides the ability to manage the entire lifecycle of a model. Again, you can use the command line:

Lists all models of the specified type supported by Xinference:
xinference registrations -t LLM
List all running models:
xinference list
Stop a running model:
xinference terminate --model-uid "qwen2"

Refer to section 3.1 for more

2. Xinference Installation

Install Xinference's base dependencies for inference, as well as dependencies that support inference with ggml and PyTorch.

2.1 Xinference Local Source Installation

First we need to prepare a 3.9+ Python environment to run Xinference, we recommend installing conda according to the official conda documentation. Then use the following commands to create a 3.11 Python environment:

conda create --name xinference python=3.11
conda activate xinference

The following two commands install Transformers and vLLM as Xinference's inference engine backend when Xinference is installed:

pip install "xinference"
pip install "xinference[ggml]"
pip install "xinference[pytorch]"

#mountingxinferenceall-inclusive
pip install "xinference[all]"

pip install "xinference[transformers]" -i /simple
pip install "xinference[vllm]" -i /simple
pip install "xinference[transformers,vllm]" # Simultaneous installation
#Or install all the inference backend engines at once
pip install "xinference[all]"  -i /simple

If you want to use models in GGML format, it is recommended to manually install the required dependencies according to the currently used hardware in order to fully utilize the acceleration capabilities of the hardware.
It is possible that other versions of PyTorch may be installed during the Xinference installation process (the vllm[3] component it depends on needs to be installed), which may cause the GPU server to not work properly, so after installing Xinference, you can run the following commands to see if PyTorch is working correctly:

python -c "import torch; print(.is_available())"

If the output isTrueIf PyTorch is not installed, then PyTorch is fine, otherwise you need to reinstall PyTorch.

2.1.1 llama-cpp-python installation

 ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install -based projects

Reason for the error: When installing with pip install llama-cpp-python, it is compiled and installed by downloading the source code (llama_cpp_python-0.2. (36.8 MB)). If the system does not have the appropriate cmake and gcc versions, this error will be thrown.

Select the official compiled whl download for offline installation according to your system.

Web site:/abetlen/llama-cpp-python/releases

Reference Links:Say goodbye to lag and enjoy GitHub: Five must-see tips for domestic developers to accelerate access and downloads

Just get a gas pedal.

wget https://git.//abetlen/llama-cpp-python/releases/download/v0.2.88-cu122/llama_cpp_python-0.2.88-cp311-cp311-linux_x86_64.whl

Examples of installation commands

 pip install llama_cpp_python-0.2.88-cp311-cp311-linux_x86_64.whl

2.2 Docker installation xinference

Reference Links:Docker Image Installation Official Manual

Currently, there are two channels for pulling the official Xinference image.

in the xprobe/xinference repository on Dockerhub.
Dockerhub images are uploaded to the AliCloud public repository for users who have difficulty accessing Dockerhub. Pull command: docker pull /xprobe_xinference/xinference.<tag> . Currently available labels include:
- nightly-main: This mirror is updated daily from the GitHub main branch and is not guaranteed to be stable.
- v<release version>: This image is made with each release of Xinference and can usually be considered stable and reliable.
- latest: This image will point to the latest release when Xinference is released.
- For the CPU version, add the -cpu suffix, e.g. nightly-main-cpu.

Nvidia GPU users can start the Xinference server using the Xinference Docker image. Before executing the install command, make sure that Docker and CUDA are installed on your system. you can start Xinference in a container using the following approach while mapping port 9997 to port 9998 on the host and specifying the logging level to be DEBUG, as well as specifying the required environment variables.

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:v<your_version> xinference-local -H 0.0.0.0 --log-level debug

You need to change <your_version> to the actual version used, or it can be latest:

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0 --log-level debug

--gpus must be specified, and as described earlier, the image must be running on a machine with a GPU or an error will occur.
-H 0.0.0.0 must also be specified, otherwise you will not be able to connect to the Xinference service outside the container.
Multiple -e options can be specified to assign multiple environment variables.

2.2.2 Mounting the model catalog

By default, the image does not contain any model files, and models are downloaded within the container during use. If you need to use a model that has already been downloaded, you need to mount the host's directory inside the container. In this case, you need to specify the local volume when running the container and configure environment variables for Xinference.

docker run -v </on/your/host>:</on/the/container> -e XINFERENCE_HOME=</on/the/container> -p 9998:9997 --gpus all xprobe/xinference:v<your_version> xinference-local -H 0.0.0.0

The above command works by mounting the specified directory on the host into the container and setting the XINFERENCE_HOME environment variable to point to that directory within the container. This way, all downloaded model files will be stored in the directory you specified on the host. You don't need to worry about losing these files when the Docker container stops; the next time you run the container, you can just use the existing models without having to download them again.

If you downloaded the models from the default path on the host, you need to mount the directory where the original files are located in the container, since the xinference cache directory is used to store the models in a soft-chained way. For example, if you are using huggingface and modelscope as model repositories, then you need to mount the corresponding directories in the container, which are generally located in <home_path>/.cache/huggingface and <home_path>/.cache /modelscope, using the following commands:

docker run \
  -v </your/home/path>/.xinference:/root/.xinference \
  -v </your/home/path>/.cache/huggingface:/root/.cache/huggingface \
  -v </your/home/path>/.cache/modelscope:/root/.cache/modelscope \
  -p 9997:9997 \
  --gpus all \
  xprobe/xinference:v<your_version> \
  xinference-local -H 0.0.0.0

3. Starting the xinference service (UI)

Xinference will start the service locally by default, on port 9997, and since the -H 0.0.0.0 parameter is configured here, non-local clients can access the Xinference service from the machine's IP address.

xinference-local --host 0.0.0.0 --port 7861

Startup output results

2024-08-14 15:37:36,771  1739661 INFO     Xinference supervisor 0.0.0.0:62536 started
2024-08-14 15:37:36,901  1739661 INFO     Starting metrics export server at 0.0.0.0:None
2024-08-14 15:37:36,903  1739661 INFO     Checking metrics export server...
2024-08-14 15:37:39,192  1739661 INFO     Metrics server is started at: http://0.0.0.0:33423
2024-08-14 15:37:39,193  1739661 INFO     Purge cache directory: /root/.xinference/cache
2024-08-14 15:37:39,194  1739661 INFO     Connected to supervisor as a fresh worker
2024-08-14 15:37:39,205  1739661 INFO     Xinference worker 0.0.0.0:62536 started
2024-08-14 15:37:43,454 .restful_api 1739585 INFO     Starting Xinference at endpoint: http://0.0.0.0:8501
2024-08-14 15:37:43,597  1739585 INFO     Uvicorn running on http://0.0.0.0:8501 (Press CTRL+C to quit)

3.1 Model Download

vLLM Engine

vLLM is a high-performance large model inference engine that supports high concurrency. Xinference automatically selects vLLM as the engine to achieve higher throughput when the following conditions are met:

The model format ispytorch ， gptq orawq 。
When the model format ispytorch The quantization option needs to benone 。
When the model format isawq The quantization option needs to beInt4 。
When the model format isgptq The quantization option needs to beInt3 、 Int4 orInt8 。
The operating system is Linux and there is at least one CUDA-enabled device.
Custom modeling of themodel_family fields and built-in models of themodel_name field in the vLLM support list.

engine (loanword)

Adopted by Xinferencellama-cpp-python be in favor ofgguf cap (a poem)ggml format of the model. It is recommended to manually install the dependencies according to the currently used hardware to get the best acceleration.

The way different hardware is mounted:

Apple M Series

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

NVIDIA graphics cards:

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

AMD graphics cards:

CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

SGLang Engine

SGLang has a high-performance inference runtime based on RadixAttention. It significantly accelerates the execution of complex LLM programs by automatically reusing the KV cache between multiple calls. It also supports other common inference techniques such as sequential batch processing and tensor parallel processing.

Initial Steps:

pip install 'xinference[sglang]'

3.2 Model deployment

The following parameters are available for selection when deploying an LLM model:

Model Format: Model format, you can choose between quantized and non-quantized formats, the non-quantized format is pytorch, and the quantized formats are ggml, gptq, awq, etc.
Model Size: the number of parameters of the model, if it is Llama3, there are options such as 8B, 70B, and so on.
QuantizationQuantization accuracy: 4-bit, 8-bit, etc.
N-GPU: Choose which GPU to use
Model UID(optional): customized name of the model, if you don't fill it in, the original model name will be used.

After the parameters are filled in, click the rocket icon button on the left to start deploying the model, and the background will choose to download the quantized or non-quantized LLM model according to the parameters. After the deployment is completed, the interface will automatically jump to the Running Models menu, in the LANGUAGE MODELS tab, we can see the deployed models.

3.2.1 flashinfer installation

Reference Links:/gh_mirrors/fl/flashinfer/overview?utm_source=artical_gitcode&index=bottom&type=card&webUrl

Reference Links:/

Pre-compiled wheels for Linux are provided, and FlashInfer can be tried with the following command:

# For CUDA 12.4 and torch 2.4
pip install flashinfer -i /whl/cu124/torch2.4
#For other CUDA and torch versions, please visit / for details

Or you can compile and install it from source:

git clone /flashinfer-ai/ --recursive
cd flashinfer/python
pip install -e .

If you need to reduce the size of the binary at build and test time, you can do so:

git clone /flashinfer-ai/ --recursive
cd flashinfer/python
#consultation /docs/stable/generated/.get_device_capability.html#.get_device_capability
export TORCH_CUDA_ARCH_LIST=8.0
pip install -e .

Check out the torch version:

import torch
print(torch.__version__)
#2.4.0+cu121

OS: Linux only
Python: 3.8, 3.9, 3.10, 3.11, 3.12
PyTorch: 2.2/2.3/2.4 with CUDA 11.8/12.1/12.4 (only for torch 2.4)
- Use python -c "import torch; print()" to check your PyTorch CUDA version.
Supported GPU architectures: sm80, sm86, sm89, sm90 (sm75 / sm70 support is working in progress).

pip install flashinfer -i /whl/cu121/torch2.4/

If you think it's too slow, use whl

github url:/flashinfer-ai/flashinfer/releases

Downloading /flashinfer-ai/flashinfer/releases/download/v0.1.4/flashinfer-0.1.4%2Bcu121torch2.4-cp311-cp311-linux_x86_64.whl (1098.5 MB)

wget https://git.//flashinfer-ai/flashinfer/releases/download/v0.1.4/flashinfer-0.1.4+cu121torch2.4-cp311-cp311-linux_x86_64.whl

pip install flashinfer-0.1.4+cu121torch2.4-cp311-cp311-linux_x86_64.whl

Another problem, probably a quantitative model unsupported issue

Trying to use qwen2: 1.5b ran into a bit of a problem:

Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-SXM2-16GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half

Compute Capability list for GPUs:

This shows that the Tesla V100 has a Compute Capability of 7.0, which means that you can't compute with Bfloat16, you have to halve Bfloat16 to convert it to float16, so the dtype at runtime is half or float16, otherwise vLLM will report an error.

At home you need to set the environment variable VLLM_USE_MODELSCOPE=True and then you can start a vLLM Big Model API service:

CUDA_VISIBLE_DEVICES=0,1 nohup python -m .api_server --model pooka74/LLaMA3-8B-Chat-Chinese --dtype=half --port 8000 &> ~/logs/ &

Interface modification reference:
- The command line is --dtype half, the interface extra + sign clicks, the key is dtype and the value is half.

View GPU Resource Usage

3.2.2 Distributed deployment

In a distributed scenario, you need to deploy a Xinference supervisor on one server and a Xinference worker on each of the remaining servers. The steps are as follows:

(1) Start supervisor and execute the command
xinference-supervisor -H "${supervisor_host}", replace ${supervisor_host} with the actual hostname or IP address of the server where the supervisor is located.

(2) Start the rest of the workers server to execute the command
xinference-worker -e "http://${supervisor_host}:9997"

When Xinference starts, it prints the endpoint of the service, which is used to manage the model via command line tools or programming interfaces:

In local deployments, endpoint defaults tohttp://localhost:9997

In a clustered deployment, endpoint defaults tohttp://${supervisor_host}:9997. where ${supervisor_host} is the hostname or IP address of the server where the supervisor is located.

3.3 Model utilization

Once the model is downloaded and launched, a local web page automatically opens where you can have a simple conversation with the model to test if it runs successfully.

Copy the Model ID at the bottom of the header to use it on other LLMops

3.3.1 Quick Gradio Conversations

3.3.2 Integration of Dify Smart Quiz

Once the model is deployed, use the access model in Dify and populate it in Settings > Model Provider > Xinference:

Model Name: qwen2-instruct
Server URL: http://<Machine_IP>:7861 Replace with your machine's IP address
Model UID：qwen2-instruct
"Save" to use the model in your application.

Dify also supports using Xinference embed models as Embedding models, just select the Embeddings type in the configuration box.

3.4 Customized models

Reference link: refer below

Deploying Custom Big Models on Xinference
Official Manual - Customized Models
Xorbits inference operation in practice
built-in model

xinference registrations --model-type LLM --endpoint "http://127.0.0.1:7861"

Type    Name                         Language                                                      Ability             Is-built-in
------  ---------------------------  ------------------------------------------------------------  ------------------  -------------
LLM     aquila2                      ['zh']                                                        ['generate']        True
LLM     aquila2-chat                 ['zh']                                                        ['chat']            True
LLM     aquila2-chat-16k             ['zh']                                                        ['chat']            True
LLM     baichuan                     ['en', 'zh']                                                  ['generate']        True
LLM     baichuan-2                   ['en', 'zh']                                                  ['generate']        True
LLM     baichuan-2-chat              ['en', 'zh']                                                  ['chat']            True
LLM     baichuan-chat                ['en', 'zh']                                                  ['chat']            True
LLM     c4ai-command-r-v01           ['en', 'fr', 'de', 'es', 'it', 'pt', 'ja', 'ko', 'zh', 'ar']  ['chat']            True
LLM     chatglm                      ['en', 'zh']                                                  ['chat']            True
LLM     chatglm2                     ['en', 'zh']                                                  ['chat']            True
LLM     chatglm2-32k                 ['en', 'zh']                                                  ['chat']            True
LLM     chatglm3                     ['en', 'zh']                                                  ['chat', 'tools']   True
LLM     chatglm3-128k                ['en', 'zh']                                                  ['chat']            True
LLM     chatglm3-32k                 ['en', 'zh']                                                  ['chat']            True
LLM     code-llama                   ['en']                                                        ['generate']        True
LLM     code-llama-instruct          ['en']                                                        ['chat']            True
LLM     code-llama-python            ['en']                                                        ['generate']        True
LLM     codegeex4                    ['en', 'zh']                                                  ['chat']            True
LLM     codeqwen1.5                  ['en', 'zh']                                                  ['generate']        True
LLM     codeqwen1.5-chat             ['en', 'zh']                                                  ['chat']            True
LLM     codeshell                    ['en', 'zh']                                                  ['generate']        True
LLM     codeshell-chat               ['en', 'zh']                                                  ['chat']            True
LLM     codestral-v0.1               ['en']                                                        ['generate']        True
LLM     cogvlm2                      ['en', 'zh']                                                  ['chat', 'vision']  True
LLM     csg-wukong-chat-v0.1         ['en']                                                        ['chat']            True
LLM     deepseek                     ['en', 'zh']                                                  ['generate']        True
LLM     deepseek-chat                ['en', 'zh']                                                  ['chat']            True
LLM     deepseek-coder               ['en', 'zh']                                                  ['generate']        True
LLM     deepseek-coder-instruct      ['en', 'zh']                                                  ['chat']            True
LLM     deepseek-vl-chat             ['en', 'zh']                                                  ['chat', 'vision']  True
LLM     falcon                       ['en']                                                        ['generate']        True
LLM     falcon-instruct              ['en']                                                        ['chat']            True
LLM     gemma-2-it                   ['en']                                                        ['chat']            True
LLM     gemma-it                     ['en']                                                        ['chat']            True
LLM     glaive-coder                 ['en']                                                        ['chat']            True
LLM     glm-4v                       ['en', 'zh']                                                  ['chat', 'vision']  True
LLM     glm4-chat                    ['en', 'zh']                                                  ['chat', 'tools']   True
LLM     glm4-chat-1m                 ['en', 'zh']                                                  ['chat', 'tools']   True
LLM     gorilla-openfunctions-v1     ['en']                                                        ['chat']            True
LLM     gorilla-openfunctions-v2     ['en']                                                        ['chat']            True
LLM     gpt-2                        ['en']                                                        ['generate']        True
LLM     internlm-20b                 ['en', 'zh']                                                  ['generate']        True
LLM     internlm-7b                  ['en', 'zh']                                                  ['generate']        True
LLM     internlm-chat-20b            ['en', 'zh']                                                  ['chat']            True
LLM     internlm-chat-7b             ['en', 'zh']                                                  ['chat']            True
LLM     internlm2-chat               ['en', 'zh']                                                  ['chat']            True
LLM     internlm2.5-chat             ['en', 'zh']                                                  ['chat']            True
LLM     internlm2.5-chat-1m          ['en', 'zh']                                                  ['chat']            True
LLM     internvl-chat                ['en', 'zh']                                                  ['chat', 'vision']  True
LLM     llama-2                      ['en']                                                        ['generate']        True
LLM     llama-2-chat                 ['en']                                                        ['chat']            True
LLM     llama-3                      ['en']                                                        ['generate']        True
LLM     llama-3-instruct             ['en']                                                        ['chat']            True
LLM     llama-3.1                    ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th']              ['generate']        True
LLM     llama-3.1-instruct           ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th']              ['chat']            True
LLM     minicpm-2b-dpo-bf16          ['zh']                                                        ['chat']            True
LLM     minicpm-2b-dpo-fp16          ['zh']                                                        ['chat']            True
LLM     minicpm-2b-dpo-fp32          ['zh']                                                        ['chat']            True
LLM     minicpm-2b-sft-bf16          ['zh']                                                        ['chat']            True
LLM     minicpm-2b-sft-fp32          ['zh']                                                        ['chat']            True
LLM     MiniCPM-Llama3-V-2_5         ['en', 'zh']                                                  ['chat', 'vision']  True
LLM     MiniCPM-V-2.6                ['en', 'zh']                                                  ['chat', 'vision']  True
LLM     mistral-instruct-v0.1        ['en']                                                        ['chat']            True
LLM     mistral-instruct-v0.2        ['en']                                                        ['chat']            True
LLM     mistral-instruct-v0.3        ['en']                                                        ['chat']            True
LLM     mistral-large-instruct       ['en', 'fr', 'de', 'es', 'it', 'pt', 'zh', 'ru', 'ja', 'ko']  ['chat']            True
LLM     mistral-nemo-instruct        ['en', 'fr', 'de', 'es', 'it', 'pt', 'zh', 'ru', 'ja']        ['chat']            True
LLM     mistral-v0.1                 ['en']                                                        ['generate']        True
LLM     mixtral-8x22B-instruct-v0.1  ['en', 'fr', 'it', 'de', 'es']                                ['chat']            True
LLM     mixtral-instruct-v0.1        ['en', 'fr', 'it', 'de', 'es']                                ['chat']            True
LLM     mixtral-v0.1                 ['en', 'fr', 'it', 'de', 'es']                                ['generate']        True
LLM     OmniLMM                      ['en', 'zh']                                                  ['chat', 'vision']  True
LLM     OpenBuddy                    ['en']                                                        ['chat']            True
LLM     openhermes-2.5               ['en']                                                        ['chat']            True
LLM     opt                          ['en']                                                        ['generate']        True
LLM     orca                         ['en']                                                        ['chat']            True
LLM     orion-chat                   ['en', 'zh']                                                  ['chat']            True
LLM     orion-chat-rag               ['en', 'zh']                                                  ['chat']            True
LLM     phi-2                        ['en']                                                        ['generate']        True
LLM     phi-3-mini-128k-instruct     ['en']                                                        ['chat']            True
LLM     phi-3-mini-4k-instruct       ['en']                                                        ['chat']            True
LLM     platypus2-70b-instruct       ['en']                                                        ['generate']        True
LLM     qwen-chat                    ['en', 'zh']                                                  ['chat', 'tools']   True
LLM     qwen-vl-chat                 ['en', 'zh']                                                  ['chat', 'vision']  True
LLM     qwen1.5-chat                 ['en', 'zh']                                                  ['chat', 'tools']   True
LLM     qwen1.5-moe-chat             ['en', 'zh']                                                  ['chat', 'tools']   True
LLM     qwen2-instruct               ['en', 'zh']                                                  ['chat', 'tools']   True
LLM     qwen2-moe-instruct           ['en', 'zh']                                                  ['chat', 'tools']   True
LLM     seallm_v2                    ['en', 'zh', 'vi', 'id', 'th', 'ms', 'km', 'lo', 'my', 'tl']  ['generate']        True
LLM     seallm_v2.5                  ['en', 'zh', 'vi', 'id', 'th', 'ms', 'km', 'lo', 'my', 'tl']  ['generate']        True
LLM     Skywork                      ['en', 'zh']                                                  ['generate']        True
LLM     Skywork-Math                 ['en', 'zh']                                                  ['generate']        True
LLM     starchat-beta                ['en']                                                        ['chat']            True
LLM     starcoder                    ['en']                                                        ['generate']        True
LLM     starcoderplus                ['en']                                                        ['generate']        True
LLM     Starling-LM                  ['en', 'zh']                                                  ['chat']            True
LLM     telechat                     ['en', 'zh']                                                  ['chat']            True
LLM     tiny-llama                   ['en']                                                        ['generate']        True
LLM     vicuna-v1.3                  ['en']                                                        ['chat']            True
LLM     vicuna-v1.5                  ['en']                                                        ['chat']            True
LLM     vicuna-v1.5-16k              ['en']                                                        ['chat']            True
LLM     wizardcoder-python-v1.0      ['en']                                                        ['chat']            True
LLM     wizardlm-v1.0                ['en']                                                        ['chat']            True
LLM     wizardmath-v1.0              ['en']                                                        ['chat']            True
LLM     xverse                       ['en', 'zh']                                                  ['generate']        True
LLM     xverse-chat                  ['en', 'zh']                                                  ['chat']            True
LLM     Yi                           ['en', 'zh']                                                  ['generate']        True
LLM     Yi-1.5                       ['en', 'zh']                                                  ['generate']        True
LLM     Yi-1.5-chat                  ['en', 'zh']                                                  ['chat']            True
LLM     Yi-1.5-chat-16k              ['en', 'zh']                                                  ['chat']            True
LLM     Yi-200k                      ['en', 'zh']                                                  ['generate']        True
LLM     Yi-chat                      ['en', 'zh']                                                  ['chat']            True
LLM     yi-vl-chat                   ['en', 'zh']                                                  ['chat', 'vision']  True
LLM     zephyr-7b-alpha              ['en']                                                        ['chat']            True
LLM     zephyr-7b-beta               ['en']                                                        ['chat']            True

Registering Models via the Web

4. Terminal commands

If you have changed the port above, you can change the port according to the following counterpart

#/
export HF_ENDPOINT=
export XINFERENCE_MODEL_SRC=modelscope
#/ export XINFERENCE_HOME
export XINFERENCE_HOME=/root/autodl-tmp
#Reset the environment variable if the port has been modified
export XINFERENCE_ENDPOINT=http://0.0.0.0:7863

After the modification, you can start the corresponding services, the following are the cmd commands to start chat / embedding / rerank three models, other model commands can refer to xinference homepage. After starting, the UID of the corresponding model will be returned (it will be used in Dify deployment later).

#deploymentschatglm3
xinference launch --model-name chatglm3 --size-in-billions 6 --model-format pytorch --quantization 8-bit
#deployments bge-large-zh embedding
xinference launch --model-name bge-large-zh --model-type embedding
#deployments bge-reranker-large rerank
xinference launch --model-name bge-reranker-large --model-type rerank

API call

If you are not satisfied with using the web interface of the LLM model, you can also call the API interface to use the LLM model. In fact, when the Xinference service is deployed, the WebGUI interface and the API interface are ready at the same time, and you can access them in your browser.http://localhost:9997/docs / / to see the list of API interfaces.

The list of interfaces contains a large number of interfaces, not only for the LLM model, but also for other models (e.g., Embedding or Rerank), and these are all compatible with the OpenAI API. Take the chat function of LLM as an example, we use the Curl tool to call its interface, the example is as follows:

curl -X 'POST' \
  'http://localhost:9997/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "chatglm3",
    "messages": [
      {
        "role": "user",
        "content": "hello"
      }
    ]
  }'

#Return results
{
  "model": "chatglm3",
  "object": "",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?",
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "total_tokens": 29,
    "completion_tokens": 37
  }
}

If you want to test whether a model has been deployed locally, for example, the rerank model, you can execute the following script, or execute the

from import Client

#url can belocalports 也can be外接ports
url = "http://172.19.0.1:6006"
print(url)

client = Client(url)
model_uid = client.launch_model(model_name="bge-reranker-base", model_type="rerank")
model = client.get_model(model_uid)

query = "A man is eating pasta."
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin."
]
print((corpus, query))

Or execute View Deployed Models

xinferencelist

If resources need to be released

xinferenceterminate--model-uid"my-llama-2"

External network access is required, you need to look up the local IP address i.e. http://<Machine_IP>:<port port> , the way to find the IP address is as follows.

#Windows
ipconfig/all

#Linux
hostname -I

5. Xinference official AI practice cases

Official Link:/zh-cn/latest/examples/

Reference Links:

Xinference: LLM, embedding, rerank macromodel needed for local deployment of Dify
Xinference Large Model Reasoning Framework Deployment and Application
Deploying Custom Big Models on Xinference
Official Manual - Customized Models
Xorbits inference operation in practice