Xinference Practical Guide: Comprehensively Analyze LLM Large Model Deployment Process, Work Together with Dify to Create Efficient AI Application Practice Cases, Accelerate the AI Project Landing Process
Xorbits Inference (Xinference) is an open source platform to simplify the running and integration of a wide range of AI models. With Xinference, you can use any open source LLM, embedded model, and multimodal model to run inference and create powerful AI applications in the cloud or local environments. With Xorbits Inference, it's easy to deploy your own models or built-in cutting-edge open source models with a single click!
- Official website:/inference
- github:/xorbitsai/inference/tree/main
- The Official Handbook:/zh-cn/latest/
- Xinference functionality features:
-
model-based reasoning
: The deployment process for large language models, speech recognition models, and multimodal models is greatly simplified. A single command can complete the deployment of models. -
forward modeling
The framework has built-in models for many cutting-edge languages in English and Chinese, including baichuan, chatglm2, and so on, which can be experienced in one click! The list of built-in models is still being rapidly updated! -
heterogeneous hardware
: Reduce latency and increase throughput by using both your GPU and CPU for inference via ggml! -
interface call
: Provide a variety of interfaces to use the model, including OpenAI-compatible RESTful API (including Function Calling), RPC, command line, web UI and so on. Convenient for model management and interaction. -
Cluster computing, distributed collaboration
: Supports distributed deployment, allowing models of different sizes to be scheduled to different machines on demand through a built-in resource scheduler, making full use of cluster resources. -
Open Ecology, Seamless
: Works seamlessly with popular three-way libraries including LangChain, LlamaIndex, Dify, FastGPT, RAGFlow, Chatbox.
-
1. Model support
1.1 Large model support
Reference Links:/zh-cn/latest/models/builtin/llm/
All major models are compatible and supported
1.2 Embedding Model
Reference Links:/zh-cn/latest/models/builtin/embedding/
Open source word embedding models are also supported
- BAAI-bge-large-zh-v1.5
BAAI Embedding semantic vector fine-tuning reference link
1.3 Reordering Model (Reranker)
Reference Links:/zh-cn/latest/models/builtin/rerank/
- bge-reranker-large
BAAI Cross-Encoder Semantic Vector Fine-Tuning Reference Links
1.4 IMAGE Model
Xinference also supports image models, which can be used for text-to-diagram and diagram-to-diagram functions. Xinference has several built-in image models, which are the various versions of Stable Diffusion (SD). The deployment method is similar to the text model, you can start the model from the WebGUI interface, and there is no need to select parameters. However, since the SD model is quite large, please make sure you have more than 50GB of space on the server before deploying the image model.
1.5 CUSTOM model
Speech modeling is a recently added feature in Xinference. Using speech modeling, you can realize speech-to-text, speech translation and other functions. Before deploying speech model, you need to install theffmpeg
component, take the Ubuntu operating system as an example, the installation command is as follows:
sudo apt update && sudo apt install ffmpeg
1.6 Model sources
Xinference defaults fromHuggingFace If you need to use another website to download models, you can do so by setting the environment variableXINFERENCE_MODEL_SRC
To accomplish this, start the Xinference service with the following code and the models will be downloaded from Modelscope[5] when they are deployed:
XINFERENCE_MODEL_SRC=modelscope xinference-local
1.7 Model Exclusive GPU
During the process of deploying models in Xinference, if your server has only one GPU, you can only deploy one LLM model or multimodal model or image model or speech model, because currently Xinference only implements one GPU for one model when deploying these models, and if you want to deploy more than one model at the same time on one GPU, you will encounter an error. If you want to deploy more than one model on a GPU at the same time, you will encounter this error: No available slot found for the model.
1.8 Management model
In addition to starting a model, Xinference provides the ability to manage the entire lifecycle of a model. Again, you can use the command line:
Lists all models of the specified type supported by Xinference:
xinference registrations -t LLM
List all running models:
xinference list
Stop a running model:
xinference terminate --model-uid "qwen2"
Refer to section 3.1 for more
2. Xinference Installation
Install Xinference's base dependencies for inference, as well as dependencies that support inference with ggml and PyTorch.
2.1 Xinference Local Source Installation
First we need to prepare a 3.9+ Python environment to run Xinference, we recommend installing conda according to the official conda documentation. Then use the following commands to create a 3.11 Python environment:
conda create --name xinference python=3.11
conda activate xinference
The following two commands install Transformers and vLLM as Xinference's inference engine backend when Xinference is installed:
pip install "xinference"
pip install "xinference[ggml]"
pip install "xinference[pytorch]"
#mountingxinferenceall-inclusive
pip install "xinference[all]"
pip install "xinference[transformers]" -i /simple
pip install "xinference[vllm]" -i /simple
pip install "xinference[transformers,vllm]" # Simultaneous installation
#Or install all the inference backend engines at once
pip install "xinference[all]" -i /simple
If you want to use models in GGML format, it is recommended to manually install the required dependencies according to the currently used hardware in order to fully utilize the acceleration capabilities of the hardware.
It is possible that other versions of PyTorch may be installed during the Xinference installation process (the vllm[3] component it depends on needs to be installed), which may cause the GPU server to not work properly, so after installing Xinference, you can run the following commands to see if PyTorch is working correctly:
python -c "import torch; print(.is_available())"
If the output isTrue
If PyTorch is not installed, then PyTorch is fine, otherwise you need to reinstall PyTorch.
2.1.1 llama-cpp-python installation
ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install -based projects
Reason for the error: When installing with pip install llama-cpp-python, it is compiled and installed by downloading the source code (llama_cpp_python-0.2. (36.8 MB)). If the system does not have the appropriate cmake and gcc versions, this error will be thrown.
Select the official compiled whl download for offline installation according to your system.
- Web site:/abetlen/llama-cpp-python/releases
Reference Links:Say goodbye to lag and enjoy GitHub: Five must-see tips for domestic developers to accelerate access and downloads
Just get a gas pedal.
wget https://git.//abetlen/llama-cpp-python/releases/download/v0.2.88-cu122/llama_cpp_python-0.2.88-cp311-cp311-linux_x86_64.whl
- Examples of installation commands
pip install llama_cpp_python-0.2.88-cp311-cp311-linux_x86_64.whl
2.2 Docker installation xinference
Reference Links:Docker Image Installation Official Manual
Currently, there are two channels for pulling the official Xinference image.
- in the xprobe/xinference repository on Dockerhub.
- Dockerhub images are uploaded to the AliCloud public repository for users who have difficulty accessing Dockerhub. Pull command: docker pull /xprobe_xinference/xinference.
<tag>
. Currently available labels include:-
nightly-main
: This mirror is updated daily from the GitHub main branch and is not guaranteed to be stable. -
v<release version>
: This image is made with each release of Xinference and can usually be considered stable and reliable. -
latest
: This image will point to the latest release when Xinference is released. - For the CPU version, add the -cpu suffix, e.g. nightly-main-cpu.
-
Nvidia GPU users can start the Xinference server using the Xinference Docker image. Before executing the install command, make sure that Docker and CUDA are installed on your system. you can start Xinference in a container using the following approach while mapping port 9997 to port 9998 on the host and specifying the logging level to be DEBUG, as well as specifying the required environment variables.
docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:v<your_version> xinference-local -H 0.0.0.0 --log-level debug
You need to change <your_version> to the actual version used, or it can be latest:
docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0 --log-level debug
- --gpus must be specified, and as described earlier, the image must be running on a machine with a GPU or an error will occur.
- -H 0.0.0.0 must also be specified, otherwise you will not be able to connect to the Xinference service outside the container.
- Multiple -e options can be specified to assign multiple environment variables.
2.2.2 Mounting the model catalog
By default, the image does not contain any model files, and models are downloaded within the container during use. If you need to use a model that has already been downloaded, you need to mount the host's directory inside the container. In this case, you need to specify the local volume when running the container and configure environment variables for Xinference.
docker run -v </on/your/host>:</on/the/container> -e XINFERENCE_HOME=</on/the/container> -p 9998:9997 --gpus all xprobe/xinference:v<your_version> xinference-local -H 0.0.0.0
The above command works by mounting the specified directory on the host into the container and setting the XINFERENCE_HOME environment variable to point to that directory within the container. This way, all downloaded model files will be stored in the directory you specified on the host. You don't need to worry about losing these files when the Docker container stops; the next time you run the container, you can just use the existing models without having to download them again.
If you downloaded the models from the default path on the host, you need to mount the directory where the original files are located in the container, since the xinference cache directory is used to store the models in a soft-chained way. For example, if you are using huggingface and modelscope as model repositories, then you need to mount the corresponding directories in the container, which are generally located in <home_path>/.cache/huggingface and <home_path>/.cache /modelscope, using the following commands:
docker run \
-v </your/home/path>/.xinference:/root/.xinference \
-v </your/home/path>/.cache/huggingface:/root/.cache/huggingface \
-v </your/home/path>/.cache/modelscope:/root/.cache/modelscope \
-p 9997:9997 \
--gpus all \
xprobe/xinference:v<your_version> \
xinference-local -H 0.0.0.0
3. Starting the xinference service (UI)
Xinference will start the service locally by default, on port 9997, and since the -H 0.0.0.0 parameter is configured here, non-local clients can access the Xinference service from the machine's IP address.
xinference-local --host 0.0.0.0 --port 7861
- Startup output results
2024-08-14 15:37:36,771 1739661 INFO Xinference supervisor 0.0.0.0:62536 started
2024-08-14 15:37:36,901 1739661 INFO Starting metrics export server at 0.0.0.0:None
2024-08-14 15:37:36,903 1739661 INFO Checking metrics export server...
2024-08-14 15:37:39,192 1739661 INFO Metrics server is started at: http://0.0.0.0:33423
2024-08-14 15:37:39,193 1739661 INFO Purge cache directory: /root/.xinference/cache
2024-08-14 15:37:39,194 1739661 INFO Connected to supervisor as a fresh worker
2024-08-14 15:37:39,205 1739661 INFO Xinference worker 0.0.0.0:62536 started
2024-08-14 15:37:43,454 .restful_api 1739585 INFO Starting Xinference at endpoint: http://0.0.0.0:8501
2024-08-14 15:37:43,597 1739585 INFO Uvicorn running on http://0.0.0.0:8501 (Press CTRL+C to quit)
3.1 Model Download
vLLM Engine
vLLM is a high-performance large model inference engine that supports high concurrency. Xinference automatically selects vLLM as the engine to achieve higher throughput when the following conditions are met:
- The model format is
pytorch
,gptq
orawq
。 - When the model format is
pytorch
The quantization option needs to benone
。 - When the model format is
awq
The quantization option needs to beInt4
。 - When the model format is
gptq
The quantization option needs to beInt3
、Int4
orInt8
。 - The operating system is Linux and there is at least one CUDA-enabled device.
- Custom modeling of the
model_family
fields and built-in models of themodel_name
field in the vLLM support list.
engine (loanword)
Adopted by Xinferencellama-cpp-python
be in favor ofgguf
cap (a poem)ggml
format of the model. It is recommended to manually install the dependencies according to the currently used hardware to get the best acceleration.
The way different hardware is mounted:
-
Apple M Series
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
-
NVIDIA graphics cards:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
-
AMD graphics cards:
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
SGLang Engine
SGLang has a high-performance inference runtime based on RadixAttention. It significantly accelerates the execution of complex LLM programs by automatically reusing the KV cache between multiple calls. It also supports other common inference techniques such as sequential batch processing and tensor parallel processing.
Initial Steps:
pip install 'xinference[sglang]'
3.2 Model deployment
The following parameters are available for selection when deploying an LLM model:
-
Model Format: Model format, you can choose between quantized and non-quantized formats, the non-quantized format is pytorch, and the quantized formats are ggml, gptq, awq, etc.
-
Model Size: the number of parameters of the model, if it is Llama3, there are options such as 8B, 70B, and so on.
-
QuantizationQuantization accuracy: 4-bit, 8-bit, etc.
-
N-GPU: Choose which GPU to use
-
Model UID(optional): customized name of the model, if you don't fill it in, the original model name will be used.
After the parameters are filled in, click the rocket icon button on the left to start deploying the model, and the background will choose to download the quantized or non-quantized LLM model according to the parameters. After the deployment is completed, the interface will automatically jump to the Running Models menu, in the LANGUAGE MODELS tab, we can see the deployed models.
3.2.1 flashinfer installation
Reference Links:/gh_mirrors/fl/flashinfer/overview?utm_source=artical_gitcode&index=bottom&type=card&webUrl
Reference Links:/
- Pre-compiled wheels for Linux are provided, and FlashInfer can be tried with the following command:
# For CUDA 12.4 and torch 2.4
pip install flashinfer -i /whl/cu124/torch2.4
#For other CUDA and torch versions, please visit / for details
- Or you can compile and install it from source:
git clone /flashinfer-ai/ --recursive
cd flashinfer/python
pip install -e .
- If you need to reduce the size of the binary at build and test time, you can do so:
git clone /flashinfer-ai/ --recursive
cd flashinfer/python
#consultation /docs/stable/generated/.get_device_capability.html#.get_device_capability
export TORCH_CUDA_ARCH_LIST=8.0
pip install -e .
Check out the torch version:
import torch
print(torch.__version__)
#2.4.0+cu121
- OS: Linux only
- Python: 3.8, 3.9, 3.10, 3.11, 3.12
- PyTorch: 2.2/2.3/2.4 with CUDA 11.8/12.1/12.4 (only for torch 2.4)
- Use python -c "import torch; print()" to check your PyTorch CUDA version.
- Supported GPU architectures: sm80, sm86, sm89, sm90 (sm75 / sm70 support is working in progress).
pip install flashinfer -i /whl/cu121/torch2.4/
If you think it's too slow, use whl
- github url:/flashinfer-ai/flashinfer/releases
Downloading /flashinfer-ai/flashinfer/releases/download/v0.1.4/flashinfer-0.1.4%2Bcu121torch2.4-cp311-cp311-linux_x86_64.whl (1098.5 MB)
wget https://git.//flashinfer-ai/flashinfer/releases/download/v0.1.4/flashinfer-0.1.4+cu121torch2.4-cp311-cp311-linux_x86_64.whl
pip install flashinfer-0.1.4+cu121torch2.4-cp311-cp311-linux_x86_64.whl
- Another problem, probably a quantitative model unsupported issue
Trying to use qwen2: 1.5b ran into a bit of a problem:
Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla V100-SXM2-16GB GPU has compute capability 7.0. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half
- Compute Capability list for GPUs:
This shows that the Tesla V100 has a Compute Capability of 7.0, which means that you can't compute with Bfloat16, you have to halve Bfloat16 to convert it to float16, so the dtype at runtime is half or float16, otherwise vLLM will report an error.
At home you need to set the environment variable VLLM_USE_MODELSCOPE=True and then you can start a vLLM Big Model API service:
CUDA_VISIBLE_DEVICES=0,1 nohup python -m .api_server --model pooka74/LLaMA3-8B-Chat-Chinese --dtype=half --port 8000 &> ~/logs/ &
- Interface modification reference:
- The command line is --dtype half, the interface extra + sign clicks, the key is dtype and the value is half.
- View GPU Resource Usage
3.2.2 Distributed deployment
In a distributed scenario, you need to deploy a Xinference supervisor on one server and a Xinference worker on each of the remaining servers. The steps are as follows:
(1) Start supervisor and execute the command
xinference-supervisor -H "${supervisor_host}", replace ${supervisor_host} with the actual hostname or IP address of the server where the supervisor is located.
(2) Start the rest of the workers server to execute the command
xinference-worker -e "http://${supervisor_host}:9997"
When Xinference starts, it prints the endpoint of the service, which is used to manage the model via command line tools or programming interfaces:
In local deployments, endpoint defaults tohttp://localhost:9997
In a clustered deployment, endpoint defaults tohttp://${supervisor_host}:9997. where ${supervisor_host} is the hostname or IP address of the server where the supervisor is located.
3.3 Model utilization
Once the model is downloaded and launched, a local web page automatically opens where you can have a simple conversation with the model to test if it runs successfully.
Copy the Model ID at the bottom of the header to use it on other LLMops
3.3.1 Quick Gradio Conversations
3.3.2 Integration of Dify Smart Quiz
Once the model is deployed, use the access model in Dify and populate it in Settings > Model Provider > Xinference:
-
Model Name: qwen2-instruct
-
Server URL: http://<Machine_IP>:7861 Replace with your machine's IP address
-
Model UID:qwen2-instruct
-
"Save" to use the model in your application.
Dify also supports using Xinference embed models as Embedding models, just select the Embeddings type in the configuration box.
3.4 Customized models
Reference link: refer below
-
Deploying Custom Big Models on Xinference
-
Official Manual - Customized Models
-
Xorbits inference operation in practice
-
built-in model
xinference registrations --model-type LLM --endpoint "http://127.0.0.1:7861"
Type Name Language Ability Is-built-in
------ --------------------------- ------------------------------------------------------------ ------------------ -------------
LLM aquila2 ['zh'] ['generate'] True
LLM aquila2-chat ['zh'] ['chat'] True
LLM aquila2-chat-16k ['zh'] ['chat'] True
LLM baichuan ['en', 'zh'] ['generate'] True
LLM baichuan-2 ['en', 'zh'] ['generate'] True
LLM baichuan-2-chat ['en', 'zh'] ['chat'] True
LLM baichuan-chat ['en', 'zh'] ['chat'] True
LLM c4ai-command-r-v01 ['en', 'fr', 'de', 'es', 'it', 'pt', 'ja', 'ko', 'zh', 'ar'] ['chat'] True
LLM chatglm ['en', 'zh'] ['chat'] True
LLM chatglm2 ['en', 'zh'] ['chat'] True
LLM chatglm2-32k ['en', 'zh'] ['chat'] True
LLM chatglm3 ['en', 'zh'] ['chat', 'tools'] True
LLM chatglm3-128k ['en', 'zh'] ['chat'] True
LLM chatglm3-32k ['en', 'zh'] ['chat'] True
LLM code-llama ['en'] ['generate'] True
LLM code-llama-instruct ['en'] ['chat'] True
LLM code-llama-python ['en'] ['generate'] True
LLM codegeex4 ['en', 'zh'] ['chat'] True
LLM codeqwen1.5 ['en', 'zh'] ['generate'] True
LLM codeqwen1.5-chat ['en', 'zh'] ['chat'] True
LLM codeshell ['en', 'zh'] ['generate'] True
LLM codeshell-chat ['en', 'zh'] ['chat'] True
LLM codestral-v0.1 ['en'] ['generate'] True
LLM cogvlm2 ['en', 'zh'] ['chat', 'vision'] True
LLM csg-wukong-chat-v0.1 ['en'] ['chat'] True
LLM deepseek ['en', 'zh'] ['generate'] True
LLM deepseek-chat ['en', 'zh'] ['chat'] True
LLM deepseek-coder ['en', 'zh'] ['generate'] True
LLM deepseek-coder-instruct ['en', 'zh'] ['chat'] True
LLM deepseek-vl-chat ['en', 'zh'] ['chat', 'vision'] True
LLM falcon ['en'] ['generate'] True
LLM falcon-instruct ['en'] ['chat'] True
LLM gemma-2-it ['en'] ['chat'] True
LLM gemma-it ['en'] ['chat'] True
LLM glaive-coder ['en'] ['chat'] True
LLM glm-4v ['en', 'zh'] ['chat', 'vision'] True
LLM glm4-chat ['en', 'zh'] ['chat', 'tools'] True
LLM glm4-chat-1m ['en', 'zh'] ['chat', 'tools'] True
LLM gorilla-openfunctions-v1 ['en'] ['chat'] True
LLM gorilla-openfunctions-v2 ['en'] ['chat'] True
LLM gpt-2 ['en'] ['generate'] True
LLM internlm-20b ['en', 'zh'] ['generate'] True
LLM internlm-7b ['en', 'zh'] ['generate'] True
LLM internlm-chat-20b ['en', 'zh'] ['chat'] True
LLM internlm-chat-7b ['en', 'zh'] ['chat'] True
LLM internlm2-chat ['en', 'zh'] ['chat'] True
LLM internlm2.5-chat ['en', 'zh'] ['chat'] True
LLM internlm2.5-chat-1m ['en', 'zh'] ['chat'] True
LLM internvl-chat ['en', 'zh'] ['chat', 'vision'] True
LLM llama-2 ['en'] ['generate'] True
LLM llama-2-chat ['en'] ['chat'] True
LLM llama-3 ['en'] ['generate'] True
LLM llama-3-instruct ['en'] ['chat'] True
LLM llama-3.1 ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th'] ['generate'] True
LLM llama-3.1-instruct ['en', 'de', 'fr', 'it', 'pt', 'hi', 'es', 'th'] ['chat'] True
LLM minicpm-2b-dpo-bf16 ['zh'] ['chat'] True
LLM minicpm-2b-dpo-fp16 ['zh'] ['chat'] True
LLM minicpm-2b-dpo-fp32 ['zh'] ['chat'] True
LLM minicpm-2b-sft-bf16 ['zh'] ['chat'] True
LLM minicpm-2b-sft-fp32 ['zh'] ['chat'] True
LLM MiniCPM-Llama3-V-2_5 ['en', 'zh'] ['chat', 'vision'] True
LLM MiniCPM-V-2.6 ['en', 'zh'] ['chat', 'vision'] True
LLM mistral-instruct-v0.1 ['en'] ['chat'] True
LLM mistral-instruct-v0.2 ['en'] ['chat'] True
LLM mistral-instruct-v0.3 ['en'] ['chat'] True
LLM mistral-large-instruct ['en', 'fr', 'de', 'es', 'it', 'pt', 'zh', 'ru', 'ja', 'ko'] ['chat'] True
LLM mistral-nemo-instruct ['en', 'fr', 'de', 'es', 'it', 'pt', 'zh', 'ru', 'ja'] ['chat'] True
LLM mistral-v0.1 ['en'] ['generate'] True
LLM mixtral-8x22B-instruct-v0.1 ['en', 'fr', 'it', 'de', 'es'] ['chat'] True
LLM mixtral-instruct-v0.1 ['en', 'fr', 'it', 'de', 'es'] ['chat'] True
LLM mixtral-v0.1 ['en', 'fr', 'it', 'de', 'es'] ['generate'] True
LLM OmniLMM ['en', 'zh'] ['chat', 'vision'] True
LLM OpenBuddy ['en'] ['chat'] True
LLM openhermes-2.5 ['en'] ['chat'] True
LLM opt ['en'] ['generate'] True
LLM orca ['en'] ['chat'] True
LLM orion-chat ['en', 'zh'] ['chat'] True
LLM orion-chat-rag ['en', 'zh'] ['chat'] True
LLM phi-2 ['en'] ['generate'] True
LLM phi-3-mini-128k-instruct ['en'] ['chat'] True
LLM phi-3-mini-4k-instruct ['en'] ['chat'] True
LLM platypus2-70b-instruct ['en'] ['generate'] True
LLM qwen-chat ['en', 'zh'] ['chat', 'tools'] True
LLM qwen-vl-chat ['en', 'zh'] ['chat', 'vision'] True
LLM qwen1.5-chat ['en', 'zh'] ['chat', 'tools'] True
LLM qwen1.5-moe-chat ['en', 'zh'] ['chat', 'tools'] True
LLM qwen2-instruct ['en', 'zh'] ['chat', 'tools'] True
LLM qwen2-moe-instruct ['en', 'zh'] ['chat', 'tools'] True
LLM seallm_v2 ['en', 'zh', 'vi', 'id', 'th', 'ms', 'km', 'lo', 'my', 'tl'] ['generate'] True
LLM seallm_v2.5 ['en', 'zh', 'vi', 'id', 'th', 'ms', 'km', 'lo', 'my', 'tl'] ['generate'] True
LLM Skywork ['en', 'zh'] ['generate'] True
LLM Skywork-Math ['en', 'zh'] ['generate'] True
LLM starchat-beta ['en'] ['chat'] True
LLM starcoder ['en'] ['generate'] True
LLM starcoderplus ['en'] ['generate'] True
LLM Starling-LM ['en', 'zh'] ['chat'] True
LLM telechat ['en', 'zh'] ['chat'] True
LLM tiny-llama ['en'] ['generate'] True
LLM vicuna-v1.3 ['en'] ['chat'] True
LLM vicuna-v1.5 ['en'] ['chat'] True
LLM vicuna-v1.5-16k ['en'] ['chat'] True
LLM wizardcoder-python-v1.0 ['en'] ['chat'] True
LLM wizardlm-v1.0 ['en'] ['chat'] True
LLM wizardmath-v1.0 ['en'] ['chat'] True
LLM xverse ['en', 'zh'] ['generate'] True
LLM xverse-chat ['en', 'zh'] ['chat'] True
LLM Yi ['en', 'zh'] ['generate'] True
LLM Yi-1.5 ['en', 'zh'] ['generate'] True
LLM Yi-1.5-chat ['en', 'zh'] ['chat'] True
LLM Yi-1.5-chat-16k ['en', 'zh'] ['chat'] True
LLM Yi-200k ['en', 'zh'] ['generate'] True
LLM Yi-chat ['en', 'zh'] ['chat'] True
LLM yi-vl-chat ['en', 'zh'] ['chat', 'vision'] True
LLM zephyr-7b-alpha ['en'] ['chat'] True
LLM zephyr-7b-beta ['en'] ['chat'] True
- Registering Models via the Web
4. Terminal commands
If you have changed the port above, you can change the port according to the following counterpart
#/
export HF_ENDPOINT=
export XINFERENCE_MODEL_SRC=modelscope
#/ export XINFERENCE_HOME
export XINFERENCE_HOME=/root/autodl-tmp
#Reset the environment variable if the port has been modified
export XINFERENCE_ENDPOINT=http://0.0.0.0:7863
After the modification, you can start the corresponding services, the following are the cmd commands to start chat / embedding / rerank three models, other model commands can refer to xinference homepage. After starting, the UID of the corresponding model will be returned (it will be used in Dify deployment later).
#deploymentschatglm3
xinference launch --model-name chatglm3 --size-in-billions 6 --model-format pytorch --quantization 8-bit
#deployments bge-large-zh embedding
xinference launch --model-name bge-large-zh --model-type embedding
#deployments bge-reranker-large rerank
xinference launch --model-name bge-reranker-large --model-type rerank
API call
If you are not satisfied with using the web interface of the LLM model, you can also call the API interface to use the LLM model. In fact, when the Xinference service is deployed, the WebGUI interface and the API interface are ready at the same time, and you can access them in your browser.http://localhost:9997/docs / / to see the list of API interfaces.
The list of interfaces contains a large number of interfaces, not only for the LLM model, but also for other models (e.g., Embedding or Rerank), and these are all compatible with the OpenAI API. Take the chat function of LLM as an example, we use the Curl tool to call its interface, the example is as follows:
curl -X 'POST' \
'http://localhost:9997/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "chatglm3",
"messages": [
{
"role": "user",
"content": "hello"
}
]
}'
#Return results
{
"model": "chatglm3",
"object": "",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?",
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"total_tokens": 29,
"completion_tokens": 37
}
}
If you want to test whether a model has been deployed locally, for example, the rerank model, you can execute the following script, or execute the
from import Client
#url can belocalports 也can be外接ports
url = "http://172.19.0.1:6006"
print(url)
client = Client(url)
model_uid = client.launch_model(model_name="bge-reranker-base", model_type="rerank")
model = client.get_model(model_uid)
query = "A man is eating pasta."
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby.",
"A man is riding a horse.",
"A woman is playing violin."
]
print((corpus, query))
- Or execute View Deployed Models
xinferencelist
- If resources need to be released
xinferenceterminate--model-uid"my-llama-2"
- External network access is required, you need to look up the local IP address i.e. http://<Machine_IP>:<port port> , the way to find the IP address is as follows.
#Windows
ipconfig/all
#Linux
hostname -I
5. Xinference official AI practice cases
Official Link:/zh-cn/latest/examples/
Reference Links:
- Xinference: LLM, embedding, rerank macromodel needed for local deployment of Dify
- Xinference Large Model Reasoning Framework Deployment and Application
- Deploying Custom Big Models on Xinference
- Official Manual - Customized Models
- Xorbits inference operation in practice