TGI Multi-LoRA: Deploying a 30-model inference service once.

Are you tired of the complexity and high costs associated with managing multiple AI models? Then.What if you could deploy 30 model inference services in one go? In today's ML world, organizations that want to realize the full value of their data may end up in a "fine-tuned world". In this world, organizations build a large number of models, each of which is highly specialized for a particular task. But how do you deal with the hassle and cost of deploying models for each segmented application? Multi-LoRA services offer a promising answer.

locomotive

It makes sense for organizations to build multiple models based on fine-tuning for multiple reasons.

Performance - there areSufficient evidence Indication: Smaller specialized models outperform larger generalized models on the target task.Predibase results[5] suggests that a task-specific approach tomistralai/Mistral-7B-v0.1 LoRA fine-tuning of the base model results in better performance than GPT-4.
Adaptation - Models such as Mistral or Llama are extremely versatile, and you can choose one of them as the base model and then target theVarious downstream missions Fine-tune a variety of specialized models. There is also the advantage that you are not locked into a particular model, as you can easily swap out that base model and then fine-tune another base model with your data (more on that later).
Independence - For different tasks, different teams can be independently fine-tuned in different ways, thus remaining independent and parallel in terms of data preparation, configuration, assessment criteria, and model update cadence.
Privacy - Dedicated models provide a great deal of flexibility, allowing us to segregate the training data according to privacy requirements, without exposing all data as training data for the base model. In addition, as local operation of the model becomes more and more important, fine-tuning allows small models running on local devices to have the ability to perform specific tasks.

In short, fine-tuning enables organizations to unlock the value of their data, an advantage that becomes particularly important, if not game-changing, when they work with their unique, highly specialized data.

It looks promising. Any problems? Yes, there are! Deploying a Large Language Model (LLM) service presents multiple challenges. The cost and operational complexity of deploying a single model is enough of a headache, let alonen model now. This means that while fine-tuning is a million times better, it's also a hard fact that it makes LLM deployments and services more complex.

TGI has recently launched a new feature - the "want and need" problem, and it's just in time.Multi-LoRA services (👏👏👏)。

LoRA Background

LoRA i.e.Low-order Adaptation, is a technique for efficiently fine-tuning pre-trained large models. The core idea is that instead of retraining the entire model, only a small set of parameters called adapters can be trained to adapt the pre-trained large model to a specific task. The size of these adapters is typically only about 1% more storage and memory overhead compared to the pre-trained LLM to achieve comparable results to those achieved with full model fine-tuning.

The obvious benefit of LoRA is that it reduces the cost of fine-tuning by reducing memory requirements. It alsoMitigating catastrophic forgettingand in thesmall data set It works better on the top.


Figure 1: LoRA Details

During the training process, LoRA freezes the original model weightsW and to two small matricesA cap (a poem)B This makes the fine-tuning more efficient. Knowing this, it's easier to understand how the LoRA model inference in Figure 1 works. We start with the pre-trained modelWx to get the output from the lower-order adaptorBAx add up[6]。

Multi-LoRA Reasoning Service

Having understood the basic idea of low-order adaptation of LoRAs, we can delve into multi-LoRA services. The concept is simple: given a basic pre-trained model and some tasks for which you can fine-tune specific LoRAs, a multi-LoRA service is a mechanism for dynamically selecting the required LoRAs based on incoming requests.


Figure 2: Multi-LORA Details

Figure 2 shows how this dynamic routing works. Each user request contains the inputx The LoRA id information allows TGI to select the correct LoRA adapter based on this information, as well as the id of the corresponding LoRA for the request (we refer to this as a cohort heterogeneous user request).

The Multi-LoRA service lets us deploy just one base model. And since LoRA adapters are small, you can load multiple adapters without worrying about memory. Note that exactly how many adapters you can load depends on your available GPU resources and the model you deploy. The end result is really equivalent to supporting multiple fine-tuned models in a single deployment.

The size of the LoRA weights varies depending on the rank and the quantization method, but they are usually very small. Here's a quick visualization.predibase/magicoder This is 13.6MB, which is less thanmistralai/Mistral-7B-v0.1 Relatively speaking, loading 30 adapters into RAM would only increase VRAM by 3%, which is not a problem for most deployments. Therefore, we can deploy multiple models at once.

How to use

Collection of LoRA weights

First, you need to train the LoRA model and derive the adapter weights. You can find the LoRA fine-tuning-relatedguidebook. Note that when you push the fine-tuned model to the Hub, only the adapter is pushed, not the full merged model. When loading LoRA adapters from the Hub, the base model is inferred from the adapter model card and loaded separately. For more in-depth support, try ourExpert Support Program. The real value comes when you create your own LoRA for a specific use case.

Low-code team

Training a LoRA for their own use cases can be difficult for some organizations, which may lack the appropriate expertise or other resources. Even if the base model is selected and the data is ready, there is still a need to keep up with the latest technology, explore the hyper-reference space, find the best hardware resources, write the code, and then evaluate it later. This task, even for experienced teams, is not a daunting one.

AutoTrain can help to significantly lower this barrier, and is a no-code solution that allows you to train machine learning models with just a few mouse clicks. We offer several ways to use AutoTrain. In addition tolocal installation In addition, we also support.

AutoTrain Environment	Hardware configuration	coding quantity	note
Hugging Face Space	Multiple GPUs and other hardware	no code	flexible and easy to use
DGX Cloud	Up to 8xH100 GPUs	no code	Better suited for larger models
Google Colab	Single T4 GPU	low code	Suitable for small models as well as quantized models

deployments

this paper is based onPredibase's LoRA Land As an example, the following two LoRA adapters are used.

predibase/customer_supportIt is in theGridspace-Stanford Harper Valley Speech Dataset Fine-tuned on, it enhances the ability to accurately understand and respond to interactive customer service work orders, improves the model's performance in tasks such as speech recognition, emotion detection, and conversation management, and helps lead to more efficient and empathetic customer support.
predibase/magicoderIt is in theise-uiuc/Magicoder-OSS-Instruct-75K fine-tuned on, which is a synthesized data set of code instructions.

TGI

TGI Documentation There is a lot of useful information on how to deploy TGI. Here, we'll just remind you of a few key points.

utilizationv2.1.1 or a newer version of TGI
Deployment base model.mistralai/Mistral-7B-v0.1
During deployment, add theLORA_ADAPTERS environment variable

Example.LORA_ADAPTERS=predibase/customer_support,predibase/magicoder

model=mistralai/Mistral-7B-v0.1
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    /huggingface/text-generation-inference:2.1.1 \
    --model-id $model \
    --lora-adapters=predibase/customer_support,predibase/magicoder

Reasoning Terminal GUI

inference terminal Supports multipleGPU or other AI acceleration cardThe GUI makes it easy to deploy across AWS, GCP, and Azure with just a few clicks! Deployment is pretty easy using the GUI. Its backend uses TGI for text generation by default (you can alsooption using your own docker image).

To use the Multi-LoRA service on an inference terminal, you simply jump to theconsolesAnd then.

Select base model.mistralai/Mistral-7B-v0.1
optionsurname Yun | as suffix city name, means prefecture or county (area administered by a prefecture level city or county level city) | software

Example.AWS | us-east-1 | Nvidia L4

Select Advanced Configuration

You should see the already selectedText Generation
Can be configured to suit your needs

Add the environment variableLORA_ADAPTERS=predibase/customer_support,predibase/magicoder
ultimateCreating Endpoints ！

Please note that the above is only the minimum configuration, you can configure other settings as needed.


Figure 3: Multi-LoRA Reasoning Terminal


Figure 4: Multi-LoRA Reasoning Terminal 2

Reasoning Terminal Code

Some people might be a littlebe afraid of ratsIf you don't want to use the mouse, we won't judge you [😂]. At this point, it's also very easy to automate the above actions with code using only the keyboard.

from huggingface_hub import create_inference_endpoint

# Custom Docker image details
custom_image = {
    "health_route": "/health",
    "url": "/huggingface/text-generation-inference:2.1.1", # This is the min version
    "env": {
        "LORA_ADAPTERS": "predibase/customer_support,predibase/magicoder", # Add adapters here
        "MAX_BATCH_PREFILL_TOKENS": "2048", # Set according to your needs
        "MAX_INPUT_LENGTH": "1024", # Set according to your needs
        "MAX_TOTAL_TOKENS": "1512", # Set according to your needs
        "MODEL_ID": "/repository"
    }
}

# Creating the inference endpoint
endpoint = create_inference_endpoint(
    name="mistral-7b-multi-lora",
    repository="mistralai/Mistral-7B-v0.1",
    framework="pytorch",
    accelerator="gpu",
    instance_size="x1",
    instance_type="nvidia-l4",
    region="us-east-1",
    vendor="aws",
    min_replica=1,
    max_replica=1,
    task="text-generation",
    custom_image=custom_image,
)
()

print("Your model is ready to use!")

It takes approximately 3 minutes and 40 seconds to deploy this configuration. Note that other models may take longer. If you're experiencing issues with load times, please submit a GitHub commit toconcern！

utilization

When using an inference terminal, you need to specify theadapter_id The following is an example of a cURL. An example of a cURL is given below.

curl 127.0.0.1:3000/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
  "inputs": "Hello who are you?",
  "parameters": {
    "max_new_tokens": 40,
    "adapter_id": "predibase/customer_support"
  }
}'

Here's another way to use theInferenceClient example from theHugging Face Hub Python Library. Make sure you use thehuggingface-hub>=0.24.0 You will also need, if necessary, tolog in hub。

from huggingface_hub import InferenceClient

tgi_deployment = "127.0.0.1:3000"
client = InferenceClient(tgi_deployment)
response = client.text_generation(
    prompt="Hello who are you?",
    max_new_tokens=40,
    adapter_id='predibase/customer_support',
)

practical consideration

(manufacturing, production etc) costs

precisely asthe following text As discussed, we are not the first to take the plunge. Be sure to read this excellent article by Predibase, the team behind LoRAX!write a blog article (netspeak), as this section is largely based on their work.


Figure 5: Multi-LoRA Costs We deployed TGI on the NVIDIA L4 with themistralai/Mistral-7B-v0.1 The base model, with itsinference terminal(manufacturing, production etc) costs 0.8$/hour. 75 requests per second, with an average of 450 input lexical elements and 234 output lexical elements per request, and compared to the cost of GPT3.5 Turbo with the corresponding configuration.

One of the major benefits of the Multi-LoRA service is that theNo need for multiple deployments for multiple models, and therefore much cheaper. This is intuitively consistent with the fact that multi-model deployments are loaded with all the weights, not just tiny adapters. As shown in Figure 5, when using TGI multi-LoRA, the cost per lemma is the same even if more models are added. However, if multi-LoRA is not used, the cost of TGI increases linearly with each additional fine-tuned model deployed.

usage pattern


Figure 6: Multi-LoRA Service Model

A real challenge when deploying multiple models is that the usage pattern of each model varies greatly: some models may have low usage rates; some may have bouts of usage patterns, and some may be high-frequency. This makes scaling very difficult, especially if each model is deployed independently of each other. When you have to add a GPU, there's a lot of "rounding" error, which can add up quickly and lead to huge waste. Ideally, you need to maximize the utilization of each GPU and try not to use any additional resources. You need to make sure you have enough GPUs, and it's hard to know that some of them will be idle!

The situation is much smoother when using the multi-LoRA scheme. As in Figure 6, we can see that the multi-LoRA service pattern is very smooth, even though some of the LoRAs themselves have unstable usage patterns. By integrating multiple LoRAs, the overall usage pattern is smoother and easier to scale. Please note that the above only provides an example, and it is up to you to carefully analyze how the usage patterns of your own workloads and how multi-LoRA can help. Our goal is to scale with only 1 model in mind, not 30!

Change the base model.

How should the real world respond to the rapid changes in AI development? What should you do if you want to choose another or newer model as the base model? While our example usesmistralai/Mistral-7B-v0.1 as the base model, but there are actually other options available, such asMistral v0.3 be in favor offunction callNot to mention that there are other families of models, such as Llama 3. In general, we are happy to see new base models that are more efficient and perform better emerge.

But don't worry! As long as you haveReason enough. Replacing the base model and retraining LoRA is relatively easy and relatively inexpensive to train, in fact, thePredibase discovery Training a LoRA only costs about $8.00. Using modern frameworks and common engineering practices, very few code changes are required. The basic approach is as follows.

Keep notebook / code for model training
Version control of datasets
Record each configuration used
Update services with new models, configurations

summarize

Multi-LoRA Services is a revolutionary solution for AI model deployment, providing a solution to the cost and complexity of addressing and managing multiple dedicated model deployments. By leveraging a single base model and dynamically applying fine-tuned adapters, an organization's operational overhead can be significantly reduced while maintaining or even enhancing performance across tasks.We call on AI directors to boldly adopt this "Base Model + Multi-LoRA" paradigm.Embrace the simplicity and cost-savings dividends that come with it. Make Multi-LoRA the cornerstone of your AI strategy and ensure your organization stays ahead of the curve in the rapidly evolving technology landscape.

a thank-you note

Implementing a multi-LoRA service can be tricky, but since thepunica-ai cap (a poem)lorax The team developed optimized operators and frameworks, and the process is already efficient.TGI leverages these optimizations to provide fast and efficient inference for multiple LoRA models.

Special thanks to Punica, LoRAX and the S-LoRA team for their excellent and open work on the Multi-LoRA service.

bibliography

[1] : Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, John P. Cunningham, LoRA Learns Less and Forgets Less, 2024
[2] : Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, 2021
[3] : Sourab Mangrulkar, Sayak Paul, PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware, 2023
[4] : Travis Addair, Geoffrey Angus, Magdy Saleh, Wael Abid, LoRAX: The Open Source Framework for Serving 100s of Fine-Tuned LLMs in Production, 2023
[5] : Timothy Wang, Justin Zhao, Will Van Eaton, LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4, 2024
[6] : Punica: Serving multiple LoRA finetuned LLM as one: /punica-ai/punica

Original in English./blog/multi-lora-serving

Original authors: Derek Thomas, Diego Maniloff, David Holtz

Translator: Matrix Yao (Yao Weifeng), Deep Learning Engineer at Intel, working on the application of transformer-family models to modal data and training inference for large-scale models.