too long to read: Many LLMs (e.g.gemma-2-9b
、 Mixtral-8x22B-Instruct-v0.1
etc.) are not applicable due to the lack of corresponding mini-models.assisted generation Programs. In this paper, we present a collaborative development by Intel Research and Hugging Face of theGeneric auxiliary generation Technology. With this technology, LLM can work with thearbitrarily SLM collocation to form an assisted generation scheme. Thereby, we can use assisted generation techniques to acceleratearbitrarily The decoder model orMixing specialists model to obtain1.5x-2.0x of the acceleration ratio. Importantly, the overhead is almost zero 🔥🔥🔥🔥🔥! Learn more about it together!
introductory
Today, the number of the most popular open-weighted LLM participants is typically in the billions to hundreds of billions (say what you will about Llama-3.1-405B 👋), which poses a number of engineering challenges for deploying these hungry beasts in production environments. One of the challenges is: large model text generation is slow. For this reason, the community has developed a number of different techniques to speed up the decoding process. Assisted generation, also known asspeculative decoding, is one of the very common and practical methods to accelerate LLM inference without loss of accuracy. In this paper, we will understand how auxiliary generation works and share our latest research results that enable the use of Hugging Face Hub140,000 language models hit the nail on the headAny one of them. model for acceleration becomes possible, 🚀!
assisted generation
At the heart of auxiliary generation is a pair of models calledtarget model respond in singingauxiliary model , where the auxiliary model is a smaller version of the target model, for example, you can use theLlama-3.2-1B Being the largerLlama-3.1-70b The auxiliary model of the target model. The entire generation process is an iterative one: in each round, the auxiliary model generates multiple lexical elements one by one autoregressively; then, the target model verifies all the lexical elements generated by the auxiliary model in this round through a single forward propagation. The secret of the speedup is that the target model can validate more than one lexical element in each forward propagation, unlike the original model which can only generate one lexical element at a time. For a more detailed explanation, seeoriginal blog post. Combined with the newly launcheddynamic speculation strategy, assisted generation can increase the speed of text generation by a factor of 1.5 to 3, depending on the type of task and the model used.
One of the most obvious problems with auxiliary generation, however, is that it requires the target and auxiliary models to use the same disambiguator, which means that they must come from the same model family. However, many widely used models lack a suitable "short and tight" model, and are therefore not eligible for such a dramatic reduction in latency. In our experience, in general, auxiliary models need to be at least 50-100 times smaller than the target model to see meaningful speedups. To give a few examples.CodeLlama-13b No miniatures.gemma-2-9b There's only one.2b
of the miniatures, obviously not small enough or fast enough, so the acceleration is not destined to be too noticeable.
Generic auxiliary generation
To alleviate this pain point, Intel Research, in collaboration with Hugging Face, has developed Universal Assisted Generation (UAG) technology, which pairs arbitrary target and auxiliary models regardless of the differences in the lexicon. For example, it is possible to usegemma-2-9b
as the target model and selectedvicuna-68m as an auxiliary model.
The main idea behind this technique is two-way lexer mapping: in each round, after the auxiliary model generates lexical elements, it decodes its output lexical element sequences into text, and then encodes the text into lexical element sequences using the target model's lexer; similarly, after the target model is validated, the target model's lexical element sequences are converted back to the auxiliary model's lexical element sequences using the same method, and then they are added to the auxiliary model's context to be used for the next round of iteration.
Since the vocabularies of the lexers of the auxiliary and target models are different, the resulting differences also need to be dealt with. In order to accurately recode the newly generated lexical element sequence of the auxiliary model, it must be given some more lexical elements from the previous text. The entire sequence is then recoded into the lexical element format of the target model and aligned with the most recent previously generated target lexical elements to anchor the exact location of the newly generated lexical elements. The following video illustrates this process.
The recoding of lexical elements from the target model to the auxiliary model follows a similar process as in the above video. In this case, any mismatched lexical elements are discarded from the key-value (KV) cache of the auxiliary model to ensure data integrity.
benchmarking
The following table shows the latency improvements measured when different target models are used to form an assisted decoding scheme with a heteroskeductor auxiliary model.
target model | auxiliary model | data set | mandates | acceleration ratio |
---|---|---|---|---|
codellama/CodeLlama-13b-Instruct-hf |
bigcode/tiny_starcoder_py |
openai/humaneval | code generation | 1.90x |
mistralai/Mixtral-8x22B-Instruct-v0.1 | double7/vicuna-68m |
cnn_dailymail | summaries | 1.52x |
google/gemma-2-9b |
double7/vicuna-68m |
cnn_dailymail | summaries | 1.76x |
mistralai/Mixtral-8x22B-Instruct-v0.1 |
Qwen/Qwen2-0.5B-Instruct |
tau/scrolls | long summary | 1.78x |
meta-llama/Llama-3.1-70B |
Qwen/Qwen2-0.5B-Instruct |
tau/scrolls | long summary | 1.78x |
microsoft/Phi-3-medium-128k-instruct |
Qwen/Qwen2-0.5B-Instruct |
tau/scrolls | long summary | 1.91x |
Note that under the standard auxiliary decoding scheme, all of the target models in the table above suffer from the lack of suitable miniatures (below 1 billion parameters).
The above experiments were done on 100 random samples.Llama
cap (a poem)Mixtral
The experiments for the target model used 2 and 4 A100 GPUs, respectively; all other experiments used a single A6000 GPU.
coding
Generic assisted generation technology has been integrated into 🤗 Transformers4.46.0 Edition.
To enable this technology, set thetokenizer
cap (a poem)assistant_tokenizer
pass on togenerate()
The sample code is as follows.
from transformers import AutoModelForCausalLM, AutoTokenizer
prompt = "Alice and Bob"
checkpoint = "google/gemma-2-9b"
assistant_checkpoint = "double7/vicuna-68m"
assistant_tokenizer = AutoTokenizer.from_pretrained(assistant_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt")
model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
outputs = (**inputs, assistant_model=assistant_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
The output is as follows.
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
the next step
Standard Assisted Generation Program indo_sample=True
When the speculative sampling algorithm used isThe paper's algorithm 1But UAG
Currently only multinomial distribution sampling has been implemented. In multinomial distribution sampling, if a lexical element sampled by the target model is not the same as that sampled by the auxiliary model, the lexical element is automatically rejected, which is different from the handling of this case by speculative sampling. In practice, this means that compared to the standard scheme of shared lexicographs, the UAG scheme is not as good atdo_sample=True
The throughput will be low. In the future, we plan to add support for UAG speculative sampling.
In addition, we intend to integrate UAG into the 🤗 Transformers pipeline to make it simpler and easier for users to utilize it.
reference resource
- Fast Inference from Transformers via Speculative Decoding
- Assisted Generation: A New Direction for Low-Latency Text Generation
Original in English./blog/universal_assisted_generation
Original authors: Daniel Korat, Oren Pereg, Moshe Berchansky, Jonathan Mamou, Joao Gante, Lewis Tunstall, Nadav Timor, Moshe Wasserblat
Translator: Matrix Yao (Yao Weifeng), Deep Learning Engineer at Intel, working on the application of transformer-family models to modal data and training inference for large-scale models.