Faster auxiliary generation: dynamic speculation

⭐ In this blog post, we're going to look atDynamic speculative decoding -- a new approach developed by Intel Labs and Hugging Face that accelerates text generation by up to 2.7 times, depending on the task. FromTransformers🤗 release4.45.0 To begin with, this method is the default mode for auxiliary generation ⭐

conjectural decoding

conjectural decoding technique is very popular and is used to accelerate the inference process for large language models while maintaining their accuracy. As shown in the figure below, speculative decoding works by dividing the generation process into two phases. In the first phase, a fast but less accurateoutline The model (Draft, also known as Assistant) autoregressively generates a series of markers. In the second stage, a large but more accurategoal The model (Target) performs parallel validation of the generated draft tokens. This process accelerates autoregressive decoding by allowing the Target model to generate multiple tokens in a single forward pass. It is hypothesized that the success of the decoding is highly dependent on theSpeculative forward-looking (Speculative Lookahead, hereafter denoted by SL), i.e., the number of tokens generated by the draft model in each iteration. In practice, SL is either a static value or based on heuristics, neither of which is optimal for maximizing performance in the inference process.

推测解码的单次迭代

A single iteration of putative decoding

Dynamic speculative decoding

Transformers🤗 The library provides two different methods for determining a plan for adjusting the number of draft (helper) tags during inference. Based on theLeviathan et al. of the direct approach uses static values for the speculative foresight and involves generating a constant number of candidate markers in each speculative iteration. The otherA heuristics-based approach Adjust the number of candidate markers for the next iteration based on the acceptance rate of the current iteration. If all putative markers are correct, the number of candidate markers increases; otherwise, the number decreases.

We anticipate that latency can be further reduced by enhancing the optimization strategy to manage the number of draft tags generated. To test this argument, we utilize a predictor to determine the optimal speculative look-ahead (SL) for each speculative iteration. This predictor utilizes the draft model autoregression to generate markers until there is an inconsistency in the predicted markers between the draft model and the target model. This process is repeated in each speculative iteration, ultimately determining the optimal (maximum) number of draft markers accepted in each iteration. Draft/target marker mismatches are recognized by the rejection sampling algorithm proposed by Leviathan et al. at zero temperature. This predictor achieves the full potential of inferred decoding by generating the maximum number of valid draft markers at each step and minimizing the number of calls to the draft and target models. We call the process of inferential decoding using this predictor to obtain SL values orcale inferential decoding.

The left image below illustrates the results from theMBPP Changes in the predicted and static speculative look-ahead values over speculative iterations in the code generation example for the dataset. A high variability in the predicted SL values (orange bars) can be observed.
The static SL value (blue bar), where the number of draft tags generated is fixed at 5, performs 38 target forward propagations and 192 draft forward propagations, while the predictive SL value performs only 27 target forward propagations and 129 draft forward propagations - a significant reduction. The figure on the right shows the entireAlpaca Predictive and static speculative forward-looking values in datasets.

在 MBPP 的一个例子上的预知和静态推测前瞻值 (SL)。

Predictive and static speculative look-ahead values (SL) on an example of MBPP.

在整个 Alpaca 数据集上平均的预知 SL 值。
Predicted SL values averaged over the entire Alpaca dataset.

The two graphs above demonstrate the variability of the predictive speculative forward-looking values, which suggests that static speculative decoding may make suboptimal.

In order to get closer to the predicted speculative decoding and gain additional speedup, we developed a simple method to dynamically adjust the speculative look-ahead value in each iteration. After generating each draft token, we determine whether the draft model should continue to generate the next token or switch to the target model for validation. This decision is based on the draft model's confidence in its predictions, as estimated by the softmax of logits. If the draft model's confidence in the current token's prediction is below a predefined threshold ofassistant_confidence_threshold It will stop the token generation process at that iteration, even if the maximum number of speculative tokens has not yet been reached.num_assistant_tokens . Once stopped, the draft tokens generated in the current iteration are sent to the target model for validation.

benchmarking

We benchmarked dynamic versus heuristic methods across a range of task and model combinations. Dynamic methods showed better performance in all tests.
It is worth noting that using the dynamic method willLlama3.2-1B act asLlama3.1-8B assistant, we observe a speedup of up to 1.52 times, while heuristics using the same settings show no significant speedup. Another observation is that thecodegen-6B-mono shows a decrease in speed when using heuristics, while it shows an increase in speed when using dynamic methods.

target model	draft model	Type of mission	Acceleration Ratio - Heuristics	Acceleration Ratio - Dynamic Strategy
`facebook/opt-6.7b`	`facebook/opt-125m`	summarization	1.82x	2.71x
`facebook/opt-6.7b`	`facebook/opt-125m`	open-ended generation	1.23x	1.59x
`Salesforce/codegen-6B-mono`	`Salesforce/codegen-350M-mono`	code generation (python)	0.89x	1.09x
`google/flan-t5-xl`	`google/flan-t5-small`	summarization	1.18x	1.31x
`meta-llama/Llama-3.1-8B`	`meta-llama/Llama-3.2-1B`	summarization	1.00x	1.52x
`meta-llama/Llama-3.1-8B`	`meta-llama/Llama-3.2-1B`	open-ended generation	1.00x	1.18x
`meta-llama/Llama-3.1-8B`	`meta-llama/Llama-3.2-1B`	code generation (python)	1.09x	1.15x

The results in the table reflect greedy decoding (temperature = 0). A similar trend is observed when using sampling (temperature > 0).
All tests were performed on the RTX 4090.
Our benchmarking is public and allows anyone to evaluate further improvements:./gante/huggingface-demos/tree/main/experiments/faster_generation

coding

Dynamic speculation has been integrated into version 4.45.0 of the Hugging Face Transformers library and is now the default mode of operation for assisted decoding. To use auxiliary generation with dynamic speculation, you don't need to make any code changes, just execute the code as usual:.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
device = "cuda" if .is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)

outputs = (**inputs, assistant_model=assistant_model)

The default dynamic speculative look-ahead parameter responds to the optimal value, but can be tuned for better performance on specific models and data using the following code.

# confidence threshold
assistant_model.generation_config.assistant_confidence_threshold=0.4

# 'constant' means that num_assistant_tokens stays unchanged during generation
assistant_model.generation_config.num_assistant_tokens_schedule='constant'

# the maximum number of tokens generated by the assistant model.
# after 20 tokens the draft halts even if the confidence is above the threshold
assistant_model.generation_config.num_assistant_tokens=20

To revert toheuristic maybestatic (as in electrostatic force) method (e.g.Leviathan et al. described in), it is only necessary to separate thenum_assistant_tokens_schedule set to'heuristic' maybe'constant' willassistant_confidence_threshold=0 cap (a poem)num_assistant_tokens=5 The settings are as follows.

# Use 'heuristic' or 'constant' or 'dynamic'
assistant_model.generation_config.num_assistant_tokens_schedule='heuristic'
assistant_model.generation_config.assistant_confidence_threshold=0
assistant_model.generation_config.num_assistant_tokens=5

What's next?

We introduce a faster auxiliary generation strategy called dynamic speculative decoding, which outperforms heuristic methods as well as methods with a fixed number of candidate tokens.

In an upcoming blog post, we will demonstrate a new approach to helper generation: combining any target model with any helper model! This will open the door to accelerating the myriad of models on the Hugging Face Hub for which small enough helper variants are not available. For example.Phi 3 、 Gemma 2 、 CodeLlama And so on will be eligible for speculative decoding. Stay tuned!

bibliography

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models。

In this paper, we introduce DISCO, a dynamic speculative look-ahead optimization method that uses a classifier to decide whether the draft model should continue to generate the next token or pause and switch to the target model for validation, rather than just using a simple threshold on the predicted probability.

Assisted Generation: a new direction toward low-latency text generation
Fast Inference from Transformers via Speculative Decoding

Link to original article./blog/dynamic_speculation_lookahead

Original authors: Jonathan Mamou, Oren Pereg, Joao Gante, Lewis Tunstall, Daniel Korat, Nadav Timor, Moshe Wasserblat

Translator: Zipxuan