Falcon Mamba It's from Abu Dhabi.Technology Innovation Institute (TII) Developed and based onTII Falcon Mamba 7B License 1.0 The model is open access. The model is open-access, so anyone can use it in the Hugging Face ecosystem.here are Use it for research or applications.
In this blog, we will dive into the model's design decisions, explore the model's competitiveness against other existing SoTA models, and how it can be used in the Hugging Face ecosystem.
The first generalized large-scale pure Mamba model
Currently, all top large language models use the Transformer architecture based on the attention mechanism. However, the attention mechanism is fundamentally limited when dealing with large sequences due to the computational and memory costs that increase with sequence length. Various alternative architectures such as State Space Language Models (SSLMs) attempt to address the limitations of sequence scaling, but still fall short of the state-of-the-art Transformer architecture models in terms of performance.
With Falcon Mamba, we demonstrate that the limitations of sequence scaling can indeed be overcome without loss of performance.Falcon Mamba is based on the original Mamba architecture, which inMamba: Linear-Time Sequence Modeling with Selective State Spaces is proposed in the Falcon Mamba: A new architecture, with an additional RMS standardization layer to ensure stable training at scale. This architectural choice ensures that Falcon Mamba.
- Ability to process sequences of arbitrary length without increasing memory storage, especially for a single A10 24GB GPU.
- The time to generate a new token is constant, regardless of the size of the context.
model training
The amount of data used for Falcon Mamba training was approximately 5500 GT, consisting primarily of selected network data supplemented with high-quality technical and code data from publicly available sources. We used a constant learning rate for most of the training process, followed by a relatively short learning rate decay phase. In this last phase, we also added a small portion of high-quality curated data to further improve model performance.
Model Evaluation
We uselm-evaluation-harness
The package evaluates our model on all benchmarks of the new leaderboard version and then normalizes the evaluation results using the Hugging Face score normalization method.model name``IFEval``BBH``MATH LvL5``GPQA``MUSR``MMLU-PRO``Average
model name |
IFEval |
BBH |
MATH LvL5 |
GPQA |
MUSR |
MMLU-PRO |
Average |
---|---|---|---|---|---|---|---|
Pure SSM models | |||||||
Falcon Mamba-7B |
33.36 | 19.88 | 3.63 | 8.05 | 10.86 | 14.47 | 15.04 |
TRI-ML/mamba-7b-rw *
|
22.46 | 6.71 | 0.45 | 1.12 | 5.51 | 1.69 | 6.25 |
Hybrid SSM-attention models | |||||||
recurrentgemma-9b |
30.76 | 14.80 | 4.83 | 4.70 | 6.60 | 17.88 | 13.20 |
Zyphra/Zamba-7B-v1 *
|
24.06 | 21.12 | 3.32 | 3.03 | 7.74 | 16.02 | 12.55 |
Transformer models | |||||||
Falcon2-11B |
32.61 | 21.94 | 2.34 | 2.80 | 7.53 | 15.44 | 13.78 |
Meta-Llama-3-8B |
14.55 | 24.50 | 3.25 | 7.38 | 6.24 | 24.55 | 13.41 |
Meta-Llama-3.1-8B |
12.70 | 25.29 | 4.61 | 6.15 | 8.98 | 24.95 | 13.78 |
Mistral-7B-v0.1 |
23.86 | 22.02 | 2.49 | 5.59 | 10.68 | 22.36 | 14.50 |
Mistral-Nemo-Base-2407 (12B) |
16.83 | 29.37 | 4.98 | 5.82 | 6.52 | 27.46 | 15.08 |
gemma-7B |
26.59 | 21.12 | 6.42 | 4.92 | 10.98 | 21.64 | 15.28 |
In addition, we uselighteval
The tool evaluates the model on a benchmark test of the first version of the LLM ranking.model name``ARC``HellaSwag``MMLU``Winogrande``TruthfulQA``GSM8K``Average
model name |
ARC |
HellaSwag |
MMLU |
Winogrande |
TruthfulQA |
GSM8K |
Average |
---|---|---|---|---|---|---|---|
Pure SSM models | |||||||
Falcon Mamba-7B *
|
62.03 | 80.82 | 62.11 | 73.64 | 53.42 | 52.54 | 64.09 |
TRI-ML/mamba-7b-rw *
|
51.25 | 80.85 | 33.41 | 71.11 | 32.08 | 4.70 | 45.52 |
Hybrid SSM-attention models | |||||||
recurrentgemma-9b **
|
52.00 | 80.40 | 60.50 | 73.60 | 38.60 | 42.60 | 57.95 |
Zyphra/Zamba-7B-v1 *
|
56.14 | 82.23 | 58.11 | 79.87 | 52.88 | 30.78 | 60.00 |
Transformer models | |||||||
Falcon2-11B |
59.73 | 82.91 | 58.37 | 78.30 | 52.56 | 53.83 | 64.28 |
Meta-Llama-3-8B |
60.24 | 82.23 | 66.70 | 78.45 | 42.93 | 45.19 | 62.62 |
Meta-Llama-3.1-8B |
58.53 | 82.13 | 66.43 | 74.35 | 44.29 | 47.92 | 62.28 |
Mistral-7B-v0.1 |
59.98 | 83.31 | 64.16 | 78.37 | 42.15 | 37.83 | 60.97 |
gemma-7B |
61.09 | 82.20 | 64.56 | 79.01 | 44.79 | 50.87 | 63.75 |
For the use ofasterisks labeled models, we evaluated the task internally; while for models labeled with twoasterisks of modeling, with results from papers or model cards.
Handling large-scale sequences
Based on the theoretical efficiency of SSM (State Space Modeling) in dealing with large scale sequences, we use theoptimum-benchmark The library compares Falcon Mamba to popular Transformer models in terms of memory usage and generation throughput. For a fair comparison, we adjusted the vocabulary size of all Transformer models to match Falcon Mamba, as this has a significant impact on the memory requirements of the models.
Before presenting the results, the difference between prefill and decode partial sequences is first discussed. As we will see, the details of prefill are more important for the state-space model than for the Transformer model. When the Transformer generates the next token, it needs to focus on the keys and values of all previous tokens in the context. This means that both memory requirements and generation time grow linearly with the length of the context. The state-space model is only concerned with and stores its cyclic state, so no additional memory or time is required to generate large sequences. While this explains the advantages of SSM over Transformer in the decoding phase, the prefill phase requires extra effort to fully utilize the SSM architecture.
The standard approach to prefill is to process the entire cue in parallel in order to fully utilize the GPU. this method is used in theoptimum-benchmark library is used and is what we call a parallel prefill. parallel prefill requires storing in memory the hidden state of each token in the hint. For the Transformer, this extra memory is mostly taken up by the stored KV cache. For the SSM model, no cache is required and the memory to store the hidden state becomes the only component proportional to the length of the hint. As a result, memory requirements will scale with hint length and the SSM model will lose the ability to handle arbitrarily long sequences, similar to Transformer.
Another prefill method is the token-by-token handling of hints, which we'll callOrder prefill . Similar to sequential parallelism, it can also be executed on larger blocks of cues rather than individual tokens to better utilize the GPU. while sequential prefill is almost meaningless for Transformer, it gives the SSM model back the possibility of handling arbitrarily long cues.
With these considerations in mind, we first tested the maximum sequence length that a single 24GB A10 GPU could support, as shown in the charts below. The batch size was fixed to 1 and we used float32 precision. Even for parallel prefill, Falcon Mamba can accommodate larger sequences than Transformer, while in sequential prefill it unleashes its full potential to handle arbitrarily long cues.
Next, we measured generation throughput in a setting with a hint length of 1 and generating up to 130k tokens, using batch size 1 and H100 GPUs. the results are reported in the graphs below. We observe that our Falcon Mamba generates all tokens at a constant throughput with no increase in CUDA peak memory. For the Transformer model, the peak memory grows with the number of tokens generated and the generation speed slows down.
Next, we measured the generation throughput for a hint length of 1 and generating up to 130,000 tokens using a single H100 GPU and a batch size of 1. The results are shown in the graphs below. We observe that our Falcon Mamba is able to generate all tokens with constant throughput and without any increase in CUDA peak memory. For the Transformer model, as the number of tokens generated increases, the peak memory grows and the generation slows down.
How to use Falcon Mamba in Hugging Face transformers?
The Falcon Mamba architecture will be available in the next version of the Hugging Face transformers library (>4.45.0). To use the model, make sure you have the latest version of Hugging Face transformers installed or install the library from source.
Falcon Mamba is compatible with most of the APIs provided by Hugging Face, which you may already be familiar with, such asAutoModelForCausalLM
maybepipeline
:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "tiiuae/falcon-mamba-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
inputs = tokenizer("Hello world, today", return_tensors="pt").to(0)
output = (**inputs, max_new_tokens=100, do_sample=True)
print((Output[0], skip_special_tokens=True))
Due to the larger model, it also supports such things asbitsandbytes
Quantitative features to run the model with smaller GPU memory constraints, e.g.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "tiiuae/falcon-mamba-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
inputs = tokenizer("Hello world, today", return_tensors="pt").to(0)
output = (**inputs, max_new_tokens=100, do_sample=True)
print((output[0], skip_special_tokens=True))
We are pleased to continue to present a command-tuned version of Falcon Mamba that has been fine-tuned with an additional 5 billion tokens of supervised fine-tuning (SFT) data. This extended training enhances the accuracy and effectiveness of the model in performing command tasks. You can experience the capabilities of the command model through our demo, which is available athere (literary) Find. For chat templates, we use the following format.
<|im_start|>user
prompt<|im_end|>
<|im_start|>assistant
You can also choose to use thebasic model as well ascommand model 4-bit converted version. Ensure that you are authorized to access the 4-bit converted version of thebitsandbytes
library-compatible GPUs to run quantization models.
You can also use the Achieve faster inference; just call the model after loading the
model = (model)
。
a thank-you note
We would like to thank the Hugging Face team for their seamless support during the integration process, with special thanks to the following.
- Alina Lozovskaya cap (a poem)Clementine Fourrier Help us evaluate models on the leaderboard
- Arthur Zucker Responsible for the integration of transformers
- Vaibhav Srivastav, hysts respond in singingOmar Sanseviero Support on Hub-related issues
The authors would also like to thank Tri Dao and Albert Gu for implementing and open-sourcing the Mamba architecture to the community.
Original in English./blog/falconmamba
original author: Jingwei Zuo, Maksim Velikanov, Rhaiem, Ilyas Chahed, Younes Belkada, Guillaume Kunsch
Translator: Evinci