Llama is now visual and ready to run on your device

Llama 3.2 is here! Today, we welcome the next release in the Llama series to the Hugging Face, and this time, we're happy to partner with Meta to release multimodal and small models. Ten open source models are available on the Hub (5 multimodal and 5 text-only).

Llama 3.2 Vision is available in two sizes: 11B for efficient deployment and development on consumer GPUs and 90B for large-scale applications. Both versions are available in base and command fine-tuned versions. In addition to these four multimodal models, Meta has also released a new version of Llama Guard with vision support.Llama Guard 3 is a security model that categorizes model inputs and generates content, including detecting harmful multimodal cues or helper responses.

Llama 3.2 also includes small text-only language models that can run on devices. They come in two new sizes (1B and 3B) and are available in base and command versions with powerful capabilities. There is also a small 1B version of Llama Guard that can be deployed with these or larger text models in production use cases.

Among the features and integrations released we have.

Model checkpoints on the Hub
Hugging Face Transformers and TGI Integration for Visual Models
Reasoning and Deployment Integration in Google Cloud, Amazon SageMaker and DELL Enterprise Center
utilizationtransformers🤗 cap (a poem)TRL Fine-tuning Llama 3.2 11B visual models on a single GPU

What is the Llama 3.2 Vision model?

Llama 3.2 Vision is the most powerful open source multimodal model released by Meta. It has excellent visual understanding and reasoning capabilities and can be used for a variety of tasks including visual reasoning and localization, document quizzing, and image-text retrieval. Chain of Thought (CoT) answers are often very good, which makes visual reasoning especially powerful.

Llama 3.2 Vision can process text and images, or text only. For image-text cues, the model can accept English input, while for text-only cues, the model can handle multiple languages. In text-only mode, the full list of supported languages includes English, German, French, Italian, Portuguese, Hindi, Spanish and Thai.

The architecture of these models is based on a combination of Llama 3.1 LLM with vision towers and image adapters. The text model used for Llama 3.2 11B Vision is Llama 3.1 8B, while Llama 3.2 90B Vision uses Llama 3.1 70B.Based on our understanding, the text model is frozen during the training of the vision model in order to maintain text-only performance.

Below are some reasoning examples from the 11B instruction tuning model demonstrating real-world knowledge, document reasoning, and infographic comprehension capabilities.


	Where is this picture located? What other attractions are nearby? The image depicts the Grand Palace in Bangkok, Thailand. The Grand Palace is a complex of buildings that served as the official residence of the King of Siam and as the administrative center of the government for 150 years. It is one of the most popular tourist attractions in Bangkok and is a must-see for anyone visiting the city. Some of the other attractions near the Grand Palace include. * Wat Phra Kaew: This temple is located within the Grand Palace complex and features the famous Jade Buddha statue. * Wat Arun: Located on the west bank of the Chao Phraya River, this temple is known for its stunning architecture and beautiful city views. * Wat Pho: This temple is near the Grand Palace and has a huge reclining Buddha statue. * :: Lumpini Park: Located in the center of Bangkok, this park offers a peaceful escape from the hustle and bustle of the city. * Jim Thompson House: This museum showcases the life and work of American businessman Jim Thompson, who helped revitalize Thailand's silk industry in the 1950s and 1960s.
	According to this chart, where is the drought occurring? 2016 Drought Regions. According to the chart, two regions experienced drought in 2016. * East and Southern Africa. The region has experienced a severe drought. * Chile. Chile experienced the driest weather in more than 50 years in January. Those regions faced severe water shortages during that period.
	How long does it take from the invoice date to the due date? Please be concise and clear. Invoice Date. February 11, 2019 Expiry Date. February 26, 2019 Calculation of time difference. February 26, 2019 - February 11, 2019 = 15 days The time difference between the invoice date and the due date is 15 days.

The Vision model has a context length of 128k tokens, which allows for multiple rounds of dialog containing images. However, the model works best when focusing on a single image, so thetransformers The implementation focuses only on the last image in the input. This maintains quality and saves memory.

The 11B base model supports a chunk size of 448, while both the command fine-tuning version and the 90B model use a chunk size of 560. These models were trained on a massive dataset containing 6 billion graphic pairs from a very diverse set of data sources. This makes them excellent candidate models for downstream task fine-tuning. The table below shows the performance of the 11B and 90B models and their command fine-tuning versions in a number of benchmarks, with data from Meta. See the model cards for more benchmarks and details.

mould	11B	11B (command fine-tuning)	90B	90B (command fine-tuning)	norm
MMMU (val)	41.7	50.7 (CoT)	49.3 (zero-shot)	60.3 (CoT)	Micro Average Accuracy
VQAv2	66.8 (val)	75.2 (test)	73.6 (val)	78.1 (test)	Accuracy
DocVQA	62.3 (val)	88.4 (test)	70.7 (val)	90.1 (test)	ANLS
AI2D	62.4	91.1	75.3	92.3	Accuracy

We expect that the textual capabilities of these models will be comparable to the Llama 3.1 models of 8B and 70B, as it is our understanding that the textual models are frozen during Vision model training. Therefore, the text benchmarking should be consistent with 8B and 70B.

Llama 3.2 license change. Sorry, European Union 😦

License Change

Regarding licensing terms, the license for Llama 3.2 is very similar to that of Llama 3.1, with the only key difference being the Acceptable Use Policy: any individual residing in the European Union or a company with its principal place of business in the European Union is not granted a license to use the multimodal models included in Llama 3.2. This restriction does not apply to end-users of products or services that integrate any of these multimodal models, so one can still build global products and visual variants.

For complete details, be sure to readofficial license cap (a poem)Acceptable Use Policy。

What's so special about the Llama 3.2 1B and 3B?

The Llama 3.2 series includes 1B and 3B text models. These models are intended for on-device use cases such as prompt rewriting, multilingual knowledge retrieval, summarization tasks, tool usage, and locally run assistants. They outperform many of the available open access models at these scales and compete with much larger models. In later sections, we show how to run these models offline.

The models follow the same architecture as Llama 3.1. They are trained with up to 9 trillion tokens and still support long context lengths of 128k tokens. They are trained with up to 9 trillion tokens and still support long context lengths of 128k tokens. the models are multilingual, supporting English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

There is also a new minor version of Llama Guard, Llama Guard 3 1B, that can be deployed with these models to evaluate the last user or assistant response in a multi-round conversation. It uses a set of predefined categories (new in this release) that can be customized or excluded based on the developer's use case. For more details on using Llama Guard, see Model Cards.

Bonus tip: Llama 3.2 reaches out to a wider set of languages than the 8 mentioned above. Developers are encouraged to fine-tune the Llama 3.2 model for specific language use cases.

We tested the base model with the Open LLM Leaderboard evaluation suite, while the command model was evaluated on three popular benchmarks that measure the ability to follow commands and are highly relevant to the LMSYS Chatbot Arena.IFEval、AlpacaEval cap (a poem)MixEval-Hard. The following results are from the base model, which includes Llama-3.1-8B as a reference.

mould	BBH	MATH Lvl 5	GPQA	MUSR	MMLU-PRO	on average
Meta-Llama-3.2-1B	4.37	0.23	0.00	2.56	2.26	1.88
Meta-Llama-3.2-3B	14.73	1.28	4.03	3.39	16.57	8.00
Meta-Llama-3.1-8B	25.29	4.61	6.15	8.98	24.95	14.00

The following are the results of the command model, using Llama-3.1-8B-Instruct as a reference:

mould	AlpacaEval (LC)	IFEval	MixEval-Hard	on average
Meta-Llama-3.2-1B-Instruct	7.17	58.92	26.10	30.73
Meta-Llama-3.2-3B-Instruct	20.88	77.01	31.80	43.23
Meta-Llama-3.1-8B-Instruct	25.74	76.49	44.10	48.78

It is worth noting that the performance of the 3B model on the IFEval is comparable to that of the 8B model! This makes the model well suited for agent applications where following instructions is critical to improving reliability. This high IFEval score is very impressive for a model of this size.

Both the 1B and 3B command tuning models support tool usage. The user specifies the tools in the 0-shot environment (there is no information about the tools that the developer will use before the model). Therefore, the built-in tools included in the Llama 3.1 model (brave_search cap (a poem)wolfram_alpha ) is no longer available.

Due to their small size, these small models can be used as assistants to larger models to perform theassisted generation (also known as speculative decoding).here are is an example of using the Llama 3.2 1B model as an assistant to the Llama 3.1 8B model. For offline use cases, see the Running on a Device section that follows.

demonstrations

You can experience all three command models in the following demo.

Llama 3.2 11B Visual Instructions in Gradio Space
Llama 3.2 3B in Gradio-driven space
Llama 3.2 3B running on WebGPUs

Demo GIF

Using Hugging Face Transformers

The text-only checkpoints have the same architecture as previous versions, so there is no need to update your environment. However, due to the new architecture, Llama 3.2 Vision requires an update to Transformers. be sure to upgrade your installation to 4.45.0 or later.

pip install "transformers>=4.45.0" --upgrade

The upgrade allows you to use the new Llama 3.2 model and utilize all the tools of the Hugging Face ecosystem.

Llama 3.2 1B and 3B Language Models

You can run 1B and 3B text model checkpoints through Transformers with just a few lines of code. The model checkpoints begin withbfloat16 The precision is uploaded, but you can also use float16 or quantized weights. The memory requirements depend on the model size and the precision of the weights. Here is a table showing the approximate memory required for inference using different configurations.

Model Size	BF16/FP16	FP8	INT4
3B	6.5 GB	3.2 GB	1.75 GB
1B	2.5 GB	1.25 GB	0.75 GB

from transformers import pipeline
import torch

model_id = "meta-llama/Llama-3.2-3B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
)
response = outputs[0]["generated_text"][-1]["content"]
print(response)
# Arrrr, me hearty! Yer lookin' fer a bit o' information about meself, eh? Alright then, matey! I be a language-generatin' swashbuckler, a digital buccaneer with a penchant fer spinnin' words into gold doubloons o' knowledge! Me name be... (dramatic pause)...Assistant! Aye, that be me name, and I be here to help ye navigate the seven seas o' questions and find the hidden treasure o' answers! So hoist the sails and set course fer adventure, me hearty! What be yer first question?

Some details.

We usebfloat16 Load Model. As mentioned above, this is the type used by the original checkpoints published by Meta, so it is recommended to ensure optimal precision or for evaluation. Depending on your hardware, float16 may be faster.
By default, transformers uses the same sampling parameters as the original Meta codebase (temperature=0.6 and top_p=0.9). We have not tested this extensively, so feel free to explore!

Llama 3.2 Visual Modeling

Vision models are larger and therefore require more memory to run than smaller text models. For reference, the 11B Vision model requires approximately 10 GB of GPU RAM for inference in 4-bit mode.

The easiest way to reason with the command-tuned Llama visual model is to use the built-in chat template. The input has auser cap (a poem)assistant roles to indicate rounds of dialog. One difference from the text model is that system roles are not supported. User turns can include image - text or text-only input. To indicate that the input contains an image, add the{"type": "image"} and then passes the image data to theprocessor :

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device="cuda",
)
processor = AutoProcessor.from_pretrained(model_id)

url = "/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/"
image = ((url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Can you please describe this image in just one sentence?"}
    ]}
]

input_text = processor.apply_chat_template(
    messages, add_generation_prompt=True,
)
inputs = processor(
    image, input_text, return_tensors="pt"
).to()

output = (**inputs, max_new_tokens=70)

print((output[0][inputs["input_ids"].shape[-1]:]))

## The image depicts a rabbit dressed in a blue coat and brown vest, standing on a dirt road in front of a stone house.

You can continue the conversation about images. Keep in mind that if you provide a new image in a new user round, the model will refer to the new image from then on. You can't query two different images at the same time. Here's an example of continuing the previous dialog, where we add assistant rounds to the dialog and ask for some more details:.

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Can you please describe this image in just one sentence?"}
    ]},
    {"role": "assistant", "content": "The image depicts a rabbit dressed in a blue coat and brown vest, standing on a dirt road in front of a stone house."},
    {"role": "user", "content": "What is in the background?"}
]

input_text = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
)
inputs = processor(image, input_text, return_tensors="pt").to()
output = (**inputs, max_new_tokens=70)
print((output[0][inputs["input_ids"].shape[-1]:]))

Here's the response we got.

In the background, there is a stone house with a thatched roof, a dirt road, a field of flowers, and rolling hills.

You can also use thebitsandbytes The library automatically quantizes the model, loading it in 8-bit or even 4-bit mode. The following is an example of how to load the generation pipeline in 4-bit mode.

import torch
from transformers import MllamaForConditionalGeneration, AutoProcessor
+from transformers import BitsAndBytesConfig

+bnb_config = BitsAndBytesConfig(
+ load_in_4bit=True,
+ bnb_4bit_quant_type="nf4",
+ bnb_4bit_compute_dtype=torch.bfloat16
)
 
model = MllamaForConditionalGeneration.from_pretrained(
    model_id,
- torch_dtype=torch.bfloat16,
- device="cuda",
+ quantization_config=bnb_config,
)

You can then apply the chat template using theprocessor , and call the model as before.

Device-side deployment

You can run Llama 3.2 1B and 3B directly on your device's CPU/GPU/browser, using several open source libraries, as shown below.

& Llama-cpp-python

is the framework of choice for performing machine learning inference on cross-platform devices. We provide 4-bit and 8-bit quantization weights for the 1B and 3B models. We hope the community will adopt these models and create additional quantization and fine-tuning. You can find more information on thehere are Find all quantized Llama 3.2 models.

Here's how to run these checkpoints directly using.

Install via brew (for Mac and Linux).

brew install

You can use the CLI to run a single generation or invoke a server that is compatible with the Open AI message specification.

You can run the CLI with the following command.

llama-cli --hf-repo hugging-quants/Llama-3.2-3b-instruct-Q8_0-GGUF --hf-file llama-3.2-3b-instruct-q8_0.gguf -p " The meaning of life and the universe is "

You can start the server like this.

llama-server --hf-repo hugging-quants/Llama-3.2-3B-Instruct-Q8_0-GGUF --hf-file llama-3.2-3b-instruct-q8_0.gguf -c 2048

You can also use thellama-cpp-python Access these models programmatically in Python.

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_,
    filename="*q8_0.gguf",
)
llm.create_chat_completion(
      messages = [
          {
              "role": "user",
              "content": "What is the capital of France?"
          }
      ]
)

You can even use it in your browser (or any JavaScript runtime such as , Deno or Bun). Running Llama 3.2. you can find it on the HubONNX Models. If you don't already have the library installed, you can install it via theNPM Use the following command to install.

npm i @huggingface/transformers

You can then run the model as follows.

import { pipeline } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline("text-generation", "onnx-community/Llama-3.2-1B-Instruct");

// Define the list of messages
const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Tell me a joke." },
];

// Generate a response

const output = await generator(messages, { max_new_tokens: 128 });
(output[0].generated_text.at(-1).content);

Here's a joke for you:

What do you call a fake noodle?

An impasta!

I hope that made you laugh! Do you want to hear another one?

Fine-tuning Llama 3.2

TRL supports chatting and fine-tuning the Llama 3.2 text model directly.

# Chat
trl chat --model_name_or_path meta-llama/Llama-3.2-3B

# Fine-tune
trl sft --model_name_or_path meta-llama/Llama-3.2-3B \
         --dataset_name HuggingFaceH4/no_robots \
         --output_dir Llama-3.2-3B-Instruct-sft \
         --gradient_checkpointing

TRL also supports the use ofThis script Fine-tuning Llama 3.2 Vision.

# Tested on 8x H100 GPUs
accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/sft_vlm.py \
    --dataset_name HuggingFaceH4/llava-instruct-mix-vsft \
    --model_name_or_path meta-llama/Llama-3.2-11B-Vision-Instruct \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 8 \
    --output_dir Llama-3.2-11B-Vision-Instruct-sft \
    --bf16 \
    --torch_dtype bfloat16 \
    --gradient_checkpointing

You can also check outnotebooksLearn how to use Transformers and PEFT for LoRA fine-tuning.

Hugging Face Partner Integration

We are currently working with partners at AWS, Google Cloud, Microsoft Azure, and DELL to add Llama 3.2 11B and 90B models to Amazon SageMaker, Google Kubernetes Engine, Vertex AI Model Catalog, Azure AI Studio, and DELL Enterprise Hub. We'll update this section as these containers become available, and you can subscribe to theHugging Squad Get email updates.

Additional resources

Models on the Hub
Hugging Face Llama Recipes
Open LLM Leaderboard
Meta Blog
Assessment data sets

express gratitude (esp. in public)

The release of this model and the support and evaluation in the ecosystem would not have been possible without the contributions of thousands of community members who contributed to transformers, text-generation-inference, vllm, pytorch, LM Eval Harness, and numerous other projects. Special thanks to the VLLM team for testing and problem reporting support. This release would not have been possible without the support of Clémentine, Alina, Elie, and Loubna for the LLM evaluation, Nicolas Patry, Olivier Dehaene, and Daniël de Kok for their contributions to text-generated reasoning; Lysandre, Arthur, Pavel, Edward Beeching, Amy, Benjamin, Joao, Pablo, Raushan Turganbay, Matthew Carrigan, and Joshua Lochner for their support of transformers, TRL, and PEFT; Nathan Sarrazin and Victor for making Llama 3.2 a new and improved version of Hugo. Llama 3.2 available on Hugging Chat; communication support from Brigitte Tousignant and Florent Daudens; development and feature release support from the Hub team by Julien, Simon, Pierric, Eliott, Lucain, Alvaro, Caleb, and Mishig. Development and feature release support from the Hub team.

Special thanks to the Meta team for releasing Llama 3.2 and making it open to the AI community!

Link to original article./blog/llama32

original author: Merve Noyan, Philipp Schmid, Omar Sanseviero, Vaibhav Srivastav, Lewis Tunstall, Aritra Roy Gosthipaty, Pedro Cuenca.

Translators: cheninwang, roseking