Llama 3.1 - Explanation of Multilingual and Long Context Capabilities of 405B, 70B and 8B

Llama 3.1 has been released! Today we welcome the newest member of the Llama family, Llama 3.1, to the Hugging Face platform. We are excited to work with Meta to ensure the best possible integration in the Hugging Face ecosystem, with eight open-source weighting models (3 base models and 5 fine-tuned models) available on the Hub.

Llama 3.1 is available in three specifications: 8B for efficient deployment and development on consumer GPUs, 70B for large-scale AI-native applications, and 405B for synthetic data, Large Language Models (LLMs) as judges, or distillation. All three specifications are available in base and instruction-tuned versions.

In addition to the six generative models, Meta has released two new models: Llama Guard 3 and Prompt Guard. Prompt Guard is a small classifier that detects prompt injections and jailbreaks. llama Guard 3 is a protection model that categorizes LLM input and generated content.

Some of the features and integrations in this release include.

Models on the Hub
Hugging Face Transformers and TGI Integration
Hugging Chat Integration for Meta Llama 3.1 405B Instruct
Inference and Deployment Integration with Inference Endpoints, Google Cloud, Amazon SageMaker, and DELL Enterprise Hub
Quantization of FP8, AWQ and GPTQ to facilitate inference
Fine-tuning Llama 3.1 8B on a single GPU with 🤗 TRL
Generating Synthesis Data for Llama 3.1 70B and 405B with Distilabel

What's new in Llama 3.1

Why is Llama 3.1 exciting? Building on its predecessor, Llama 3.1 adds a number of key new features.

Long context capability of 128K tokens (vs. 8K)
Multi-language support
Tool Use Functions
Very Large Dense Model with 405 Billion Parameters
More liberal licenses

Let's dive into these new features!

Llama version 3.1 introduces six new open source LLM models based on the Llama 3 architecture. They are available in three sizes: 8B, 70B, and 405B parameters, each with a base (pre-trained) and instruction-tuned version. All versions support a context length of 128K tokens and eight languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.Llama 3.1 continues to use Grouped Query Attention (GQA), which is a highly efficient representation that helps to handle longer contexts.

Meta-Llama-3.1-8B: Basic 8B model
Meta-Llama-3.1-8B-Instruct:: Command tuning version of the Foundation 8B model
Meta-Llama-3.1-70B: Basic 70B model
Meta-Llama-3.1-70B-Instruct: Command Tuning Version of the Base 70B Model
Meta-Llama-3.1-405B: Basic 405B model
Meta-Llama-3.1-405B-Instruct: Command Tuning Version of the Basic 405B Model

In addition to these six language models, Llama Guard 3 and Prompt Guard were released.

Llama Guard 3 Llama Guard 3 is the latest version of the Llama Guard family, fine-tuned based on Llama 3.1 8B, designed for production use cases with 128k context length and multi-language capabilities. Designed for production use cases, it has a 128k context length and multi-language capabilities.Llama Guard 3 can categorize LLM inputs (hints) and outputs to detect content deemed insecure in risk classification.
Prompt Guard, on the other hand, is a small 279M-parameter BERT-based classifier that detects cue injection and jailbreaks. It is trained on a large-scale attack corpus and is proposed to be further fine-tuned using application-specific data.

What's new in Llama 3.1 compared to Llama 3 is that the command model has been fine-tuned in terms of tool calls for smart body use cases. Two tools are built-in (search, mathematical reasoning using Wolfram Alpha) that can be extended with custom JSON functions.

The Llama 3.1 model was trained on a custom GPU cluster with over 15 trillion tokens totaling 39.3M GPU hours (8B 1.46M, 70B 7.0M, 405B 30.84M). We do not know the specifics of the training dataset mix, but we speculate that it is more broadly curated in terms of multilingualism.Llama 3.1 Instruct has been optimized for instruction following and performs supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) on publicly available instruction datasets, as well as on more than 25 million synthetically-generated examples.Meta develops an LLM-based LLM-based classifiers to filter and curate high-quality cues and responses during data mix creation.

Regarding the license terms, Llama 3.1 has a very similar license, with one key difference: theIt allows the use of model outputs to improve other LLMs. This means that synthetic data generation and distillation is allowed, even for different models! This is especially important for 405B models, as discussed later. The license allows redistribution, fine-tuning, and the creation of derivative works, and still requires that derived models include "Llama" at the beginning of their name, and that any derivative work or service must mention "Built with Llama". For complete details, be sure to readofficial license。

How much RAM does Llama 3.1 require?

Llama 3.1 brings exciting advances. However, running it requires careful consideration of hardware resources. We decompose the memory requirements in training and inference for three model specifications.

Reasoning about memory requirements

For inference, the memory requirement depends on the model specification and the accuracy of the weights. The following are approximate memory requirements for different configurations: the

Model Specifications	FP16	FP8	INT4
8B	16 GB	8 GB	4 GB
70B	140 GB	70 GB	35 GB
405B	810 GB	405 GB	203 GB

Note: The numbers quoted above represent the GPU VRAM needed to load only the model checkpoints. they do not include torch reserved space for kernel or CUDA graphics.

For example, an H100 node (8x H100) has about 640GB of VRAM, so the 405B model needs to be run in a multi-node setup or at lower precision (e.g. FP8), which is the recommended approach.

Keep in mind that lower precision (e.g. INT4) may result in some loss of precision, but can significantly reduce memory requirements and increase inference speed. In addition to the model weights, you need to keep the KV cache in memory. It contains the keys and values of all tokens in the model context so that they do not need to be recalculated when new tokens are generated. It becomes critical especially when utilizing the long context lengths available. In FP16, the KV cache memory requirements are as follows.

Model Specifications	1k token	16k token	128k token
8B	0.125 GB	1.95 GB	15.62 GB
70B	0.313 GB	4.88 GB	39.06 GB
405B	0.984 GB	15.38	123.05 GB

In particular, for small-specification models, the cache uses as much memory as the weights when approaching the upper limit of the context length.

Training memory requirements

The following table summarizes the approximate memory requirements for training Llama 3.1 models using different techniques.

Model Specifications	Full Fine-tuning	LoRA	Q-LoRA
8B	60 GB	16 GB	6 GB
70B	300 GB	160 GB	48 GB
405B	3.25 TB	950 GB	250 GB

Note: These are estimates and may vary depending on implementation details and optimizations.

Llama 3.1 Assessment

Note: We are currently working on a newOpen LLM Leaderboard 2 on a separate evaluation of Llama 3.1 and will update this section later today. Here is an excerpt from the official Meta evaluation.

*form*	*standard of reference*	*sample size*	*norm*	*Llama 3 8B*	*Llama 3.1 8B*	*Llama 3 70B*	*Llama 3.1 70B*	*Llama 3.1 405B*
synthesize	MMLU	5	Macro average/character accuracy	66.7	66.7	79.5	79.3	85.2
	MMLU PRO（CoT）	5	Macro average/character accuracy	36.2	37.1	55.0	53.8	61.6
	AGIEval English	3-5	Average/character accuracy	47.1	47.8	63.0	64.6	71.6
	CommonSenseQA	7	Character Accuracy	72.6	75.0	83.8	84.1	85.8
	Winogrande	5	Character Accuracy	-	60.5	-	83.3	86.7
	BIG-Bench Hard（CoT）	3	Average/exact match	61.1	64.2	81.3	81.6	85.9
	ARC-Challenge	25	Character Accuracy	79.4	79.7	93.1	92.9	96.1
intellectual inference	TriviaQA-Wiki	5	perfect match	78.5	77.6	89.7	89.8	91.8
	SQuAD	1	perfect match	76.4	77.0	85.6	81.8	89.3
reading comprehension	QuAC（F1）	1	F1	44.4	44.9	51.1	51.1	53.6
	BoolQ	0	Character Accuracy	75.7	75.0	79.0	79.4	80.0
	DROP（F1）	3	F1	58.4	59.5	79.7	79.6	84.8

Using Hugging Face Transformers

Llama 3.1 needs a small modeling update to effectively handle RoPE scaling. Using TransformersVersion 4.43You can use the new Llama 3.1 model and take advantage of all the tools in the Hugging Face ecosystem. Make sure to use the latesttransformers Version.

pip install "transformers>=4.43" --upgrade

A few details.

Transformers loads models in bfloat16 by default. This is the type used by the original checkpoints published by Meta, so this is the recommended method for ensuring optimal accuracy or for evaluation.
The helper response may be in the form of a special token<|eot_id|> end, but we must also stop generating the regular EOS token when we find it. We can stop the generation of regular EOS tokens by adding theeos_token_id Provide a list of terminators in the parameter to stop the generation early.
We used the default sampling parameters from the Meta codebase (temperature cap (a poem)top_p ) We haven't had time to test extensively, so feel free to explore!

The following snippet shows how to use themeta-llama/Meta-Llama-3.1-8B-Instruct It requires approximately 16 GB of VRAM and is suitable for many consumer GPUs. It requires approximately 16 GB of VRAM and is suitable for many consumer GPUs. the same code snippet applies to themeta-llama/Meta-Llama-3.1-70B-Instruct The newest addition to the system is the newest addition to the 140GB of VRAM and the newest addition to themeta-llama/Meta-Llama-3.1-405B-Instruct (810GB VRAM required), making it a very interesting model for production use cases. Memory consumption can be further reduced by loading in 8-bit or 4-bit mode.

from transformers import pipeline
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=False,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)
# Arrrr, me hearty! Yer lookin' fer a bit o' information about meself, eh? Alright then, matey! I be a language-generatin' swashbuckler, a digital buccaneer with a penchant fer spinnin' words into gold doubloons o' knowledge! Me name be... (dramatic pause)...Assistant! Aye, that be me name, and I be here to help ye navigate the seven seas o' questions and find the hidden treasure o' answers! So hoist the sails and set course fer adventure, me hearty! What be yer first question?

You can also auto-quantize the model to load in 8-bit or even 4-bit mode using bitsandbytes. 4-bit loading of the large 70B version takes about 34 GB of memory to run. This is how to generate a pipeline loaded in 4-bit mode.

pipeline = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "quantization_config": {"load_in_4bit": True}
    },
)

Regarding the use oftransformers For more detailed information on the model, please seemodel card。

Note: Transformers handles all the tricky prompt template issues, so if you want to learn more about prompts, check out the next section.

How to use Llama 3.1

Base models do not have a cue format. Like other base models, they can be used to continue input sequences and perform reasonable continuation or zero/sample less reasoning. They are also an excellent basis for fine-tuning your own use cases.

The command version supports a dialog format with 4 roles.

system: Set the context of the dialog. It allows the inclusion of rules, guidelines or information necessary to help respond effectively. It is also used to enable tool use where appropriate.
user: User input, commands, and questions to the model.
assistant: The assistant's response, based on thesystem cap (a poem)user context provided in the prompt.
ipython: New role introduced in Llama 3.1. Used as output when tool calls are returned to LLM.

The command version uses the following dialog structure for simple conversations.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>

The Llama 3.1 command model now supports tool calls, including three built-in tools (brave_search, wolfram_alpha, and code_interpreter) and custom tool calls via JSON functions. The built-in tools use Python syntax. Generating Python code for function calls is part of the code interpreter tool and must be done at the system prompt using theEnvironment keyword is enabled, as shown below.

Built-in tool calls

Including "Environment: ipython" turns on code interpreter mode, where the model can generate the Python code it expects to be executed. The message body of the helper response is marked with the special<|python_tag|> begins with<|eom_id|> Ending, not standard<|eot_id|>. The latter indicates the end of the round, while the former indicates the continuation of multi-step reasoning.

Example of a built-in tool call

<|begin_of_text|><|start_header_id|>system<|end_header_id|>


Environment: ipython
Tools: brave_search, wolfram_alpha

Cutting Knowledge Date: 01 March 2023
Today's Date: 13 July 2024


You are a helpful Assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Weather in Menlo Park, California<|eot_id|><|start_header_id|>assistant<|end_header_id|>

At this point the model's response will include a call to one of the supported tools (in this case thebrave_search ) of Python code.

<|python_tag|>brave_search.call(query="current weather in Menlo Park, California")<|eom_id|>

The response to the call is executed and then sent back to the model to retrieve the final response. For brevity, the following will be appended to the message shown in the previous snippet: the

<|python_tag|>brave_search.call(query="Menlo Park California weather")<|eom_id|><|start_header_id|>ipython<|end_header_id|>

{"query": "Menlo Park California weather", "top_k": [{"title": "10-Day Weather Forecast for West Menlo Park, CA - The Weather Channel | ", "url": "/weather/tenday/l/West+Menlo+Park+CA?canonicalCityId=b2375713aa1943aad7d1a13a85e1c0adad13c1b10563b2bbaad70734dc61cf11", "description": "Be prepared with the most accurate 10-day forecast for West <strong>Menlo</strong> <strong>Park</strong>, CA with highs, lows, chance of precipitation from The <strong>Weather</strong> Channel and <strong>Weather</strong>.com", "type": "search_result"},....}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The final response from LLM will be.

The current weather in Menlo Park, California is mostly sunny with a high of 77°F and a low of 56°F.<|eot_id|>

Customized Tool Calls

The Llama 3.1 command supports calling custom functions from a single user message. The following tips provide examples of how to call a custom function from model output. In a custom function call, the model output<|eot_id|> rather than<|eom_id|> . System prompts need to be adjusted to inform the model how to handle function call output.

Customizing Tools to Call JSON Functions

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant with tool calling capabilities. When you receive a tool call response, use the output to format an answer to the orginal user question.<|eot_id|><|start_header_id|>user<|end_header_id|>

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

{
    "type": "function",
    "function": {
    "name": "get_current_conditions",
    "description": "Get the current weather conditions for a specific location",
    "parameters": {
        "type": "object",
        "properties": {
        "location": {
            "type": "string",
            "description": "The city and state, ., San Francisco, CA"
        },
        "unit": {
            "type": "string",
            "enum": ["Celsius", "Fahrenheit"],
            "description": "The temperature unit to use. Infer this from the user's location."
        }
        },
        "required": ["location", "unit"]
    }
    }
}

Question: what is the weather like in Menlo Park?<|eot_id|><|start_header_id|>assitant<|end_header_id|>

{"name": "get_current_conditions", "parameters": {"location": "Menlo Park, CA", "unit": "Fahrenheit"}}<|eot_id|><|start_header_id|>ipython<|end_header_id|>

When we retrieve the output from the selected tool, we pass it back to the model, using the same<|python_tag|> Separator.<|python_tag|> does not imply the use of Python. it is only used to indicate the start of any tool's output.

<|python_tag|>{
    "tool_call_id": "get_current_conditions"
    "output": "Clouds giving way to sun Hi: 76° Tonight: Mainly clear early, then areas of low clouds forming Lo: 56°"
}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The weather in Menlo Park is currently cloudy with a high of 76° and a low of 56°, with clear skies expected tonight.<|eot_id|>

This format must be copied exactly to be used effectively. the chat templates available in transformers make it easy to format prompts properly.

demonstrations

You can experiment with the three command models in the following demo.

Hugging Chat for Llama 3.1 405B/chat/models/meta-llama/Meta-Llama-3.1-405b-instruct/
Hugging Chat for Llama 3.1 70B/chat/models/meta-llama/Meta-Llama-3.1-70b-instruct/
Llama 3.1 8B Demo of Gradio-driven Space/spaces/ysharma/Chat_with_Meta_llama3_1_8b

The entire stack is open source.Hugging Chat is powered by thechat-ui cap (a poem)text-generation-inference Provide support.

FP8, AWQ, and GPTQ Quantification for Llama 3.1 405B

Meta createdOfficial FP8 quantized version of Llama 3.1 405BThe FP8 quantization is applied only to the main linear operators of the model, such as gates and rising and falling projections of FFNs (covering 75% of the inference FLOPs). To achieve this, FP8 quantization is only applied to the main linear operators of the model, such as gates and ascending and descending projections of FFNs (covering 75% of inferred FLOPs). We have made a concerted effort to ensure that this FP8 quantization checkpoint is compatible in the community (transformers, TGI, VLLM).

In addition, we created AWQ and GPTQ quantization variants of INT4 using AutoAWQ and AutoGPTQ. For AWQ, all linear layers are quantized using the GEMM kernel, which quantizes zeros to 4 bits with a group size of 128; for GPTQ, the same setup uses only the GPTQ kernel. We ensure that the INT4 checkpoints are compatible with transformers and TGI, including Marlin kernel support, to speed up inference for GPTQ quantization in TGI.

Available quantization weights for Llama 3.1 405B.

meta-llama/Meta-Llama-3.1-405B-Base-FP8: Official FP8 quantized weights, runs on 8xH100
meta-llama/Meta-Llama-3.1-405B-Instruct-FP8: Official FP8 quantized weights, runs on 8xH100
hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4: Hugging Face quantization weights running on 8xA100 80GB, 8xH100 80GB and 8xA100 40GB (reduced KV cache and no CUDA graphics)
hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4:: Hugging Face quantization weights running on 8xA100 80GB, 8xH100 80GB and 8xA100 40GB (reduced KV cache and no CUDA graphics)
hugging-quants/Meta-Llama-3.1-405B-BNB-NF4: Hugging Face quantization weights for QLoRA fine-tuning
hugging-quants/Meta-Llama-3.1-405B-Instruct-BNB-NF4: Hugging Face quantization weights for reasoning on 8xA100 and 4xH100

Hugging Quants Organization Quantization checkpoints for versions 70B and 8B are also included.

Reasoning Integration

Hugging Face Reasoning API

Hugging Face PRO Users Now Have Access to Exclusive API Endpoints, hosting the Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, and Llama 3.1 405B Instruct AWQs, by thetext-generation-inference Support is provided. All versions support the Messages API and are therefore compatible with OpenAI client libraries, including LangChain and LlamaIndex.

Note: Use thepip install "huggingface_hub>=0.24.1" Updated to the latesthuggingface_hub Version.

from huggingface_hub import InferenceClient

# Initializing the client，Points to an available model
client = InferenceClient()

chat_completion = (
    model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
    messages=[
        {"role": "system", "content": "You are a helpful and honest programming assistant."},
        {"role": "user", "content": "Is Rust better than Python?"},
    ],
    stream=True,
    max_tokens=500
)

# Iterate and print the stream
for message in chat_completion:
    print([0]., end="")

For more details on using the Messages API, check out thethis post。

Hugging Face Reasoning Endpoints

You can find more information on Hugging Face'sinference endpoint Deploy Llama 3.1 on Llama, which uses Text Generation Inference, a production-ready inference container developed by Hugging Face that supports FP8, sequential batching, token streaming, and tensor parallelism for fast inference on multiple GPUs. To deploy Llama 3.1, go tomodel page and click Deploy -> Reasoning Endpoints Widget.

Meta-Llama-3.1-8B-Instruct Recommended to run on 1x NVIDIA A10G or L4 GPUs
Meta-Llama-3.1-70B-Instruct Recommended to run on 4x NVIDIA A100 or quantize to AWQ/GPTQ on 2x A100
Meta-Llama-3.1-405B-Instruct-FP8 Recommended to run as FP on 8x NVIDIA H100 or quantize asAWQ/GPTQ Running on 8x A100

from huggingface_hub import InferenceClient

# Initializing the client，Points to an available model
client = InferenceClient(
    base_url="<ENDPOINT_URL>",
)

# Create a chat to complete
chat_completion = (
    model="ENDPOINT",
    messages=[
        {"role": "system", "content": "You are a helpful and honest programming assistant."},
        {"role": "user", "content": "Is Rust better than Python?"},
    ],
    stream=True,
    max_tokens=500
)

# Iterate and print the stream
for message in chat_completion:
    print([0]., end="")

Hugging Face Partner Integration

Note: We are currently working with our partners AWS, Google Cloud, Microsoft Azure, and DELL to add Llama 3.1 8B, 70B, and 405B to Amazon SageMaker, Google Kubernetes Engine, Vertex AI Model Catalog, Azure AI Studio, DELL Enterprise Hub. Catalog, Azure AI Studio, and DELL Enterprise Hub. we will update this section as containers become available - you canSubscribe to Hugging Squad for email updates。

Fine-tuning with Hugging Face TRL

In this section, we'll look at the tools available in the Hugging Face ecosystem to efficiently train Llama 3.1 on consumer GPUs.Here's a sample command to use in OpenAssistant'schat dataset to fine-tune the Llama 3.1 8B. we used 4-bit quantization andQLoRA to save memory to target the linear layer of all attention blocks.

Example of fine-tuning using Hugging Face TRLs

First, install the latest version of 🤗 TRL and clone the repo to access theTraining Scripts:

pip install "transformers>=4.43" --upgrade
pip install --upgrade bitsandbytes
pip install --ugprade peft
pip install git+/huggingface/trl
git clone /huggingface/trl
cd trl

Then you can run the script.

python \
    examples/scripts/ \
    --model_name meta-llama/Meta-Llama-3.1-8B \
    --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
    --dataset_text_field="text" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 2e-4 \
    --report_to "none" \
    --bf16 \
    --max_seq_length 1024 \
    --lora_r 16 --lora_alpha 32 \
    --lora_target_modules q_proj k_proj v_proj o_proj \
    --load_in_4bit \
    --use_peft \
    --attn_implementation "flash_attention_2" \
    --logging_steps=10 \
    --gradient_checkpointing \
    --output_dir llama31

If you have more GPUs, you can use DeepSpeed and ZeRO Stage 3 to run the training:.

accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/ \
    --model_name meta-llama/Meta-Llama-3.1-8B \
    --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
    --dataset_text_field="text" \
    --per_device train batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 2e-5 \
    --report_to wandb \
    --bf16 \
    --max_seq_length 1024 \
    --attn_implementation eager \
    --logging_steps=10 \
    --gradient_checkpointing \
    --output_dir models/llama

Generating synthetic data with distilabel

One of the major changes in the Llama 3.1 license is that it allows the use of model outputs to improve other LLMs, meaning that you can generate synthetic datasets using Llama 3.1 models and use them to fine-tune smaller, more specialized models.

Let's see an example of how to usedistilabelThe TRL is an open-source framework for generating synthetic data that produces a preference dataset. This dataset can be used to fine-tune the model using preference optimization methods (e.g., DPO or KTO) provided by TRL.

First install the latestdistilabel versions, includinghf-inference-endpoints Additional components that use thepip As follows.

pip install “distilabel[hf-inference-endpoints]” --upgrade

Then define a pipeline: the

Load a dataset with instructions from the Hugging Face Hub.
Use Hugging Face reasoning endpoints to generate responses via Llama 3.1 70B Instruct and Llama 3.1 405B Instruct.
Finally, using the Llama 3.1 405B Instruct as a referee, responses are scored using the UltraFeedback prompt. From these scores, responses can be selected and rejected, and the model can be fine-tuned using preference optimization methods.

See the code below to define the pipeline, or use thisColab Notebook Run and explore the generated dataset on your own.

from import InferenceEndpointsLLM
from import Pipeline
from import LoadDataFromHub, CombineColumns
from import TextGeneration, UltraFeedback

llama70B = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct"
)
llama405B = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8"
)

with Pipeline(name="synthetic-data-with-llama3") as pipeline:
    # Loading datasets with hints
    load_dataset = LoadDataFromHub(
        repo_id="argilla/10Kprompts-mini"
    )
    # Generate two responses for each prompt
    generate = [
        TextGeneration(llm=llama70B),
        TextGeneration(llm=llama405B)
    ]
    # Combine responses into one column
    combine = CombineColumns(
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"]
    )
    # utilization 405B LLM-as-a-judge Scoring the response
    rate = UltraFeedback(aspect="overall-rating", llm=llama405B)
    # Defining Pipelines
    load_dataset >> generate >> combine >> rate

if __name__ == "__main__":
    distiset = ()

What's next? In addition to the above examples.distilabel Exciting methods for generating synthetic data using LLM in a wide range of scenarios and topics are also provided. It includes implementations from the current SOTA literature for tasks such as evaluating outputs using the LLM-as-a-judge method, evolutionary instructions, data filtering, and defining custom components.

Additional resources

Models on the Hub
Hugging Face Llama Recipes
Open LLM Leaderboard
Hugging Chat Demo for Llama 3.1 405B Instruct
Meta Blog

a thank-you note

The release of these models and the support and evaluation in the ecosystem would not have been possible without the contributions of thousands of community members to transformers, tgi, vllm, pytorch, LM Eval Harness and many other projects. This release would not have been possible withoutClémentine cap (a poem)Nathan Support for LLM assessment.Nicolas、Olivier Dehaene cap (a poem)Daniël de Kok Contributions to Text Generation Inference support.Arthur、Matthew Carrigan、Zachary Mueller、Joao、Joshua Lochner cap (a poem)Lysandre Integration of Llama 3.1 intotransformers (b) Contribution of the United Nations to the implementation of the Convention on the Rights of Persons with Disabilities; andMatthew Douglas Contribution to quantitative support.Gabriel Martín Blázquez treat (sb a certain way)distilabel Contributions of support.Merve Noyan cap (a poem)Aymeric Roucher Contribution to the audit.hysts cap (a poem)Yuvi Contribution to the presentation.Ellie Contributions to fine-tuning tests.Brigitte Tousignant cap (a poem)Florent Daudens Contribution to communication.Nathan cap (a poem)Victor Contribution to the usability of Llama 3.1 in Hugging Chat.

Thanks to the Meta team for releasing Llama 3.1 and making it available to the open source AI community!

Original in English./blog/llama31

original author: Philipp Schmid, Omar Sanseviero, Alvaro Bartolome, Leandro von Werra, Daniel Vila, Vaibhav Srivastav, Marc Sun, Pedro Cuenca

Translator: AdinaY