synopsis
In this article, we will presentSmolLMIt is a collection of cutting-edge 135M, 360M, and 1.7B parametric models. It assembles a series of cutting-edge 135M, 360M, and 1.7B parametric quantities of small models, which are all trained on a brand-new high-quality dataset. In this paper, we will introduce the process related to data organization, model evaluation, and usage.
introductory
Recently, there has been a growing interest in small language models that can run on local devices. This trend has not only inspired related industry players to explore large model compression techniques such as distillation or quantization, but also a lot of work has begun to experiment with training small models from scratch on large datasets.
Microsoft's Phi series, Alibaba's Qwen2 (less than 2B parameter count), and Meta's MobileLLM all demonstrate the conclusion that small models can achieve good performance if properly designed and trained. However, most of the information about data organization and training details are not disclosed.
In this article, we will presentSmolLMThis is a collection of the top small language models. This is a collection of the top small-language models with parametric counts of 135M, 360M, and 1.7B. These models are based on theSmolLM-Corpus This carefully organized, high-quality dataset was constructed with the following three subsets.
- Cosmopedia v2:: A dataset of texts, stories, etc. synthesized through the Mixtral model (28B tokens)
- Python-Edu:: Data samples taken fromThe Stack Dataset,Scoring based on educational value Filtered dataset (number of tokens is 4B)
- FineWeb-Edu: FineWeb The dataset is de-duplicated andScoring based on educational value Filtered dataset (number of tokens is 220B)
Our evaluation results show that SmolLM's model outperforms existing models on a range of common-sense reasoning and world knowledge rubrics in the corresponding parametric intervals. In this paper, we present the methodology for organizing the three subsets of the training corpus and discuss the training and evaluation process of SmolLM.
Data collation
Cosmopedia datasets: from v1 to v2
Cosmopedia v2 is an enhanced version of the Cosmopedia dataset.Cosmopedia is the largest synthetic dataset currently available and is often used for training with. It contains over three million texts, blogs, stories, etc., generated by the Mixtral-8x7B-Instruct-v0.1 model. Most of the data is generated in this way: web content is collected (called "seed samples"), provided with the subject categories to which it belongs, and then augmented by the model. Figure 1 shows an example of one of these samples. Here we use a large number of web samples to increase the diversity of the data and to expand the range of topics for the cue words.this article The Cosmopedia dataset is described in detail.
In order to further optimize the data quality in the v2 dataset, we have tried the following two strategies.
- Use multiple high-performance models to generate data for the same cue word
- Optimize the cue word itself
For the first strategy, we have tried llama3-70B-Instruct, Mixtral-8x22B-Instruct-v0.1, and Qwen1.5-72B-Chat, but when we trained on these generative data, we found that the effect improvement is very limited. Therefore, in the following we focus on the second strategy: how we improved the cue words.
Finding better themes and seed samples
Each cue contains three main components: the topic, the seed sample, and the generation style, which define the intended audience and the type of content we want the model to generate.
To ensure the consistency of the generation, we need to categorize the highly relevant seed samples into corresponding topics. In Cosmopedia v1, we clustered the samples in FineWeb to ensure that the topics are consistent with the corresponding samples (Figure 2). However, this approach has two limitations.
- These topics, while very comprehensively reflecting the clustering results of the web/FineWeb data, may not fully reflect the real-world distribution of subject topics.
- The samples within each cluster are not filtered further, so they may contain many low-quality samples.
Therefore, in the v2 dataset, we use theBISAC Book Classification defined 34,000 subjects in place of unsupervised clustering. BISAC has been used as a common standard to categorize books into subjects. Therefore, using this method not only provides a comprehensive coverage of topics, but also allows us to use topics that are more specialized in terms of their educational value. Specifically, we started by using 5000 topics from the 51 categories in BISAC, and let the Mixtral model generate multiple secondary subcategories for each topic. The figure below shows the final distribution of the number of topics in the subcategories under each broad category.
After defining the topics, we also needed to find data entries related to the topics. Similar to using a search engine, we made a search tool to retrieve data with strong relevance to each theme. We used BISAC's broad categories and subcategories of topics as keywords for searching in theFineWeb data setCC-MAIN-2024-10 cap (a poem)CC-MAIN-2023-50 folders were searched, and the two folders contained over 520 million samples. For each search term, we retrieved 1000 closest data entries. The code can be found inhere are。
In the end, we integrated 34 million pieces of data covering 34,000 topics. The next thing to determine is which generation style works best.
Generate Style
To determine the most efficient generation style, we conducted comparative experiments by training the 1.8B model, where we used different subsets of Cosmopedia v1 data, totaling 8 billion tokens. When generating the training data, we only generated 2 billion tokens for 4 rounds of training to save time (generating 2 billion tokens using Mixtral takes about 1000 GPU hours). The configurations for training and evaluation are as followsFineWeb ablation models Consistency. We ran each workout twice, each time with a different random seed, and the final evaluation score was averaged over the two times.
As for training result comparisons, we compared these subsets of Cosmopedia v1: the
- Two web samples sets.web_samples_v1 cap (a poem)web_samples_v2
- stories subsets
- stanford cap (a poem)openstax Two subsets
We find that overall performance is best when the training text is based on topics and seed samples from stanford and openstax, with both MMLU and ARC metrics higher than the two web sample sets. And stories only contributes to the common sense correlation metrics. After implementing the code to retrieve new topics and seed samples for the v2 version of the dataset, we can also compare the metrics data from this experiment to determine the quality of our newly generated cue words.
Next, we had to explore which audience style was best. We generated text content using the same text-based cues, but for two target audiences: high school students and college students. We found that training on the data generated for the middle school audience resulted in the model achieving the best scores on every metric except MMLU. A plausible explanation is that these metrics generally examine elementary or intermediate scientific knowledge, whereas MMLU contains questions that target advanced or even expert knowledge.
For the v2 data, we generated 40% of the data for a middle school audience, 30% for a college audience, and the remaining 30% was a mix of different audiences and a blend of stories, stanford, and other styles of text styles from v1. In addition, we generated 1 billion code-related texts based on theAutoMathText data setPython Code section.
In the end, we generated 39 million synthetic data, with a size of 2 billion in terms of number of tokens, covering texts, stories, articles, and code, and a high diversity of hypothetical audiences, covering more than 34,000 topics.
FineWeb-Edu Dataset
The FineWeb-Edu dataset was developed by us a few months ago along with theTechnical report on the FineWeb dataset public, it contains1.3 trillion Its content comes from education-related web pages that are filtered from the 🍷 FineWeb dataset.
In the process of filtering the data, we developed aClassifiers on the quality of educational valuesIt is trained using labeled information produced by Llama3-70B-Instruct. We use this classifier to identify the batch of web content with the highest educational value in FineWeb. The experiments in the figure below show that the model trained on the filtered out FineWeb-Edu is clearly due to FineWeb in terms of common metrics. this also shows that our classifier is useful.
In the Smollm-Corpus dataset, we added 220 billion de-weighted tokens from FineWeb.
Stack-Edu-Python Dataset
Here, we use the same method as FineWeb-Edu. We use Llmama3 toThe Stack The 500,000 python snippets in the dataset were scored according to their educational value, and the scored data were then used to train ansorterWe then use this classifier in the python subset of the Starcoder model training corpus. We then used this classifier on a python subset of the training corpus for the Starcoder model. We kept only samples with a score of 4 and above, and we ended up with a new dataset of 4 billion tokens out of 40 billion.
The figure below shows the effect of training the model on different datasets (filtered, unfiltered using 4 or 3 as a threshold). We can see that the model converges more than 3 times faster on Python-Edu than on unfiltered data. And after using only 12 billion tokens of training data, it achieves a top-1 16% pass rate.
model training
SmolLM contains three models with different parameter sizes, all of which are trained on the mixture of data shown below.
- Models with 135M and 360M parameters were modeled using theSmollm-Corpus The 600 billion tokens of data volume for training
- For a model with a parameter count of 1.7B, useSmollm-Corpus Trained on 1 trillion tokens of data.
Selection of hyperparameters
We use a trapezoidal learning rate variation strategy, with the last 20% of the total training duration used as cooling time. Note that the original validation experiments for trapezoidal learning rate variation used only small-scale training, while our work extends it to the large modeling domain.
For the model structure, both our 135M and 360M models used andMobileLLM A similar design incorporates a Grouped-Query Attention structure and prioritizes depth expansion over width, while the 1.7T model uses a relatively traditional design. In addition, all three models use embedding tying with a context length of 2048 tokens, and the use of long contextual fine-tuning can further extend the context length of our models.
Detailed information on the specific model structure can be found in the following table.
The tokenizer we used was trained on Smollm-Corpus with a vocabulary of 49152.
test
One advantage of using the trapezoidal learning rate is that we can more quickly validate the model for extended experiments under scaling law (cf.Hägele et al. This paper). Here we do a small experiment on scaling law using SmolLM-125M to verify this. We end the training by cooling the learning rate on different normal training nodes. We observe that the performance is continuously increasing as the model is trained for longer and longer periods of time, and this phenomenon exists even after the Chinchilla optimal point (the optimal ratio of the number of parameters to the training data). Based on these experimental phenomena, we decided to train the 1.7B model on 1T tokens, while the 135M and 360M models were trained on 600B tokens. This is because after training on 400B tokens, the two smaller models are already making slow progress on some metrics.
We also tried adding the instruction dataset and upsampling a subset of Cosmopedia during the learning rate cooling phase, but these had little effect. The likely reason for this is that the quality of our hybrid dataset is already high enough that these improvements have limited effect.
During the training of the two smaller models, we recorded the changes in the various rubrics. See figure below.
Model Review
We reviewed SmolLM models with different parametric quantities and compared them to some of the best current models. We used a variety of metrics, and the reviews included commonsense reasoning and world knowledge. We uselighteval
cap (a poem)These configurations For human subjective evaluation, we use bigcode-evaluation-harness, where temperature is set to 0.2, top-p is 0.95, and sample size is 20. For human subjective evaluation, we use bigcode-evaluation-harness, where temperature is set to 0.2, top-p to 0.95, and sample size to 20. For MobileLLM, which is not open-source, the test results are taken from the data in the paper.
We found that.
- Among models with less than 200M participants, SmolLM-135M outperforms the current best model, MobileLLM-125M, in all metrics. compared with MobileLLM-125M, which uses 1T tokens for training, SmolLM-135M uses only 600B of data.
- SmolLM-360M also outperforms other models with less than 500M participants. Compared to MobileLLM-350M and Qwen2-500M, SmolLM-360M has fewer parameters and less training data.
- SmolLM-1.7B also outperforms models including Phi1.5 and MobileLLM-1.5B for models up to 2B parameter count.
- SmolLM-1.7B also performs well on Python programmability (the Qwen2-1.5B score we evaluated is different from the one given by the Qwen team; our experimental configuration was: temperature set to 0.2, top-p set to 0.95, and a sample size of 20).
We also fine-tuned the models with instructions using publicly available datasets. All three models were tested on theWebInstructSub dataset and StarCoder2-Self-OSS-Instruct performed a training round. Subsequently, we also performed DPO training, in which we used theHelpSteer Train the model for 135M and 1.7B using theargilla/dpo-mix-7k Training a 360M model. The associated training configuration and Zephyr-Gemma'sdocumentation Same, except that the learning rate of SFT was changed to 3e-4 by us.
The table below shows how the SmolLM model (SmolLM-Instruct) fine-tuned with the instructions compares to the other models on the IFEval. qwen2-1.5B-Instruct achieves the highest scores, while the SmolLM-Instruct model achieves a good trade-off between model size and performance, and uses only publicly available datasets.
How do I run the SmolLM model locally?
Our miniatures can run on a variety of local hardware. For example, the iPhone 15 has 6GB of RAM, the iPhone 15 Pro has 8GB of RAM, and many devices, from cell phones to laptops, are sufficient to run our model. In the following table, we record the actual memory usage of our model when running.
apart fromtransformers
In addition to the model weights that can be used directly by the library, we have also opened up the ONNX model and plan to create a new model for the The GGUF version of the model is available. In addition, theSmolLM-135M cap (a poem)SmolLM-360M The WebGPU demo page is also available.
summarize
This paper introduces the SmolLM family of models, which experimentally demonstrates that small models can achieve good performance if they are well trained and the data quality is good enough. Here we provide an example with SmolLM, which strongly demonstrates that model size and model performance can be perfectly traded off.
Other resources
- SmolLM model collection./collections/HuggingFaceTB/smollm-models-6695016cad7167254ce15966
- SmolLM-Corpus dataset./datasets/HuggingFaceTB/smollm-corpus
- WebGPU demo page./spaces/HuggingFaceTB/SmolLM-135M-Instruct-WebGPU and /spaces/HuggingFaceTB/SmolLM-360M-Instruct-WebGPU
Original in English.
/blog/smollm Original authors: Loubna Ben Allal, Anton Lozhkov, Elie Bakouch
Translator: hugging-hoi2022