How to optimize domain-specific LLM pre-training with 1% of data?

Thesis: Target-Aware Language Modeling via Granular Data Sampling

Paper Address:/abs/2409.14705

innovation point

An algorithm is proposed to merge pre-trained markers with multi-granularity markers to generate efficientn-gramcharacteristics and are highly correlated with the performance of downstream tasks.
Utilizing the above findings, the importance-based data sampling technique was improved by adapting the generic word set to the target word set. This allows for a better representation of the data and improves the performance of the model in the target task, while maintaining good performance in the non-target task.

Content overview

Pre-training of language models usually targets a wide range of usage scenarios and combines data from multiple sources. However, sometimes models need to perform well in specific domains without compromising performance in other domains. This requires the use of data selection methods to identify potential core data and how to effectively sample these selected data for training.

The paper uses a multi-granularity tag consisting ofn-gramfeatures for importance sampling, which strikes a good balance between sentence compression and representational power. The high correlation between the data obtained from the sampling and the performance of the target downstream task, while preserving its validity on other tasks, allows the language model to be pre-trained more efficiently on selected documents.

In eight benchmark tests, after using about1%data, the pretrained model performs as well as the fullRefinedWebThe data are comparable, and with a model size range of125Muntil (a time)1.5Btime beyond a randomly selected sample.

methodologies

From large-scale datasets such asRefinedWeb) in selecting samples is slow and expensive, a viable solution is to use easily computablen-gramFeatures encode each document as a vector.

Assuming that the distribution from the target\(p\) A small sample of the target text is captured in\(D_{task}\) and from the distribution of\(q\) A large set of raw data acquired in\(D_{raw}\) The following is a summary of the information contained in the\(N\) For an example, the goal is to select from the original dataset\(k\) Examples (\(k \ll N\) ), these examples are similar to the target.

importance sampling

The importance sampling technique selects examples aligned to the target distribution, provides a processable importance estimate for each text, and provides the necessary structure in the feature space of the\({\mathbb{Z}}\) Apply importance sampling on.

feature extractor\(h: {\mathbb{X}} \rightarrow {\mathbb{Z}}\) Used to convert inputs to features, the resulting raw feature distribution\(q_{\text{feat}}\) and target feature distribution\(p_{\text{feat}}\) The goal is to select features with a target feature distribution\(p_{\text{feat}}\) Aligned data.

In order to extract features\(q_{\text{feat}}\) cap (a poem)\(p_{\text{feat}}\)The following is an example of the extract from each participant's documentn-grams. Eachn-gramis mapped to a key in the hash table, with each key mapped to then-gramCount. The number that will be removed from the\(N\) Each of the features obtained in the original example\(z_i = h(x_i)\) Calculate the importance weights, which are\(w_i = \frac{\hat{p}_{\text{feat}}(z_i)}{\hat{q}_{\text{feat}}(z_i)}\) 。

Finally sampling is performed to select from a distribution of\(k\) examples and without substitution, the probability is given by\(\frac{w_i}{\sum_{i=1}^N w_i}\) Give.

participle adaptation

To derive the target vocabulary\(V(t)\) UseLlama-3Lexicon of Splitters\(V_{start}\) as a starting point and will\(V_{start}\) with data from the mission\(D_{task}\) What you've learned in the\(V_{task}\) Merger. In building the\(V_{task}\) When doing so, make sure to include multi-granularity tokens (i.e., words and multi-word combinations), and then set the\(V_{task}\) together with\(V_{start}\) incorporation\(v(t - 1)\) 。

Next, step-by-step from\(v(t - 1)\) Remove the marker from the\(v(t)\) , in which the distance to the original word set is minimized in order to extract less biased document features asn-gramVector.

A metric is first defined to measure the quality of the word repertoire in the corpus, and then the quality of the word repertoire is measured by maximizing the lexical utility metric (\(\mathcal{H}_{v}\) ) to learn the optimal vocabulary, the metric is formulated as:

\[\begin{equation} \mathcal{H}_{v} = - \frac{1}{l_{v}}\sum_{j \in v } P(j)\log P(j), \end{equation} \]

Among them.\(P(j)\) is a marker from the target data\(j\) The relative frequency of the\(l_{v}\) It's the vocabulary.\(v\) The average length of the tokens in the For any word, the entropy score\(\mathcal{H}_{v}\) Calculated based on the vocabulary of its previous step, the optimization problem can be formulated as:

\[\begin{equation} \text{arg\ min}_{v(t-1), v(t)} \big [ \mathcal{H}{v(t)} - \mathcal{H}{v(t-1)} \big ] \end{equation} \]

Among them.\(v(t)\) cap (a poem)\(v(t - 1)\) are two sets containing all the words, with upper bounds on the size of each\(|v(t)|\) cap (a poem)\(|v(t - 1)|\) . Settings\(|v(t)| = 10k\) which\(t=10\) but (not)\(|v(0)|\) is the defaultLlama-3 tokenizerthe size of the vocabulary.

Main experiments

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.

How to optimize domain-specific LLM pre-training with 1% of data? | EMNLP'24

importance sampling

participle adaptation