Thesis: Target-Aware Language Modeling via Granular Data Sampling
- Paper Address:/abs/2409.14705
innovation point
- An algorithm is proposed to merge pre-trained markers with multi-granularity markers to generate efficient
n-gram
characteristics and are highly correlated with the performance of downstream tasks. - Utilizing the above findings, the importance-based data sampling technique was improved by adapting the generic word set to the target word set. This allows for a better representation of the data and improves the performance of the model in the target task, while maintaining good performance in the non-target task.
Content overview
Pre-training of language models usually targets a wide range of usage scenarios and combines data from multiple sources. However, sometimes models need to perform well in specific domains without compromising performance in other domains. This requires the use of data selection methods to identify potential core data and how to effectively sample these selected data for training.
The paper uses a multi-granularity tag consisting ofn-gram
features for importance sampling, which strikes a good balance between sentence compression and representational power. The high correlation between the data obtained from the sampling and the performance of the target downstream task, while preserving its validity on other tasks, allows the language model to be pre-trained more efficiently on selected documents.
In eight benchmark tests, after using about1%
data, the pretrained model performs as well as the fullRefinedWeb
The data are comparable, and with a model size range of125M
until (a time)1.5B
time beyond a randomly selected sample.
methodologies
From large-scale datasets such asRefinedWeb
) in selecting samples is slow and expensive, a viable solution is to use easily computablen-gram
Features encode each document as a vector.
Assuming that the distribution from the target\(p\) A small sample of the target text is captured in\(D_{task}\) and from the distribution of\(q\) A large set of raw data acquired in\(D_{raw}\) The following is a summary of the information contained in the\(N\) For an example, the goal is to select from the original dataset\(k\) Examples (\(k \ll N\) ), these examples are similar to the target.
importance sampling
The importance sampling technique selects examples aligned to the target distribution, provides a processable importance estimate for each text, and provides the necessary structure in the feature space of the\({\mathbb{Z}}\) Apply importance sampling on.
feature extractor\(h: {\mathbb{X}} \rightarrow {\mathbb{Z}}\) Used to convert inputs to features, the resulting raw feature distribution\(q_{\text{feat}}\) and target feature distribution\(p_{\text{feat}}\) The goal is to select features with a target feature distribution\(p_{\text{feat}}\) Aligned data.
In order to extract features\(q_{\text{feat}}\) cap (a poem)\(p_{\text{feat}}\)The following is an example of the extract from each participant's documentn-grams
. Eachn-gram
is mapped to a key in the hash table, with each key mapped to then-gram
Count. The number that will be removed from the\(N\) Each of the features obtained in the original example\(z_i = h(x_i)\) Calculate the importance weights, which are\(w_i = \frac{\hat{p}_{\text{feat}}(z_i)}{\hat{q}_{\text{feat}}(z_i)}\) 。
Finally sampling is performed to select from a distribution of\(k\) examples and without substitution, the probability is given by\(\frac{w_i}{\sum_{i=1}^N w_i}\) Give.
participle adaptation
To derive the target vocabulary\(V(t)\) UseLlama-3
Lexicon of Splitters\(V_{start}\) as a starting point and will\(V_{start}\) with data from the mission\(D_{task}\) What you've learned in the\(V_{task}\) Merger. In building the\(V_{task}\) When doing so, make sure to include multi-granularity tokens (i.e., words and multi-word combinations), and then set the\(V_{task}\) together with\(V_{start}\) incorporation\(v(t - 1)\) 。
Next, step-by-step from\(v(t - 1)\) Remove the marker from the\(v(t)\) , in which the distance to the original word set is minimized in order to extract less biased document features asn-gram
Vector.
A metric is first defined to measure the quality of the word repertoire in the corpus, and then the quality of the word repertoire is measured by maximizing the lexical utility metric (\(\mathcal{H}_{v}\) ) to learn the optimal vocabulary, the metric is formulated as:
Among them.\(P(j)\) is a marker from the target data\(j\) The relative frequency of the\(l_{v}\) It's the vocabulary.\(v\) The average length of the tokens in the For any word, the entropy score\(\mathcal{H}_{v}\) Calculated based on the vocabulary of its previous step, the optimization problem can be formulated as:
Among them.\(v(t)\) cap (a poem)\(v(t - 1)\) are two sets containing all the words, with upper bounds on the size of each\(|v(t)|\) cap (a poem)\(|v(t - 1)|\) . Settings\(|v(t)| = 10k\) which\(t=10\) but (not)\(|v(0)|\) is the defaultLlama-3 tokenizer
the size of the vocabulary.
Main experiments
If this article is helpful to you, please click a like or in the look at it ~~
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].