The Origins of Big Language Models for Artificial Intelligence (II), From General Purpose Language Fine-Tuning to Harnessing LLM

(5) "Universal Language Model Fine-tuning for Text Classification" by Howard and Ruder in 2018./abs/1801.06146

This paper is very interesting from a historical perspective. Although it was written a year after the release of the original Attention Is All You Need transformer, it does not deal with transformers, but rather focuses on recurrent neural networks. However, it is still noteworthy because it effectively presents pre-training and transfer learning of language models for downstream tasks.

While transfer learning is well established in computer vision, it is not yet prevalent in natural language processing (NLP).ULMFit is one of the first papers to show that pre-training a language model and fine-tuning it leads to state-of-the-art results in many NLP tasks.

The three-stage process of the fine-tuned language model proposed by ULMFit is as follows:

Training language models on large-scale text corpora.
Fine-tuning this pre-trained language model on task-specific data allows it to adapt to the specific style and vocabulary of the text.
Fine-tune the classifier on task-specific data and gradually unfreeze the layers to avoid catastrophic forgetting.

This process - training a language model on a large-scale corpus and then fine-tuning it on downstream tasks - is the core methodology used by transformer-based models and foundational models like BERT, GPT-2/3/4, RoBERTa, and others.

However, the key part of ULMFiT, stepwise unfreezing, is usually not routinely performed in practice, especially when using converter architectures, where all layers are usually fine-tuned at once.

Source:/abs/1801.06146

（6）Devlin、Chang、Lee respond in singing Toutanova sentence-final interrogative particle2018Published in《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》，/abs/1810.04805

Following the original transformer architectures, research in large-scale language modeling began to fall into two directions: an encoder-style transformer for predictive modeling tasks such as text categorization, and a decoder-style transformer for generative modeling tasks such as translation, summarization, and other forms of text generation.

The BERT paper above introduced masked-language modeling and the original concept of next-sentence prediction. It remains the most influential encoder style architecture. If you are interested in this line of research, I recommend you to learn more about RoBERTa, which simplifies the pre-training goal by removing the next-sentence prediction task.

Source:/abs/1810.04805

（7）Radford respond in singing Narasimhan sentence-final interrogative particle2018Published in《Improving Language Understanding by Generative Pre-Training》，/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035

The original GPT paper describes the popular decoder style architecture with pre-training by next word prediction.BERT can be viewed as a bidirectional transformer since its pre-training goal is to mask the language model, while GPT is a unidirectional, autoregressive model. While embeddings of GPT can also be used for classification tasks, the GPT approach is at the heart of some of today's most influential large-scale language models (LLMs), such as ChatGPT.

If you are interested in this line of research, I suggest you read further GPT-2/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe and GPT-3 The paper /abs/2005.14165. These two papers demonstrate that LLMs are capable of zero-shot and few-shot learning, and highlight the emergent capabilities of LLMs.GPT-3 remains a popular benchmark and base model for the training of the current generation of LLMs (e.g., ChatGPT)-we will discuss later as a separate entry the factors that led to the ChatGPT 's InstructGPT approach.

              Source: /paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035

（8）Lewis、Liu、Goyal、Ghazvininejad、Mohamed、Levy、Stoyanov respond in singing Zettlemoyer sentence-final interrogative particle2019Published in《BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension》，/abs/1910.13461

As mentioned earlier, BERT-type encoder-style LLMs are typically better suited for predictive modeling tasks, while GPT-type decoder-style LLMs are better at generating text. In order to take advantage of both, the BART paper above combines the encoder and decoder parts together (which is not very different from the original transformer architecture (the second paper in this list)).

Source:/abs/1910.13461

(9) Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond by Yang, Jin, Tang, Han, Feng, Jiang, Yin and Hu in 2023./abs/2304.13712

This is not a research paper, but is probably the best architectural overview to date, showing the evolution of different architectures. However, in addition to discussing BERT-style masked language models (encoders) and GPT-style autoregressive language models (decoders), it provides useful discussion and guidance on pre-training and fine-tuning data.

                                          Evolutionary tree of modern LLM from /abs/2304.13712.