PREVIOUS: "A Comprehensive Explanation of How the LLM Model of Artificial Intelligence Really Works (III)
Preface: It is only natural that this section, as the finale of the whole article, echoes the most advanced AI models of today. Here we will briefly introduce the fundamentals of the GPT model of OpenAI, the world's first company pushing AI to generate human language. If you too wish to contribute to the development of humanity and get into the AI industry, this is certainly an excellent starting point. All other knowledge is fundamental to entering the industry, and understanding the model is a must.OpenAI's founding team includes tech giant Elon Musk, and Ilya Sutskever, a student of 2024 Nobel Prize winner Geoffrey Hinton. They all represent some of the world's wealthiest and most intelligent and forward-thinking figures.OpenAI initially made the source code for the Language Model (LLM) of ChatGPT-2 publicly available, but stopped open-sourcing it in subsequent versions of ChatGPT-3 and beyond, gradually departing from its initial commitment to openness and leading to the departure of core members within the company. The model presented in this section is based on the ChatGPT3 model by Andrej Karpathy, an OpenAI participant and student of current Stanford professor Feifei Li.
(Follow Without Getting Lost to receive timely updates on the latest AI materials)
GPT architecture
Next, let's talk about the GPT architecture. Most GPT models (although there are different variations) use this architecture. If you followed the article up to this point, this part should be relatively easy to understand. Using a block diagram representation, this is a high-level schematic of the GPT architecture:
At this point, we have discussed all the modules in detail, except for the "GPT Transformer block". The + sign here just means that the two vectors are added together (which means that both embeddings must be the same size). Let's take a look at this GPT Transformer block:
That's it. It's called "Transformer" because it derives from and belongs to a Transformer architecture - we'll learn more about that in the next section. It doesn't matter what we understand, because we've already discussed all the modules shown here. Let's review the process of building this GPT architecture so far:
- We learned that neural networks receive numbers and output other numbers, and that weights are trainable parameters
- We can interpret these input/output numbers to give real-world meaning to neural networks
- We can connect neural networks in series to create larger networks, and we can call each one a "block" and represent it as a box to simplify the illustration. The role of each block is to take in one set of numbers and output another set of numbers
- We learned about many different types of blocks, each with its own different purpose
- The GPT is just a special arrangement of these blocks, as shown above and explained in the way discussed in Part 1
Various modifications have been made over time to make the modern LLM more powerful, but the basic principles remain the same.
Now, this GPT Transformer was actually called a "decoder" in the original Transformer paper. Let's take a look at that.
Transformer Architecture
This is one of the key innovations driving the rapid increase in language modeling capabilities.Transformer not only improves predictive accuracy, it is also more efficient (and easier to train) than previous models, allowing larger models to be built. This is the foundation of the GPT architecture.
Looking at the GPT architecture, you'll see that it's well suited for generating the next word in the sequence. It basically follows the logic we discussed in Part 1: start with a few words and generate words one by one. But what if you want to do a translation? Let's say you have a German sentence (e.g. "Wo wohnst du?" = "Where do you live?") that you want to translate into English. How do we train the model to accomplish this task?
As a first step, we need to find a way to input German words, which means we have to extend the embedding to include both German and English. I guess a simple way to input would be to concatenate German sentences with generated English sentences and enter them into the context. To make the model easier to understand, we can add a separator. Each step looks like this:
This could work, but there is still room for improvement:
- Sometimes the original sentence is lost if the context is of fixed length
- Modeling requires a lot of learning. Including two languages, one also needs to know
- Each time a word is generated, the whole German sentence needs to be processed with different offsets. This means that there are different internal representations of the same content, and the model should be able to translate through these representations
Transformer was originally created for this task and consists of an "encoder" and a "decoder" - basically two separate modules. One module processes only German sentences, generating an intermediate representation (still a collection of values) - this is called the encoder. The second module generates words (of which we have seen many). The only difference is that in addition to feeding the generated words into the decoder, the German sentences output by the encoder are taken as additional input. That is, when generating the language, its context is all the words that have been generated plus the German sentences. This module is called the decoder.
These encoders and decoders consist of a number of blocks, especially the Attention block, which is sandwiched between other layers. Let's take a look at the diagram of the Transformer architecture in the "Attention is all you need" paper and try to understand it:
The set of vertical blocks on the left is called the "encoder" and the one on the right is called the "decoder". Let's understand each part one by one:
Feedforward Networks: feedforward networks are networks without loops. The original network discussed in Part I is a feedforward network. In fact, this block uses a very similar structure. It contains two linear layers, each followed by a ReLU (see Part I on ReLUs) and a Dropout layer. Keep in mind that this feedforward network applies to each position independently. That is, there is one feedforward network for position 0, one for position 1, and so on. But the neurons at position x will not be connected to the feedforward network at position y. The importance of this is to prevent the network from "peeking" ahead during training.
Cross-attention: you'll notice that the decoder has a multi-head attention with arrows coming from the encoder. What's happening here? Remember the VALUE, KEY, and QUERY in self-attention and multi-head attention? They all come from the same sequence. In fact, query is just the last word in the sequence. So what happens if we keep query, but have value and key come from a completely different sequence? That's what's happening here. value and key come from the output of the encoder. The math doesn't change, just the input sources for key and value.
Nx: Nx means that the block is repeated N times. Basically, you're stacking a block on top of each other, with the output of the previous block serving as the input to the next. This makes the neural network deeper. Looking at the graph, how the encoder output is passed to the decoder can be confusing. Suppose N=5. Do we pass each layer of encoder output to the corresponding decoder layer? Not really. You actually run the encoder only once and then provide the same representation to the 5 decoder layers.
Add & normalize blocks: this is the same as below (the author seems to be just trying to save space).
We have already discussed the rest. You now have a complete understanding of the Transformer architecture, built step-by-step from simple addition and multiplication operations to the now complete self-contained explanation! You know how to build every line, every addition, every piece and every word of Transformer from scratch. If you are interested, see this open source library (open source GPT./karpathy/nanoGPT), which implements the GPT architecture described above from scratch.
appendice
matrix multiplication
In the embedding section, we introduced the concepts of vectors and matrices. A matrix has two dimensions (number of rows and number of columns). A vector can also be seen as a matrix with only one dimension. The product of two matrices is defined as:
The dots indicate multiplication. Now let's look again at the computation for the blue and organic neurons in the first figure. If we write the weights as a matrix and the inputs as vectors, we can represent the whole operation as follows:
If the weight matrix is called "W" and the input is called "x", then Wx is the result (in this case the middle layer). We can also transpose both to write xW - it's a matter of personal preference.
(statistics) standard deviation
In the Layer Normalization section, we use the concept of standard deviation. The standard deviation is a statistic that describes the range of the distribution of values (within a set of numbers), e.g., if all values are the same, the standard deviation is zero. If each value is far from the mean of those values, the standard deviation will be high. The formula for calculating the standard deviation of a set of numbers a1, a2, a3... (assuming there are N numbers) is as follows: subtract the mean from each number, then square the result for each of the N numbers. Add all these numbers, then divide by N, and finally square root the result.
position code
We mentioned positional embedding above. The position encoding is the same length as the embedding vector, with the difference that it is not an embedding and does not require training. We assign a unique vector to each position. For example, position 1 is one vector, position 2 is another, and so on.
(concluded)
Welcome to communicate and discuss in the comments section, the author can also explain all the principles and implementation process and details of the model.