A comprehensive explanation of how AI LLM models really work (III)

PREVIOUS: "A Comprehensive Explanation of How the LLM Model of Artificial Intelligence Really Works (II)

Preface:In the previous two sections, we presented the design diagram of a large language model and the implementation of a neural network capable of generating natural language. This is the prototype of a modern advanced AI language model. However, the language models currently on the market are far more complex than the one we designed. So, what exactly is so complex about them? This section will give you a detailed overview of how these models make neural networks perform at or even beyond human levels in specific domains through a number of key techniques? They are summarized in the following nine areas.

(Follow Without Getting Lost to receive timely updates on the latest AI materials)

What is it about large language models that makes them so effective?

The earliest models generated 'Humpty Dumpty sat on a wall' by generating 'Humpty Dumpty sat on a wall' on a character-by-character basis, which is a far cry from the functionality of current state-of-the-art large-scale language models, but it is a core principle of these advanced models. Through a series of innovations and improvements, generative AI has evolved from this simple form to become a powerful tool for solving real-world problems with bots capable of human-like conversations, AI customer service, virtual employees, and more. So where exactly have current state-of-the-art models made improvements? Let's break it down one by one.

embedding

Remember when we mentioned that the way we enter characters is not optimal? Earlier we randomly assigned a number to each character. If we could find better numbers, we might be able to train a better network. So how do we find these better numbers? Here's a clever way:

In the previous model training, we trained the model by adjusting the weights and observing if the final loss decreases, constantly adjusting the weights. At each step, we will:

- input data

- Computational output layer

- Compare with desired output and calculate average loss

- Adjust the weights and start over

In this process, the input is fixed, which made sense when RGB and volume were the inputs. But now the numbers for the input characters a, b, c, etc. are arbitrarily chosen by us. Would it be possible to adjust not only the weights, but also the input representation in each iteration to see if using a different number for "a" would reduce the loss? This would indeed reduce the loss and make the model better (which is the direction we designed it to go). Basically, gradient descent is applied not only to the weights, but also to the numeric representation of the inputs, which are themselves arbitrarily chosen numbers. This is called "embedding". It is an input-to-number mapping that needs to be trained like a parameter. Once the embedding is trained, it can be reused in other models. Note that the same symbols/characters/words are always represented by the same embeddings.

The embeddings we have discussed have only one number per character. However, in practice embeddings usually consist of multiple numbers because it is difficult to express the richness of a concept with a single number. Recalling our leaf and flower examples, each object has four numbers (the size of the input layer), which each express a property of the object and which the model can effectively use to recognize the object. If there is only one number, such as the red channel, the model may have a harder time judging it. It takes more than one number to capture the complexity of human language.

So, can we represent each character with multiple numbers to capture more richness? Let's assign each character a set of numbers, called a "vector" (arrange each number in order so that if you swap positions it becomes a different vector. For example, in the data for leaves and flowers, swapping the red and green numbers changes the color and results in a different vector). The length of the vector i.e. how many digits it contains. We will assign a vector to each character. There are two issues here:

- How do you get "humpty dumpt" into the network if you assign vectors instead of numbers to each character? The answer is simple. Assuming we assign a vector of 10 numbers to each character, the 12 neurons in the input layer become 120 neurons, because each character in the "humpty dumpt" has 10 numbers. Then we just put the neurons side by side.

- How do we find these vectors? Fortunately, we just learned about embedding training. Training embedded vectors is similar to training parameters, except that now there are 120 inputs instead of 12, and the goal is still to minimize loss. The first 10 numbers are the vectors corresponding to "h", and so on.

All embedding vectors must be the same length, otherwise different character combinations cannot be entered consistently. For example, "humpty dumpt" and "dumpty dumpty" - both of which input 12 characters - cannot be consistently input to the 120-long input layer if the vector lengths for each character are different. input to the 120-long input layer. Let's look at the visualization of the embedding vectors:

Let us call this set of vectors of the same length a matrix. The matrix pictured above is called the embedding matrix. You tell it the column number, which represents the character, and then find the corresponding column in the matrix to get a vector representing that character. This embedding is suitable for embedding any collection of things, you just need to provide enough columns for each thing.

participle (math.)

So far, we have used characters as the basic building blocks of language, but this approach has limitations. The weights of the neural network must do a lot of work to understand certain sequences of characters (i.e., words) and the relationships between them. What if we just assign embeddings to words and let the network predict the next word? The network only understands numbers anyway, so we could assign a single embedding to words like "humpty," "dumpty," "sat," "on," and so on. "dumpty", "dumpty", "sat", "on", etc. We can assign a 10-dimensional vector to the words "humpty", "dumpty", "sat", "on", etc., and then input two words and ask it to predict the next one. "Token" refers to the embedded unit, our model used characters as token before, now we propose to use the whole word as token (of course, you can also use the whole sentence or phrase as token).

The use of word-splitting has profound implications for the model. There are over 180,000 words in the English language, and if each possible output is represented by a single neuron, the output layer would require hundreds of thousands of neurons instead of about 26. With the increase in hidden layer size in modern networks, this problem becomes less intractable. Note that since each word is processed independently and the initial embeddings are represented as random numbers, the initial representations of similar words (e.g., "cat" and "cats") are unrelated. It can be expected that the model will learn the similarity between the two words, but can this apparent similarity be exploited to simplify learning?

It's possible. The most common embedding scheme in today's language models is to split words into subwords and embed them. For example, we split "cats" into two tokens: "cat" and "s". This makes it easier for the model to understand the meaning of the "s" after other words, etc. This also reduces the number of tokens (the number of "s "s). This also reduces the number of tokens (sentencepiece is a commonly used lexer with a vocabulary size in the tens of thousands, rather than the hundreds of thousands of words in English). The splitter splits the input text (e.g., "Humpty Dumpt") into tokens and returns the corresponding numbers, which are used to find the vector of that token in the embedding matrix. For example, "humpty dumpty" is split into an array of characters ['h','u', ... 't'], and then return the corresponding numbers [8,21,...20], because you need to look up column 8 of the embedding matrix to get the embedding vector for 'h' (the embedding vector is for the input model, not the number 8, unlike the previous operation). Unlike the previous operation). The arrangement of the matrix columns is irrelevant; assigning any column to 'h' is fine, as long as you look up the same vector each time you enter 'h'. The lexicon gives us an arbitrary (but fixed) number to look up, whereas what you really need is the lexicon to slice the sentence into tokens.

Using embedding and subword disambiguation, the model may be as follows:

The next few sections cover recent advances in language modeling, which are what make LLM so powerful. However, there are some basic mathematical concepts that need to be mastered before understanding these. These concepts are summarized below:

- Matrices and matrix multiplication

- Basic Concepts of Functions in Mathematics

- Powers of numbers (e.g. a³=aaa)

- Sample mean, variance and standard deviation

A summary of these concepts can be found in the appendix.

Self-attention mechanism

So far, we have only discussed a simple neural network structure (called a feedforward network) that contains a number of layers, each of which is fully connected to the next layer (i.e., there is a line between two neurons in any neighboring layer) and which connects only to the next layer (e.g., there is no connecting line between layer 1 and layer 3). However, as you can imagine, there is really nothing to prevent us from removing or creating other connections, or even building more complex structures. Let's explore one particularly important structure: the self-attention mechanism.

Observing the structure of human language, the next word we want to predict usually depends on all the preceding words. However, it may depend more on some of the previous words. For example, if we want to predict "Damian had a secret child, a girl, and he wrote in his will that all his property, along with the magic ball, would go to ____". The word used here could be "she" or "he", depending on the word that precedes the sentence: girl/boy.

The good news is that our simple feedforward model can connect to all words in the context, so it can learn the appropriate weights for important words. The problem, however, is that the weights connecting to specific locations through the feedforward layer are fixed (for each location). If the important words were always in the same position and it could learn the appropriate weights, then it would be fine. However, the relevant words needed for the next prediction can appear anywhere in the system. We can rewrite the sentence above so that when guessing "she or he", the word boy/girl is important no matter where it appears in the sentence. So we need to make the weights dependent not only on the position, but also on the content of that position. How do we accomplish this?

The self-attention mechanism operates similarly to weighting the embedding vectors of each word, but instead of adding them directly, it applies some weight to each word. For example, if the embedding vectors for humpty, dumpty, and sat are x1, x2, and x3, respectively, then it will multiply each vector by a weight (a numeric value) before adding them together. For example output = 0.5 * x1 + 0.25 * x2 + 0.25 * x3, where output is the output of self-attention. If we write the weights u1, u2, and u3, then output = u1 * x1 + u2 * x2 + u3 * x3, so how are these weights u1, u2, and u3 obtained?

Ideally, we'd like these weights to depend on the vectors we've added - as mentioned earlier, some words may be more important than others. But more important to whom? More important to the words we are about to predict. Therefore, we also want these weights to depend on the words we are about to predict. However, there is a problem here: we certainly do not know what the word we are about to predict is until we predict it. So, the self-attention mechanism uses the word immediately preceding the word we're about to predict, i.e., the last word currently available in the sentence (I'm not sure why this is the case rather than some other word, but so much of what goes on in deep learning is arrived at through iterative experimentation that I'm guessing it's a valid choice).

Well, we want the weights of these vectors, and we want each weight to depend on the currently aggregated word and the previous word of the soon to be predicted word. Basically, we want a function u1 = F(x1, x3), where x1 is the word we want to weight and x3 is the last word in our existing sequence (assuming we only have 3 words). A straightforward way to implement this is to define a vector for x1 (called k1) and a separate vector for x3 (called q3) and then take their dot product. This will result in a numerical value, and it depends on x1 and x3. So how are these vectors k1 and q3 obtained? We can build a simple single-layer neural network that maps x1 to k1 (or x2 to k2, x3 to k3, etc.). Simultaneously build another network to map x3 to q3, and so on. Using the matrix representation, we basically get the weight matrices Wk and Wq such that k1 = Wk * x1, q1 = Wq * x1, and so on. Now we can dot product k1 and q3 to get a scalar, so u1 = F(x1, x3) = Wk * x1 - Wq * x3.

In the self-attention mechanism, there is an additional step where instead of taking the weighted sum of the embedding vectors directly, we take the weighted sum of some "value" of the embedding vectors, which is obtained through another small single-layer network. This means that similar to k1 and q1, we also need a v1 for the word x1, which is obtained from the matrix Wv, i.e. v1 = Wv * x1. These v1's are then aggregated. So if we only have 3 words and are trying to predict a fourth, the whole thing looks like this:

The plus sign in the figure indicates a simple summation of vectors, meaning that they must have the same length. The last modification not shown is that the scalars u1, u2, u3, etc. do not necessarily sum to 1. If we need them as weights, we should let them sum to 1. So here we will apply the familiar trick of using the softmax function.

This is self-attention. There is also a kind of cross-attention, where one can have q3 come from the last word, but k and v can come from completely different sentences. This is valuable in translation tasks, for example. Now we have understood what the attention mechanism is.

We can encapsulate this whole thing into a "self-attention block". Basically, this block takes the embedding vector and outputs a single vector of any user-selected length. This block has three parameters, Wk, Wq, and Wv - it doesn't need to be any more complex. There are many such blocks in the machine learning literature, usually represented in a diagram by a box labeled with their name. Something like this:

You'll notice in self-attention that the order of the words doesn't seem to matter so much. We use the same W matrix throughout, so exchanging Humpty and Dumpty won't make a material difference - the result will be the same for all values. This means that while attention can recognize what needs to be attended to, it won't depend on word position. However, we know that word position is important in English, and performance can be improved by giving the model some positional information.

Therefore, when using the attention mechanism, we usually do not input the embedding vector directly from the attention block. Later we will see how to add position information to the embedding vector by "position encoding" before inputting the attention block.

Note: For those of you who already know about self-attention you may notice that we don't mention any K and Q matrices, nor do we apply masks, etc. This is because these are implementation details of how the model is commonly trained. Data is input in bulk and the model is simultaneously trained to predict from humpty to dumpty, from humpty dumpty to sat, and so on. This is for efficiency and does not affect comprehension or model output, so we chose to ignore optimization techniques on training efficiency.

Softmax

We briefly mentioned softmax at the very beginning. this is the problem that softmax tries to solve: in the output layer we have the same number of neurons as the number of possible options, and we say that we will choose the neuron with the highest value in the network as the output. Then we will calculate the loss by finding the difference between the value provided by the network and the ideal value we expect. But what is our ideal value? In the leaf/flower example, we set it to 0.8. but why 0.8? why not 5, 10 or 10 million? In theory, the higher the better! Ideally, we want infinity! But that would make the problem unsolvable - all losses would be infinite, and our plan to minimize losses by tweaking the parameters (remember "gradient descent") would be invalid. What can we do about it?

A simple way to do this is to limit the ideal value to some range, say between 0 and 1. This way all losses will be finite, but now a new problem arises: what if the network outputs values outside this range? Let's say in one example it outputs (5,1) for (leaves, flowers) while another example outputs (0,1). The first example makes the right choice, but the loss is higher! Well, we need a way to also convert the values of the output layer to the range (0,1) while keeping the order the same. We can use any mathematical "function" to achieve this (a "function" is a rule that maps one number to another - input a number, output another number), and a feasible way to do this is to use a function that maps one number to another - input a number, output another number, output another number. outputs another number), a viable option is the logistic function (shown below), which maps all numbers between (0,1), while keeping the order constant:

Now that each neuron in the output layer has a value between 0 and 1, we can calculate the loss by setting the correct neuron to 1 and the others to 0. This allows us to compare the difference between the output of the network and the ideal value. This works, but could it be better?

Going back to our "Humpty Dumpty" example, let's say we generate "dumpty" verbatim, and the model makes a mistake in predicting "m", and the highest "u" is not "m", but "m" is right behind it in the output layer. Instead of "m", the highest output layer is "u", but "m" is a close second. If we continue to use "duu" to predict the next character, the model's confidence is low because "humpty duu..." is not very likely to follow. There are few follow-up possibilities. However, "m" is the next highest value, so we can give "m" a chance to predict the next character and see what happens. It might give us a more reasonable word.

So, what we are talking about here is not blindly choosing the maximum, but trying a few possibilities. How do we do that? We'd have to give each possibility a probability - for example, picking the highest with a probability of 50%, the next highest with 25%, and so on, which is pretty good. But perhaps we also want the probabilities to be associated with the model's predictions. If the model's predictions for m and u are fairly close (relative to other values), then a 50-50 chance might be good.

We need a nice rule that converts these values to probabilities. softmax does the job. It is a generalization of the above logistic function, but with some added properties. If you enter 10 arbitrary numbers, it will return 10 outputs, each between 0 and 1 and summing to 1, so we can interpret them as probabilities. You'll notice that softmax appears as the last layer in almost every language model.

residual link

As the chapter progresses, we progressively use boxes/modules to represent concepts in the network. This representation is particularly effective in describing the useful concept of "residual connectivity". Let's look at residual connectivity in conjunction with self-attentive blocks:

Note that we have boxed the "Inputs" and "Outputs" to simplify the content, but they are still basically just a collection of numbers or neurons, similar to the one shown above.

What is happening here? We are essentially adding the output of the self-attention block to the original input before passing it on to the next block. The first thing to note is that this requires that the dimension of the output of the self-attention block be the same as the input. This isn't a problem because the dimensionality of the output of the self-attention block is determined by the user. But why is this necessary? Without going into all the details, the key here is that as the network hierarchy deepens (the number of layers between the input and output increases), training becomes significantly more difficult. It has been shown that residual connectivity helps alleviate these training difficulties.

layer normalization

Layer normalization is a relatively simple layer that normalizes incoming data by subtracting the mean and dividing by the standard deviation (done slightly more below). For example, if we apply layer normalization immediately after input, it will compute the mean and standard deviation for all neurons in the input layer. Assuming that the mean is M and the standard deviation is S, then layer normalization would replace each neuron's value with (x-M)/S, where x represents the original value of any neuron.

So how does this help? It basically stabilizes the input vectors and helps train the deep network. One question is, by normalizing the inputs, do we lose some useful information that might be helpful to the target? To solve this problem, the layer normalization layer has a "scaling" and a "bias" parameter. Basically, for each neuron, you can multiply a scaling value and then add a bias. The scaling and bias values are trainable parameters that allow the network to learn changes that may be valuable for prediction. Since these are the only parameters, the layer normalization block does not require a large number of parameters for training. The whole process looks something like this:

Scaling and bias are trainable parameters. As can be seen, the layer normalization is a relatively simple block and the operations are mainly performed point by point (after the initial mean and standard deviation computation). Reminiscent of activation layers (e.g., ReLU), the only difference is that here we have some trainable parameters (although much less than in other layers, since it is a simple point-by-point operation).

The standard deviation is a statistical indicator of the range of distribution of values, e.g., if all values are the same, the standard deviation is zero. If every value is far from the mean, the standard deviation is higher. The formula for calculating the standard deviation of a set of numbers a1, a2, a3... (assuming there are N numbers) is as follows: subtract the mean from each number and square the result for each of the N numbers. Add all these numbers and divide by N. Finally square root the result.

Dropout

Dropout is a simple but effective way to prevent model overfitting. Overfitting is when a model works well on training data, but does not generalize well to examples that the model has not seen before. Techniques that help avoid overfitting are called "regularization techniques" and Dropout is one of them.

If you train a model, it may produce errors in the data, or overfit in some way. If you train another model, it may also produce errors, but in a different way. What if you train multiple models and average their outputs? This is often called "integrated modeling" because it makes predictions by combining the outputs of multiple models, and integrated models usually perform better than any single model.

In neural networks, you can do the same. It is possible to build multiple (slightly different) models and then combine their outputs to get a better model. However, this can be computationally overhead. dropout is an approach that doesn't actually build the integrated model, but it captures some of the essence of integrated models.

The concept is simple, by inserting a dropout layer during training you can randomly remove a certain percentage of neuron connections. Using our initial network as an example, inserting a 50% Dropout layer between the input and middle layers might look something like this:

This forces the network to train in a lot of redundancy. Essentially, you're training many different models at once - but they share weights.

When making inferences, we can use something like integrated modeling. We could use dropout to make multiple predictions and then combine the results. However, this has a high computational overhead - and since the models share weights - why not just use all the weights to make predictions (instead of just using 50% of the weights each time)? This should approximate the effect of the integrated model.

However, there is a problem: a model trained with 50% weights will have very different values of intermediate neurons than a model with all weights. What we want is something more like the average effect of the integrated model. How to achieve this? A simple way is to multiply all the weights by 0.5, since now twice as many weights are used. This is what Dropout does during inference: it uses all the weights of the full network and multiplies the weights by (1-p), where p is the probability of deletion. Studies have shown that this works quite well as a regularization technique.

Long Attention

This is a key module in the Transformer architecture. We've already learned what the Attention module is. Remember? The length of an attention module's output is determined by the user, i.e. the length of v. Multi-head attention is running multiple attention heads in parallel (they all take the same input) and then simply concatenating all the outputs to look like this:

Note that the arrows from v1 -> v1h1 represent the linear layers - there is a matrix on each arrow to transform. I haven't shown them here to avoid overcomplicating the graph.

The process here is to generate the same key, query, and value for each header. but we are essentially applying linear transformations on top of them (individually for each k, q, v, and separately for each header) before we use the values of those k, q, v. The process here is to generate the same key, query, and value for each header. This extra layer does not exist in self-attention.

On a side note, I think this method of creating multiple heads of attention is a bit peculiar. For example, instead of creating separate Wk, Wq, and Wv matrices for each head, why add a new layer and share those weights? If you know why, let me know - I haven't really figured it out.

Location coding and embedding

In the self-attention section, we briefly discussed the motivations for using positional coding. So what are these? Although the figure shows positional coding, it is more common to use positional embeddings than positional coding. Therefore, we discuss common "positional embeddings" here, but the appendix also includes "positional encodings" used in the original paper. Positional embeddings are no different from other embeddings, except that instead of embedding words in a vocabulary, they embed the numbers 1, 2, 3, etc. The embedding is therefore a separate word from the vocabulary. Therefore, this embedding is a matrix of the same length as the vocabulary embedding, with each column corresponding to a number. It is that simple.

Unfinished Business. In the next section, which will be the last part of this post, we will briefly mention the current state-of-the-art language model, GPT, and its architecture, and share a fully open-sourced AI modeling code by a former OpenAI engineer...