A comprehensive explanation of how AI LLM models really work (II)

PREVIOUS: "A Comprehensive Explanation of How the LLM Model of Artificial Intelligence Really Works (I)

Preface:In the previous post, we built a neural network to recognize "leaves" and "flowers" in principle and explained in detail how it works. This involves multiplying the input numbers one by one with the weights, summing them, adding the bias values, and finally calculating the output through nonlinear processing and statistical distribution. These operations use simple mathematical operations (multiplication, addition and nonlinear processing). The focus of this section is to answer the question of how the neural network's weights and bias values are obtained and the most crucial concept: how to make the neural network output chatGPT-like sentences. In order for the neural network to learn the appropriate weights and biases, we need to provide a large amount of learning data (e.g., a large number of "leaf" and "flower" pictures), so that the network can adjust the weights and biases of each neuron during the learning process, and ultimately achieve the correct classification. (Please move your little hands and subscribe to the author!）

How to train this neural network (model)?

In the above example, we preset the model with appropriate weights and biases for testing purposes so that we can get accurate outputs. But in practice, how are the weights and biases obtained? The process of obtaining suitable 'weights' and 'biases' is called 'training the model' or 'training the neural network'. "The process of getting the right 'weights' and 'biases' is called 'training the model' or 'training the neural network', which can also be interpreted as 'self-learning of the AI'; yes, this process is 'training the AI'. All humans need to do is to provide the model with quality data for training.

Let's say we have collected some data, including various types of "leaves" and "flowers". Then, we use a tool to convert their colors and volumes into numbers, and label each data sample as a "leaf" or "flower" (naming the data is called "labeled data"). "The final set of data is our "training dataset".

Training a neural network works as follows:

Initialization weights

First, start with a random number and set each parameter/weight of the neuron to a random number. (Anything not initialized in the computer memory when starting the training program is a random number and generally does not need to be specifically set)

Input data and get initial output

We feed the neural network a data representation of a "leaf" (e.g., R=32, G=107, B=56, Vol=11.2), and expect that the value of the first neuron in the output layer is greater than the value of the second neuron, which means that the "leaf" is recognized. If the expected value of the "leaf" neuron is 0.8, the value of the neuron representing the "flower" is 0.2.

Calculation of losses

Because the initial weights are randomized, the actual output often differs from what is expected. For example, if the initial outputs of two neurons are 0.6 and 0.4, we can calculate the loss by finding the difference and adding the squared difference: (0.8 - 0.6)² + (0.2 - 0.4)² = 0.04 + 0.04 = 0.08. Ideally, we would like to have the loss close to zero, which is the "minimized loss". ".

Calculate the gradient and update the weights

Calculate the effect of each weight on the loss (called the gradient) to see in which direction the adjustment will reduce the loss. The gradient indicates the direction of change for each parameter - the weights are adjusted slightly in the direction of loss reduction. This process is called "gradient descent".

iterate

Repeating these steps over and over again, the loss is gradually reduced by constantly updating the weights, resulting in a set of "trained" weights or parameters. This is the training process of neural networks, called "gradient descent".

supplementary note

- Multiple training samples

Multiple samples are often used in training. Fine-tuning the weights to minimize the loss of one sample may cause the loss of other samples to increase. To solve this problem, the average loss of all samples is computed and the weights are updated based on the gradient of the average loss. Each full cycle of samples is called an epoch, and training with multiple epochs can help to find better weights over time.

- Automatic gradient calculation

In fact, there is no need to manually fine-tune the weights to compute the gradient; the mathematical formulas can directly derive the optimal tuning direction for each parameter. For example, if the weights were 0.17 in the previous step and the output of the neuron is desired to increase, it may be more efficient to adjust the weights to 0.18.

In practice, training a deep network is a complex process, and the training may encounter situations where the gradient goes out of control, e.g., the gradient value tends to zero or tends to infinity, which are known as the "gradient vanishing" and "gradient explosion" problems, respectively. While the above loss definitions are valid, in practice it is common to use a loss function that is better suited to the specific task to improve training results.

How do these principles help neural networks generate language?

Keep in mind that a neural network can only take input one set of numbers, perform mathematical operations based on trained parameters, and finally output another set of numbers. The key is how to interpret these numbers and automatically adjust the parameters through training. If we can interpret two numbers as "leaves/flowers" or "it will be sunny or rainy in an hour", we can also interpret them as "the next character of a sentence".

However, there are far more than two English letters, so we need to extend the number of neurons in the output layer, e.g., to more than 26 neurons (plus some symbols such as spaces, periods, etc.). Each neuron corresponds to a letter or symbol, and then we find the neuron with the largest value in the output layer and use its corresponding character as the output character. Now we have a network that can take input and output characters.

If we feed a neural network the string "Humpty Dumpt" and have it output a character that we interpret as "the next character predicted by the network", we can train the network to ensure that it outputs the letter "y" when it receives the string "Humpty Dumpt", thus achieving our desired result of "Humpty Dumpty". We can train to ensure that the network outputs the letter "y" when it receives such a string "Humpty Dumpt" as input, thus achieving our desired result "Humpty Dumpty".

But here's the catch: how do you feed strings into the network? After all, neural networks only accept numbers! In general practice, we can convert strings into arrays of values that can be understood and processed by neural networks by using "one-hot encoding" or other encoding methods.

Here we encode with one of the simplest solutions: directly assign a number to each character. For example, a=1, b=2, and so on. Now we can type "humpty dumpt" and train the network to output "y". The network works as follows:

First input a string of sentences (strings) in the input layer of the neural network and it will predict the next character in the output layer. This approach helps us to build complete sentences. For example, once we have predicted "y", we can add this "y" to the end of the previously entered string and send it back to the input layer of the neural network to predict the next character. If trained properly, the network will predict a space; and so on, eventually producing the complete sentence: "Humpty Dumpty sat on a wall". In this way, we have a generative AI (Artificial Intelligence Language Model) and the neural network can now generate natural human language!

Of course, in real applications such as chatGPT, we would not use this simple character numbering method. We will introduce a more reasonable encoding method later. If you can't wait, check out the "Encoding" section in the appendix.

The attentive reader may notice that we can't directly input "Humpty Dumpty" because, as shown in the figure, there are only 12 neurons in the input layer, corresponding to each character (including spaces) in "humpty dumpt". There are no more neurons left for the 'y' input. So, how do we add the 'y' in the next step? Adding a 13th neuron to the input layer would require retooling the entire network, which is obviously not practical. The solution is simple: we can eliminate the earliest character "h" and keep the 12 most recent inputs. For example, if we enter "umpty dumpty", the network will predict a space; then we enter "mpty dumpty", the network will output "s ", and so on, the process is shown below:

One problem with this approach is that when we type "sat on the wal", much of the previous information is lost. So, how do modern top neural networks solve this? The principle is basically similar. The input to the neural network is of a fixed length (depending on the size of the input layer), and this length is called the "context length", i.e. the range of references that the network uses to predict what will follow. The context length of modern networks can be very long (often up to tens or even hundreds of thousands of characters. For example, ChatGPT's 4o model supports 128,000 characters and Claude supports 256,000 characters. This means that they use more than 100,000 neurons in the input layer to receive user input. (Imagine, hundreds of billions of parameters means how many neurons are involved in the computation?) , which is very helpful in improving results. Although some methods allow for input sequences of infinite length, models with larger fixed context lengths have outperformed these methods in terms of performance.

The careful reader may also notice that we interpret the same letter differently at the input and output sides! For example, when inputting "h" we represent it with the number 8, but in the output layer, we don't directly ask the model to output the number 8 for "h", but rather we generate 26 numeric values and choose the letter that corresponds to the largest of these values as the output. If the 8th value is the largest, we interpret it as "h". Why not use the same representation at both ends? In fact, this is to build more efficient models - different ways of interpreting inputs and outputs provide more possibilities for model performance improvement. Practice has shown that such different representations are more effective for language generation. In fact, our numerical representation at the input is not optimal either, and a more optimal approach will be presented later.

This section is to figure out the core principles of chatGPT to output human natural language sentences, I hope that if you are interested in it, if you do not understand it, read more or leave a message in the comment section to communicate with the author, I will answer all the comments without omission.

To be continued...