A comprehensive explanation of how AI LLM models really work (I)
Artificial Intelligence #Large Language Modeling LLM #Machine Learning ML #Deep Learning #Data Mining
Preface: In order to help more people understand, we will break down into subsections to explain the real workings of the Large Language Model (LLM), starting from scratch, with no additional knowledge base, and requiring only a basic foundation in middle school math (just know how to add and multiply). This paper contains all the knowledge and concepts needed to understand the LLM, and is completely self-contained (no reliance on external sources). We will first build a simple generative large language model on paper, and then dissect each step step-by-step in detail to help you master the Modern AI Language Model (LLM) and the Transformer architecture. The text is stripped of all complex terminology and machine learning jargon, and simplified to a pure expression of numerical multiplication and addition. Of course, we have not discarded the details and will point out relevant terms at appropriate places in the text so that you can make connections to other specialized content.
Going from knowing how to add/multiply math operations to figuring out today's state-of-the-art AI models means we need to cover a lot of ground. This is not a toy version of the LLM explanation - someone with a mind to do so could theoretically rebuild a modern LLM from it.I've cut out all the superfluous words and phrases, so the whole piece of knowledge doesn't lend itself to a quick skimming type of read.
(By following and subscribing to the author, you will be able to receive the latest updates from the author, the latest technical developments in the industry, as well as practical hands-on experience sharing.)
The entirety of this article will cover the following 15 points:
-
A simple neural network
-
How are these neural network models trained?
-
How does the model generate the output language?
-
What makes the LLM model work so well?
-
Embedding.
-
participle (math.)
-
self-attention
-
Softmax
-
residual link
-
layer normalization
-
Dropout
-
Long Attention
-
Position embedding
-
GPT architecture
-
Transformer Architecture
Here we go.
The first thing to mention is that neural networks can only receive numbers as input and output numbers. Without exception, the key to the magic lies in how to convert everything the user inputs (text, images, video, sound) into numbers, as well as interpreting the neural network's output numbers for the purpose. Finally, we build a neural network ourselves so that it takes the input you provide and gives you the output you want (based on how you choose to decode the output). Let's see how we can realize the effects of the AI language model Llama 3.1 capabilities from the basic arithmetic principles of addition and multiplication.
Constructing a simple neural network:
Let's start by figuring out how a simple neural network that can classify objects exists, which is the context of the task that tells us about the neural network we're going to design:
- Describe the objects that will be recognized by the neural network in terms of color (RGB) and volume (ml)
- Requires the neural network to accurately distinguish whether an object is a "leaf" or a "flower."
Below is an example of the use of numbers to represent "leaves" and "flowers":
The leaf color is represented by the RGB values (32, 107, 56) and the volume is 11.2 ml. The color of the flower is represented by the RGB values (241, 200, 4) and the volume is 59.5 ml. This data is used to train the neural network to recognize leaves and flowers based on "color" and "volume".
Let's now build a neural network to accomplish this classification task. We start by determining the format of the inputs/outputs and how the outputs will be interpreted. The leaves (Leaf) and flowers (Flower) in the figure above are already represented by numbers, so they can be passed directly to the neurons in the neural network. Since the neural network can only output numbers, we also need to define the numbers that the neural network outputs (i.e., what kind of numbers represent the types of objects that the neural network recognizes, i.e., whether it is a "leaf" or a "flower", since the neural network itself cannot directly output "leaves" or "flowers"). (i.e., what number represents the type of object recognized by the neural network, i.e., whether it is a "leaf" or a "flower", because the neural network itself cannot directly tell us the classification result by outputting the names "leaf" and "flower"). Therefore, we need to define an interpretation scheme that corresponds the output numbers to the object categories:
- If the designed neural network has only one output neuron, it can represent the recognized object category by a positive or negative number. When the neuron outputs a positive number, we consider that the neural network recognizes the object as a "leaf"; if it outputs a negative number, the recognized object is a "flower".
- Alternatively, a neural network can be designed with two output neurons, each representing a different object class. For example, let's say the first neuron represents "leaf" and the second neuron represents "flower". When the number output from the first neuron of the neural network is greater than the number output from the second neuron, we say that the neural network recognizes the current object as a "leaf"; conversely, when the number output from the first neuron is less than the number output from the second neuron, we say that the neural network recognizes the current object as a "flower". "flower".
Both schemes allow the neural network to recognize whether an object is a "leaf" or a "flower". However, in this paper, we choose the second scheme because its structure is more easily adaptable to what we are going to explain later. Here is the neural network designed according to the second scheme. Let's analyze it in detail:
The circles in the figure represent the neurons in the neural network, and each vertical row represents a layer of the network. All the data enters from the first layer, and then layer by layer, one by one, we do multiplication and addition calculations, go through the hidden layer (three neurons), and finally arrive at the output layer (two neurons), where we predict what object is currently recognized based on the values output by the two neurons in the last layer. Note the arrows and numbers in the figure and the multiplication and addition relationship between them.
The calculation in the blue circle is as follows: (32 * 0.10) + (107 * -0.29) + (56 * -0.07) + (11.2 * 0.46) = -26.6
Some jargon (technical term):
- Neurons/nodes: circles with numbers
- Weight: number marked on the arrow line
- Layers: a group (row) of neurons is called a layer. The network can be thought of as having three layers: an input layer of 4 neurons, an intermediate layer of 3 neurons, and an output layer of 2 neurons.
To calculate the prediction/output of the network (called "forward propagation"), start from the left. Fill the neurons in the first layer with the numbers represented by the "leaves". To advance to the next layer, multiply and add the numbers in the circles with the weights of the corresponding neurons. We demonstrate the computation for the blue and orange circles. After running the entire network, the first number in the output layer is larger, so we can interpret this as "the network classifies these (RGB, Vol) values as leaves". A well-trained network can handle a variety of (RGB, Vol) inputs and classify objects correctly.
The neural network model itself has no concept of "leaf", "flower" or (RGB, Vol) (i.e. it does not understand what "leaf" and "flower" are). "flower"). It is designed to receive 4 numbers and output 2 numbers. We specify that the 4 input numbers represent the color value and volume of the object, and we also specify how the values of the 2 output neurons correspond to "leaves" and "flowers". Ultimately, the weights of the network are automatically adjusted through the training process to ensure that the model receives the input numbers and outputs results that match our interpretation.
As an interesting side effect, the neural network can also be used to predict the weather conditions in the next hour. We represent e.g. cloudiness and humidity as 4 different numerical values as inputs and interpret the final output of the neural network as "sunny in 1 hour" or "rainy in 1 hour". If the weights of this neural network are well calibrated, the network can perform the tasks of classifying leaves/flowers and predicting the weather at the same time. Our neural network simply outputs two numbers, and what those two numbers actually mean depends entirely on your definition of what they mean. For example, the two numbers could represent the result of classifying an object or predicting the weather, etc.
In writing this subsection, I have omitted some of the following technical terms in order to make it more understandable to a wider audience. Even ignoring these terms, you can still understand the basic concepts of neural networks:
- Activation layer:
Neural networks often have an "activation layer" that applies a nonlinear function to the results of each node's calculations to enhance the network's ability to handle complex situations. A common activation function is ReLU, which sets negative numbers to zero while keeping positive numbers constant. For example, in the above example, we can replace the negative numbers in the hidden layer with zeros before passing them on to the next layer for calculation. Without an activation layer, all additions and multiplications in the network can be reduced to a single layer. For example, the output of the green node can be written directly as a weighted sum of RGBs without a hidden layer. The nonlinear nature of the activation layer allows the neural network to handle more complex patterns.
- Bias:
Each node in a neural network usually has a "bias" value associated with it, which is added to the node's weighting and result to adjust the output. For example, if the top blue node has a bias of 0.25, the formula becomes: (32 * 0.10) + (107 * -0.29) + (56 * -0.07) + (11.2 * 0.46) + 0.25 = -26.35. The bias allows the network to fit the data more flexibly, and the term "parameters" usually refers to these weights and biases in the model. " usually refers to these weights and bias values in the model.
• Softmax:
At the output level, we often want to convert the results into probabilities, and the Softmax function is a common way to do this by converting all output values into a probability distribution (summing to 1). Softmax divides the exponent of each output value by the sum of the exponents of all the output values, so that the results at the output level can be interpreted as the probability of the respective classifications. For example, if the Softmax processed values are 0.8 and 0.2, this means that there is an 80% probability of being a "leaf" and a 20% probability of being a "flower".
To be continued...