>Follow the official account**Reply 1**
>
> Obtain **Takes on Management"**
Before, everyone thought that China and the United States had a big gap in the field of AI. Who would have thought that DeepSeek emerged during the Spring Festival and was directly confused by OpenAI and also confused many domestic practitioners!
For AI enthusiasts, I am very curious about: How did **DeepSeek do it? **
So I searched for a lot of information during the Spring Festival and gained a little. Today we will explore the secrets of its realization.
## "Copy homework", copy the second place in the grade
First of all, DeepSeek's success is standing on the shoulders of a giant, and he has two teachers:
1. The first is **OpenAI**. Although OpenAI did not publish the details of the model implementation, it has published a large number of model research papers, including blogs and APIs, so that everyone can better understand the basic working principles of large models;
1. The second is the ** model open source community. Whether it is GPT-2, BERT or powerful LLama, it provides reference code and implementation, including some tools and data sets;
We do have to lament the rapid progress of DeepSeek, but without the contribution of the above two, DeepSeek will be difficult to succeed; and then DeepSeek also gave back to open source, and in the end the entire AI field has taken a big step forward in crossing the domain.
## A brief explanation of the principle of the big model
To facilitate your understanding, here we will first explain the working principle of the big model in a very simple way. Imagine that you are playing the idiom solitaire game.
Every idiom is a "high-dimensional vector" and its meaning is like a vector of a word, and the connection between idioms is the reasoning process of the model.
The rules of the game require you to start from the last word of the previous idiom, and the combination of these idioms will gradually guide you to a complete logical chain:
The **label data** is like you know the correct combination of each idiom in advance, helping you quickly connect to the correct idiom.
It directly tells you **"What is the next idiom" so that you can complete the solitaire more accurately, like playing with a very directional sense.
It's like when the model is trained, it learns how to correctly infer the results of a task (such as judging a crime or verdict).
**Non-labeled data** is like you don’t need a prompt during the process of succession, relying on your intuition and experience to infer the next appropriate idiom.
By contacting a large number of idioms, you can feel which idioms are closer in meaning and which idioms will be more matched in certain contexts, so that you can more flexibly follow appropriate idioms.
This is like a model learning itself through non-labeled data. It not only knows the meaning of certain idioms, but also adjusts its understanding through constant succession, so that the model can reason and respond reasonably when facing new tasks.
**Finally**, labeled data helps the model accurately establish task-specific inference rules through supervised learning, optimizing the model's accuracy on specific tasks.
Through self-supervised learning or unsupervised learning, non-labeled data helps the model discover potential semantic structures in high-dimensional vector space, improving its generalization ability and reasoning flexibility.
Combining both, the model can not only perform tasks accurately, but also effectively respond to new complex challenges.
### Take an example
**Tagged data** can help the model establish task-specific inference rules through supervised learning, because the tagged data contains ** input and the corresponding correct output (label). **
Through these labeled data, the model can learn how to infer the correct results from the input data. That is, the ** tagged data provides the "standard answer" for model learning, allowing the model to optimize its inference ability to accurately complete specific tasks.
For example, suppose you are training a legal judgment prediction model.
You have a lot of legal judgments as input data, and each judgment has been marked with the correct charge (such as "theft") and the judgment result (such as "three years in *"). These labeled data are labeled data.
By learning these marked data, the model identifies the keywords of the case (such as “theft”) and the judgment result (such as “in *”) from the judgment. During the training process, the model gradually adjusts its parameters based on the input judgment and the correct label, so that it can accurately infer the crime and judgment results of the case.
Ultimately, the ** model can learn how to reason about the correct crime and judgment based on the content of the verdict through labeling data, so that it can also make correct judgments when encountering similar unseen data in the future.
The above is, **With supervised fine-tuning, the model is trained by a large amount of labeled data. **
The model learns the keywords (such as "theft") of the input data (verdict) and the relevant judgment results (such as "three years in *"), and gradually optimizes the reasoning process by adjusting parameters. After training, the model is able to infer accurate crimes and verdicts from new verdicts.
**If there is no supervision fine-tuning, will the situation develop in disorder? **
Without supervised fine-tuning, the pre-trained model will get a lot of common knowledge about language, such as grammar, relationships between vocabulary, etc., but it does not learn how to apply this knowledge Reasoning tasks for legal judgments (without purpose). **
The model may understand the basic structure of the judgment, but it does not know what details (such as the specific facts of the case, relevant legal provisions) are the key to judging the crime and the judgment result.
For example, it may recognize the word “theft” but cannot correctly associate it with the legal concept of theft.
Finally, because the model is not fine-tuned under the supervision of marked data, it may ignore certain key legal details in the judgment, resulting in the predicted charges or judgment results being wrong or incomplete. **
For example, the model may recognize the term “theft” from the judgment, but if it is not properly supervised, it may misclassify it as a common “theft” rather than a legal “theft crime”, thus affecting Verdict result.
### To sum up
**Pretraining** Let the model learn the basic rules of language through large-scale unlabeled data (such as text, books), including grammatical structure, semantic understanding, and contextual reasoning.
The model established a common language understanding ability at this stage. By adjusting the underlying vector space, the model can recognize vocabulary relationships, infer contextual meanings, and has extensive reasoning ability, but does not focus on specific tasks.
**Fine-tuning** is based on pre-training, using a small amount of labeled domain data (such as legal judgments, medical reports) for intensive training. The goal is to focus the model on the reasoning ability of a specific task or field. **
For example, in the field of law, fine-tuning allows models to learn how to reason about crimes and judgments based on judgments.
Pre-training provides a common language understanding framework by adjusting the underlying vector structure of the model; fine-tuning optimizes the model's inference ability on specific tasks and enhances the model's accuracy and adaptability. The two work together to help the model handle complex tasks efficiently.
With the understanding of the most basic working principle of the big model, we will continue to discuss it.
>>Popular Science: Unlabeled data refers to original data without clear labels or classifications**
>>
>>For example: Article 231 Theft: Theft: Theft of public and private property, the amount is large, and the punishment is imposed according to law.
>>Popular Science: Labeled data means **Every item in the data set has a clear label or classification**
>>
>>Diagnostic data is an explanation or additional explanation of a paragraph of text. It helps the model understand and learn key parts and information in the text, and thus makes reasonable predictions or classifications in unlabeled data.
>>
>>Original text: According to Article 234 of the Criminal Law of the People's *, the crime of theft is a criminal offense and involves property losses, criminal liability should be pursued in accordance with the law.
>>
>> Labeled data: legal provisions (Article 234 of the Criminal Law of the People's *), crime (theft).
## DeepSeek's innovation in training methods
### 1. Traditional training methods
As mentioned earlier, traditional model training is usually based on supervised learning and fine-tuning. The training mainly includes two steps:
- **First, pre-training**
Train a general model through a large number of data sets (such as books, articles, research reports, etc.).
For example, the pre-training steps of GPT, BERT and other models use massive unlabeled data, and the goal is to let the model learn basic language rules.
- **Second, supervision and fine adjustment**
On the pre-trained pedestal model, based on task requirements, a small amount of labeled data is used to fine-tune the pre-trained model to ensure that the model can show higher accuracy and accuracy on specific tasks (such as text classification, question and answer systems, etc.).
This process is called supervised fine-tuning, which means optimizing the model with labeled data to make it more accurate on a specific task.
Traditional methods are usually used for tasks that require a large amount of annotation of data, such as sentiment analysis, machine translation, etc. The model performs fine adjustments by labeling data, so that the model performs very well on these tasks.
Although high-precision results can be obtained, ** requires a large amount of labeled data and computing resources, which often leads to higher training costs. **
So, let’s take a look at the difference here in DeepSeek?
### 2. DeepSeek training method
The training method proposed by DeepSeek is the augmentation training paradigm based on reinforcement learning. Its core lies in combining the **reinforcement Learning technology with model training to enhance the model's inference ability. **
The training process of DeepSeek R1 focuses more on the following aspects:
- **First, pre-training**
Similar to traditional methods, DeepSeek also uses a pedestal model (such as DeepSeek V3) for pre-training to learn basic language or image features.
However, DeepSeek emphasizes on this basis to improve the decision-making and reasoning capabilities of the model through **reinforcement learning**.
The core of reinforcement learning is to learn by interacting with the environment and receive rewards or punishments, and the model optimizes its own behavior through feedback. **
- **Second, rules-based reinforcement learning**
To cope with large-scale training and inference tasks, DeepSeek adopts a rule-based approach, which allows reinforcement learning to effectively scale (scaling) on large-scale models.
Rules help the system design a reasonable reward mechanism, thereby guiding the model to optimize its inference capabilities without relying entirely on traditional annotated data. **
- **Third, DeepSeek-R1-Zero**
The key to this innovative model is that DeepSeek R1 enhances reasoning ability through reinforcement learning "self-learning" rather than relying solely on traditional annotated data.
**This method can reduce dependence on labeled data and gradually improve inference capabilities with less manual intervention. **
Let's discuss it in detail here.
## Reinforcement learning and reward mechanism
In DeepSeek, the specific implementation method of combining reinforcement learning with reward mechanism is one of the core of its innovative training methods. Here is a brief introduction to reinforcement learning:
### Reinforcement Learning
Reinforcement learning is a paradigm of machine learning that is different from traditional supervised learning.
In supervised learning, the model learns from a set of marked input-output pairs and is trained through input data and corresponding labels;
In reinforcement learning, the model learns through interaction with the environment. Through this interaction, the model receives rewards or punishments based on the actions performed, and then adjusts its own behavior to maximize future rewards.
In fact, reinforcement learning has many similarities with the core principles of traditional supervised learning (including fine-tuning), especially since they all adjust the behavior of the model through feedback. **
These feedbacks usually optimize the performance of the model by updating the model's parameters (such as weights and biases). From this perspective, the essence of reinforcement learning and fine-tuning is indeed optimized by changing the model's structure (vectors or parameters) Output. **
The biggest difference between reinforcement learning and fine-tuning is the training goals and feedback methods.
Fine-tuning relies on labeled data to solve specific tasks through one-time optimization, while reinforcement learning optimizes behavior through continuous environmental interaction and reward feedback mechanisms, so that the model can adapt to dynamic tasks and changing environments for a long time.
Therefore, reinforcement learning is more flexible and effective in dealing with tasks that require long-term decision-making and self-adjustment.
****
Finally, traditional training methods face a huge challenge: **high costs**, which are not only reflected in data collection and annotation, but also in the consumption of computing resources during training.
The emergence of DeepSeek, through a series of innovative strategies, significantly optimizes training costs, especially through the clever combination of reinforcement learning, model distillation and feedback mechanisms:
### Model distillation
Model distillation is a technique that transfers the knowledge learned in a large model (teacher model) to a smaller model (student model).
In the initial stages of model distillation, the teacher model (such as large pretrained models such as GPT) usually generates a large amount of high-quality data. **
These data include not only the traditional "hard tags" (i.e. the correct answer), but also "soft tags" (i.e. the output probability distribution or intermediate representation of the model). These soft tags can contain more information than hard tags, such as the relative probability between different candidate answers, confidence in model reasoning, etc.
The teacher model processes a set of input data and generates outputs by generating a Q&A pair or **, and provides the student model as a "learning goal".
This process conveys the knowledge of the teacher model through distillation, but the goal is no longer to let the student model directly imitate hard labels, but to imitate the reasoning process, behavior or output distribution of the teacher model.
After obtaining the Q&A pairs and soft tags generated by the teacher model, the student model (such as DeepSeek) uses these high-quality data as "training data" to fine-tune. At this stage, the goals of the student model are:
1. **Imitate the behavior and output of the teacher model as much as possible. **Specifically, the student model will adjust its parameters by comparing it with the teacher model output, so that its output is close to the teacher model.
1. The student model will adjust its parameters and structure through fine-tuning the process, making its reasoning process, decision-making path and teacher model closer.
This fine-tuning process is usually achieved by optimizing the loss function, which measures the gap between the student model output and the teacher model output (usually a measure of the difference in probability distributions such as KL divergence is used).
The advantage of this approach is that student models do not need to have as large computing and storage requirements as teachers' models to have similar inference capabilities, thereby achieving significant savings in computing resources.
But it should be noted here that even if you want to use model distillation, the student model itself must have a certain level, and the open source model here will help a lot.
### Reinforcement learning optimization and dialogue process
After completing the initial distillation and fine-tuning, for further optimization, especially in scenarios involving dynamic interactions or complex reasoning tasks, the student model may enter a "dialogue" phase, constantly interacting and adjusting with the teacher model.
At this stage, reinforcement learning begins to work: the “dialogue” between the student model and the teacher model is not just through a static fine-tuning process, but through continuous interaction, ** to optimize the student model through reward and punishment mechanisms Behavior. **
Each time the model has an "interaction", the student model adjusts its strategy based on feedback (reward or punishment):
1. **Reward mechanism: **Every time the output of the student model approaches the behavior of the teacher model, it will receive a reward. For example, if the student model can generate a high-quality output similar to the teacher model, it will receive a reward signal; if it deviates from the teacher model's output, it will be punished.
1. **The goal of reinforcement learning: ** Through this reward mechanism, the student model can continuously optimize its behavioral strategies based on the interaction process with the teacher model, so that its reasoning and decision-making abilities are constantly approaching the teacher model.
1. **Continuously adjust weights: **The weights of the student model will be continuously adjusted during this "dialogue". Through reinforcement learning, the student model continuously optimizes the decision-making process through dialogue and feedback with the teacher model, making it more flexible and precise in more complex tasks.
## Cost Advantage
In summary, the core advantage of DeepSeek is how to efficiently combine innovative training methods such as reinforcement learning, model distillation and self-learning.
This greatly reduces the computing resource requirements of traditional methods and improves the model's inference ability in dynamic and complex tasks.
>But you must pay attention to this low cost and high efficiency must be thanks to GPT and high-quality base models
### 1. Application of reinforcement learning
DeepSeek trains through the continuous interaction between reinforcement learning and the environment, avoiding strong dependence on large amounts of labeled data. The model adjusts behavior through real-time feedback (rewards or punishments) and gradually improves decision strategies.
This training method gradually optimizes the model, reducing the requirement that a large amount of data is required in traditional training methods, and improving task adaptability and efficiency.
### 2. Model distillation
DeepSeek uses model distillation technology to pass on the knowledge of the large-scale teacher model optimized by reinforcement learning to smaller student models.
In this way, the student model retains high abilities when reasoning, but does not require as large computing resources as the teacher model, which significantly reduces the training cost and the consumption of computing resources.
### 3. Rules-based reinforcement learning
DeepSeek's training method adopts rules-based reinforcement learning, which designs a reasonable reward mechanism through rules to help the model optimize the inference process without relying entirely on labeled data.
The rule-guided model maintains stable performance in complex tasks, reduces the cost of collecting large-scale annotation data and improves the scalability of large-scale task processing.
### 4. Reduce manual intervention
DeepSeek optimizes the model's inference ability through feedback mechanisms of self-learning and reinforcement learning.
Through interaction with the environment or other models, the model adjusts itself through rewards and punishments, without frequent human intervention or data cleaning.
Through the training and feedback mechanism with the teacher model, DeepSeek can continuously improve its reasoning capabilities with less manual intervention, thereby reducing the cost of manual adjustment and labeling of data.
### 5. Low-cost innovation system
DeepSeek combines reinforcement learning, model distillation and self-learning to form an efficient training system.
In this system, the model reduces the need for labeled data and high computing resources through self-optimization, and maintains efficient inference with fewer computing resources.
This systematic innovation improves training speed while maintaining efficiency while handling complex tasks, significantly reducing training costs.
## Conclusion
In the previous discussion, we analyzed the innovations of the DeepSeek model in detail, especially through the combination of reinforcement learning, model distillation and self-learning, which significantly reduces training costs and improves efficiency.
In reality, many great masters have successfully completed the DeepSeek replica. For example, **Datawhale Master Luo**.
Among them, the articles on **Internet Reading Notes** are relatively easy to read, and I have a very similar view to him: **Model training, data is king**.
So many people can complete model reproduction, which shows a problem: **The threshold for model training is not high, and the subsequent development of the tool platform will become lower, but how to prepare high-quality data using engineering methods is the biggest problem. **.
At this point, this article is over. It is important to emphasize: **This article is my absorption and interpretation of more than 100 articles, including many practical experience in the past two years. There must be many mistakes and omissions. Please Everyone corrects **.
Finally, add one sentence: **AI product development is the core of engineering practice under industry KnowHow, where how to cycle through high-quality data to form a flywheel system is the key, so today's content is actually optional for engineering applications. Yes, we can use the model as a black box, but it doesn’t hurt to know more.
