Location>code7788 >text

Model distillation case--Model distillation from DeepSeek-R1-1.5B to Qwen-2.5-1.5B

Popularity:280 ℃/2025-03-20 22:18:00

DeepSeek-R1-1.5B to Qwen-2.5-1.5BModelDistillationDistillation

This article focuses on model distillation from DeepSeek-R1-1.5B to Qwen-2.5-1.5B (Distillation), Due to limited hardware resources, we can only use CPU for model distillation.

1. Distillation target

1.1. Knowledge Transfer

Migrate DeepSeek’s reasoning capabilities (such as multiple rounds of logical reasoning and code generation) to Qwen-2.5;

1.2. Efficiency optimization

Reduce inference costs (such as memory usage and latency) while maintaining performance;

1.3. compatibility

Ensure that the student model is compatible with the original features of Qwen-2.5 (such as dialogue, multilingual support).

2. Environmental preparation

2.1. PycharmInstall

Download address:/en-us/pycharm/download/?section=windows

Select version: PyCharm Community Edition

 

Installation: Just follow the prompts to install.

2.2. Dependency library installation

Make sure to install the following Python libraries:

pip install torch torchvision transformers datasets

pip install accelerate # Accelerate distributed training
pip install evaluate # Evaluation indicators

2.3. Hardware requirements

GPU: It is recommended to use a single or multiple NVIDIA GPUs (such as V100, A100) to ensure sufficient video memory (at least 24GB is recommended).

CUDA: Install CUDA versions compatible with PyTorch (such as CUDA 11.7).

 

Due to limited machine resources, this time, a 2-core Intel CPU (Intel(R) Core(TM) i7-10700F CPU @ 2.90GHz 2.90GHz) and 16G memory and virtual 20G memory are adopted, and the distillation time is about 30 days. Setting virtual memory is as follows:

 

2.4. Models and datasets

2.4.1. Teacher Modeldownload

DeepSeek-R1-1.5B (need to be downloaded from official or trusted sources). Offline download method:

$env:HF_ENDPOINT = ""

huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir ./models/DeepSeek-R1-Distill-Qwen-1.5B --local-dir-use-symlinks False

 

2.4.2. Student Modeldownload

Qwen-2.5-1.5B (need to be obtained from Alibaba Cloud or Hugging Face). Offline download method (fromDownload offline):

$env:HF_ENDPOINT = ""

huggingface-cli download Qwen/Qwen2.5-1.5B --local-dir ./models/qwen2.5-1.5B --local-dir-use-symlinks False

 

2.4.3. DatasetDatasetsdownload

It is recommended to use large-scale text datasets (such as wikitex, Wikipedia, BooksCorpus, OpenWebText, etc.). Offline download address (downloaded from /datasets/jayanthbontha/wikitext)

 

3. Process log

3.1. Log and current file path

#Configuration log(level=, format='%(asctime)s - %(levelname)s - %(message)s')
logger = (__name__)

 # Get the absolute path to the current script file
 current_script_path = (__file__)
(f"Current script path: {current_script_path}")

 # Get the directory where the current script file is located
 current_script_dir = (current_script_path)
(f"Current script directory: {current_script_dir}")

 

4. Model Loading and Configuration

4.1. Loading the teacher model

AutoTokenizer.from_pretrained is the core tool for handling text preprocessing, simplifying the loading and configuration of word segmenters. By setting parameters reasonably (such as use_fast, cache_dir), you can adapt to the needs of different scenarios. In complex tasks such as knowledge distillation, it is necessary to ensure the consistency of word participle between teachers and students' models to ensure training results.

 

# Loading the teacher model (DeepSeek-R1:1.5B)teacher_model_name = (current_script_dir, "../models/DeepSeek-R1-Distill-Qwen-1.5B")
(f"Loading teacher model: {teacher_model_name}")
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_model_name,
    local_files_only=True
)
teacher_model = AutoModelForCausalLM.from_pretrained(teacher_model_name,
    local_files_only=True
)

 

 

Key parameter description

Parameter name

describe

Example Values

pretrained_model_name_or_path

The pretrained model name (such as bert-base-uncased) or local path.

"DeepSeek/r1-1.5b"

use_fast

Whether to use the fast word segmenter based on the tokenizers library (default True).

True / False

tokenizer_type

Manually specify the word parter type (such as BertTokenizer).

"BertTokenizer"

revision

Specify the model version (such as "v1.0").

"main"

subfolder

The subdirectory path in the model repository (if the model file is not in the root directory).

"models/tokenizer"

cache_dir

Specify the cache directory (default is ~/.cache/huggingface/transformers).

"/path/to/cache"

force_download

Whether to force redownload the model file (overwrite existing files).

False

local_files_only

Use only local files and do not try to download them from the network.

False

trust_remote_code

Allows execution of remote code (such as when a custom model is required).

False

 

4.2. loadStudent Model

# Loading the student model (Qwen)student_model_name = (current_script_dir, "../models/qwen2.5-1.5B")  # Make sure the model name is correct(f"Loading student model: {student_model_name}")
student_tokenizer = AutoTokenizer.from_pretrained(student_model_name,
    local_files_only=True
)
student_model = AutoModelForCausalLM.from_pretrained(student_model_name,
    local_files_only=True
)

 

 

Key parameter description

Parameter name

describe

Example Values

pretrained_model_name_or_path

The pretrained model name (such as bert-base-uncased) or local path.

"DeepSeek/r1-1.5b"

use_fast

Whether to use the fast word segmenter based on the tokenizers library (default True).

True / False

tokenizer_type

Manually specify the word parter type (such as BertTokenizer).

"BertTokenizer"

revision

Specify the model version (such as "v1.0").

"main"

subfolder

The subdirectory path in the model repository (if the model file is not in the root directory).

"models/tokenizer"

cache_dir

Specify the cache directory (default is ~/.cache/huggingface/transformers).

"/path/to/cache"

force_download

Whether to force redownload the model file (overwrite existing files).

False

local_files_only

Use only local files and do not try to download them from the network.

False

trust_remote_code

Allows execution of remote code (such as when a custom model is required).

False

4.3. Data preprocessing function

() is the core method used in the Hugging Face datasets library to batch preprocess datasets. When batched=True, it passes the dataset in batches (batch) to the preprocess_function instead of processing it on a sample-by-sample basis. This batch processing method is more efficient and is especially suitable for large-scale data sets.

 

# Data preprocessing(f"Preprocess_function")
def preprocess_function(examples):
    return teacher_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

("Preprocessing train dataset")
train_dataset = train_dataset.map(preprocess_function, batched=True)
("Preprocessing eval dataset")
eval_dataset = eval_dataset.map(preprocess_function, batched=True)

 

 

preprocess_function must return a dictionary, and its value must be a list of the same size as the input batch. For example, if the input batch has 3 samples, the corresponding list length of each key returned must also be 3.

4.4. Data Collector

DataCollatorForLanguageModeling is a data collation class (Data Collator) in the Hugging Face transformers library, which is used to dynamically generate training samples when training language models (such as BERT, GPT, etc.). It can preprocess the input data according to task requirements such as mask language model (MLM) or causal language model (CLM).

 

# Data Collector("DataCollatorForLanguageModeling")
data_collator = DataCollatorForLanguageModeling(tokenizer=teacher_tokenizer, mlm=False)

 

 

mlm (key parameters): Function: Controls whether **mask language model (MLM)** mode is enabled.

mlm=True: Some tokens in the random mask input (such as BERT training method), generate the [MASK] tag.

mlm=False: Disable mask, suitable for causal language model (CLM) (such as GPT training method), the input and label are the original token sequence.

4.5. Define training parameters

# Define training parameters("Creating trainer")
training_args =TrainingArguments(
     output_dir="./results", # Training result saving path
     eval_strategy="epoch", # Evaluate at the end of each epoch
     learning_rate=5e-5, # Learning rate (default 5e-5 is a common choice)
     per_device_train_batch_size=2, # Training batch size for each device (GPU single card)
     per_device_eval_batch_size=2, # Evaluation of each device batch size
     num_train_epochs=3, # Training rounds (3 rounds may be short and need to be adjusted according to the task)
     weight_decay=0.01, # Weight decay (L2 regularization)
     logging_dir="./logs", # log saving path
     logging_steps=100, # Logging every 100 steps
     fp16=False, # Whether to enable mixed precision training (recommended to enable)
     gradient_accumulation_steps=4, # gradient accumulation steps (equivalent batch_size=8)
     report_to="tensorboard", # Use TensorBoard to record the training process
     # tensorboard_dir="./tensorboard" # Optional: Specify the TensorBoard log directory
 )

 

Core optimization direction: Adjust batch size, learning rate, video memory strategy and saving strategy to meet the needs of distillation tasks.

Key parameters: fp16, gradient_accumulation_steps, save_strategy and metric_for_best_model need to be adjusted according to hardware and task characteristics.

Recommended practice: Combined with TensorBoard monitoring training process, regularly evaluate model performance and adjust hyperparameters.

4.6. Define distillation configuration

# Define distillation configuration weight: add weight, "loss": "mse"("Creating distillation config")
distill_config =DistillationConfig(

     temperature=2.0, # Temperature parameter, controls the smoothness of soft labelshard_label_weight=0.5, # Real tag loss weightkd_loss_type="ce", # Knowledge Distillation Loss Type (Cross Entropy)intermediate_matches=[ # Intermediate layer matching configuration{

             "layer_T": 6, # The 6th level of the teacher model

             "layer_S": 6, # The 6th layer of the student model

             "feature": "hidden", # Match hidden layer features

             "weight": 1.0, # Intermediate layer loss weight

             "loss": "mse" # Use mean square error loss
        }

    ]

)

 

4.7. Define training configuration

# Define training configuration("Creating training config")
train_config =TrainingConfig(

     device="cuda" if .is_available() else "cpu", # Device selectionlog_dir="./logs", # log directoryoutput_dir="./outputs" # Model output directory

     # save_best_model=True, # Whether to save the best model (comment status)

     # save_last_model=True, # Whether to save the last model (comment status)

     # save_model_every_epoch=True, # Whether to save the model every round (comment status)

     # tensorboard_dir="./tensorboard" # TensorBoard log directory (comment status)
)

 

4.8. Create a distiller

# Create a distiller("Creating distiller")
distiller =GeneralDistiller(
     train_config=train_config, # Training configuration (including devices, paths, etc.)
     distill_config=distill_config, # Distillation configuration (temperature, loss weight, etc.)
     model_T=teacher_model, # teacher model
     model_S=student_model, # Student Model
     adaptor_T=None, # Teacher Model Adapter (not configured)
     adaptor_S=None # Student Model Adapter (not configured)
 )

 

 

4.9. Start distillation

# Start distillationwith distiller:  # Use the still context manager to ensure that resources are initialized and released correctly("Starting training") # Record training start log

    

     # Initialize Trainer, integrated model distillation configuration
    trainer =Trainer(

         model=student_model, # student model (small model that needs training)args=training_args, # Training parameters (such as learning rate, batch size, equipment, etc.)train_dataset=train_dataset, # training dataset (including input and label)eval_dataset=eval_dataset, # Verify dataset (used to evaluate model performance)data_collator=data_collator, # Data batch processing function (combining single pieces of data into batches)

         # processing_class=teacher_tokenizer # Note: There may be problems here (see instructions below)

         # Correct way: Adapter or data processing logic should be processed in a distillation configuration)

    

     # Start model training() # Start the training loop, including forward propagation, loss calculation, backpropagation, etc.("Training finished") # Record training end log

 

5. Results Analysis

Through the above steps, the knowledge of DeepSeek-R1-1.5B can be distilled onto Qwen-2.5-1.5B, significantly improving the performance of the student model while maintaining lightweight. In actual applications, hyperparameters and data sets need to be adjusted according to specific tasks. At the same time, reduce calculation costs. The key lies in adapter design, loss function optimization and distributed training strategies. Pay attention to model architecture differences, task adaptability and legal compliance to ensure that the final model balances performance and cost.

index

Teacher Model (DeepSeek-R1-1.5B)

Student Model (Qwen-2.5-1.5B)

Model after distillation

Verify loss

1.23

2.15

1.45

Generate text quality

high

medium

Approaching the teacher model

Speed ​​of reasoning

Slow (150ms/sample)

Fast (80ms/sample)

70ms/sample

6. Appendix: Complete code
import os

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, DataCollatorForLanguageModeling, Trainer, \
    TrainingArguments
from textbrewer import GeneralDistiller, TrainingConfig, DistillationConfig
from datasets import load_dataset
importLogging

 #Configuration log
 (level=, format='%(asctime)s - %(levelname)s - %(message)s')
logger = (__name__)

 # Get the absolute path to the current script file
 current_script_path = (__file__)
(f"Current script path: {current_script_path}")

 # Get the directory where the current script file is located
 current_script_dir = (current_script_path)
(f"Current script directory: {current_script_dir}")

 # Loading the teacher model (DeepSeek-R1:1.5B)
 teacher_model_name = (current_script_dir, "../models/DeepSeek-R1-Distill-Qwen-1.5B")
(f"Loading teacher model: {teacher_model_name}")
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_model_name,
    local_files_only=True
)
teacher_model = AutoModelForCausalLM.from_pretrained(teacher_model_name,
    local_files_only=True
 )

 # Loading the student model (Qwen)
 student_model_name = (current_script_dir, "../models/qwen2.5-1.5B") # Make sure the model name is correct
 (f"Loading student model: {student_model_name}")
student_tokenizer = AutoTokenizer.from_pretrained(student_model_name,
    local_files_only=True
)
student_model = AutoModelForCausalLM.from_pretrained(student_model_name,
    local_files_only=True
 )

 # Prepare the dataset
 datasets_name = (current_script_dir, "../models/Dataset/wikitext-2-raw/") # Make sure the model name is correct
 data_files = {
    "train": datasets_name+"",
    "test": datasets_name+""
}
(f"Loading dataset from local files: {data_files}")
dataset = load_dataset("text", data_files=data_files)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]


 # Data preprocessing
 (f"Preprocess_function")
def preprocess_function(examples):
    return teacher_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)


("Preprocessing train dataset")
train_dataset = train_dataset.map(preprocess_function, batched=True)
("Preprocessing eval dataset")
eval_dataset = eval_dataset.map(preprocess_function, batched=True)

 # Data Collector
 ("DataCollatorForLanguageModeling")
data_collator = DataCollatorForLanguageModeling(tokenizer=teacher_tokenizer, mlm=False)

 # Define training parameters
 ("Creating trainer")
training_args =TrainingArguments(
     output_dir="./results", # Training result saving path
     eval_strategy="epoch", # Evaluate at the end of each epoch
     learning_rate=5e-5, # Learning rate (default 5e-5 is a common choice)
     per_device_train_batch_size=2, # Training batch size for each device (GPU single card)
     per_device_eval_batch_size=2, # Evaluation of each device batch size
     num_train_epochs=3, # Training rounds (3 rounds may be short and need to be adjusted according to the task)
     weight_decay=0.01, # Weight decay (L2 regularization)
     logging_dir="./logs", # log saving path
     logging_steps=100, # Logging every 100 steps
     fp16=False, # Whether to enable mixed precision training (recommended to enable)
     gradient_accumulation_steps=4, # gradient accumulation steps (equivalent batch_size=8)
     report_to="tensorboard", # Use TensorBoard to record the training process
     # tensorboard_dir="./tensorboard" # Optional: Specify the TensorBoard log directory)

 # Define distillation configuration weight: add weight, "loss": "mse"
 ("Creating distillation config")
distill_config =DistillationConfig(
     temperature=2.0, # Temperature parameter, controls the smoothness of soft labels
     hard_label_weight=0.5, # Real tag loss weight
     kd_loss_type="ce", # Knowledge Distillation Loss Type (Cross Entropy)
     intermediate_matches=[ # Intermediate layer matching configuration{
             "layer_T": 6, # The 6th level of the teacher model
             "layer_S": 6, # The 6th layer of the student model
             "feature": "hidden", # Match hidden layer features
             "weight": 1.0, # Intermediate layer loss weight
             "loss": "mse" # Use mean square error loss}
     ]
 )

 # Define training configuration
 ("Creating training config")
train_config =TrainingConfig(
     device="cuda" if .is_available() else "cpu", # Device selection
     log_dir="./logs", # log directory
     output_dir="./outputs" # Model output directory
     # save_best_model=True, # Whether to save the best model (comment status)
     # save_last_model=True, # Whether to save the last model (comment status)
     # save_model_every_epoch=True, # Whether to save the model every round (comment status)
     # tensorboard_dir="./tensorboard" # TensorBoard log directory (comment status))

 # Create a distiller
 ("Creating distiller")
distiller =GeneralDistiller(
     train_config=train_config, # Training configuration (including devices, paths, etc.)
     distill_config=distill_config, # Distillation configuration (temperature, loss weight, etc.)
     model_T=teacher_model, # teacher model
     model_S=student_model, # Student Model
     adaptor_T=None, # Teacher Model Adapter (not configured)
     adaptor_S=None # Student Model Adapter (not configured))

 # Start distillation
 with distiller: # Use the still context manager to ensure that resources are initialized and released correctly
     ("Starting training") # Record training start log

     # Initialize Trainer, integrated model distillation configuration
     trainer =Trainer(
         model=student_model, # student model (small model that needs training)
         args=training_args, # Training parameters (such as learning rate, batch size, equipment, etc.)
         train_dataset=train_dataset, # training dataset (including input and label)
         eval_dataset=eval_dataset, # Verify dataset (used to evaluate model performance)
         data_collator=data_collator, # Data batch processing function (combining single pieces of data into batches)
         # processing_class=teacher_tokenizer # Note: There may be problems here (see instructions below)
         # Correct way: Adapter or data processing logic should be processed in a distillation configuration)

     # Start model training
     () # Start the training loop, including forward propagation, loss calculation, backpropagation, etc.trainer.save_model()

     ("Training finished") # Record training end log