LLM Training Series] Phi2-Mini-Chinese Project Interpretation for Training Large Models from Scratch

I. Preface

This article is mainly to briefly analyze the project Phi2-mini-Chinese after reproducing and practicing Phi2-mini-Chinese, and to make a learning practical summary.

Originally posted on Knowledge:/p/718307193, reproduced with permission.

Introduction to Phi2-Mini-Chinese

Phi2-Chinese-0.2B Train your own Phi2 Chinese small chat model from scratch, support access to langchain to load local knowledge base for retrieval and enhancement to generate RAG.

Project start period: December 22, 2023
Address:/charent/Phi2-mini-Chinese

Process Steps

data processing
Tokenizer Training
pre-training
SFT
DPO

Data processing steps are omitted. Open source datasets are generally used.

II. Tokenizer training

It's just using the tokenizers library to train with BPE, there's not much to it.

III. Pre-training codes

import os, platform, time
from typing import Optional
import numpy as np
import pandas as pd
from dataclasses import dataclass,field
from datasets import load_dataset, Dataset
import torch
from transformers.trainer_callback import TrainerControl, TrainerState
from transformers import PreTrainedTokenizerFast, DataCollatorForLanguageModeling, PhiConfig, PhiForCausalLM, Trainer, TrainingArguments, TrainerCallback

# Pre-training data（Simple text data）
TRAIN_FILES = ['./data/wiki_chunk_320_2.',]
EVAL_FILE = './data/pretrain_eval_400_1w.parquet'

@dataclass
class PretrainArguments:
    tokenizer_dir: str = './model_save/tokenizer/'
    model_save_dir: str = './model_save/pre/'
    logs_dir: str = './logs/'
    train_files: list[str] = field(default_factory=lambda: TRAIN_FILES)
    eval_file: str = EVAL_FILE
    max_seq_len: int = 512
    attn_implementation: str = 'eager' if () == 'Windows' else attn_implementation

pretrain_args = PretrainArguments()
# Load trainedtokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(pretrain_args.tokenizer_dir)
# Word list size correction
vocab_size = len(tokenizer)
if vocab_size % 64 != 0:
    vocab_size = (vocab_size // 64 + 1) * 64
# If the word list size is less than 65535 expense or outlayuint16stockpile，Save disk space，否则expense or outlayuint32stockpile
map_dtype = np.uint16 if vocab_size < 65535 else np.uint32

def token_to_id(samples: dict[str, list]) -> dict:
    batch_txt = samples['text']
    outputs = tokenizer(batch_txt, truncation=False, padding=False, return_attention_mask=False)
    input_ids = [(item, dtype=map_dtype) for item in outputs["input_ids"]]
    return {"input_ids": input_ids}

# Load Dataset
def get_maped_dataset(files: str|list[str]) -> Dataset:
    dataset = load_dataset(path='parquet', data_files=files, split='train', cache_dir='.cache')
    maped_dataset = (token_to_id, batched=True, batch_size=1_0000, remove_columns=dataset.column_names)
    return maped_dataset

train_dataset = get_maped_dataset(pretrain_args.train_files)
eval_dataset = get_maped_dataset(pretrain_args.eval_file)
# definedata_collator。`mlm=False`Indicates trainingCLMmould，`mlm=True`Indicates trainingMLMmould
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

phi_config = PhiConfig(
    vocab_size=vocab_size,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    hidden_size=960,
    num_attention_heads=16,
    num_hidden_layers=24,
    max_position_embeddings=512,
    intermediate_size=4096,
    attn_implementation=pretrain_args.attn_implementation,
)
model = PhiForCausalLM(phi_config)

# define训练参数
my_trainer_callback = MyTrainerCallback() # cuda cachecallback function
args = TrainingArguments(
    output_dir=pretrain_args.model_save_dir, per_device_train_batch_size=4,
    gradient_accumulation_steps=32, num_train_epochs=4, weight_decay=0.1,
    warmup_steps=1000, learning_rate=5e-4, evaluation_strategy='steps',
    eval_steps=2000, save_steps=2000, save_strategy='steps', save_total_limit=3,
    report_to='tensorboard', optim="adafactor", bf16=True, logging_steps=5,
    log_level='info', logging_first_step=True,
)
trainer = Trainer(model=model, tokenizer=tokenizer,args=args,
    data_collator=data_collator, train_dataset=train_dataset,
    eval_dataset=eval_dataset, callbacks=[my_trainer_callback],
)
()
trainer.save_model(pretrain_args.model_save_dir)

This code is basically not very different from the Trainer as long as it uses the Transformers library, which is very different.
Mainly, the difference between tokenizer and CausalLM models.

PhiConfig, PhiForCausalLM becomes.

from transformers import LlamaConfig as PhiConfig
from transformers import LlamaForCausalLM as PhiForCausalLM

Or:

from transformers import Qwen2Config as PhiConfig
from transformers import Qwen2ForCausalLM as PhiForCausalLM

It becomes a simple pre-training of other models quite arbitrarily .

Regarding training data construction, where DataCollatorForLanguageModeling:

(Note: load_dataset of get_maped_datasetget_maped_dataset didn't add num_proc, resulting in slow loading, plus set it to the number of cores)

This part of the code is very similar to a previous post I wrote about training GPT2 based on the transformers library:
/p/685851459

(Note: load_dataset of get_maped_datasetget_maped_dataset didn't add num_proc, resulting in slow loading, plus set it to the number of cores)

IV. SFT codes

It's basically the same as pre-training, the only difference is that the label of the output is set

import time
import pandas as pd
import numpy as np
import torch
from datasets import load_dataset
from transformers import PreTrainedTokenizerFast, PhiForCausalLM, TrainingArguments, Trainer, TrainerCallback
from trl import DataCollatorForCompletionOnlyLM

# 1. Define the training data，tokenizer，Paths and maximum lengths of pre-trained models
sft_file = './data/sft_train_data.parquet'
tokenizer_dir = './model_save/tokenizer/'
sft_from_checkpoint_file = './model_save/pre/'
model_save_dir = './model_save/sft/'
max_seq_len = 512

# 2. Load training dataset
dataset = load_dataset(path='parquet', data_files=sft_file, split='train', cache_dir='.cache')
tokenizer = PreTrainedTokenizerFast.from_pretrained(tokenizer_dir)
print(f"vicab size: {len(tokenizer)}")

# ## 2.1 definesft data_collatorcommand character
# It is also possible to manually set the`instruction_template_ids`cap (a poem)`response_template_ids`add toinput_idshit the nail on the head，Because if it'sbyte level tokenizerwill probably`:`cap (a poem)后面的字符合并，Resulting in not being able to find`instruction_template_ids`cap (a poem)`response_template_ids`。
# It can also be used as below by adding the`'#'`cap (a poem)`':'`Front and rear manual plus`'\n'`settle (a dispute)

# %%
instruction_template = "##ask questions:"
response_template = "##responsive:"

map_dtype = np.uint16 if len(tokenizer) < 65535 else np.uint32

def batched_formatting_prompts_func(example: list[dict]) -> list[str]:
    batch_txt = []
    for i in range(len(example['instruction'])):
        text = f"{instruction_template}\n{example['instruction'][i]}\n{response_template}\n{example['output'][i]}[EOS]"
        batch_txt.append(text)

    outputs = tokenizer(batch_txt, return_attention_mask=False)
    input_ids = [(item, dtype=map_dtype) for item in outputs["input_ids"]]
    return {"input_ids": input_ids}

dataset = (batched_formatting_prompts_func, batched=True,
                        remove_columns=dataset.column_names).shuffle(23333)

# 2.2 definedata_collator
#
data_collator = DataCollatorForCompletionOnlyLM(
  instruction_template=instruction_template,
  response_template=response_template,
  tokenizer=tokenizer,
  mlm=False
)
empty_cuda_cahce = EmptyCudaCacheCallback() ## define训练过程hit the nail on the head回调函数
my_datasets = dataset.train_test_split(test_size=4096)

# 5. define训练参数
model = PhiForCausalLM.from_pretrained(sft_from_checkpoint_file)
args = TrainingArguments(
    output_dir=model_save_dir, per_device_train_batch_size=8, gradient_accumulation_steps=8,
    num_train_epochs=3, weight_decay=0.1, warmup_steps=1000, learning_rate=5e-5,
    evaluation_strategy='steps', eval_steps=2000, save_steps=2000, save_total_limit=3,
    report_to='tensorboard', optim="adafactor", bf16=True, logging_steps=10,
    log_level='info', logging_first_step=True, group_by_length=True,
)
trainer = Trainer(
    model=model, tokenizer=tokenizer, args=args,
    data_collator=data_collator,
    train_dataset=my_datasets['train'],
    eval_dataset=my_datasets['test'],
    callbacks=[empty_cuda_cahce],
)
()
trainer.save_model(model_save_dir)

In short, it's all one set of code, but the details of everything are actually hidden:
DataCollatorForLanguageModeling, Trainer, tokenizer and CausalLM in the implementation.
The underlying reality is in pytorch's implementation, though it generally doesn't involve analysis of the implementation within the framework.

and huggingface's trl library for the SFT example, the only difference is that it still uses the Trainer
/docs/trl/main/en/sft_trainer#train-on-completions-only

In this case, DataCollatorForCompletionOnlyLM automatically constructs samples for instruction fine-tuned complementary training:

You can use the DataCollatorForCompletionOnlyLM to train your model on the generated prompts only. Note that this works only in the case when packing=False.

For instruction data that is fine-tuned, instantiate a datacollator, passing in a template and tokenizer for the response.

Internally, you can do the splitting of token ids in the response section and specify them as predictive labels.

Here's the official HuggingFace example of using DataCollatorForCompletionOnlyLM+FTTrainer, for instruction fine-tuning:

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM

dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

instruction_template = "### Human:"
response_template = "### Assistant:"
collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)

trainer = SFTTrainer(
    model,
    args=SFTConfig(
        output_dir="/tmp",
        dataset_text_field = "text",
    ),
    train_dataset=dataset,
    data_collator=collator,
)
()

Regarding the difference between Trainer and SFTTrainer, it doesn't feel like there is much of a difference
/@sujathamudadla1213/difference-between-trainer-class-and-sfttrainer-supervised-fine-tuning-trainer-in-hugging-face-d295344d73f7

V. DPO codes

import time
import pandas as pd
from typing import List, Optional, Dict
from dataclasses import dataclass, field
import torch
from trl import DPOTrainer
from transformers import PreTrainedTokenizerFast, PhiForCausalLM, TrainingArguments, TrainerCallback
from datasets import load_dataset

# 1. definesftModel paths anddpodigital
dpo_file = './data/dpo_train_data.json'
tokenizer_dir = './model_save/tokenizer/'
sft_from_checkpoint_file = './model_save/sft/'
model_save_dir = './model_save/dpo/'
max_seq_len = 320

# 2. 加载digital集

# digital集tokenformatting
# DPOdigital格式：[promptModel Inputs，chosenstandard practice， rejectednegative example]
# commander-in-chief (military)dpodigital集三列digital添加上`eos`token，`bos`may be added or not
def split_prompt_and_responses(samples: dict[str, str]) -> Dict[str, str]:
    prompts, chosens, rejects = [], [], []
    batch_size = len(samples['prompt'])
    for i in range(batch_size):
        # add an eos token for signal that end of sentence, using in generate.
        (f"[BOS]{samples['prompt'][i]}[EOS]")
        (f"[BOS]{samples['chosen'][i]}[EOS]")
        (f"[BOS]{samples['rejected'][i]}[EOS]")
    return {'prompt': prompts, 'chosen': chosens, 'rejected':rejects,}

tokenizer = PreTrainedTokenizerFast.from_pretrained(tokenizer_dir)
dataset = load_dataset(path='json', data_files=dpo_file, split='train', cache_dir='.cache')
dataset = (split_prompt_and_responses, batched=True,).shuffle(2333)

# 4. Loading Models
# `model`cap (a poem)`model_ref`It starts with the same model.，training only`model`specifications，`model_ref`Parameters are preserved unchanged
model = PhiForCausalLM.from_pretrained(sft_from_checkpoint_file)
model_ref = PhiForCausalLM.from_pretrained(sft_from_checkpoint_file)

# 5. define训练中的回调函数
# emptycuda(computing) cache，dpoTo load two models，Large video memory footprint，This can effectively alleviate the problem of slow video memory growth on low video memory machines
class EmptyCudaCacheCallback(TrainerCallback):
    log_cnt = 0
    def on_log(self, args, state, control, logs=None, **kwargs):
        self.log_cnt += 1
        if self.log_cnt % 5 == 0:
            .empty_cache()
            
empty_cuda_cahce = EmptyCudaCacheCallback()

# training parameter
args = TrainingArguments(
    output_dir=model_save_dir, per_device_train_batch_size=2, gradient_accumulation_steps=16,
    num_train_epochs=4, weight_decay=0.1, warmup_steps=1000, learning_rate=2e-5, save_steps=2000, save_total_limit=3, report_to='tensorboard', bf16=True, logging_steps=10, log_level='info',
    logging_first_step=True, optim="adafactor", remove_unused_columns=False, group_by_length=True,
)
trainer = DPOTrainer(
    model, model_ref, args=args, beta=0.1,
    train_dataset=dataset,tokenizer=tokenizer, callbacks=[empty_cuda_cahce],
    max_length=max_seq_len * 2 + 16, # 16 for eos bos
    max_prompt_length=max_seq_len,
)
()
trainer.save_model(model_save_dir)

VI. Shattering

in-depth study

The training code for Traniner using transformers as well as the trl library is basically the same because transformers and trl are well encapsulated.

To go slightly deeper into the details, it is recommended to read or debug the following repositories. Both repositories are based on pytorch implementations:

/DLLXW/baby-llama2-chinese/tree/main
/jzhang38/TinyLlama/blob/main/pretrain/

revised

This project is based on Phi2-mini-Chinese, mainly by replacing phi2 with qwen, and then directly using qwen's tokenizer
/jiahe7ay/MINI_LLM/

On my side I tried to use the transformers library to pull out the qwen2's for training.
In fact, it's no different than using Qwen2LMModel directly with transformers.

Those interested can replace any mainstream model, modify the configuration, and it's actually not too far off.
The code is mainly for learning purposes. With a bit of time, a bit of card, and little effort you can get some data replication to go through the whole process.

However, it's not simple to train small-scale LLMs that work OK.

If you need to cite this article, please refer to:

LeonYi. (Aug. 25, 2024). An Interpretation of the Phi2-Mini-Chinese Project for Training Large Models from Scratch [LLM Training Series].

@online{title={[LLM Training Series] Phi2-Mini-Chinese Project Interpretation for Training Large Models from Scratch},
author={LeonYi},
year={2024},
month={Sep},
url={/justLittleStar/p/18405618},
}