LLM Training Series] NanoGPT Source Code Explanation and Chinese GPT Training Practice

This article is the first in the [Training LLM series], which mainly focuses on the NanoGPT code as well as Chinese and English pre-training practices. For the latest version, see my Zhihu:/p/716442447

In addition to running through the original NanoGPT code, theThe Dream of the Red Chamber, the Four Great Masterpieces, and dozens of popular web novels were used respectively for character-level, self-trained tokenizer, and Chinese GPT training attempts using Qwen2's Tokenizer, and demonstrated the effects of renewal.

Available for quick pre-training of Chinese GPT.

The second will learn NanoGPT by way of debug analysis. follow up will also introduce some open source small LLM training program, also practice + code analysis of the idea.
NanoGPT项目

0. Preface

The current open source results are getting better and better with each passing period, the big models have been monopolized by the head honchos, and the vast majority of people won't have a need to train LLMs

Most of the scenarios are partial application work due to the alignment to the business results, more time will be spent on cue word optimization, data + various tasks LoRA fine-tuning, and even domain fine-tuning or full parameter fine-tuning time is less and less or no.

The purpose of learning small LLM training is to master the principles and also to provide a guiding experience for training your own LLMs:

The project accumulates real-world field experience, but it is also important to go deeper into LLM training principles, which can support the design of more feasible and reliable programs. Practical combat and principles complement each other.
Even after running the small model pre-training and incremental pre-training open source LLM, it is still almost something. The reason is that there is no depth to the specific training process, eat through the principle.

The pre-training process belongs to the integer data and startup script, and the space for tuning parameters is really limited. It can only be considered as a run-through
At present, the progress is changing day by day, ToDoList is more and more, rather than more time and energy to track the frontier, it is better to spend time to eat through the foundation. Modeling algorithms are ever-changing, but not away from their roots, mastering the fundamentals can be unchanged to cope with all the changes.

Some of the fundamentals that are stable and can be migrated: basic model algorithms, disambiguators, optimizers, underlying Pytorch principles, the mathematics involved in the algorithms such as statistics, matrices, calculus, and the principles of high-performance computing and computer systems.

Using open source LLM is sometimes like a black box, but by mastering the principles and practicing training small LLMs on your own, the black box of large models is unlocked.

In particular, preparing data SFT's own trained Base model is a very special feeling when answering questions.

Anyway, I've been practicing small-scale LLM training lately. The idea of learning: first run the project, then learn the code, then change the practice.

1、Introduction of NanoGPT

NanoGPT is karpathy's open source replication of GPT2 scale LLM in 23 years:/karpathy/nanoGPT。

The project has no special dependencies, and given a corpus-local notebook can quickly train its own small-scale causal language model.
项目requiremenets

1.1 Project Analysis

项目页面

Project Main Code：

data
- Stores raw data, as well as preprocessing Tokenize's data
- Preprocessing Tokenize code (supports very simple character-level disambiguation, and GPT2 disambiguation for tiktoken).
- Logic of the preprocessing code: divide the training and validation data, then split the words, save them in numpy's int format, and persist them as bin files, which are used to read the large tokenized files on the hard disk based on numpy's memmap sub-batch for mini-batch training at the time of training.
  
  It is very easy to train your own tokenizer with Transformers' tokenizers or just use an existing tokenizer such as Qwen2.
config
- Stores code for training and fine-tuning gpt2, as well as evaluating open gpt2
  
  Here you can customize the model configuration, as well as the hyperparameters for training
- GPT realized
- GPT training code. Supports PyTorch Distributed Data Parallel (DDP)'s for single-machine multi-card and multi-machine multi-card distributed training.
- GPT reasoning

characterization:

Particularly suitable for debugging each step of the training process on Pycharm to gain a deeper understanding of the TransformerDecoder training steps.
Suitable for magic changes to the code (direct replacement with modeling_qwen.py, or step-by-step modification of the GPT model structure)

Next, a brief description of the code.

Actually the code for the GPT model structure, and a previous numpy implementation of my GPT article:/p/679330102 Very similar, except for the migration from numpy to torch.

1.2 Model code implementation of NanoGPT

LLM-visualization: /llm
LLM-visualization project./llm

It can be combined with both LLM-visualization and Pycharm Debug NanoGPT code for best results.

The GPT core isCausalSelfAttention+LayerNorm+MLPconstituentTransformerDecoder, which is the Block in the code below.

import math
import inspect
from dataclasses import dataclass
import torch
import as nn
from import functional as F

class LayerNorm():
    """ LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False """
    def __init__(self, ndim, bias):
        super().__init__()
         = ((ndim))
         = ((ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, , , , 1e-5)

class MLP():
    def __init__(self, config):
        super().__init__()
        self.c_fc = (config.n_embd, 4 * config.n_embd, bias=)
            = ()
        self.c_proj = (4 * config.n_embd, config.n_embd, bias=)
         = ()

    def forward(self, x):
        x = self.c_fc(x) # (math.) higher dimensional variation， (B, T, C) -> (B, T, 4*C)
        x = (x)
        x = self.c_proj(x)
        x = (x)
        return x

TransformerDecoder结构

\[ \mathbf{H} = \mathrm{softmax}\left(\frac{\mathbf Q \mathbf K^\top }{\sqrt{d}}\right) \mathbf V \in \mathbb{R}^{T\times d} \]

class CausalSelfAttention():
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = (config.n_embd, 3 * config.n_embd, bias=)
        # output projection
        self.c_proj = (config.n_embd, config.n_embd, bias=)
        # regularization
        self.attn_dropout = ()
        self.resid_dropout = ()
        self.n_head = config.n_head
        self.n_embd = config.n_embd
         =
        # flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
         = hasattr(, 'scaled_dot_product_attention')
        if not :
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            # causal mask to ensure that attention is only applied to the left in the input sequence
            self.register_buffer("bias", ((config.block_size, config.block_size))
                                        .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = () # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = (B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = (B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = (B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        if :
            # efficient attention using Flash Attention CUDA kernels
            y = .scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p= if else 0, is_causal=True)
        else:
            # manual implementation of attention
            ## 1、Calculating the Attention Score
            att = (q @ (-2, -1)) * (1.0 / ((-1)))  # (B, nh, T, hs) x (B, nh, hs, T)
            ## 2、Attention scores are added to the upper triangular matrix of causal masks
            att = att.masked_fill([:,:,:T,:T] == 0, float('-inf'))  # (B, nh, T, T)
            ## 3、SoftmaxCalculation of the attention factor
            att = (att, dim=-1) # (B, nh, T, T)
            ## 4、Attention Dropout
            att = self.attn_dropout(att)
            ## 5、Calculating Attention Outcomes（Weighted summation based on causal attention coefficients，Each position is weighted and summed over the current position and its previous value vectors）
            y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)

        # re-assemble all head outputs side by side， nhThe results of the individual heads are combined together，And then change it.
        y = (1, 2).contiguous().view(B, T, C) # (B, T, nh, hs) -> (B, T, C), viewReorganized.tensor(used form a nominal expression)shape

        # output projection
        y = self.resid_dropout(self.c_proj(y)) # Output Header Conversion, transformer(used form a nominal expression)代码早在20It's been switched for years.，But in the wrong place.，GAT/sequences
        return y

class Block():
    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=)
         = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=)
         = MLP(config)

    def forward(self, x):
        x = x + (self.ln_1(x))
        x = x + (self.ln_2(x))
        return x

GPT network structure configuration and GPT implementation
GPT网络结构

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304 # GPT2(used form a nominal expression)vocab_sizebecause of50257, commander-in-chief (military)vocab_size填充because of64(used form a nominal expression)倍数以提升训练效率（It's said to enhance+30%）
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768
    dropout: float = 0.0 # pre-trainingdropoutbecause of0，微调时可以设置小(used form a nominal expression)dropout(be) worth
    bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster

class GPT():
    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
         = config

         = (dict(
            wte = (config.vocab_size, config.n_embd),
            wpe = (config.block_size, config.n_embd),
            drop = (),
            h = ([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, bias=),
        ))
        self.lm_head = (config.n_embd, config.vocab_size, bias=False)
         = self.lm_head.weight # /method/weight-tying

        # init all weights，对(used form a nominal expression)mapmethodologies
        (self._init_weights)
        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if ('c_proj.weight'):
                .normal_(p, mean=0.0, std=0.02/(2 * config.n_layer))

        # report number of parameters
        print("number of parameters: %.2fM" % (self.get_num_params()/1e6,))

    def _init_weights(self, module):
        if isinstance(module, ):
            .normal_(, mean=0.0, std=0.02)
            if is not None:
                .zeros_()
        elif isinstance(module, ):
            .normal_(, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        device =
        b, t = () # shape (b, t)
        assert t <= .block_size, f"Cannot forward sequence of length {t}, block size is only {.block_size}"
        pos = (0, t, dtype=, device=device) # shape (t)

        # forward the GPT model itself
        tok_emb = (idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = (pos) # position embeddings of shape (t, n_embd)
        x = (tok_emb + pos_emb)
        for block in :
            x = block(x)
        x = .ln_f(x)

        if targets is not None:
            # If there istargetslabeling information
            # x, shape(b, t, d), cap (a poem)lm_head, shape(d, vocab_size)Doing matrix multiplication（inner product scoring）
            logits = self.lm_head(x)
            # targetbecause of输入token(used form a nominal expression)词表idOffset one position to the right，nexttokenanticipate（It's possible to offset more than one.，multi-tokenanticipate，(prefix indicating ordinal number, e.g. first, number two etc)2classifier for individual things or people, general, catch-all classifiertokenanticipate🤣）
            # 训练(used form a nominal expression)时候，输出输入序列每一classifier for individual things or people, general, catch-all classifier位置(used form a nominal expression)anticipatelogitscore（Extremely multiclassified）， torchCalculate cross entropy usingunnormalized logits（Internal will be normalized）
            # commander-in-chief (military)每一classifier for individual things or people, general, catch-all classifierlabel对应(used form a nominal expression)这classifier for individual things or people, general, catch-all classifiercap (a poem)传统(used form a nominal expression)序列anticipate就没啥区别了。
            loss = F.cross_entropy((-1, (-1)), (-1), ignore_index=-1)
        else:
            # inference mode，当然只需要计算最后一classifier for individual things or people, general, catch-all classifierposition(used form a nominal expression)logits，即anticipate下一classifier for individual things or people, general, catch-all classifiertoken(used form a nominal expression)概率
            logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
            loss = None

        return logits, loss

GPT prediction code

@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
    """
    Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
    the sequence max_new_tokens times, feeding the predictions back into the model each time.
    Most likely you'll want to make sure to be in () mode of operation for this.
    """
    for _ in range(max_new_tokens):
        # if the sequence context is growing too long we must crop it at block_size
        idx_cond = idx if (1) <= .block_size else idx[:, -.block_size:]
        # Calling the model'sforwardPredicting the classification of positions corresponding to input sequenceslogits
        logits, _ = self(idx_cond)
        # SoftmaxformerlogitshertemperatureProbability distribution scaling
        logits = logits[:, -1, :] / temperature
        # top-ktruncate
        if top_k is not None:
            v, _ = (logits, min(top_k, (-1)))
            logits[logits < v[:, [-1]]] = -float('Inf')
        # apply softmax to convert logits to (normalized) probabilities
        probs = (logits, dim=-1)
        # Category distribution sampling（multinomial distribution，The number of times is1hour）
        # The principle is that a pseudo-randomized algorithm is used to generate an approximately uniformly distributed number based on a random seed.，Then look at the number falling on the continuum by probability1region of dimensional space，to determine the category
        idx_next = (probs, num_samples=1)
        # append sampled index to the running sequence and continue
        idx = ((idx, idx_next), dim=1)

    return idx

The main thing that is a little harder to understand is that the split and merge operations of a specific torch for a 3- or 4-dimensional tensor are compactly implemented in the actual GPT in no more than 150 lines.

This GPT model structure of the code, and before one of my explanation of picoGPT numpy implementation of the article is basically no difference, is replaced with torch code can be automatically differentiated from the calculation.
/p/679330102

1.3 NanoGPT training code

Three types of model initialization：

1、scratch from scratch when training
2. Load the previously trained checkpoint to continue training (i.e., use it for fine-tuning)
3. Pre-training weights based on OpenAI GPT-2 continue training

model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=None, dropout=dropout) # start with model_args from command line
if init_from == 'scratch':
    # init a new model from scratch
    print("Initializing a new model from scratch")
    # determine the vocab size we'll use for from-scratch training
    if meta_vocab_size is None:
        print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")
    model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
elif init_from == 'resume':
    print(f"Resuming training from {out_dir}")
    # resume training from a checkpoint.
    ckpt_path = (out_dir, '')
    checkpoint = (ckpt_path, map_location=device)
    checkpoint_model_args = checkpoint['model_args']
    # force these config attributes to be equal otherwise we can't even resume training
    # the rest of the attributes (. dropout) can stay as desired from command line
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = checkpoint_model_args[k]
    # create the model
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    # fix the keys of the state dictionary :(
    # honestly no idea how checkpoints sometimes get this prefix, have to debug more
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if (unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)
    iter_num = checkpoint['iter_num']
    best_val_loss = checkpoint['best_val_loss']
elif init_from.startswith('gpt2'):
    print(f"Initializing from OpenAI GPT-2 weights: {init_from}")
    # initialize from OpenAI GPT-2 weights
    override_args = dict(dropout=dropout)
    model = GPT.from_pretrained(init_from, override_args)
    # read off the created config params, so we can store them into checkpoint correctly
    for k in ['n_layer', 'n_head', 'n_embd', 'block_size', 'bias', 'vocab_size']:
        model_args[k] = getattr(, k)

Training code
训练标签构造

optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
X, Y = get_batch('train') # fetch the very first batch
t0 = ()
local_iter_num = 0 # number of iterations in the lifetime of this process
raw_model = if ddp else model # unwrap DDP container if needed

while True:
    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 and master_process:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                print(f"saving checkpoint to {out_dir}")
                (checkpoint, (out_dir, ''))
    if iter_num == 0 and eval_only:
        break

    # forward backward update, with optional gradient accumulation to simulate larger batch size
    # and using the GradScaler if data type is float16
    for micro_step in range(gradient_accumulation_steps):
        if ddp:
            model.require_backward_grad_sync = (micro_step == gradient_accumulation_steps - 1)
        with ctx:
            logits, loss = model(X, Y) # forward propagation
            loss = loss / gradient_accumulation_steps # scale the loss to account for gradient accumulation
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch('train')
        # backward pass, with gradient scaling if training in fp16
        (loss).backward() # backward propagation

    # lessengradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        .clip_grad_norm_((), grad_clip)
    # updateoptimizercap (a poem)scaler
    (optimizer)
    ()
    # mouldgradientsZero and releasememoryoccupancy
    optimizer.zero_grad(set_to_none=True)

    # Maximum trainingstepordinal number，End of training
    if iter_num > max_iters:
        break

2、Project practice

NanoGPT's code (and by extension any GPT torch code) is inherently agnostic as to what splitter is used; after all, it only requires the input to be a sequence of token ids of type int.

No matter what modal data the input is (timing, audio, images, videos, user click behavior sequences, etc.), it can be trained as soon as it is tokenized.

So, whether it's this code, or another project, essentially the changes outside the model are modifications to the process of decoding when the disambiguator acts on preprocessing encode and inference prediction.

2.1 Shakespeare character level shakespeare_char

Here, you can download the text of tinyshakespeare and reproduce it in the following 3 steps

python data/shakespeare_char/
python  config/train_shakespeare_char.py
python  --out_dir=out-shakespeare-char

Configuration

root@eais-bjyo5z6grbpbqo36a86q-85b696d67f-zhx46:/mnt/workspace/nanoGPT-master# python  config/train_shakespeare_char.py
Overriding config with config/train_shakespeare_char.py:
# train a miniature character-level shakespeare model
# good for debugging and playing on macbooks and such

out_dir = 'out-shakespeare-char'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'shakespeare-char'
wandb_run_name = 'mini-gpt'

dataset = 'shakespeare_char'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
# n_layer = 6
# n_head = 6
# n_embd = 384

##### debug  ####
n_layer = 2
n_head = 4
n_embd = 128
###############

dropout = 0.2

learning_rate = 1e-3 # with baby networks can afford to go a bit higher
max_iters = 5000
lr_decay_iters = 5000 # make equal to max_iters usually
min_lr = 1e-4 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
device = 'cpu'  # run on cpu only
compile = False # do not torch compile the model

tokens per iteration will be: 16,384
found vocab_size = 65 (inside data/shakespeare_char/)
Initializing a new model from scratch
number of parameters: 0.40M
/usr/local/lib/python3.10/site-packages/torch/amp/grad_scaler.py:131: UserWarning:  is enabled, but CUDA is not available.  Disabling.
  (
num decayed parameter tensors: 10, with 434,304 parameters
num non-decayed parameter tensors: 5, with 640 parameters
using fused AdamW: False
step 0: train loss 4.1860, val loss 4.1832
iter 0: loss 4.1887, time 38322.89ms, mfu -100.00%
iter 10: loss 3.9036, time 417.94ms, mfu 0.04%
iter 20: loss 3.6104, time 400.67ms, mfu 0.04%
iter 30: loss 3.4437, time 452.36ms, mfu 0.04%
iter 40: loss 3.2072, time 445.09ms, mfu 0.04%
iter 50: loss 2.9770, time 397.14ms, mfu 0.04%
iter 60: loss 2.8188, time 482.85ms, mfu 0.04%
iter 70: loss 2.7120, time 426.90ms, mfu 0.04%
iter 80: loss 2.6770, time 434.85ms, mfu 0.04%
iter 90: loss 2.6223, time 410.70ms, mfu 0.04%
iter 100: loss 2.5966, time 444.17ms, mfu 0.04%
iter 110: loss 2.5599, time 378.64ms, mfu 0.04%
iter 120: loss 2.5563, time 459.60ms, mfu 0.04%
iter 130: loss 2.5470, time 482.69ms, mfu 0.04%

The prediction results were trained for 200steps with less continuous English characters:

root@eais-bjyo5z6grbpbqo36a86q-85b696d67f-zhx46:/mnt/workspace/nanoGPT-master# python  --out_dir=out-shakespeare-char --device=cpu
Overriding: out_dir = out-shakespeare-char
Overriding: device = cpu
number of parameters: 0.40M
Loading meta from data/shakespeare_char/...

HWAF heve wabeacar
F s sele in
WAnd hur stout arined ato ce h, tofow, couriourtheald ath as and mathise m tetil f chom, s ke ar tanetr ieaifo he s mat y dsakthe go irsten ar:
Anoupat he, pit thinougaris inge veeathed 'oilithangis, wey, ored sh toe t,
Thende me m pon.

I me beng wed timy youlre tshofundt,

And t corst thand?

MENSTUDUCWIO:
Pane mashele he y hayouthe.
Y I w INIxpare fetodis be mimicos w INILARY we atidses fand s h andathe tofery dad ge wn withisthatod br fr anfanor hith, he shat t
---------------

BEENoo arot, toth ly whond wistore thou hed we che tif d win chathere he clomapal ond br t inde heeronghen'theathowng ourat ads the maimes one shiten he ove whese thitha mbath fr ir thanore acheit t hes d inoce t whinkert sackitrmome t ond hend ind,
And n t s areroras y eethit fass ad foueld se nout as ce oore aldeath nchitherds w om y owin d hasing spllore
We thoururake,
An be mame tipy ine rar, m.
I burisorthe ng ce
MENEROLENCEN:

Achollamucais y the co as sererrury by s yodounor theer itorean

2.2 The Dream of the Red Chamber Character Level GPT

It's easy, just download the Red Book, rename the file under then press shakespeare_char's 3 steps to train.

Because.shakespeare_char/The word list after word separation is about 4,000 words, according to the character level of word separation, that is, according to the granularity of Chinese characters.

Show partial character level tokens

length of dataset in characters: 875,372
all the unique characters: !
!" ()-. :<>? --''""...-,... The book was written by a man who had a job in a small town in the middle of the country, and he was not involved in any of the work, but in a small town in the middle of the country, where he had a job in a small town in the middle of the country. The most important thing is that you have to be able to get the best out of your life, and you have to be able to get the best out of your life, and you have to be able to get the best out of your life, and you have to be able to get the best out of your life, and you have to be able to get the best out of your life, and you have to get the best out of your life, and you have to get the best out of your life, and you have to get the best out of your life, and you know that you have to be willing to pay for your life, and you know that you have to pay for your life, because you are the only person who can pay for your life, and you know that you are the only person in the world that can pay for your life. Kui soul spirit quietly nightmare Wei devil Orange fish Lu Bao Bao tuna shark fresh replica fish sturgeon carp whale loach sturgeon ao widower turtle scales jay bird mandarin bird dove cockerel gulls crow crows hemlock duck owl mandarin larks kites crow crows luan hong azalea swan goose parrots magpie quail quail teal migratory falcon harrier waggy wagtail cranefowl duck parrot parrot falcon falcon eagle brine deer deer doe kirki musklin mai ma Hui hong hon kibli sticky black silky dai detachment tattoo sallow 黹黻黼黾鼋鼎鼐鼒鼓鼠鼻 snoring snore snoozy The Turtle in the Dragon's Niche from the Crowd︰︰︰﹒;﹕!! (SIGHING) ã€€ã€€ã€€ã€€ã€€ã€€ã€€ã ￥
vocab size: 4,432
train has 787,834 tokens
val has 87,538 tokens

Training configuration

tokens per iteration will be: 16,384
found vocab_size = 4432 (inside data/shakespeare_char/)
Initializing a new model from scratch
number of parameters: 0.96M
num decayed parameter tensors: 10, with 993,280 parameters
num non-decayed parameter tensors: 5, with 640 parameters
using fused AdamW: False

The results of 250 STEPs of continuation prediction were trained, with less continuous and overlapping words repeating the prediction multiple times:

python --out_dir=out-shakespeare-char --device=cpu --start=When Daiyu saw Bao Yu, she said

Overriding: out_dir = out-shakespeare-char
Overriding: device = cpu
Overriding: start = When Daiyu met Baoyu, she said
number of parameters: 0.96M
Loading meta from data/shakespeare_char/...
Daiyu saw Baoyu and said, "You're here for this!" Baoyu listened to the clothes, then not with the attacker busy said: "You are interesting to be sure that he is not?" Daiyu laughed:
A bite, I came back she came back, in the garden laughed: "You are not I said:" I wait for her? Today, is, put your grandmother grandmother grandmother, I want to invite you a Jin Gui again to scold Baoyu quietly said: "You do not good words." Baoyu busy in also not interesting, do not use some 'so good, she, it is not: "You still have to!"
Xiangling said: "my original, and then we ate. Now years really drink, shall not."
Said: "I myself laughed at 'what?" Bao Yu." Bao toad smiled and said: "grandmother's home to." Bao Yu looked at this original.
Bao Yu listened.
 Phoenix said: "We." Said: "He only saw things there, really if some things." Phoenix said: "Miss Lin came back to look at Dai Hairpin said:" Baoyu said: "Well. You can't laugh at home," here there is a sound, then asked: "We are not good coincidence you pain you have general, good sister, also don't know, only you went to his, only afraid to get? Where we called you are not these outlaws, you say is: "bring, anxious, not, you actually is what kind of, call me no no not say?" Qingwen said: "You are just there." Ming Yan under it?" Baoyu, Baochai listened, then asked what it means." Bao Yu said: "not yet made what kind of children, is this person laughed:" I do not know, said: "you a while, then laugh." Xiang Yun to this one is not
---------------
Daiyu saw Baoyu said a small boy, then to Baoyu laugh: "He will be son, I can strongly said we are I am very you call me is not something, you have to be good, I am in my. When out good or bad gas good to say know, you what you I see people listen." Aunt Xue laughed: "There is nothing good to talk about, called ay yo, but also do not allow you to I did not go up quickly to come, I wait." This said you take me here to play laugh: "This, fast is the slave inverted is me is this person also good laugh: "You come, but there." Baoyu said: "ay?  Siqi just promised, you and I and do not dare to ask us to know say, and do not know: "You so! You also can't say, have to eat, then say I say." Xiangyun: "Your lordship too words, and took a handful of what is this? I go." And not just how!" Aunt Xue said: "even no milk grandmother grandmother side body a handful of what not, you are not he sat words, so go just a not now have to, no one said: 'Moreover do not know: "you, good thing is that I to the past!" said, and no one have heard the milk grandmother. Said, also no one have hit hear milk call you guys I was, just up." Jia mother said: "your mother-in-law, also do not come over, just I only pick, Bao hairpin, etc. you also begged again here of our do the old lady are remembered on the body is not good, looked." Phoenix second master said a girl's, as long as you guys also only in the heart, themselves to the road: "You came to come back." Jia Lian then from I just brother brother, no 'but said: "where to Fang Niang and we have a good deal to play, call people busy road:

Changed the model configuration and trained a little longer. The result looks much smoother

batch_size = 32
block_size = 256 # context of up to 256 previous characters
n_layer = 24
n_head = 8
n_embd = 64 * n_head

step 0: train loss 8.4835, val loss 8.4883
...
step 2250: train loss 1.2916, val loss 4.3947
iter 2470: loss 1.4080, time 446.68ms, mfu 1.51
iter 2480: loss 1.3687, time 447.08ms, mfu 1.51%
iter 2490: loss 1.4792, time 447.05ms, mfu 1.51%
root@eais-bjsrvif50zbuisrd5dfe-64847b6ff7-496mt:/mnt/workspace/buildLLM/nan
oGPT-master# python --out_dir=out-tonestory-char
Overriding: out_dir = out-tonestory-char
number of parameters: 40.03M
Loading meta from data/shakespeare_char/...

This You insisted on not wanting You to come over and call: "Ai yo, this can be your life?" You laugh: "I don't believe it, it's the treasure girl is the treasure brother." You laughed: "No, you just talk. It's the kid, isn't it isn't it?" You're laughing: "Now because of the little child, you say it is not 'jin yu' 'jin' 'jin' 'jin' 'up' ', 'Jin' Huixiang ", Feng sister-in-law asked with a smile:" Who? Don't with you are also seen, tell your old lady. Now in addition to your family, is your old man, to here, nothing." You smiled and said, "What things?" You laughed: "I do not and you. You and not sorry, and not so, just say yes. You deliberated on to the room, but someone." Phoenix sister said: "The old lady said, I often say, but it is to be."
Because she said: "This girl, the masters asked what she is blessed. I heard that you say, we really doubt." A wall walk, one side sitting, one side and sent a girl to come, but a granny to go. You then said: "sister-in-law, you do not complain." Mrs. Wang heard, the heart is again happy, because asked: "The old lady does not know." Mrs. Wang laughed and said, "Our words are not afraid. You small, where know."
Are talking between, only to see the two Mrs. Wang's room to the sister sister Feng. Sister Feng will send Caixia and a mother-in-law to report that: "Here a girls, have got." Zhou Rui family said: "your rules, we only in front of the crowd is not, do not talk to Ping'er, can not be." Zhou Rui's family asked: "This time where to invite the Grand Master?" Zhou Rui's family said: "You guys
---------------

This day is exactly, only to hear a burst of Yu Village laugh. Come in, suddenly saw Baoyu came in, then lifted Daiyu, laughed: "You good ghost, where do I go?" Bao Yu heard, then called: "Lin sister sister said this, how do you not worry?" Baoyu heard, and then realized what it means, the heart is puzzled.
Suddenly heard the courtyard door on the people said: "Lin sister sister where to go?" Baoyu heard, heart really hesitant. Anxious to think, it was: "Lin sister sister suddenly someone came to be those girls. Back I was sick, do not dare to sleep, slept all night, lazy to pick up, everyone scattered." Then go out. To know the end, let's listen to the next chapter.
The 18th time the idiot woman fabricated hibiscus fetish idiotic gentleman sad deep idiotic love
Bao Yu's heart, Xiangyun's heart, and her heart.
Bao Hairpin married Shichun princely death Xiaoxiang Yun disease
That said, Bao Yu would like to come, see Bao Yu back, square out, then make a resignation, not no words. Baoyu at this time in mind doubt, and he spoke also dare not answer, then went out and walked to Mrs. Wang's room, rushed over, called to ask: "Bao brother called me to go." Bao Yu heard, and ordered people to come: "Mrs. Please come out." Back to one side to go back, one side to come back to jia mother. Jia mother met again, asked her brother's jade is good, and then over into the garden fell in the garden, not to mention.
It is to go back, only to see Zhou Rui walked in, one side came to ask: "the family's sister-in-law, where she remembered me?" Zhou Rui, Zhou Rui family said: "How do you and children say Zhou Rui?" Zhou Rui's family sneered: "Today the family big gone, you see this silver, they will recognize!" Zhou Rui's family to Zhou Rui's family's maid to see her mother back to the room.
---------------

There is a small girl holding the box, laughed: "This is the wife, I do not know where, take this money quickly take it!" The little girl ignored, only to take the money, handed with his little girl, said: "I will not be allowed! Finished, quickly please you go. The old lady does not know."
Jia mother asked, that Qiao Sister children embarrassed, only to follow, suddenly saw a granny out back said: "Grandma is not big." Jia mother listened to the next pull down a small girl to, while saying: "Mrs. can send someone to send someone." And said: "What is it? Our two girls and more than different, still have time." Back to order people to take out, said a mother-in-law to come, come in, resign.
Here Phoenix just to go into the house, see came in, see Feng Sister Er is wondering not very bad, then said: "My words do not weigh, instantly called me to hear." Sister Feng heard can not help, said: "You do not have to be busy, it is so, I can not care very much." Ping'er said: "It is not good to see, I thought about it, has gone, I just look at you." Ping'er said: "You came, I have gone over, I also want to invite you back to the old Mrs. Ann." Feng sister heard, namely get up to let, and said: "You only don't come." Ping'er Yi'er then ordered: "Go tell the second grandmother, do not stand slightly a break to go in." Ping'er promised to go out, just to come, only to see the girl came, said: "big grandma's house here, we ask to understand." Jia mother asked: "That girl's family is not good, just say some of what happened." Ping'er heard and laughed: "You do not believe, tell your old lady." Ping'er heard, and said, "This is the people at Feng Ya-tou's place, what else is going on." Ping'er
---------------

Eat, Bao Yu because asked: "shall not be, attacker sister's hand, where the rouge bamboo shoots sweet, just steep some, and fried." Baoyu laughed: "Not good, what to eat, just wake up to go." Assailant said: "These things, we call her good to send." Attacker laughed: "We give her this." Attacker said: "Here is not what." Bao Yu said, "You don't need to bring it up, just follow, I don't want it." Assailant laughed: "You go naturally." Attacker said: "You do not go, I do not believe. He also does not go, do not have to bother him to come." Assailant: "Don't be angry, I also have a little." Attacker: "Good sister, I just went to sleep, sleep down." Baoyu said: "How are you?" Assailant: "Waiting in the house." Attacker: "Tomorrow come back, there is something to say is the old lady reward up." Assailant: "Assailant sister also do not have to say that one." Baoyu said: "Yesterday here, the night has been four, to the old lady to rest in the morning, the old lady rest to rest, not here, but also to us." Attacker said: "You can only rest assured that you sleep." Attacker: "This time to sleep, carefully said." Attacker: "I don't want to go out to go back, let's stroll around to stroll around." Baoyu said: "Tomorrow then sleep." Attacker: "You don't have to go." Attacker: "Both lazy, just you should also be assured." Assailant: "I'm not sleepy, you should sleep." Assailant: "I also fell asleep, let's sleep again." Assailant: "Sleep is not necessary to sleep, and late a rest." Assailant: "It is already, and good." Baoyu said: "you also sleep, days and sleep." Attacker
---------------

One day, only to see Mrs. Wang's words with tear stains, forgetting all the things. The people heard, all bluffed, so busy asked: "What about your words?" Mrs. Wang said: "You know that outside the daughter is not that a talking child?" Jia mother said: "I heard a daughter there again yesterday." Daiyu said: "What is it called?" Baoyu said: "I yell drink wine, you to the south." And reached out to move. Jia mother also said: "no, you mix outside said. You here and bad gas, no need to talk, sooner or later on the back." Shi Xiangyun said: "No." Baoyu laughed: "not, what else to say, we yesterday a class of big wine, are set to drink." Jia mother asked: "This word?" Jia mother said: "your family sit more, you also big dinner." Yuanyang said: "There house also do not eat, our meals are set wine." Yuanyang laughed: "Not necessary. Yesterday said, I said that the reward crabs to go." Yuanyang promised. mandarin ducks and said: "today a come, you talk, each eat, do not eat, our drink a cup." mandarin ducks laughed: "there is before in the garden, all sung." Yuanyang early fly washed his hands, according to Jia mother said: "You just please, we came." Jia mother said: "you here also do not need to go over to stroll, otherwise; we are not laughing." Yuanyang said, "Yes." Yuanyang said, "Rewarded, rewarded you for eating these." Yuanyang said: "The last crab, we ate two." Yuanyang promised, with then walked out. Yuanyang asked: "Mrs. asked not dare to eat, we said, just take it back." Yuanyang said: "Today only came back, said over there did not eat, do not have to
---------------

At this time, only to see a person in the day one day, are like from where to go. Bao Yu laughed: "I heard said Mrs. words, or is it?" Attacker busy said: "You heard Miss Lin, such a black heart fine, why am I sick, I went over." Baoyu said: "Miss Lin yesterday only master told Xueyan to tell Xueyan, only just told Xueyan it." Tangchun said: "You know your second brother and Ping'er said, tell me, I'm sick." Tangchun said: "Not allowed and your old man told the second milk said, only that I dare not say." Said then ordered to tanchun room. Ping'er see her like this, the heart can not think of their own heart, they died, they said: "Fortunately, good, people do not know, over for news." Ping'er panicked, then turned back to Ping'er said: "Sister sister so, big grandma and then go out to ask for peace, and sent people to see." Ping'er said with tears in her eyes: "She is not an important matter, she is also a small matter. Where is she? Both so, you do not want her to go, not tired of her to go, first tell people still." Ping'er heard, surprised: "Naturally, you are such a nonsense, but also in the past." Tangchun said: "you say, we naturally also not fast, I do not know is her here, not how willing to talk to her?" Ping'er said: "No, also they can not find." Tangchun said: "People learn the words, Mrs. said is too house living it." Ping'er busy promised, really a word to solve. Tang Chun only had to go again, with Ping'er to the inner room, Tang Chun again to accompany the smile. Ping'er laughed and said: "You eat, just tell you." Ping'er said: "Ping girl although in the old lady there, I do not know what is the reason?" Ping'er said: "Why not go home?

2.3 Chinese GPT based on BBPE tokenizer

This operation is also simple, write one by yourself, set the size of the word list, first based on tokenizers byte-level byte pair encoding (BBPE) directly after training the lexicon, save the word list and configuration, and then use it in the sample inference.

Then, in based on this trained tokenizer the data is segmented.

This side separately tries to create a system based on theDream of the Red Chamber, the Four Great Masterpieces and dozens of popular web novelsand trained on different model configurations (with reference to the latest MobileLLM's deep and narrow model architecture setup) with single or multiple cards

Using torchrun to run pytorch's self-contained single-computer multi-card DDP training, there's no difference from Deepspeed on small models.

torchrun --standalone --nproc_per_node=8  config/train_gpt2.py

Training experience:

This side of the training is mainly pursuing overfitting, the loss of the validation set is not very meaningful in reference (too lazy to divide the dataset)
Using the Adam optimizer, which takes up 2x the space of the model in the check-point (first and second order gradient information)
A lexer trained on a relatively large corpus on its own generates somewhat smoother results than a character-level lexer, presumably when the word list is generally larger than the character set, and the compression is greater. However, there is no detailed comparison
Continuity:
- Mention that the characters of a single novel are capable of fitting a continuation.
- When training with only the four great masterpieces, it is barely possible to mix different characters to continue the story, appearing as Lin Chong and Dai Yu
- Forcing a mix of characters from different novels on a web novel corpus is basically incomprehensible to the model when renewing the story 😅

Putting the various corpus

2.4 GPT based on Qwen2 tokenizer

Referring to the 3rd self-trained BBPE tokenizer, this side uses Qwen2 tokenize directly.

1, the basic output of the training result is garbled 💔

This is the result both on Dream of the Red Chamber and on a web novel corpus of nearly 100 million words. It is estimated that the corpus is too small, and the word list of Qwen2 is about 150,000 words, which is too small to train the corpus, and the corpus only covers a single type, and most of the embedding of tokens are not trained, resulting in a bad model train.

2.5 Replication of GPT2 based on openwebtext dataset

This side is a direct reference to the original project operation. The difficulty is how to get the data. Because openwebtext and tiktoken gpt2 wall can not be down.

Over here, I chose to find a NanoGPT project on kaggle, and summed down its processed (6.76GB) and trained it directly.

openwebtext数据

Kaggle program link:/code/carrot1500/nanogpt-trained-on-openwebtext

python3  config/train_gpt2.py --wandb_log=False \
  --max_iters=1000 \
  --log_interval=1 \
  --eval_interval=200 \
  --eval_iters=20 \
  --learning_rate="0.0001" \
  --gradient_accumulation_steps=4 \
  --batch_size=12 \
  --n_layer=8 \
  --n_head=8 \
  --n_embd=512 \
  --compile=False \
  --out_dir=out

Overriding config with config/train_gpt2.py:
# config for training GPT-2 (124M) down to very nice loss of ~2.85 on 1 node of 8X A100 40GB
# launch as the following (. in a screen session) and wait ~5 days:
# $ torchrun --standalone --nproc_per_node=8  config/train_gpt2.py

wandb_log = True
wandb_project = 'owt'
wandb_run_name='gpt2-124M'

# these make the total batch size be ~0.5M
# 12 batch size * 1024 block size * 5 gradaccum * 8 GPUs = 491,520
batch_size = 12
block_size = 1024
gradient_accumulation_steps = 5 * 8

# this makes total number of tokens be 300B
max_iters = 600000
lr_decay_iters = 600000

# eval stuff
eval_interval = 1000
eval_iters = 200
log_interval = 10

# weight decay
weight_decay = 1e-1

Overriding: wandb_log = False
Overriding: max_iters = 1000
Overriding: log_interval = 1
Overriding: eval_interval = 200
Overriding: eval_iters = 20
Overriding: learning_rate = 0.0001
Overriding: gradient_accumulation_steps = 4
Overriding: batch_size = 12
Overriding: n_layer = 8
Overriding: n_head = 8
Overriding: n_embd = 512
Overriding: compile = False
Overriding: out_dir = out
tokens per iteration will be: 49,152
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 50.93M
num decayed parameter tensors: 34, with 51,445,760 parameters
num non-decayed parameter tensors: 17, with 8,704 parameters
using fused AdamW: True
step 0: train loss 10.8790, val loss 10.8800
iter 0: loss 10.8837, time 12969.53ms, mfu -100.00%
iter 1: loss 10.8779, time 2646.03ms, mfu -100.00%
iter 2: loss 10.8718, time 2645.71ms, mfu -100.00%
iter 3: loss 10.8618, time 2647.04ms, mfu -100.00%
iter 4: loss 10.8776, time 2643.33ms, mfu -100.00%
iter 5: loss 10.8743, time 2644.41ms, mfu 2.12%
...
iter 996: loss 6.2718, time 2657.69ms, mfu 2.11%
iter 997: loss 6.2626, time 2654.19ms, mfu 2.11%
iter 998: loss 6.3724, time 2657.96ms, mfu 2.11%
iter 999: loss 6.1987, time 2657.67ms, mfu 2.11%
step 1000: train loss 6.3108, val loss 6.2706
saving checkpoint to out
iter 1000: loss 6.3944, time 13053.12ms, mfu 1.94%

3. Summary

The code is basically off-the-shelf, but training a good LLM that is actually usable on a small scale is not easy.

The research determines the model architecture, how much data is needed, then data preparation, data matching, data de-duplication screening cleaning, data enhancement, collecting, cleaning, labeling and generating SFT data, and alignment.

ToDo

Subsequently, the modified code will be put on git.
NanoGPT is just an implementation of the classic GPT-2, and its structure is still different from the current LLM, e.g.;.
- Rotational position encoding ROPE, the key to long context modeling
- GQA Grouped Attention
- RMSprop, only centralized, not standardized Layernorm operations
- SwiGLU

If you need to cite this article, please refer to:

LeonYi. (Aug. 25, 2024). [LLM Training Series] NanoGPT Source Code Detailed Explanation and Chinese GPT Training Practice.

@online{title={[LLM Training Series] NanoGPT Source Code Detailed Explanation and Chinese GPT Training Practices},
author={LeonYi},
year={2024},
month={Aug},
url={/justLittleStar/p/18379771},
}

bibliography
[1] Most of the illustrations are from./illu