.NET 9 new library

In .NET 9, Microsoft introducedLibrary that provides .NET developers with powerful text tokenization capabilities.

1. What is

　is a library for text tokenization, a powerful library in the .NET ecosystem designed to convert text into tokens

For use in natural language processing (NLP) tasks. The library supports multiple tokenization algorithms, including Byte Pair Encoding (BPE), SentencePiece and WordPiece, to meet the needs of different models and applications.

2. Main application scenarios

Natural Language Processing (NLP): During the training and inference phases, convert text into a token format that the model can process.
Preprocessing steps: Tokenize input text in tasks such as text analysis, sentiment analysis, and machine translation.
Custom vocabulary: Developers can import custom vocabulary and use BPE Tokenizer to process text data in specific fields.

3. Supported models and services

Optimized for several popular model families, including:

- GPT series: Such as GPT-4, GPT-o1, etc.
- Llama Series。
- Phi series。
- Bert series。

In addition, the library is integrated with other AI services, such as Azure, OpenAI, etc., providing developers with a unified C# abstraction layer to simplify interaction with AI services.

4. Main class Class

1. `Tokenizer`kind

TokenizerThe class acts as a pipeline for text processing, accepting raw text input and outputtingTokenizerResultobject. It allows setting up different models, preprocessors and normalizers to meet specific needs.

Main methods:

- Encode(string text): Encode input text into an object containing a list of tokens, a token ID, and a map of token offsets.
- Decode(IEnumerable<int> ids, bool skipSpecialTokens = true): Decodes the given token ID back into a string.
- TrainFromFiles(Trainer trainer, ReportProgress reportProgress, params string[] files): Train a tagger model using an input file.

Main attributes:

- Model: Gets or sets the model used by the tokenizer.
- PreTokenizer: Gets or sets the preprocessor used by the tokenizer.
- Normalizer: Gets or sets the normalizer used by the tokenizer.
- Decoder: Gets or sets the decoder used by the tokenizer.

2. `Model`kind

ModelClass is an abstract base class for models used in the tokenization process, such as BPE, WordPiece, or Unigram. Specific models (e.g.Bpe) inherits from this class and implements its methods.

Main methods:

- GetTrainer(): Get the trainer object used to train the model.
- GetVocab(): Get a vocabulary mapping tokens to IDs.
- GetVocabSize(): Get the size of the vocabulary.
- TokenToId(string token): Maps tokens to tokenized IDs.
- IdToToken(int id, bool skipSpecialTokens = true): Maps tokenized IDs to tokens.
- Tokenize(string sequence): Tokenize a sequence of strings into a list of tokens.
- Save(string vocabPath, string mergesPath): Save model data to vocabulary and merge files.

3. `Bpe`kind

BpeThe class represents the Byte Pair Encoding model, which isModelOne of the concrete implementations of the class. It is used to split text into sub-word units to improve processing capabilities for unregistered words.

Main attributes:

- UnknownToken: Gets or sets an unknown token. Used when unknown characters are encountered.
- FuseUnknownTokens: Gets or sets whether multiple unknown tokens are allowed to fuse.
- ContinuingSubwordPrefix: Optional prefix for any subword that exists only after another subword.
- EndOfWordSuffix: Optional suffix used to describe the characteristics of word-final subwords.

Main methods:

- Save(string vocabPath, string mergesPath): Save model data to vocabulary and merge files.
- Tokenize(string sequence): Tokenize a sequence of strings into a list of tokens.
- GetTrainer(): Get the trainer object used to train the model and generate vocabulary and merged data.

4. `EnglishRoberta`kind

EnglishRobertaClass is a tokenizer model designed specifically for the English Roberta model. it inherits fromModelclass and implements Roberta-specific tokenization logic.

Main attributes:

- PadIndex: Gets the index of the filled symbol in the symbol list.
- SymbolsCount: Get the length of the symbol list.

Main methods:

- AddMaskSymbol(string maskSymbol): Add mask symbols to the symbol list.
- IdsToOccurrenceRanks(IReadOnlyList<int> ids): Convert a list of token IDs into a ranking of highest occurrences.
- OccurrenceRanksIds(IReadOnlyList<int> ranks): Convert the list of highest occurrence rankings into a list of token IDs.
- Save(string vocabPath, string mergesPath): Save model data to vocabulary, merge, and match mapping files.

5. `RobertaPreTokenizer`kind

RobertaPreTokenizerClass is a preprocessor designed for the English Roberta tokenizer. It is responsible for the initial splitting and processing of text before tokenization.

Main methods:

- PreTokenize(string text): Pre-tokenize input text.

6. `Split`kind

SplitClass represents the substring after splitting the original string. Each substring is represented by a token, which may ultimately represent parts of the original input string.

Main attributes:

- TokenString: Get the underlying split token.

5. Sample code

useThe library tokenizes text to fit the GPT-4 model. You can follow the following steps:

Install necessary NuGet packages: Make sure the project referencesBag.
Load GPT-4 vocabulary and merge pair files: Get a glossary of GPT-4 models from official or trusted sources () and merged pairs ()document.
Initialize the BPE model and load the vocabulary:usein the libraryBpeClassload vocabulary and merge pair files.
Create a tokenizer and do text tokenization and decoding:useTokenizerClass tokenizes input text and decodes it back to the original text if necessary.

The following is sample code:

using System;
using ;

class Program
{
    static void Main(string[] args)
    {
        //Initialize BPE model
        var bpe = new Bpe();

        //Load GPT-4 vocabulary and merge pair files
        ("path_to_vocab.json", "path_to_merges.txt");

        //Create a tokenizer
        var tokenizer = new Tokenizer(bpe);

        //Enter text
        var inputText = "This is a text for testing.";

        //Tokenize text
        var encoded = (inputText);

        //Output tokenized results
        ("Tokens:");
        foreach (var token in )
        {
            (token);
        }

        //Decode back to original text
        var decodedText = ();
        ($"Decoded Text: {decodedText}");
    }
}

Path settings:Will"path_to_vocab.json"and"path_to_merges.txt"Replace with actual file path.
Glossary and merged access to files: Make sure to obtain vocabulary and merge pair files that are compatible with the GPT-4 model from official or trusted sources.
Model compatibility: Although this code uses a generic BPE tagger, in real applications, it may need to be adjusted according to the specific requirements of the GPT-4 model.

Zhou Guoqing

2025/1/6

.NET 9 new library

2. Main application scenarios

3. Supported models and services

4. Main class Class

1. Tokenizerkind

2. Modelkind

3. Bpekind

4. EnglishRobertakind

5. RobertaPreTokenizerkind

6. Splitkind

1. `Tokenizer`kind

2. `Model`kind

3. `Bpe`kind

4. `EnglishRoberta`kind

5. `RobertaPreTokenizer`kind

6. `Split`kind