In .NET 9, Microsoft introducedLibrary that provides .NET developers with powerful text tokenization capabilities.
1. What is
is a library for text tokenization, a powerful library in the .NET ecosystem designed to convert text into tokens
For use in natural language processing (NLP) tasks. The library supports multiple tokenization algorithms, including Byte Pair Encoding (BPE), SentencePiece and WordPiece, to meet the needs of different models and applications.
2. Main application scenarios
-
Natural Language Processing (NLP): During the training and inference phases, convert text into a token format that the model can process.
-
Preprocessing steps: Tokenize input text in tasks such as text analysis, sentiment analysis, and machine translation.
-
Custom vocabulary: Developers can import custom vocabulary and use BPE Tokenizer to process text data in specific fields.
3. Supported models and services
Optimized for several popular model families, including:
-
-
GPT series: Such as GPT-4, GPT-o1, etc.
-
Llama Series。
-
Phi series。
-
Bert series。
-
In addition, the library is integrated with other AI services, such as Azure, OpenAI, etc., providing developers with a unified C# abstraction layer to simplify interaction with AI services.
4. Main class Class
1. Tokenizer
kind
Tokenizer
The class acts as a pipeline for text processing, accepting raw text input and outputtingTokenizerResult
object. It allows setting up different models, preprocessors and normalizers to meet specific needs.
Main methods:
-
-
Encode(string text)
: Encode input text into an object containing a list of tokens, a token ID, and a map of token offsets. -
Decode(IEnumerable<int> ids, bool skipSpecialTokens = true)
: Decodes the given token ID back into a string. -
TrainFromFiles(Trainer trainer, ReportProgress reportProgress, params string[] files)
: Train a tagger model using an input file.
-
Main attributes:
-
-
Model
: Gets or sets the model used by the tokenizer. -
PreTokenizer
: Gets or sets the preprocessor used by the tokenizer. -
Normalizer
: Gets or sets the normalizer used by the tokenizer. -
Decoder
: Gets or sets the decoder used by the tokenizer.
-
2. Model
kind
Model
Class is an abstract base class for models used in the tokenization process, such as BPE, WordPiece, or Unigram. Specific models (e.g.Bpe
) inherits from this class and implements its methods.
Main methods:
-
-
GetTrainer()
: Get the trainer object used to train the model. -
GetVocab()
: Get a vocabulary mapping tokens to IDs. -
GetVocabSize()
: Get the size of the vocabulary. -
TokenToId(string token)
: Maps tokens to tokenized IDs. -
IdToToken(int id, bool skipSpecialTokens = true)
: Maps tokenized IDs to tokens. -
Tokenize(string sequence)
: Tokenize a sequence of strings into a list of tokens. -
Save(string vocabPath, string mergesPath)
: Save model data to vocabulary and merge files.
-
3. Bpe
kind
Bpe
The class represents the Byte Pair Encoding model, which isModel
One of the concrete implementations of the class. It is used to split text into sub-word units to improve processing capabilities for unregistered words.
Main attributes:
-
-
UnknownToken
: Gets or sets an unknown token. Used when unknown characters are encountered. -
FuseUnknownTokens
: Gets or sets whether multiple unknown tokens are allowed to fuse. -
ContinuingSubwordPrefix
: Optional prefix for any subword that exists only after another subword. -
EndOfWordSuffix
: Optional suffix used to describe the characteristics of word-final subwords.
-
Main methods:
-
-
Save(string vocabPath, string mergesPath)
: Save model data to vocabulary and merge files. -
Tokenize(string sequence)
: Tokenize a sequence of strings into a list of tokens. -
GetTrainer()
: Get the trainer object used to train the model and generate vocabulary and merged data.
-
4. EnglishRoberta
kind
EnglishRoberta
Class is a tokenizer model designed specifically for the English Roberta model. it inherits fromModel
class and implements Roberta-specific tokenization logic.
Main attributes:
-
-
PadIndex
: Gets the index of the filled symbol in the symbol list. -
SymbolsCount
: Get the length of the symbol list.
-
Main methods:
-
-
AddMaskSymbol(string maskSymbol)
: Add mask symbols to the symbol list. -
IdsToOccurrenceRanks(IReadOnlyList<int> ids)
: Convert a list of token IDs into a ranking of highest occurrences. -
OccurrenceRanksIds(IReadOnlyList<int> ranks)
: Convert the list of highest occurrence rankings into a list of token IDs. -
Save(string vocabPath, string mergesPath)
: Save model data to vocabulary, merge, and match mapping files.
-
5. RobertaPreTokenizer
kind
RobertaPreTokenizer
Class is a preprocessor designed for the English Roberta tokenizer. It is responsible for the initial splitting and processing of text before tokenization.
Main methods:
-
-
PreTokenize(string text)
: Pre-tokenize input text.
-
6. Split
kind
Split
Class represents the substring after splitting the original string. Each substring is represented by a token, which may ultimately represent parts of the original input string.
Main attributes:
-
-
TokenString
: Get the underlying split token.
-
5. Sample code
useThe library tokenizes text to fit the GPT-4 model. You can follow the following steps:
-
Install necessary NuGet packages: Make sure the project references
Bag.
-
Load GPT-4 vocabulary and merge pair files: Get a glossary of GPT-4 models from official or trusted sources (
) and merged pairs (
)document.
-
Initialize the BPE model and load the vocabulary:use
in the library
Bpe
Classload vocabulary and merge pair files. -
Create a tokenizer and do text tokenization and decoding:use
Tokenizer
Class tokenizes input text and decodes it back to the original text if necessary.
The following is sample code:
using System; using ; class Program { static void Main(string[] args) { //Initialize BPE model var bpe = new Bpe(); //Load GPT-4 vocabulary and merge pair files ("path_to_vocab.json", "path_to_merges.txt"); //Create a tokenizer var tokenizer = new Tokenizer(bpe); //Enter text var inputText = "This is a text for testing."; //Tokenize text var encoded = (inputText); //Output tokenized results ("Tokens:"); foreach (var token in ) { (token); } //Decode back to original text var decodedText = (); ($"Decoded Text: {decodedText}"); } }
-
Path settings:Will
"path_to_vocab.json"
and"path_to_merges.txt"
Replace with actual file path. -
Glossary and merged access to files: Make sure to obtain vocabulary and merge pair files that are compatible with the GPT-4 model from official or trusted sources.
-
Model compatibility: Although this code uses a generic BPE tagger, in real applications, it may need to be adjusted according to the specific requirements of the GPT-4 model.
Zhou Guoqing
2025/1/6