Location>code7788 >text

Hugging Face NLP Course Notes - 2. Using Hugging Face Transformers

Popularity:51 ℃/2024-09-20 14:34:11

Hugging Face NLP Course Notes - 2. Using Hugging Face Transformers

Description:

  • First published: 2024-09-19
  • Official website:/learn/nlp-course/zh-CN/chapter2
  • Regarding: Read and take notes, keep only the highlights, mostly excerpts from the original text, embellish the original text

2. Using Hugging Face Transformers

Behind the pipeline

Start with an example:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

Raw text --> Tokenizer --> Model --> Post-processing/Predictions

Preprocessing with a tokenizer (Preprocessing with a tokenizer)

Like other neural networks, the Transformer model cannot process raw text directly, so the first step in our pipeline is to convert the textual input into numbers that the model can understand. To do this, we use a tokenizer, which is responsible for:

  • Splitting the input into words, sub-words, or symbols (e.g., punctuation) called tokens
  • Mapping each token to an integer
  • Add other inputs that may be useful to the model

We useAutoTokenizerclass and itsfrom_pretrained()method to get the same tokenizer as during training.

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Once we have the tokenizer, we can pass sentences directly to it and we will get a dictionary that is ready to be fed into our model! The only thing left to do is to convert the list of input IDs into a tensor.

The Transformers backend might be Pytorch, Tensorflow or Flax.

The Transformers model accepts only tensors as input.

To specify the type of tensor to return (PyTorch, TensorFlow, or plain NumPy), we use thereturn_tensorsParameters:

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
{
    'input_ids': tensor([
        [  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,     0,     0,     0,     0,     0,     0]
    ]), 
    'attention_mask': tensor([
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    ])
}

The output itself is a dictionary containing two keys, input_ids and attention_mask. input_ids contains two rows of integers (one for each sentence) that are unique identifiers for the lexical elements (tokens) in each sentence. We'll explain what attention_mask is later in this chapter.

Go through the model

We can download our pre-trained model as if it were a tokenizer.

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call hidden states, also known as features. For each model input, we’ll retrieve a high-dimensional vector representing the contextual understanding of that input by the Transformer model.

A high-dimensional vector?

Transformer output vectors are generally large. There are usually 3 dimensions:

  • Batch size: The number of sequences to process at once (2 in our example).
  • Sequence length: The length of the numerical representation of the sequence (16 in our example).
  • Hidden size: The vector dimension of each model input.

It is called high dimensional because the Hidden size. Hidden size can be very large (768 is usually used in smaller models, and in larger models this can be 3072 or larger).

We can see this if we feed the preprocessed inputs into the model:

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
([2, 16, 768])

Note that the output of the 🤗 Transformers model is the same as that of thenamedtupleor dictionary similar. You can add a dictionary to a dictionary by means of an attribute (as we have done) or a key (theOutput ["last_hidden_state"]) to access elements, even by index, provided you know exactly where to look (outputs[0])。

Model heads: Making sense out of numbers (Model heads: Making sense out of numbers)

Model heads take as input high-dimensional vectors of hidden states and project them into different dimensions. They usually consist of one or several linear layers:

![[en_chapter2_transformer_and_head.svg]]

The output of the Transformers model is sent directly to the model header for processing.

In this diagram, the model is represented by its embeddings layer and the subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences.

There are many different architectures in Transformers, each designed around handling a specific task. The following is a non-exhaustive list:

  • *Model (retrieve the hidden states)
  • *ForCausalLM
  • *ForMaskedLM
  • *ForMultipleChoice
  • *ForQuestionAnswering
  • *ForSequenceClassification
  • *ForTokenClassification
  • as well as other

For our example, we need a model with a sequence classification header (capable of classifying sentences as either affirmative or negative). Therefore, we will not actually use theAutoModelclass, but instead use theAutoModelForSequenceClassification

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Now, if we look at the shape of the output, the dimensionality will be much lower: the model header takes as input the high-dimensional vectors we saw before and outputs vectors containing two values (one for each label):

print()
([2, 2])

Postprocessing the output

The output values we get from the model don't necessarily make sense on their own. Let's see, the

print()
tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)

Our model predicts the first sentence as[-1.5607, 1.6123]The second sentence reads[ 4.1692, -3.3464]. These are not probabilities, but _logits_, the raw unstandardized scores output by the last layer of the model. To be converted to probabilities, they need to go through theSoftMaxlayer (all 🤗Transformers models output logits because the loss function used for training typically fuses the final activation function (e.g., SoftMax) with the actual loss function (e.g., cross-entropy)):

import torch

predictions = (, dim=-1)
print(predictions)
tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)

Now we can see that the model predicts the first sentence as[0.0402, 0.9598]The second sentence reads[0.9995, 0.0005]. These are recognizable probability scores.

To get the label corresponding to each position, we can check the model configuration of theid2labelattribute (this is described in more detail in the next section):

.id2label
{0: 'NEGATIVE', 1: 'POSITIVE'}

We can now conclude that the model predicts the following:

  • First sentence: negative: 0.0402, positive: 0.9598
  • Second sentence: negative: 0.9995, positive: 0.0005

Models

Creating a Transformer

The first thing you need to do to initialize a BERT model is to load a configuration object:

from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

The configuration contains a number of attributes that are used to build the model:

print(conrfig)
BertConfig {
  [...]
  "hidden_size": 768,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  [...]
}

Different loading methods

Creating a model from the default configuration initializes it with random values:

from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

# Model is randomly initialized!

The model can be used in this state, but will output gibberish; it first needs to be trained.

Loading an already trained Transformers model is simple - we can use thefrom_pretrained() Methods:

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

As you've seen before, we can use the equivalentAutoModelclass replaces the Bert model.

Distinction:

  • utilizationAutoModelclass, we are not aware of the checkpoints of the model.
  • utilizationBertModel.from_pretrained("bert-base-cased")When the specified identifier is used, it is passed through thebert-base-casedLoad the pre-trained model, which is a model checkpoint trained by the BERT author.

Saving methods

We usesave_pretrained() method to save the model:

model.save_pretrained("directory_on_my_computer")

utilizationTransformersModel for reasoning(Using a Transformer model for inference)

The Transformer model can only handle numbers - numbers generated by a disambiguator. But before we discuss the disambiguator, let's explore what inputs the model accepts.

The Tokenizer is responsible for converting the input into a tensor for the appropriate framework (Pytorch, Tensorflow or Flax).

Suppose we have several sequences:

sequences = ["Hello!", "Cool.", "Nice!"]

The disambiguator converts them into vocabulary indices, which we usually callInput IDs. Each sequence is now a list of numbers! The result:

encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

This is a list of encoded sequences. Convert it to a tensor (tensors):

import torch

model_inputs = (encoded_sequences)

Using the tensors as inputs to the model

output = model(model_inputs)

Tokenizers

Splitters are one of the core components of a natural language processing (NLP) pipeline. They have one purpose: to convert text into data that the model can process. Models can only handle numbers, so the disambiguator needs to convert our text input into numerical data.

In NLP tasks, the data usually processed is raw text. Such as:

Jim Henson was a puppeteer

However, the model can only handle numbers, so we need to find a way to convert the raw text to numbers. This is where a disambiguator comes in, and there are many ways to accomplish this. The goal is to find the most meaningful representation - i.e., the one that makes the most sense to the model - and, if possible, the smallest representation.

Let's take a look at some examples of the splitter algorithm.

Word-based

The first type of splitter that comes to mind is word-based. It is usually very easy to set up and use, requires only a few rules, and usually produces decent results.

For example, for the original textLet's do tokenization!

Splitting based on spaces:Let's do tokenization!
Split based on punctuation (Punctuation):Let 's do tokenization !

We can do this by applying Python'ssplit()function that uses spaces to split text into words:

tokenized_text = "Jim Henson was a puppeteer".split()
print(tokenized_text)
['Jim', 'Henson', 'was', 'a', 'puppeteer']

There are many variants of word-based tokenizers (word tokenizers) that have additional rules about punctuation.

Using such a lexer, we may obtain some rather large "vocabularies", where the vocabulary is defined as the total number of independent tokens in our corpus.

Each word is assigned an ID, starting at 0 and going up to the size of the vocabulary. The model uses these IDs to recognize each word.

If we want to cover a language completely with a word-based lexer, then we need to assign an identifier to each word in that language, which will generate a large number of lexical elements. For example, there are more than 500,000 words in English, so in order to map each word to an input ID, we would need to keep track of that many IDs. in addition, words like "dog" behave differently from words like "dogs" and the model initially has no way of knowing that "dog" and "dogs" are similar: it recognizes the two words as unrelated. The same applies to other similar words, such as "run" and "running", which the model does not initially recognize as similar.

Finally, we need a custom lexeme to represent words that are not in the vocabulary. This is often called an "unknown" lexeme, and is usually denoted as "[UNK]"or"<unk>". If you find that the lexer generates a large number of such lexical elements, this is usually a bad sign, as it is unable to find a suitable representation for a word, resulting in information being lost in the process. When building a vocabulary, the goal is to minimize the number of cases where the lexer splits words into unknown lexical elements.

One way to reduce the number of unknown lexical elements is to go a step further and use a character-based tokenizer.

Character-based

Character-based splitters split text into characters instead of words. This has two main benefits:

  1. The glossary is much smaller.
  2. There are far fewer out-of-vocabulary (unknown) lexical elements (tokens) because each word can be constructed from characters.

But here, again, some questions arise about spaces and punctuation:

![[some questions arise concerning spaces and ]]

This approach is not perfect either. Since the representation is now based on characters rather than words, it can be argued that intuitively this approach makes less sense: each character does not have much meaning on its own, whereas words do not. However, this varies from language to language; in Chinese, for example, each character carries more information than in Latin languages.

Another issue to consider is that our model will have to deal with a large number of lexical elements: in a word-based disambiguator, a word will only be one lexical element, and when it is converted to a character, it can easily become 10 or more lexical elements.

In order to combine the advantages of both approaches, we can use a third technique that combines them: subword tokenization.

Subword tokenization

The subword splitting algorithm is based on the principle that frequently used words should not be split into smaller subwords, while rare words should be broken down into meaningful subwords.

For example, "annoyingly" may be considered a rare word that can be broken down into "annoying" and "ly". These two parts may occur more frequently as separate subwords, while retaining the meaning of "annoyingly" through the combined meaning of "annoying" and "ly". ".

The following is an example showing how the subword disambiguation algorithm disambiguates the sequence "Let's do tokenization!":

Let's</w> do</w> token ization</w> !</w>

These subwords provide rich semantic meaning: for example, in the above example, "tokenization" is split into "token" and "ization "which are both semantically meaningful and spatially efficient (only two lexical elements are needed to represent a long word). This allows us to achieve relatively good coverage in smaller vocabularies with almost no unknown lexical elements.

And more! (And more!)

Unsurprisingly, there are more techniques. To name just a few:

  • Byte-level BPE, for GPT-2
  • WordPiece, for BERT
  • SentencePiece or Unigram, for multiple multilingual models

Loading and saving

Loading and saving a participle is as simple as modeling. In fact, it is based on the same two methods: from_pretrained() and save_pretrained(). These methods will load or save the algorithm used by the lexer (somewhat similar to the model's architecture) as well as its vocabulary (somewhat similar to the model's weights).

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Similar to AutoModel, the AutoTokenizer class will automatically fetch the appropriate classifier class from the library based on the checkpoint name and can be used directly with any checkpoint:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer("Using a Transformer network is simple")

Save the participle:

tokenizer.save_pretrained("directory_on_my_computer")

Encoding

Converting text to numbers is called encoding. Encoding is a two-step process: first, there is word splitting, and then there is conversion to input IDs.

  • The first step is to break the text into words (or parts of words, punctuation marks, etc.), often called lexical elements (Token).
  • The second step is to convert these lexical elements into numbers so that we can construct tensors from them and feed them into the model. To do this, the disambiguator has a vocabulary, which is what we use in thefrom_pretrained()method to instantiate the parts downloaded when instantiating the lexer. Again, we need to use the same vocabulary that was used during the pre-training of the model.

Tokenization

The disambiguation process is performed by the disambiguator'stokenize() method accomplished:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = (sequence)

print(tokens)
['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']

This tokenizer is a subword tokenizer: it splits words until it gets a lexical element that can be represented by its vocabulary. Here, "transformer" is split into two tokens:transform cap (a poem)##er

From lexical elements toInput IDs(From tokens to input IDs)

using a tokenizer.convert_tokens_to_ids() method for transformation:

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)
[7993, 170, 11303, 1200, 2443, 1110, 3014]

Decoding

decoded_string = ([7993, 170, 11303, 1200, 2443, 1110, 3014])

print(decoded_string)
Using a transformer network is simple

Handling multiple sequences

In the previous section, we explored the simplest use case: reasoning about a sequence of small length. However, some problems have arisen:

  • How do we handle multiple sequences?
  • How do we handle multiple sequences of different lengths?
  • Is the vocabulary index the only input to make the model work?
  • Is there a problem with the sequence being too long?

Models expect a batch of inputs (Models expect a batch of inputs)

In the previous exercise, you saw how a sequence can be converted into a list of numbers. Now let's convert this list of numbers into a tensor and pass it to the model:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = (sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = (ids)
# This line will fail
model(input_ids)

Report an error:

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
print(input_ids)
tensor([ 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012])
tokenized_inputs = tokenizer(sequence, return_tensors="pt")
print(tokenized_inputs["input_ids"])
tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102]])
print(input_ids.shape, tokenized_inputs["input_ids"].shape)
([14]) ([1, 16])

A new dimension needs to be added:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = (sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = ([ids])
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", )
Input IDs: [[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607, 2026,  2878,  2166,  1012]]
Logits: [[-2.7276,  2.8789]]

Batching refers to passing multiple sentences to the model at once. If you only have one sentence, you can build a batch with a single sequence.

This is a batch (Batches) containing two identical sequences:

batched_ids = [ids, ids]

Batch processing enables the model to process multiple sentences. Using multiple sequences is just as easy as building a batch of individual sequences. However, there is another problem. When you try to put two (or more) sentences together in a batch, they may be of different lengths. If you've ever worked with tensors, you know that they need to be rectangular in shape, so it's not possible to directly convert a list of input IDs to a tensor. To get around this, we usually pad the input.

Padding the inputs

The following lists cannot be converted to tensors:

batched_ids = [
    [200, 200, 200],
    [200, 200]
]

To solve this problem, we will use padding to give our tensor a rectangular shape. Padding ensures that all sentences have the same length by adding a special word (called a padding token) to shorter sentences. For example, if you have 10 sentences with 10 words each and 1 sentence with 20 words, padding will ensure that all sentences have 20 words. In our example, the generated tensor looks like this:

padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

The padding token ID can be found in thetokenizer.pad_token_id in which it is found. Let's use it and pass the two sentences to the model individually as well as in a batch together:

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model((sequence1_ids)).logits)
print(model((sequence2_ids)).logits)
print(model((batched_ids)).logits)
tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

There's something wrong with the logits in our batch prediction: the second line should have the same logits as the second sentence, but we get completely different values!

This is because the key feature of Transformer models is attention layers that contextualize each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask.

This is because an important feature of the Transformer model is the attention layers, which contextualize each token. These layers take into account filler lexemes because they pay attention to all lexemes (tokens) in the sequence. In order to obtain the same result when passing individual sentences of different lengths to the model, or when passing a batch of the same sentence with padding applied, we need to tell these attention layers to ignore the padding tokens. This is accomplished by using an attention mask.

Attention masks (Attention masks)

The attention mask is a tensor with exactly the same shape as the input ID tensor, filled with 0s and 1s: 1s indicate that the corresponding lexical elements should be attended to, while 0s indicate that the corresponding lexical elements should not be attended to (i.e., they should be ignored by the model's attention layer).

Let's complete the previous example with an attention mask:

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model((batched_ids), attention_mask=(attention_mask))
print()
tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

Now we get the same logical value for the second sentence in the batch. (i.e.[ 0.5803, -0.4125]together withprint(model((sequence2_ids)).logits)(the output of which is consistent).

Notice that the last value of the second sequence is a filler ID, which is a 0 value in the attention mask.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)


sequence_1 = "I've been waiting for a HuggingFace course my whole life."
sequence_2 = "I hate this so much!"

tokens_1 = (sequence_1)
ids_1 = tokenizer.convert_tokens_to_ids(tokens_1)
input_ids_1 = ([ids_1])
output_1 = model(input_ids_1)

tokens_2 = (sequence_2)
ids_2 = tokenizer.convert_tokens_to_ids(tokens_2)
input_ids_2 = ([ids_2])
output_2 = model(input_ids_2)

print(input_ids_1)
print(input_ids_2)
print(output_1.logits)
print(output_2.logits)
tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
tensor([[1045, 5223, 2023, 2061, 2172,  999]])
tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)
tensor([[ 3.1931, -2.6685]], grad_fn=<AddmmBackward0>)
batched_ids = [
    ids_1,
    ids_2 + [tokenizer.pad_token_id] * (input_ids_1.shape[1] - input_ids_2.shape[1])
]

attention_mask = [
    [1] * input_ids_1.shape[1],
    [1] * input_ids_2.shape[1] + [0] * (input_ids_1.shape[1] - input_ids_2.shape[1]),
]

outputs = model((batched_ids), attention_mask=(attention_mask))
print()
tensor([[-2.7276,  2.8789],
        [ 3.1931, -2.6685]], grad_fn=<AddmmBackward0>)

Longer sequences

For Transformer models, there is a limit to the length of sequences that can be passed to the model. Most models handle sequences of up to 512 or 1024 lemmas, and will crash when handling longer sequences. There are two solutions for this:

  1. Use models that support longer sequence lengths.
  2. Truncate your sequence.

The length of sequences supported varies from model to model, with some specializing in very long sequences.Longformer is one example, and another isLED. If you are working on tasks that require very long sequences, it is recommended that you check out these models.

Otherwise, we recommend that you specify themax_sequence_length parameter to truncate your sequence:

sequence = sequence[:max_sequence_length]

Put it all together.

When you call the tokenizer directly to process the sentence, you get input that can be passed directly to the model:

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs)
{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

As we will see in some of the examples below, this approach is very powerful. First of all, it allows for the disambiguation of individual sequences:

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

Multiple sequences can also be processed at once:

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

It can pad according to several objectives:

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

It can also truncate sequences: the

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

tokenizer Objects can handle transformations to a specific frame tensor, which can then be passed directly to the model. For example, in the following code example, we let thetokenizer Return tensor for different frames -"pt" Returns the PyTorch tensor."tf" Returns the TensorFlow tensor."np" Returns the NumPy array:

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

Special tokens

If we look at the input IDs returned by the splitter, we will see that they are slightly different from the previous ones:

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = (sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

A token ID is added at the beginning and one at the end. Let's decode the two ID sequences above to see what's going on:

print((model_inputs["input_ids"]))
print((ids))
"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."

Splitters add special words at the beginning[CLS], added special word at the end[SEP]. This is because the model uses these lexical elements during pre-training, so in order to get the same results during inference, we need to add them as well. Note that some models will not add special words, or will add different special words; others may only add these special words at the beginning or end. In any case, the lexer knows what special words the model expects and will handle it all for you.

Wrapping up: from tokenizer to model

Now that we've learned about all the individual steps used by the Parser object when processing text, let's take one last look at how it handles multiple sequences (padding!) , very long sequences (truncation!) ), and tensors of many types using its main API:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)