Hugging Face NLP Course Notes - 0. Installing the transformers library & 1. Transformer models

Hugging Face NLP Course Study Notes - 0. Installing the transformers library & 1. Transformer models

Description:

First published: 2024-09-14
Official website:/learn/nlp-course/zh-CN/chapter1
Regarding: Read and take notes, keep only the highlights, mostly excerpts from the original text, embellish the original text

0. Install the transformers library

Create a conda environment and install packages:

conda create -n hfnlp python=3.12
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers==4.44.2

# More
pip install seqeval
pip install sentencepiece

Use the Hugging Face image (see/ ）：

export HF_ENDPOINT=

Or set up a Hugging Face mirror in python:

import os
["HF_ENDPOINT"] = ""

1. Transformer model

What can Transformers do?

Using pipelines

The most basic object in the Transformers library is thepipeline() function. It connects the model with its necessary pre-processing and post-processing steps, allowing us to enter any text by direct input and obtain the final answer:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

Tip:

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b ....
[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

Output:

[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

Enter multiple sentences:

classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

There are three main steps involved in passing some text to pipeline:

The text is preprocessed into a format that the model can understand.
The preprocessed inputs are passed to the model.
The model is processed to output a final human-understandable result.

Zero sample classification

from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

Tip:

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 ([/facebook/bart-large-mnli](/facebook/bart-large-mnli)).
Using a pipeline without specifying a model name and revision in production is not recommended.

Output:

{'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445952534675598, 0.11197696626186371, 0.043427806347608566]}

This pipeline is called zero-shot because you don't need to fine-tune the model on your data to use it!

text categorization

Now let's see how we can use pipeline to generate some text. The main usage here is that you provide a prompt and the model will automatically complete the whole paragraph by generating the rest of the text.

from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

Tip:

No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.

Output:

[{'generated_text': 'In this course, we will teach you how to create a simple Python script that uses the default Python scripts for the following tasks, such as adding a linker at the end of a file to a file, editing an array, etc.\n\n'}]

Using other models in the Hub in pipeline

The previous example used the default model, but you can also select a specific model from the Hub to use in the pipeline for a specific task - for example, text generation. Go toModel Center (hub)and clicking on the appropriate tab on the left will display only the models supported by that task.For example。

Let's try it.distilgpt2 Model it! Here's how to load it in the same pipeline as before:

from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2
)

[{'generated_text': 'In this course, we will teach you how to make your world better. Our courses focus on how to make an improvement in your life or the things'},
 {'generated_text': 'In this course, we will teach you how to properly design your own design using what is currently in place and using what is best in place. By'}]

Mask filling

The next pipeline you will try is thefill-mask. The idea of this task is to fill in the blanks in the given text:

from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

[{'score': 0.19198445975780487,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04209190234541893,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'}]

top_k The parameter controls how many results are to be displayed. Note that here the model is populated with special <mask > word, which is often referred to as a mask marker. Other mask-filling models may have different mask markers, so verify what the correct mask word is when exploring other models.

Named Entity Recognition

Named Entity Recognition (NER) is a task in which the model must find which parts of the input text correspond to entities such as people, locations, or organizations. Let's look at an example:

from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in *lyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english ...

[{'entity_group': 'PER',
  'score': 0.9981694,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.9796019,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9932106,
  'word': '*lyn',
  'start': 49,
  'end': 57}]

We pass the option in the pipeline creation functiongrouped_entities=True to tell the pipeline to regroup sentence parts that correspond to the same entity: here the model correctly groups "Hugging" and "Face" into one organization, even though the name consists of more than one word.

Named Entity Recognition (Chinese)

Run the data from the/shibing624/bert4ner-base-chinese README code

pip install seqeval

import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from .sequence_labeling import get_entities

["KMP_DUPLICATE_LIB_OK"] = "TRUE"

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-chinese")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-chinese")
label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']

sentence = "Wang Hongwei from Beijing，It's a cop.，I like to go to Wangfujing。"

def get_entity(sentence):
    tokens = (sentence)
    inputs = (sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(inputs).logits
    predictions = (outputs, dim=2)
    char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
    print(sentence)
    print(char_tags)

    pred_labels = [i[1] for i in char_tags]
    entities = []
    line_entities = get_entities(pred_labels)
    for i in line_entities:
        word = sentence[i[1]: i[2] + 1]
        entity_type = i[0]
        ((word, entity_type))

    print("Sentence entity:")
    print(entities)


get_entity(sentence)

Wang Hongwei is from Beijing, a police officer who likes to visit Wangfujing.
[('Hong', 'B-PER'), ('Wei', 'I-PER'), ('Lai', 'I-PER'), ('Since', 'O'), ('Bei', 'O'), ('Jing', 'B-LOC'), (',', 'I-LOC'), ('Is', 'O'), ('A', 'O'), ('Police', 'O'), ('Cha', 'O') , (',', 'O'), ('happy', 'O'), ('cheerful', 'O'), ('go', 'O'), ('king', 'O'), ('house', 'B-LOC'), ('well', 'I-LOC'), ('swim', 'I-LOC'), ('play', 'O'), ('child', 'O')]
Sentence entity.
[('Wang Hongwei', 'PER'), ('Beijing', 'LOC'), ('Wangfujing', 'LOC')]

Or by using thenerpylibrary to use the model shibing624/bert4ner-base-chinese.

In addition, you can use the ltp to do Chinese named entity recognition, its Github repository/HIT-SCIR/ltp There are 4.9K stars

question and answer system

from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in *lyn",
)

{'score': 0.6949753761291504, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

Note that this pipeline works by extracting information from the provided context; it does not generate answers out of thin air.

text summary

Text summarization is the task of reducing text to shorter texts while retaining the main (important) information in the text. Here is an example:

from transformers import pipeline

summarizer = pipeline("summarization", device=0)
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

As with text generation, you specify the resultingmax_length maybemin_length。

rendering

For translations, if you provide the language pair in the task name (e.g. "translation_en_to_fr"), then the default model can be used, but the easiest way to do this is in theModel Center (hub)Select the model you want to use. Here we will try to translate from French to English:

pip install sentencepiece

from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en", device=0)
translator("Ce cours est produit par Hugging Face.")

[{'translation_text': 'This course is produced by Hugging Face.'}]

Translate from English to Chinese:

from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh", device=0)
translator("America has changed dramatically during recent years.")

[{'translation_text': 'In recent years, the United States has changed dramatically.'}]

Bias and limitations

If you intend to use pre-trained or fine-tuned models in a formal project. Please note: While these models are powerful, they have limitations. One of the biggest is that in order to pre-train on large amounts of data, researchers usually collect all the content they can find, which may be interspersed with stereotypes of ideologies or values.

To quickly explain this clearly, let's go back to an example of a pipeline using the BERT model:

from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased", device=0)
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']

When the model was asked to fill in the missing words in these two sentences, only one of the answers given by the model was not related to gender (waiter/waitress).