Hugging Face NLP Course Study Notes - 0. Installing the transformers library & 1. Transformer models
Description:
- First published: 2024-09-14
- Official website:/learn/nlp-course/zh-CN/chapter1
- Regarding: Read and take notes, keep only the highlights, mostly excerpts from the original text, embellish the original text
0. Install the transformers library
Create a conda environment and install packages:
conda create -n hfnlp python=3.12
conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers==4.44.2
# More
pip install seqeval
pip install sentencepiece
Use the Hugging Face image (see/ ):
export HF_ENDPOINT=
Or set up a Hugging Face mirror in python:
import os
["HF_ENDPOINT"] = ""
1. Transformer model
What can Transformers do?
Using pipelines
The most basic object in the Transformers library is thepipeline() function. It connects the model with its necessary pre-processing and post-processing steps, allowing us to enter any text by direct input and obtain the final answer:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")
Tip:
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b ....
[{'label': 'POSITIVE', 'score': 0.9598047137260437}]
Output:
[{'label': 'POSITIVE', 'score': 0.9598047137260437}]
Enter multiple sentences:
classifier(
["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)
[{'label': 'POSITIVE', 'score': 0.9598047137260437},
{'label': 'NEGATIVE', 'score': 0.9994558691978455}]
There are three main steps involved in passing some text to pipeline:
- The text is preprocessed into a format that the model can understand.
- The preprocessed inputs are passed to the model.
- The model is processed to output a final human-understandable result.
Zero sample classification
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
classifier(
"This is a course about the Transformers library",
candidate_labels=["education", "politics", "business"],
)
Tip:
No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 ([/facebook/bart-large-mnli](/facebook/bart-large-mnli)).
Using a pipeline without specifying a model name and revision in production is not recommended.
Output:
{'sequence': 'This is a course about the Transformers library', 'labels': ['education', 'business', 'politics'], 'scores': [0.8445952534675598, 0.11197696626186371, 0.043427806347608566]}
This pipeline is called zero-shot because you don't need to fine-tune the model on your data to use it!
text categorization
Now let's see how we can use pipeline to generate some text. The main usage here is that you provide a prompt and the model will automatically complete the whole paragraph by generating the rest of the text.
from transformers import pipeline
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")
Tip:
No model was supplied, defaulted to openai-community/gpt2 and revision 6c0e608 (/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Output:
[{'generated_text': 'In this course, we will teach you how to create a simple Python script that uses the default Python scripts for the following tasks, such as adding a linker at the end of a file to a file, editing an array, etc.\n\n'}]
Using other models in the Hub in pipeline
The previous example used the default model, but you can also select a specific model from the Hub to use in the pipeline for a specific task - for example, text generation. Go toModel Center (hub)and clicking on the appropriate tab on the left will display only the models supported by that task.For example。
Let's try it.distilgpt2 Model it! Here's how to load it in the same pipeline as before:
from transformers import pipeline
generator = pipeline("text-generation", model="distilgpt2")
generator(
"In this course, we will teach you how to",
max_length=30,
num_return_sequences=2
)
[{'generated_text': 'In this course, we will teach you how to make your world better. Our courses focus on how to make an improvement in your life or the things'},
{'generated_text': 'In this course, we will teach you how to properly design your own design using what is currently in place and using what is best in place. By'}]
Mask filling
The next pipeline you will try is thefill-mask. The idea of this task is to fill in the blanks in the given text:
from transformers import pipeline
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)
[{'score': 0.19198445975780487,
'token': 30412,
'token_str': ' mathematical',
'sequence': 'This course will teach you all about mathematical models.'},
{'score': 0.04209190234541893,
'token': 38163,
'token_str': ' computational',
'sequence': 'This course will teach you all about computational models.'}]
top_k The parameter controls how many results are to be displayed. Note that here the model is populated with special <mask > word, which is often referred to as a mask marker. Other mask-filling models may have different mask markers, so verify what the correct mask word is when exploring other models.
Named Entity Recognition
Named Entity Recognition (NER) is a task in which the model must find which parts of the input text correspond to entities such as people, locations, or organizations. Let's look at an example:
from transformers import pipeline
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in *lyn.")
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english ...
[{'entity_group': 'PER',
'score': 0.9981694,
'word': 'Sylvain',
'start': 11,
'end': 18},
{'entity_group': 'ORG',
'score': 0.9796019,
'word': 'Hugging Face',
'start': 33,
'end': 45},
{'entity_group': 'LOC',
'score': 0.9932106,
'word': '*lyn',
'start': 49,
'end': 57}]
We pass the option in the pipeline creation functiongrouped_entities=True to tell the pipeline to regroup sentence parts that correspond to the same entity: here the model correctly groups "Hugging" and "Face" into one organization, even though the name consists of more than one word.
Named Entity Recognition (Chinese)
Run the data from the/shibing624/bert4ner-base-chinese README code
pip install seqeval
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from .sequence_labeling import get_entities
["KMP_DUPLICATE_LIB_OK"] = "TRUE"
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("shibing624/bert4ner-base-chinese")
model = AutoModelForTokenClassification.from_pretrained("shibing624/bert4ner-base-chinese")
label_list = ['I-ORG', 'B-LOC', 'O', 'B-ORG', 'I-LOC', 'I-PER', 'B-TIME', 'I-TIME', 'B-PER']
sentence = "Wang Hongwei from Beijing,It's a cop.,I like to go to Wangfujing。"
def get_entity(sentence):
tokens = (sentence)
inputs = (sentence, return_tensors="pt")
with torch.no_grad():
outputs = model(inputs).logits
predictions = (outputs, dim=2)
char_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())][1:-1]
print(sentence)
print(char_tags)
pred_labels = [i[1] for i in char_tags]
entities = []
line_entities = get_entities(pred_labels)
for i in line_entities:
word = sentence[i[1]: i[2] + 1]
entity_type = i[0]
((word, entity_type))
print("Sentence entity:")
print(entities)
get_entity(sentence)
Wang Hongwei is from Beijing, a police officer who likes to visit Wangfujing.
[('Hong', 'B-PER'), ('Wei', 'I-PER'), ('Lai', 'I-PER'), ('Since', 'O'), ('Bei', 'O'), ('Jing', 'B-LOC'), (',', 'I-LOC'), ('Is', 'O'), ('A', 'O'), ('Police', 'O'), ('Cha', 'O') , (',', 'O'), ('happy', 'O'), ('cheerful', 'O'), ('go', 'O'), ('king', 'O'), ('house', 'B-LOC'), ('well', 'I-LOC'), ('swim', 'I-LOC'), ('play', 'O'), ('child', 'O')]
Sentence entity.
[('Wang Hongwei', 'PER'), ('Beijing', 'LOC'), ('Wangfujing', 'LOC')]
Or by using thenerpylibrary to use the model shibing624/bert4ner-base-chinese.
In addition, you can use the ltp to do Chinese named entity recognition, its Github repository/HIT-SCIR/ltp There are 4.9K stars
question and answer system
from transformers import pipeline
question_answerer = pipeline("question-answering")
question_answerer(
question="Where do I work?",
context="My name is Sylvain and I work at Hugging Face in *lyn",
)
{'score': 0.6949753761291504, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}
Note that this pipeline works by extracting information from the provided context; it does not generate answers out of thin air.
text summary
Text summarization is the task of reducing text to shorter texts while retaining the main (important) information in the text. Here is an example:
from transformers import pipeline
summarizer = pipeline("summarization", device=0)
summarizer(
"""
America has changed dramatically during recent years. Not only has the number of
graduates in traditional engineering disciplines such as mechanical, civil,
electrical, chemical, and aeronautical engineering declined, but in most of
the premier American universities engineering curricula now concentrate on
and encourage largely the study of engineering science. As a result, there
are declining offerings in engineering subjects dealing with infrastructure,
the environment, and related issues, and greater concentration on high
technology subjects, largely supporting increasingly complex scientific
developments. While the latter is important, it should not be at the expense
of more traditional engineering.
Rapidly developing economies such as China and India, as well as other
industrial countries in Europe and Asia, continue to encourage and advance
the teaching of engineering. Both China and India, respectively, graduate
six and eight times as many traditional engineers as does the United States.
Other industrial countries at minimum maintain their output, while America
suffers an increasingly serious decline in the number of engineering graduates
and a lack of well-educated engineers.
"""
)
As with text generation, you specify the resultingmax_length maybemin_length。
rendering
For translations, if you provide the language pair in the task name (e.g. "translation_en_to_fr"), then the default model can be used, but the easiest way to do this is in theModel Center (hub)Select the model you want to use. Here we will try to translate from French to English:
pip install sentencepiece
from transformers import pipeline
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en", device=0)
translator("Ce cours est produit par Hugging Face.")
[{'translation_text': 'This course is produced by Hugging Face.'}]
Translate from English to Chinese:
from transformers import pipeline
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-zh", device=0)
translator("America has changed dramatically during recent years.")
[{'translation_text': 'In recent years, the United States has changed dramatically.'}]
Bias and limitations
If you intend to use pre-trained or fine-tuned models in a formal project. Please note: While these models are powerful, they have limitations. One of the biggest is that in order to pre-train on large amounts of data, researchers usually collect all the content they can find, which may be interspersed with stereotypes of ideologies or values.
To quickly explain this clearly, let's go back to an example of a pipeline using the BERT model:
from transformers import pipeline
unmasker = pipeline("fill-mask", model="bert-base-uncased", device=0)
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])
result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])
['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']
['nurse', 'maid', 'teacher', 'waitress', 'prostitute']
When the model was asked to fill in the missing words in these two sentences, only one of the answers given by the model was not related to gender (waiter/waitress).