Preface:How does the Artificial Intelligence Large Language Model (LLM) understand natural human language? The process centers on converting text into a numerical form that computers can handle, and after computation, ultimately achieve an understanding of the language. At first, we simply used a random number to represent a word or a root word, but as our research went deeper, we found that different numerical expressions can significantly improve the model's understanding of language. Therefore, when building a Large Language Model (LLM), a crucial step is to transform human language into suitable numerical representations so that the model can receive, process and generate effective output. Okay, let's get to the point.
Work with real data sources
Now that you understand the basics of fetching sentences, encoding them with a word index, and serializing the results, you can take your skills a step further by working with some well-known public datasets, converting them with tools into formats that are easy to serialize. We'll start with the IMDb dataset in TensorFlow Datasets, which has most of the processing already done for you. After that, we'll get our hands dirty with a JSON-based dataset as well as a couple of comma-separated value (CSV) datasets containing sentiment data!
Getting Text from TensorFlow Datasets
We explored TFDS in Chapter 4, so if you're unfamiliar with some of the concepts in this section, review it.The goal of TFDS is to make accessing data as simple as possible in a standardized way. It provides several text-based datasets, and we'll explore imdb_reviews, an IMDb dataset of 50,000 movie reviews, each labeled as positive or negative in terms of sentiment.
The following code will load the training set of the IMDb dataset and iterate through it item by item, adding text fields containing comments to a list named imdb_sentences. Each comment consists of text and a tag indicating the sentiment. Note that wrapping the call in tfds.as_numpy ensures that the data is loaded as a string rather than a tensor:
imdb_sentences = []
train_data = tfds.as_numpy(('imdb_reviews', split="train"))
for item in train_data:
imdb_sentences.append(str(item['text']))
Once the sentences are obtained, a disambiguator can be created and fitted to them as before, and a set of sequences can be created:
tokenizer = (num_words=5000)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)
You can also print out your word index to view:
print(tokenizer.word_index)
The word index is too large to show them all, but here are the first 20 words. Note that the lexicon is organized by the frequency of words in the dataset, so common words such as "the", "and" and "a" are indexed:
{'the': 1, 'and': 2, 'a': 3, 'of': 4, 'to': 5, 'is': 6, 'br': 7, 'in': 8, 'it': 9, 'i': 10, 'this': 11, 'that': 12, 'was': 13, 'as': 14, 'for': 15, 'with': 16, 'movie': 17, 'but': 18, 'film': 19, "'s": 20, ...}
These are deactivated words, as described in the previous section. Due to the highest frequency of occurrence and lack of uniqueness of these words, their presence affects training accuracy.
In addition, note that "br" is also in the list, as it is often used as an HTML tag in this corpus
。
You can update the code to remove HTML tags using BeautifulSoup, add string conversions to remove punctuation, and remove deactivated words from a given list as shown below:
from bs4 import BeautifulSoup
import string
stopwords = ["a", ..., "yourselves"]
table = ('', '', )
imdb_sentences = []
train_data = tfds.as_numpy(('imdb_reviews', split="train"))
for item in train_data:
sentence = str(item['text'].decode('UTF-8').lower())
soup = BeautifulSoup(sentence)
sentence = soup.get_text()
words = ()
filtered_sentence = ""
for word in words:
word = (table)
if word not in stopwords:
filtered_sentence = filtered_sentence + word + " "
imdb_sentences.append(filtered_sentence)
tokenizer = (num_words=25000)
tokenizer.fit_on_texts(imdb_sentences)
sequences = tokenizer.texts_to_sequences(imdb_sentences)
print(tokenizer.word_index)
Note that the sentence is converted to lower case before processing, since all the deactivated words are stored in lower case. The printed word index now looks like this
{'movie': 1, 'film': 2, 'not': 3, 'one': 4, 'like': 5, 'just': 6, 'good': 7, 'even': 8, 'no': 9, 'time': 10, 'really': 11, 'story': 12, 'see': 13, 'can': 14, 'much': 15, ...}
As you can see it is now much cleaner than before. There is still room for improvement, however, and I have noticed that some uncommon words appear meaningless at the end when viewing the full index. Reviewers often combine words, such as with hyphens ("annoying-conclusion") or slashes ("him/her"), and removing the punctuation incorrectly combines these words into one.
You can add code to add spaces around these characters as soon as the sentence is created
sentence = (",", " , ")
sentence = (".", " . ")
sentence = ("-", " - ")
sentence = ("/", " / ")
In this way, combinations like "him/her" would be converted to "him / her", and then / would be dropped and the split would become two words. This may lead to better training results.
Now that you have the corpus's splitter, you can encode the sentences. For example, the simple sentence from the previous section would look like this
sentences = [
'Today is a sunny day',
'Today is a rainy day',
'Is it sunny today?'
]
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
Results for:
[[516, 5229, 147], [516, 6489, 147], [5229, 516]]
If decoded, you can see that the deactivated words have been deleted and the sentences are coded as "today sunny day", "today rainy day" and "sunny today ".
If you want to decode it in code, you can create a new dictionary, reverse the keys and values (i.e., swap the key-value pairs in the word index) and perform a lookup. The code is as follows:
reverse_word_index = dict(
[(value, key) for (key, value) in tokenizer.word_index.items()])
decoded_review = ' '.join([reverse_word_index.get(i, '?') for i in sequences[0]])
print(decoded_review)
This will output:
today sunny day
Using the IMDb Subword Dataset
TFDS also contains several IMDb datasets that use subword preprocessing. Here, you don't need to segment sentences by word; they are already segmented by subword. Using subwords is a compromise between per-letters (a small number of low semantic tokens) and per-words (a large number of high semantic tokens), and can often be very effective in training language classifiers. These datasets also contain the encoders and decoders used to segment and encode the corpus.
To access them, call and pass imdb_reviews/subwords8k or imdb_reviews/subwords32k, for example
(train_data, test_data), info = (
'imdb_reviews/subwords8k',
split=(, ),
as_supervised=True,
with_info=True
)
The encoder can be accessed through the info object, which will help to see the vocabulary size
encoder = ['text'].encoder
print('Vocabulary size: {}'.format(encoder.vocab_size))
The output is 8185, since the vocabulary in this instance consists of 8,185 tokens. If you want to see the list of subwords, you can use the attribute to get it:
print()
The output is similar to the following
['the_', ', ', '. ', 'a_', 'and_', 'of_', 'to_', 's_', 'is_', 'br', 'in_', 'I_', 'that_', ...]
It can be noted here that stop words, punctuation and grammar are present in the corpus, as well as things like the
HTML tags like this. Spaces are underlined, so the first token is "the".
If you want to encode a string, you can use the encoder
sample_string = 'Today is a sunny day'
encoded_string = (sample_string)
print('Encoded string is {}'.format(encoded_string))
The output will be a list of tokens
Encoded string is [6427, 4869, 9, 4, 2365, 1361, 606]
Your five words are encoded into seven tokens. To see the tokens, you can use the encoder's subwords property to return an array. It starts from zero, e.g. "Tod" in "Today" is encoded as 6427, which is item 6426 in the array:
print([6426])
Output:
Tod
To decode, use the encoder's decode method:
encoded_string = (sample_string)
original_string = (encoded_string)
test_string = ([6427, 4869, 9, 4, 2365, 1361, 606])
Later lines of code will produce the same result because encoded_string, despite its name, is actually a list of tokens identical to the list hardcoded on the next line.
This section summarizes:This section mainly introduces how to convert text expressions into computer-understandable digital expressions. Specifically, it is to preprocess the text through TensorFlow Datasets, including steps such as word splitting, de-duplication, etc., and ultimately convert the text into a sequence of numbers to prepare for subsequent natural language processing tasks; the next article is a supplement to this one, which focuses on how to extract the text from CSV and JSON files to be used for training the model.