Artificial Intelligence - Introduction to Natural Language Processing

Preface:Natural Language Processing (NLP) is a technique in Artificial Intelligence that focuses on understanding content based on human language. It encompasses programming techniques for creating models that can understand language, categorize content, and even generate and create new works in human language. In the next few chapters, we will explore these techniques. In addition, there are many services that use NLP to create applications, such as chatbots (they are apps, which fall under Agent application development), but these are outside the scope of the knowledge - we'll focus on the basics of NLP (the principles of implementation) and how to do language modeling so that you can train neural networks to teach computers to understand and categorize text.

We will start this section by first understanding how language can be broken down into numbers and how these numbers can be used in neural networks; the so-called 'break down' is actually given to replacing the word or root of a word in a linguistic sentence with a number, because computers can only handle numbers; after people convert language into numbers to be processed by computers, and then re-transform them back to the linguistic words can then be recognized by humans and know what the computer has done.

encode language into numbers

There are several ways to encode language into numbers. The most common is to encode by letters, as is the natural form of strings when they are stored in a program. However, in memory, what you store is not the letter itself, but its encoding - which may be an ASCII, Unicode value, or some other form. For example, consider the word "listen". Encoded in ASCII, this word could be represented as the numbers 76, 73, 83, 84, 69, and 78. The advantage of this encoding is that you can now represent the word as a number. However, if you consider the word "silent", it is an anagram of the word "listen". Even though the two words are encoded with the same number, the order is different, which can make it a little difficult to build a model for understanding the text.

An "antonym" is a word that is formed by reversing the order of the letters of one word to form another word, and the two have opposite meanings. For example, "united" and "untied" are a pair of antonymic anagrams, as are "restful" and "fluster," "santa" and "santa" and "santa" and "santa" and "santa" and "santa" and "santa" and "fluster. fluster", "Santa" and "Satan", "forty-five" and "over fifty". "over fifty". My previous job title was "Developer Evangelist", which was changed to "Developer Advocate" - a good thing, because "Developer Evangelist" is not the same as "Developer Advocate". This is a good thing, because "Evangelist" is an anagram of "Evil's Agent"!

A better alternative might be to encode the entire word in numbers, rather than letter by letter. In this case, "silent" could be represented by the number x and "listen" by the number y, without overlapping each other.

Using this technique, consider a sentence such as "I love my dog." You can encode it as the numbers [1, 2, 3, 4]. If you wanted to encode "I love my cat.", it would be [1, 2, 3, 5]. You can already see that these sentences are numerically similar - [1, 2, 3, 4] looks a lot like [1, 2, 3, 5], so you can assume that they have similar meanings.

This process is called "word splitting", and you'll find out how to implement it in your code.

introduction to participles (linguistics)

TensorFlow Keras includes a library called "preprocessing" that provides a number of very useful tools for preparing data for machine learning. One of them is the "Tokenizer", which converts words into tokens. Let's see it in action with a simple example:

import tensorflow as tf

from tensorflow import keras

from import Tokenizer

sentences = [

'Today is a sunny day',

'Today is a rainy day'

]

tokenizer = Tokenizer(num_words=100)

tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index

print(word_index)

In this example, we create a Tokenizer object and specify the number of words it can split. This will be the maximum number of tokens generated from the thesaurus. Our thesaurus here is very small, containing only six unique words, so it's much smaller than the one hundred specified.

Once we have a lexer, calling fit_on_texts creates the tokenized word index. The printout will show the collection of key/value pairs in the lexicon, similar to this:

{'today': 1, 'is': 2, 'a': 3, 'day': 4, 'sunny': 5, 'rainy': 6}

This participle is very flexible. For example, if we expand the corpus by adding another sentence containing the word "today" with a question mark, the results will show that it is smart enough to filter "today?" into "today ":

sentences = [

'Today is a sunny day',

'Today is a rainy day',

'Is it sunny today?'

]

The output is: {'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6, 'it': 7}

This behavior is controlled by the filters parameter of the splitter, which by default removes all punctuation except apostrophes. So, for example, "Today is a sunny day" would become a sequence of [1, 2, 3, 4, 5] based on the previous encoding, while "Is it sunny today?" would become [2, 7, 4, 1 ]. Once you have split the words in the sentence, the next step is to convert the sentence into a list of numbers, where the numbers are the values corresponding to the key-value pairs of the words in the dictionary.

Converting sentences to sequences

Now that you understand how to split words and convert them to numbers, the next step is to encode sentences into sequences of numbers. The participle has a method called text_to_sequences where you simply pass in a list of sentences and it returns a list of sequences. For example, if you modify the previous code as follows:

sentences = [

'Today is a sunny day',

'Today is a rainy day',

'Is it sunny today?'

]

tokenizer = Tokenizer(num_words=100)

tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

print(sequences)

You will be given the sequence that represents the three sentences. Recall that the vocabulary index looks like this:

{'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6, 'it': 7}

The output will be shown below:

[[1, 2, 3, 4, 5], [1, 2, 3, 6, 5], [2, 7, 4, 1]]

You can then replace the numbers with words so that the sentences make sense.

Now consider what happens when you train a neural network with a set of data. The usual pattern is that you have a set of data for training, but you know that it can't cover all your needs, only as much as you can. In the case of NLP, your training data may contain thousands of words appearing in different contexts, but you can't cover all possible words in all contexts. So what happens when you show your neural network some new, previously unseen text containing unseen words? You guessed it - it gets confused because it has absolutely no context for those words, and as a result its predictions go wrong.

Use of "out-of-vocabulary" tokens

One tool for dealing with these situations is the Out-of-Vocabulary (OOV) token. It can help your neural network understand the context of data that contains unseen text. For example, let's say you have the following small corpus and want to process sentences like this:

test_data = [

'Today is a snowy day',

'Will it be rainy tomorrow?'

]

Keep in mind that you are not adding these inputs to a pre-existing text corpus (which can be viewed as your training data), but rather considering how the pre-trained network will process the text. If you use an existing vocabulary and a lexer to disambiguate these sentences, as shown below:

test_sequences = tokenizer.texts_to_sequences(test_data)

print(word_index)

print(test_sequences)

The output is as follows:

{'today': 1, 'is': 2, 'a': 3, 'sunny': 4, 'day': 5, 'rainy': 6, 'it': 7}

[[1, 2, 3, 5], [7, 6]]

Then the new sentences, after swapping the tokens back into words, become "today is a day" and "it rainy".

As you can see, the context and meaning is almost completely lost. Here you can help with an "out-of-vocabulary" token, which you can specify in the splitter. Just add a parameter called oov_token, which you can set to any string, but make sure it doesn't appear in your corpus:

tokenizer = Tokenizer(num_words=100, oov_token="")

tokenizer.fit_on_texts(sentences)

word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

test_sequences = tokenizer.texts_to_sequences(test_data)

print(word_index)

print(test_sequences)

You will see some improvements in the output:

{'': 1, 'today': 2, 'is': 3, 'a': 4, 'sunny': 5, 'day': 6, 'rainy': 7, 'it': 8}

[[2, 3, 4, 1, 6], [1, 8, 1, 7, 1]]

A new item has been added to your token list"", and your test sentences keep their length. Now the reverse encoding gives you "today is a day" and " it rainy ”。

The former is closer to the original meaning, while the latter still lacks context as most of the words are not in the corpus, but it is a step in the right direction.

Understanding padding

When training neural networks, it is often necessary for all data to have the same shape. Recall from previous chapters that training images requires that they be formatted to the same width and height. A similar problem is faced in text processing - once you have split words and converted sentences into sequences, they may vary in length. To make them the same size and shape, padding can be used.

To explore padding, let's add another longer sentence to the corpus:

sentences = [

'Today is a sunny day',

'Today is a rainy day',

'Is it sunny today?',

'I really enjoyed walking in the snow today'

]

When you convert them to sequences, you will see that the list of numbers is of different lengths:

[

[2, 3, 4, 5, 6],

[2, 3, 4, 7, 6],

[3, 8, 5, 2],

[9, 10, 11, 12, 13, 14, 15, 2]

]

(When you print these sequences, they appear on one line, which I've broken into multiple lines here for clarity.)

If you want these sequences to be the same length, you can use the pad_sequences API. first, you need to import it:

from import pad_sequences

Using this API is very simple. To convert your (unpadded) sequences into padded collections, simply call pad_sequences as follows:

padded = pad_sequences(sequences)

print(padded)

You will get a collection of neatly formatted sequences. They will be on separate lines, like this:

[[ 0 0 0 2 3 4 5 6]

[ 0 0 0 2 3 4 7 6]

[ 0 0 0 0 3 8 5 2]

[ 9 10 11 12 13 14 15 2]]

These sequences are populated with zeros, which are not tokens in our word list. If you've ever wondered why the list of tokens starts at 1 instead of 0, now you know why!

Now you have a consistently shaped array that can be used for training. But before that, let's explore this API further as it provides many options that can be used to optimize the data.

First of all, you may notice that in shorter sentences, a necessary number of zeros are added to the beginning in order to make them consistent with the shape of the longest sentence. This is called "padding" and it is the default behavior. You can change it with the padding parameter. For example, if you want the sequence to be padded with zeros at the end, you can use:

padded = pad_sequences(sequences, padding='post')

The output is as follows:

[[ 2 3 4 5 6 0 0 0]

[ 2 3 4 7 6 0 0 0]

[ 3 8 5 2 0 0 0 0]

[ 9 10 11 12 13 14 15 2]]

Now you can see that the word is at the beginning of the fill sequence and the 0 is at the end.

Another default behavior is that all sentences are padded to the same length as the longest sentence. This is a reasonable default setting because then you don't lose any data. The tradeoff is that you get a lot of padding. If you don't want to do this, for example if a sentence is too long and fills up too much, you can specify the desired maximum length using the maxlen parameter, as shown below:

padded = pad_sequences(sequences, padding='post', maxlen=6)

The output is as follows:

[[ 2 3 4 5 6 0]

[ 2 3 4 7 6 0]

[ 3 8 5 2 0 0]

[11 12 13 14 15 2]]

Now your fill sequences are of consistent length and do not have a lot of fill. However, you will notice that some words of the longest sentence are truncated from the beginning. If you don't want to lose the words at the beginning, but want to truncate from the end of the sentence, you can override the default behavior with the truncating parameter, as shown below:

padded = pad_sequences(sequences, padding='post', maxlen=6, truncating='post')

The results show that the longest sentence is now truncated from the end, not the beginning:

[[ 2 3 4 5 6 0]

[ 2 3 4 7 6 0]

[ 3 8 5 2 0 0]

[ 9 10 11 12 13 14]]

TensorFlow supports training with "sparse" (differently shaped) tensors, which is ideal for NLP needs. Using them is a little more advanced than in this book, but you can learn more by consulting the documentation after you've completed the NLP primer provided in the next few chapters.

Remove stop words and clean up text

In the next sections, we'll look at some real text datasets and realize that there is often unwanted text in the data. You may want to filter out so-called "stop words" that are too common to have any real meaning, such as "the", "and", and "but". You may also come across a lot of HTML tags, and removing them will make the text cleaner. Additionally, other things to filter include foul language, punctuation, or people's names. Later we'll explore a dataset of tweets that often contain user IDs, which we'll also want to remove.

While each task will vary depending on the content of the text, there are usually three main ways to programmatically clean up text. The first step is to remove HTML tags. Fortunately, there is a library called BeautifulSoup that makes this task easy. For example, if your sentence contains HTML tags (such as
), the following code removes them:

from bs4 import BeautifulSoup

soup = BeautifulSoup(sentence)

sentence = soup.get_text()

A common method of removing deactivated words is to create a list of deactivated words and then preprocess the sentence to remove the deactivated words from it. Here is a simplified example:

stopwords = ["a", "about", "above", ... "yours", "yourself", "yourselves"]

A complete list of stop words can be found in some of the online examples in this chapter. Then, as you traverse the sentence, you can use the following code to remove the deactivated words from the sentence:

words = ()

filtered_sentence = ""

for word in words:

if word not in stopwords:

filtered_sentence = filtered_sentence + word + " "

(filtered_sentence)

Another thing to consider is the removal of punctuation, which can interfere with the removal of a stop word. The code shown above looks for words surrounded by spaces, so if the deactivated word is immediately followed by a period or comma, it will not be recognized.

Python's string library provides a translation function that can easily solve this problem. It also comes with a constant that contains a list of common punctuation marks, so they can be removed from the word using the following code:

import string

table = ('', '', )

words = ()

filtered_sentence = ""

for word in words:

word = (table)

if word not in stopwords:

filtered_sentence = filtered_sentence + word + " "

(filtered_sentence)

Here, for each sentence, the punctuation has been removed from the word before filtering the deactivated word. So, if you split the sentence to get "it;", it will be converted to "it" and then filtered out as a stop word. However, note that you may need to update the lists of deactivated words when processing this way. Often, these lists will contain abbreviations and acronyms, such as "you'll". The translator converts "you'll" to "youll", and if you want to filter it out, you need to add it to the list of deactivated words.

After following these three steps, you'll have a much cleaner set of text data. But of course, each dataset is unique, and you'll need to adjust it on a case-by-case basis

This section concludes.This section introduces the basic concepts of Natural Language Processing (NLP), including techniques such as text encoding, word segmentation, de-duplication and text cleaning. First, it explores how language can be converted into numbers for computer processing and how words can be decomposed into numerical values through coding methods. Then, word-splitting tools (e.g., Tokenizer) are described for assigning and managing word indexes in text preprocessing. Strategies for dealing with unseen words (OOVs) to minimize model errors are also discussed. In terms of cleaning the text, HTML tags are removed using the BeautifulSoup library and the dataset is further cleaned using a list of deactivated words and a punctuation filter. In addition, to ensure data consistency, padding (padding) techniques are introduced to make the data shape consistent and applicable for model training. These steps provide a solid foundation for text cleaning and modeling, but should be flexibly adapted to address the needs of different datasets in practical applications.