Location>code7788 >text

Recurrent neural network designs can also use the pre-training word "embedding"

Popularity:886 ℃/2024-12-09 15:04:46

Preface: re-training AI large-scale models is a complex and costly task, especially for current LLMs (Large Language Models), which is unaffordable for 99.99% of organizations worldwide. This is due to the fact that model training requires huge resource investment, complex technical processes, and a lot of human support. Therefore, both in scientific research and practical applications, people usually rely on open source pre-trained models and their already learned information about various features, just like using open source Linux. This section explains how to utilize the "embedded" information in these pre-trained models to solve real-world problems.

Using pre-trained embeddings with RNN

In all previous examples, we collected the complete set of words to be used in the training set and then trained embeddings with them. These embeddings were initially aggregated together and then fed into the dense network, and in the most recent section we explored how to use the RNN to improve the results. In doing so, we were restricted to the words that already existed in the dataset and how their embeddings were learned using the labels in that dataset. Recall that earlier in the chapter, we discussed transfer learning. What if, instead of learning the embeddings yourself, you could use embeddings that have already been pre-learned, where the researchers have done the hard work of converting words into vectors, and where those vectors are validated? One example is the GloVe (Global Vectors for Word Representation) model developed by Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford University.

In this case, the researchers shared the word vectors they pre-trained for various datasets:

- A thesaurus with 6 billion tokens, 400,000 words, with dimensions 50, 100, 200 and 300, words from Wikipedia and Gigaword

- A lexical collection of 42 billion tokens, 1.9 million words, with a dimension of 300, from the Universal Crawler

- A lexical collection of 840 billion tokens, 2.2 million words, with a dimension of 300, from the Universal Crawler

- A lexicon of 27 billion tokens, 1.2 million words, with dimensions 25, 50, 100 and 200, from a Twitter crawl of 2 billion tweets

Considering that these vectors are already pre-trained, we can easily reuse them in our TensorFlow code without having to learn them from scratch. First, we need to download the GloVe data. The choice here is to use the Twitter dataset, containing a lexicon of 27 billion tokens and 1.2 million words. The download is an archive file containing 25, 50, 100 and 200 dimensions.

To make the whole process a little easier, I've hosted the 25-dimensional version, which you can download into your Colab notebook like this:

!wget --no-check-certificate \

//.27B. \

-O /tmp/

This is a ZIP file that you can extract like this to get a file called .27B:

Unpacking GloVe Embedding

import os

import zipfile

local_zip = '/tmp/'

zip_ref = (local_zip, 'r')

zip_ref.extractall('/tmp/glove')

zip_ref.close()

Each line in the file is a word followed by the dimension coefficients learned for it. The simplest way to use this is to create a dictionary where the keys are words and the values are embeddings. You can set up this dictionary like this:

glove_embeddings = dict()

f = open('/tmp/glove/.27B.')

for line in f:

values = ()

word = values[0]

coefs = (values[1:], dtype='float32')

glove_embeddings[word] = coefs

()

At this point, you can find the set of coefficients for any word simply by using the word as a key. For example, to see the embedding of "frog", you can use:

glove_embeddings['frog']

With this resource, you can use the splitter to get the word index of the corpus as before - but now, you can create a new matrix, which I call the embedding matrix. This matrix will use the embeddings in the GloVe set (obtained from glove_embeddings) as its values. So if you examine the words in the word index in the dataset as follows:

{'': 1, 'new': 2, … 'not': 5, 'just': 6, 'will': 7}

Then the first row of the embedding matrix should be in GloVe""The next line is the coefficient of "new" and so on.

You can create this matrix using the following code:

embedding_matrix = ((vocab_size, embedding_dim))

for word, index in tokenizer.word_index.items():

if index > vocab_size - 1:

break

else:

embedding_vector = glove_embeddings.get(word)

if embedding_vector is not None:

embedding_matrix[index] = embedding_vector

This just creates a matrix with the dimensions of your desired vocabulary size and embedding dimension. Then, for each lexical index item of the lexer, you look up the coefficients in GloVe (from glove_embeddings) and add those values to the matrix.

Next, you need to modify the embedding layer to use a pre-trained embedding, by setting the weights parameter, and specifying that you do not want the layer to be trained, by setting trainable=False:

model = ([

(vocab_size, embedding_dim,

weights=[embedding_matrix], trainable=False),

((embedding_dim, return_sequences=True)),

((embedding_dim)),

(24, activation='relu'),

(1, activation='sigmoid')

])

Now, you can train as you did before. However, you need to consider the size of your vocabulary. In the previous chapter, you did some optimizations to avoid overfitting, with the goal of preventing the embeddings from learning too many low-frequency words; you avoided overfitting by using a smaller vocabulary list, containing only commonly used words. In this case, since the word embeddings have already been learned for you through GloVe, you can expand the vocabulary-but by how much?

The first thing to explore is how many words from your corpus are actually in the GloVe set.GloVe has 1.2 million words, but there is no guarantee that it contains all of your words. So here's some code for a quick comparison that lets you explore how big your vocabulary should be.

First, organize the data. Create a list containing Xs and Ys, where X is the vocabulary index and Y=1 means the word is in the embedding and 0 means it is not. In addition, you can create a cumulative set that calculates the proportion of words at each time step. For example, the word "OOV" indexed at 0 is not in GloVe, so it has a cumulative Y value of 0. The next indexed word "new" is in GloVe, so it has a cumulative Y value of 0.5 (i.e., half of the words seen so far are in GloVe). The next indexed word, "new", is in GloVe, so it has a cumulative Y value of 0.5 (i.e., half of the words you've seen so far are in GloVe), and you'll continue to compute the entire dataset this way:

xs = []

ys = []

cumulative_x = []

cumulative_y = []

total_y = 0

for word, index in tokenizer.word_index.items():

(index)

cumulative_x.append(index)

if glove_embeddings.get(word) is not None:

total_y = total_y + 1

(1)

else:

(0)

cumulative_y.append(total_y / index)

You can then plot Xs versus Ys using the following code:

import as plt

fig, ax = (figsize=(12, 2))

['top'].set_visible(False)

(x=0, y=None, tight=True)

([13000, 14000, 0, 1])

(ys)

This will give you a word frequency graph that looks like Figure 7-17.

                                                      Figure 7-17.

As the chart shows, the density changes between 10,000 and 15,000. This lets you visualize that around the 13,000 mark, the frequency of words not in the GloVe embedding begins to exceed that of those already in the GloVe embedding.

You will be able to understand the change better if you then plot cumulative_x vs cumulative_y. Here is the code:

import as plt

(cumulative_x, cumulative_y)

([0, 25000, .915, .985])

You can see the results in Figure 7-18.


Figure 7-18. plotting word indexing frequency vs. GloVe

Now you can adjust the parameters in to zoom in to see the inflection points and see how words that don't appear in GloVe start to outnumber those that do. This is a good starting point for setting the vocabulary size.

Using this approach, I chose a vocabulary size of 13,200 (instead of the 2,000 used previously to avoid overfitting) and used the following model architecture, where embedding_dim is 25 because I am using the GloVe set:

model = ([

(vocab_size, embedding_dim,

weights=[embedding_matrix], trainable=False),

((embedding_dim, return_sequences=True)),

((embedding_dim)),

(24, activation='relu'),

(1, activation='sigmoid')

])

Then, use the Adam optimizer:

adam = (learning_rate=0.00001, beta_1=0.9, beta_2=0.999, amsgrad=False)

(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])

After training for 30 epochs, good results were obtained. The accuracy is shown in Figure 7-19. The validation accuracy is very close to the training accuracy, indicating that we are no longer overfitting.


Figure 7-19. stacked LSTM accuracy using GloVe embedding

This is further verified by the loss curve, as shown in Figure 7-20. The verification loss is no longer diverging, which indicates that although our accuracy is only about 73%, we can be confident that the model is accurate to that extent.

                                                      Figure 7-20. stacked LSTM loss using GloVe embedding

Training the model for a longer period of time gives very similar results and shows that the model remains very stable despite overfitting starting to occur around the 80th epoch.

The accuracy metrics (Figure 7-21) show that the model is well trained.

The loss indicator (Figure 7-22) shows that divergence begins to occur around the 80th epoch, but the model is still well fitted.


Figure 7-21. accuracy of stacked LSTM with GloVe over 150 epochs


Figure 7-22. loss of stacked LSTM with GloVe over 150 epochs

This tells us that this model is a good candidate for early stopping and that you only need to train it for 75 to 80 epochs to get the best results.

I tested this with a headline from The Onion (The Onion is the source of satirical headlines and the source of the satirical dataset) against other sentences with the following test code:

test_sentences = [

"It Was, For, Uh, Medical Reasons, Says Doctor To Boris Johnson, Explaining Why They Had To Give Him Haircut",

"It's a beautiful sunny day",

"I lived in Ireland, so in high school they made me learn to speak and write in Gaelic",

"Census Foot Soldiers Swarm Neighborhoods, Kick Down Doors To Tally Household Sizes"

]

The results of these headlines are as follows - remember that values close to 50% (0.5) are considered neutral, values close to 0 are non-ironic, and values close to 1 are ironic:

[[0.8170955 ]

[0.08711044]

[0.61809343]

[0.8015281 ]]

The first and fourth sentences from The Onion showed a probability of sarcasm of over 80%. The statement about the weather, on the other hand, appears to be very non-sarcastic (9%), while the sentence about going to high school in Ireland is perceived as possibly sarcastic, but with low confidence (62%).

summarize

In this section we introduced recurrent (recursive) neural networks (RNNs), which use sequence-oriented logic in their design and can help you understand the sentiment of a sentence based not only on the words in it, but also on the order in which they appear. Understand how basic RNNs work and how LSTMs can improve on this by preserving long-term context. You used these techniques to improve the sentiment analysis model you've been working on. Next, you looked at the overfitting problems of RNNs and techniques to improve them, including the use of migration learning from pretrained embeddings. In the next chapters, we'll use all of what we've learned earlier to explore how to predict words and, in turn, create a model that can generate text and even write poetry for you!