Text sentiment analysis preprocessing tutorial: From data collection to visualization

In the field of natural language processing (NLP), text sentiment analysis is an important task that aims to identify and extract emotional tendencies (such as positive, negative, or neutral) in text through computer technology. For accurate sentiment analysis, preprocessing steps are crucial. This article will lead you to complete the preprocessing of text sentiment analysis step by step, including data collection, word segmentation, stop word deactivation, and word frequency statistics, and use the NLTK/SpaCy and Seaborn libraries in Python to generate word cloud maps and high-frequency word distribution maps.

1. Data collection

Before performing text sentiment analysis, you need to obtain text data first. A commonly used dataset is the IMDB movie review dataset, which contains 50,000 movie reviews, divided into two categories: positive and negative.

Source of data: IMDB datasets can be downloaded from multiple open source platforms, such as Kaggle, UCI machine learning library, etc.
Download data: Take Kaggle as an example, visit the Kaggle website, search for the IMDB dataset, and download CSV files containing both positive and negative comments.
Data preparation: Unzip the downloaded dataset to the local directory, ensuring that each file (e.g.and) contains comments in the corresponding category.

2. Environmental preparation

Before starting coding, make sure your development environment has the following Python libraries installed:

NLTK or SpaCy: used for text processing, such as word participle and stop word.
Seaborn: used for data visualization.
Matplotlib: Use with Seaborn to generate a chart.
WordCloud: Used to generate word cloud maps.

These libraries can be installed with the following command:

bash copy code

 pip install nltk spacy seaborn matplotlib wordcloud

For SpaCy, you also need to download the English model:

bash copy code

 python -m spacy download en_core_web_sm

3. Text preprocessing

1. Read data

First, write code to read comments in the IMDB dataset. Here is an example of reading positive comments:

import os
 
def read_reviews(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        reviews = ()
    return reviews
 
pos_reviews = read_reviews('path/to/')
neg_reviews = read_reviews('path/to/')

2. Partition word

Word participle is the process of dividing text into words or phrases. Here we use NLTK and SpaCy methods for word segmentation.

Using NLTK：

import nltk
 from import word_tokenize
 
 ('punkt') # Download word segmenter
 
 def tokenize_reviews(reviews):
     tokenized_reviews = [word_tokenize(()) for review in reviews]
     return tokenized_reviews
 
 pos_tokenized = tokenize_reviews(pos_reviews)
 neg_tokenized = tokenize_reviews(neg_reviews)

Using SpaCy：

import spacy
 
nlp = ('en_core_web_sm')
 
def spacy_tokenize_reviews(reviews):
    tokenized_reviews = []
    for review in reviews:
        doc = nlp(())
        tokenized_reviews.append([ for token in doc])
    return tokenized_reviews
 
pos_spacy_tokenized = spacy_tokenize_reviews(pos_reviews)
neg_spacy_tokenized = spacy_tokenize_reviews(neg_reviews)

3. Go to stop word

Stop words refer to words that appear frequently in text but contribute little to emotional analysis, such as "the", "is", etc. Use NLTK's stop word list to perform the destop word operation.

from import stopwords
 
 ('stopwords') # Download stopword list
 stop_words = set(('english'))
 
 def remove_stopwords(tokenized_reviews):
     filtered_reviews = []
     for review in tokenized_reviews:
         filtered_review = [word for word in review if () and word not in stop_words]
         filtered_reviews.append(filtered_review)
     return filtered_reviews
 
 pos_filtered = remove_stopwords(pos_tokenized) # You can also use spacy_tokenized
 neg_filtered = remove_stopwords(neg_tokenized)

4. Word frequency statistics

Statistics the frequency of each word appearing in the comments for subsequent analysis.

from collections import Counter
 
def get_word_frequencies(filtered_reviews):
    all_words = [word for review in filtered_reviews for word in review]
    word_freq = Counter(all_words)
    return word_freq
 
pos_word_freq = get_word_frequencies(pos_filtered)
neg_word_freq = get_word_frequencies(neg_filtered)

4. Data visualization

1. Generate word cloud map

Word cloud map is a visual way to intuitively display high-frequency vocabulary in text.

from wordcloud import WordCloud
import  as plt
 
def generate_wordcloud(word_freq, title):
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq)
    (figsize=(10, 5))
    (wordcloud, interpolation='bilinear')
    ('off')
    (title)
    ()
 
generate_wordcloud(pos_word_freq, 'Positive Reviews Word Cloud')
generate_wordcloud(neg_word_freq, 'Negative Reviews Word Cloud')

2. Draw high-frequency word distribution diagram

Use the Seaborn library to draw a high-frequency word distribution map to show the frequency of high-frequency words appearing in positive and negative comments.

import pandas as pd
import seaborn as sns
 
def plot_top_words(word_freq, title, num_words=20):
    top_words = word_freq.most_common(num_words)
    df = (top_words, columns=['Word', 'Frequency'])
    
    (figsize=(10, 6))
    (x='Frequency', y='Word', data=df, palette='viridis')
    (title)
    ('Frequency')
    ('Word')
    ()
 
plot_top_words(pos_word_freq, 'Top 20 Words in Positive Reviews')
plot_top_words(neg_word_freq, 'Top 20 Words in Negative Reviews')

V. Summary and Extension

Through the tutorial in this article, we have completed the entire process from data collection to text preprocessing, and then to data visualization. Specific steps include:

Data collection: Get positive and negative comments from the IMDB dataset.
Participle: Use NLTK and SpaCy for word segmentation.
Go to stop word: Use NLTK's stop word list to remove meaningless vocabulary.
Word frequency statistics: Statistics the frequency of occurrence of each word.
Data visualization: Generate word cloud map and high-frequency word distribution map.

Extension suggestions：

Sentiment analysis model: After completing the preprocessing, sentiment analysis can be further performed using machine learning or deep learning models (such as LSTM, BERT).
Multilingual support: Explore how to deal with non-English texts, such as Chinese, Spanish, etc.
Real-time analysis: Integrate preprocessing and analysis processes into real-time systems such as social media monitoring tools.

Through continuous learning and practice, you will be able to master the preprocessing techniques of text sentiment analysis and apply them to various practical scenarios. I hope this article can provide you with valuable reference and guidance.