TF-IDF Algorithm Principle and Source Code Implementation

TF-IDF (Term Frequency-Inverse Document Frequency), is used to measure the importance of a word in a document, see the formula for TDF-IDF below:

The first is TF, or word frequency, a measure of how often a word appears in a document. Assuming that a word appears ( n ) times in a document, and the document contains ( N ) words in total, the TF of the word is defined as:

Note: t in (t, d) denotes the vocabulary in the document, and d denotes the set of words in the document, by calculating the TF which is also known as word frequency statistics, okay, so look at the implementation of the code.

def compute_tf(word_dict, doc_words):
    """

    :param word_dict: count of the number of characters in the document
    :param doc_words: collection of characters in the document
    :return.
"""
    tf_dict = {}
    words_len = len(doc_words)
    for word_i, count_i in word_dict.items():
        tf_dict[word_i] = count_i / words_len
    return tf_dict


# example document
doc1 = "this is a sample"
doc2 = "this is another example example example"
doc3 = "this is a different example example"

# split word
doc1_words = ()
doc2_words = ()
doc3_words = ()

# Calculate word frequency for each document
word_dict1 = Counter(doc1_words)
word_dict2 = Counter(doc2_words)
word_dict3 = Counter(doc3_words)

# Calculate TF
tf1 = compute_tf(word_dict1, doc1_words)
tf2 = compute_tf(word_dict2, doc2_words)
tf3 = compute_tf(word_dict3, doc3_words)

print(f'tf1:{tf1}')
print(f'tf2:{tf2}')
print(f'tf3:{tf3}')

# tf1:{'this': 0.25, 'is': 0.25, 'a': 0.25, 'sample': 0.25}
# tf2:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'another': 0.16666666666666666, 'example': 0.5}
# tf3:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'a': 0.16666666666666666, 'different': 0.16666666666666666, 'example': 0.3333333333333333}

After looking at the calculation of TF, let's take a look at the definition of IDF, the formula and the corresponding realization of it, the definition of IDF is: that is, the inverse document frequency, reflecting the degree of rarity of the word, the higher the IDF, that is, the rarer the word. This inverse document frequency that is to say a word in the document collection of the fewer the number of times, the more he has a characterization type, because in the text there are a lot of "the", "the" this kind of word, these words are not very important, but the importance of the word less important to a little bit to see Take a look at the IDF formula:

where ( D ) is the total number of documents and ( df_t ) is the number of documents containing the word ( t ). By taking logarithms, the problem of too large values can be avoided and the monotonically decreasing property of IDF is ensured, see the reality of the code below:

def compute_idf(doc_list):
    """

    :param doc_list: collection of documents
    :return.
"""

    sum_list = list(set([word_i for doc_i in doc_list for word_i in doc_i]))

    idf_dict = {word_i: 0 for word_i in sum_list}

    for word_j in sum_list:
        for doc_j in doc_list:
            if word_j in doc_j:
                idf_dict[word_j] += 1
    return {k: (len(doc_list) / (v + 1)) for k, v in idf_dict.items()}


# example document
doc1 = "this is a sample"
doc2 = "this is another example example example"
doc3 = "this is a different example example"

# split word
doc1_words = ()
doc2_words = ()
doc3_words = ()

# Calculate word frequency for each document
word_dict1 = Counter(doc1_words)
word_dict2 = Counter(doc2_words)
word_dict3 = Counter(doc3_words)

# Calculate the IDF for the entire document collection
idf = compute_idf([doc1_words, doc2_words, doc3_words])
# idf:{'different': 0.4054651081081644, 'another': 0.4054651081081644, 'a': 0.0, 'example': 0.0, 'this': -0.2876820724517809, 'sample': 0.4054651081081644, 'is': -0.2876820724517809}

Through the results, it can be found that different, another and sample all have higher IDF values than words such as is and a, representing the more important.

Okay, let's take a final look at the TF-IDF formula.

$$TF-IDF=TF*IDF $$

TF-IDF is TF*IDF to synthesize the importance of a word in a document.

Taking a final look at the complete code, the

import math
from collections import Counter
import math


def compute_tfidf(tf_dict, idf_dict):
    tfidf = {}
    for word, tf_value in tf_dict.items():
        tfidf[word] = tf_value * idf_dict[word]
    return tfidf


def compute_tf(word_dict, doc_words):
    """

    :param word_dict: count of the number of characters in the document
    :param doc_words: collection of characters in the document
    :return.
"""
    tf_dict = {}
    words_len = len(doc_words)
    for word_i, count_i in word_dict.items():
        tf_dict[word_i] = count_i / words_len
    return tf_dict


def compute_idf(doc_list):
    """

    :param doc_list: collection of documents
    :return.
"""

    sum_list = list(set([word_i for doc_i in doc_list for word_i in doc_i]))

    idf_dict = {word_i: 0 for word_i in sum_list}

    for word_j in sum_list:
        for doc_j in doc_list:
            if word_j in doc_j:
                idf_dict[word_j] += 1
    return {k: (len(doc_list) / (v + 1)) for k, v in idf_dict.items()}


# example document
doc1 = "this is a sample"
doc2 = "this is another example example example"
doc3 = "this is a different example example"

# split word
doc1_words = ()
doc2_words = ()
doc3_words = ()

# Calculate word frequency for each document
word_dict1 = Counter(doc1_words)
word_dict2 = Counter(doc2_words)
word_dict3 = Counter(doc3_words)

# Calculate TF
tf1 = compute_tf(word_dict1, doc1_words)
tf2 = compute_tf(word_dict2, doc2_words)
tf3 = compute_tf(word_dict3, doc3_words)

print(f'tf1:{tf1}')
print(f'tf2:{tf2}')
print(f'tf3:{tf3}')

# Calculate the IDF for the entire document collection
idf = compute_idf([doc1_words, doc2_words, doc3_words])

print(f'idf:{idf}')
# Calculate TF-IDF for each document
tfidf1 = compute_tfidf(tf1, idf)
tfidf2 = compute_tfidf(tf2, idf)
tfidf3 = compute_tfidf(tf3, idf)

print("TF-IDF for Document 1:", tfidf1)
print("TF-IDF for Document 2:", tfidf2)
print("TF-IDF for Document 3:", tfidf3)


"""
tf1:{'this': 0.25, 'is': 0.25, 'a': 0.25, 'sample': 0.25}
tf2:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'another': 0.16666666666666666, 'example': 0.5}
tf3:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'a': 0.16666666666666666, 'different': 0.16666666666666666, 'example': 0.3333333333333333}
idf:{'example': 0.0, 'different': 0.4054651081081644, 'this': -0.2876820724517809, 'another': 0.4054651081081644, 'is': -0.2876820724517809, 'a': 0.0, 'sample': 0.4054651081081644}
TF-IDF for Document 1: {'this': -0.07192051811294523, 'is': -0.07192051811294523, 'a': 0.0, 'sample': 0.1013662770270411}
TF-IDF for Document 2: {'this': -0.047947012075296815, 'is': -0.047947012075296815, 'another': 0.06757751801802739, 'example': 0.0}
TF-IDF for Document 3: {'this': -0.047947012075296815, 'is': -0.047947012075296815, 'a': 0.0, 'different': 0.06757751801802739, 'example': 0.0}

"""