TF-IDF (Term Frequency-Inverse Document Frequency), is used to measure the importance of a word in a document, see the formula for TDF-IDF below:
The first is TF, or word frequency, a measure of how often a word appears in a document. Assuming that a word appears ( n ) times in a document, and the document contains ( N ) words in total, the TF of the word is defined as:
Note: t in (t, d) denotes the vocabulary in the document, and d denotes the set of words in the document, by calculating the TF which is also known as word frequency statistics, okay, so look at the implementation of the code.
def compute_tf(word_dict, doc_words): """ :param word_dict: count of the number of characters in the document :param doc_words: collection of characters in the document :return. """ tf_dict = {} words_len = len(doc_words) for word_i, count_i in word_dict.items(): tf_dict[word_i] = count_i / words_len return tf_dict # example document doc1 = "this is a sample" doc2 = "this is another example example example" doc3 = "this is a different example example" # split word doc1_words = () doc2_words = () doc3_words = () # Calculate word frequency for each document word_dict1 = Counter(doc1_words) word_dict2 = Counter(doc2_words) word_dict3 = Counter(doc3_words) # Calculate TF tf1 = compute_tf(word_dict1, doc1_words) tf2 = compute_tf(word_dict2, doc2_words) tf3 = compute_tf(word_dict3, doc3_words) print(f'tf1:{tf1}') print(f'tf2:{tf2}') print(f'tf3:{tf3}') # tf1:{'this': 0.25, 'is': 0.25, 'a': 0.25, 'sample': 0.25} # tf2:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'another': 0.16666666666666666, 'example': 0.5} # tf3:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'a': 0.16666666666666666, 'different': 0.16666666666666666, 'example': 0.3333333333333333}
After looking at the calculation of TF, let's take a look at the definition of IDF, the formula and the corresponding realization of it, the definition of IDF is: that is, the inverse document frequency, reflecting the degree of rarity of the word, the higher the IDF, that is, the rarer the word. This inverse document frequency that is to say a word in the document collection of the fewer the number of times, the more he has a characterization type, because in the text there are a lot of "the", "the" this kind of word, these words are not very important, but the importance of the word less important to a little bit to see Take a look at the IDF formula:
where ( D ) is the total number of documents and ( df_t ) is the number of documents containing the word ( t ). By taking logarithms, the problem of too large values can be avoided and the monotonically decreasing property of IDF is ensured, see the reality of the code below:
def compute_idf(doc_list): """ :param doc_list: collection of documents :return. """ sum_list = list(set([word_i for doc_i in doc_list for word_i in doc_i])) idf_dict = {word_i: 0 for word_i in sum_list} for word_j in sum_list: for doc_j in doc_list: if word_j in doc_j: idf_dict[word_j] += 1 return {k: (len(doc_list) / (v + 1)) for k, v in idf_dict.items()} # example document doc1 = "this is a sample" doc2 = "this is another example example example" doc3 = "this is a different example example" # split word doc1_words = () doc2_words = () doc3_words = () # Calculate word frequency for each document word_dict1 = Counter(doc1_words) word_dict2 = Counter(doc2_words) word_dict3 = Counter(doc3_words) # Calculate the IDF for the entire document collection idf = compute_idf([doc1_words, doc2_words, doc3_words]) # idf:{'different': 0.4054651081081644, 'another': 0.4054651081081644, 'a': 0.0, 'example': 0.0, 'this': -0.2876820724517809, 'sample': 0.4054651081081644, 'is': -0.2876820724517809}
Through the results, it can be found that different, another and sample all have higher IDF values than words such as is and a, representing the more important.
Okay, let's take a final look at the TF-IDF formula.
$$TF-IDF=TF*IDF $$
TF-IDF is TF*IDF to synthesize the importance of a word in a document.
Taking a final look at the complete code, the
import math from collections import Counter import math def compute_tfidf(tf_dict, idf_dict): tfidf = {} for word, tf_value in tf_dict.items(): tfidf[word] = tf_value * idf_dict[word] return tfidf def compute_tf(word_dict, doc_words): """ :param word_dict: count of the number of characters in the document :param doc_words: collection of characters in the document :return. """ tf_dict = {} words_len = len(doc_words) for word_i, count_i in word_dict.items(): tf_dict[word_i] = count_i / words_len return tf_dict def compute_idf(doc_list): """ :param doc_list: collection of documents :return. """ sum_list = list(set([word_i for doc_i in doc_list for word_i in doc_i])) idf_dict = {word_i: 0 for word_i in sum_list} for word_j in sum_list: for doc_j in doc_list: if word_j in doc_j: idf_dict[word_j] += 1 return {k: (len(doc_list) / (v + 1)) for k, v in idf_dict.items()} # example document doc1 = "this is a sample" doc2 = "this is another example example example" doc3 = "this is a different example example" # split word doc1_words = () doc2_words = () doc3_words = () # Calculate word frequency for each document word_dict1 = Counter(doc1_words) word_dict2 = Counter(doc2_words) word_dict3 = Counter(doc3_words) # Calculate TF tf1 = compute_tf(word_dict1, doc1_words) tf2 = compute_tf(word_dict2, doc2_words) tf3 = compute_tf(word_dict3, doc3_words) print(f'tf1:{tf1}') print(f'tf2:{tf2}') print(f'tf3:{tf3}') # Calculate the IDF for the entire document collection idf = compute_idf([doc1_words, doc2_words, doc3_words]) print(f'idf:{idf}') # Calculate TF-IDF for each document tfidf1 = compute_tfidf(tf1, idf) tfidf2 = compute_tfidf(tf2, idf) tfidf3 = compute_tfidf(tf3, idf) print("TF-IDF for Document 1:", tfidf1) print("TF-IDF for Document 2:", tfidf2) print("TF-IDF for Document 3:", tfidf3) """ tf1:{'this': 0.25, 'is': 0.25, 'a': 0.25, 'sample': 0.25} tf2:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'another': 0.16666666666666666, 'example': 0.5} tf3:{'this': 0.16666666666666666, 'is': 0.16666666666666666, 'a': 0.16666666666666666, 'different': 0.16666666666666666, 'example': 0.3333333333333333} idf:{'example': 0.0, 'different': 0.4054651081081644, 'this': -0.2876820724517809, 'another': 0.4054651081081644, 'is': -0.2876820724517809, 'a': 0.0, 'sample': 0.4054651081081644} TF-IDF for Document 1: {'this': -0.07192051811294523, 'is': -0.07192051811294523, 'a': 0.0, 'sample': 0.1013662770270411} TF-IDF for Document 2: {'this': -0.047947012075296815, 'is': -0.047947012075296815, 'another': 0.06757751801802739, 'example': 0.0} TF-IDF for Document 3: {'this': -0.047947012075296815, 'is': -0.047947012075296815, 'a': 0.0, 'different': 0.06757751801802739, 'example': 0.0} """