Simple topic identification
This chapter will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You'll experiment and compare two simple methods - bag-of-words and Tf-idf using NLTK, and a new library Gensim. This is the Summary of lecture "Introduction to Natural Language Processing in Python", via datacamp.
from pprint import pprint
my_string = "The cat is in the box. The cat box."
from nltk.tokenize import word_tokenize
from collections import Counter
Counter(word_tokenize(my_string)).most_common(len(word_tokenize(my_string)))
Building a Counter with bag-of-words
In this exercise, you'll build your first (in this course) bag-of-words counter using a Wikipedia article, which has been pre-loaded as article
. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as article_title
. Note that this article text has had very little preprocessing from the raw Wikipedia database entry.
import re
from nltk.tokenize import word_tokenize
with open('./dataset/wikipedia_articles/wiki_text_debugging.txt', 'r') as file:
article = file.read()
article_title = word_tokenize(article)[2]
tokens = word_tokenize(article)
# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]
# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)
# Print the 10 most common tokens
pprint(bow_simple.most_common(10))
Simple text preprocessing
- preprocessing
- Helps make for better input data
- When performing machine learning or other statistical methods
- Examples
- Tokenization to create a bag of words
- Lowercasing words
- Lemmatization / Stemming
- Shorten words to their root stems
- Removing stop words, punctuation, or unwanted tokens
- Helps make for better input data
wordnet
package
import nltk
nltk.download('wordnet')
with open('./dataset/english_stopwords.txt', 'r') as file:
english_stops = file.read()
from nltk.stem import WordNetLemmatizer
# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]
# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
# Create the bag-of-words: bow
bow = Counter(lemmatized)
# Print the 10 most common tokens
pprint(bow.most_common(10))
Creating and querying a corpus with gensim
It's time to apply the methods you learned in the previous video to create your first gensim
dictionary and corpus!
You'll use these data structures to investigate word trends and potential interesting topics in your document set. To get started, we have imported a few additional messy articles from Wikipedia, which were preprocessed by lowercasing all words, tokenizing them, and removing stop words and punctuation. These were then stored in a list of document tokens called articles
. You'll need to do some light preprocessing and then generate the gensim
dictionary and corpus.
import glob
path_list = glob.glob('dataset/wikipedia_articles/*.txt')
articles = []
for article_path in path_list:
article = []
with open(article_path, 'r') as file:
a = file.read()
tokens = word_tokenize(a)
lower_tokens = [t.lower() for t in tokens]
# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]
# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]
articles.append(no_stops)
from gensim.corpora.dictionary import Dictionary
# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)
# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")
# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))
# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]
# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])
Gensim bag-of-words
Now, you'll use your new gensim corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms. Take a guess at what the topics are and feel free to explore more documents in the IPython Shell!
You have access to the dictionary
and corpus
objects you created in the previous exercise, as well as the Python defaultdict
and itertools
to help with the creation of intermediate data structures for analysis.
-
defaultdict
allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argumentint
, we are able to ensure that any non-existent keys are automatically assigned a default value of 0. This makes it ideal for storing the counts of words in this exercise. -
itertools.chain.from_iterable()
allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through ourcorpus
object (which is a list of lists).
from collections import defaultdict
import itertools
# Save the fifth document: doc
doc = corpus[4]
# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)
# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
print(dictionary.get(word_id), word_count)
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
total_word_count[word_id] += word_count
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True)
# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
print(dictionary.get(word_id), word_count)
Tf-idf with gensim
- TF-IDF
- Term Frequency - Inverse Document Frequency
- Allows you to determine the most important words in each document
- Each corpus may have shared words beyond just stop words
- These words should be down-weighted in importance
- Ensures most common words don't show up as key words
- Keeps document specific frequent words wieghted high
- Formula
$$ w_{i, j} = \text{tf}_{i, j} * \log (\frac{N}{\text{df}_i}) $$
- $ w_{i, j}=$ tf-idf for token $i$ in document $j$
- $ \text{tf}_{i, j} =$ number of occurences of token $i$ in document $j$
- $ \text{df}_{i} =$ number of documents that contain token $i$
- $ N =$ total number of documents
Tf-idf with Wikipedia
Now it's your turn to determine new significant terms for your corpus by applying gensim
's tf-idf. You will again have access to the same corpus and dictionary objects you created in the previous exercises - dictionary
, corpus
, and doc
. Will tf-idf make for more interesting results on the document level?
from gensim.models.tfidfmodel import TfidfModel
# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)
# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]
# Print the first five weights
print(tfidf_weights[:5])
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
print(dictionary.get(term_id), weight)