from pprint import pprint

Word counts with bag-of-words

  • Bag-of-words
    • Basic method for finding topics in a text
    • Need to first create tokens using tokenization
    • ... and then count up all the tokens
    • The more frequent a word, the more important it might be
    • Can be a great way to determine the significant words in a text

Bag-of-words picker

It's time for a quick check on your understanding of bag-of-words.

my_string = "The cat is in the box. The cat box."
from nltk.tokenize import word_tokenize
from collections import Counter

[('The', 2),
 ('cat', 2),
 ('box', 2),
 ('.', 2),
 ('is', 1),
 ('in', 1),
 ('the', 1)]

Building a Counter with bag-of-words

In this exercise, you'll build your first (in this course) bag-of-words counter using a Wikipedia article, which has been pre-loaded as article. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as article_title. Note that this article text has had very little preprocessing from the raw Wikipedia database entry.

import re
from nltk.tokenize import word_tokenize

with open('./dataset/wikipedia_articles/wiki_text_debugging.txt', 'r') as file:
    article =
    article_title = word_tokenize(article)[2]
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
[(',', 151),
 ('the', 150),
 ('.', 89),
 ('of', 81),
 ("''", 66),
 ('to', 63),
 ('a', 60),
 ('``', 47),
 ('in', 44),
 ('and', 41)]

Simple text preprocessing

  • preprocessing
    • Helps make for better input data
      • When performing machine learning or other statistical methods
    • Examples
      • Tokenization to create a bag of words
      • Lowercasing words
    • Lemmatization / Stemming
      • Shorten words to their root stems
    • Removing stop words, punctuation, or unwanted tokens

Text preprocessing practice

Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text.

Note: Before lemmatizing token through NLTK, you must install wordnet package
import nltk'wordnet')
[nltk_data] Downloading package wordnet to /home/chanseok/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
with open('./dataset/english_stopwords.txt', 'r') as file:
    english_stops =
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
[('debugging', 40),
 ('system', 25),
 ('bug', 17),
 ('software', 16),
 ('problem', 15),
 ('tool', 15),
 ('computer', 14),
 ('process', 13),
 ('term', 13),
 ('debugger', 13)]

Introduction to gensim

  • gensim
    • Popular open-source NLP library
    • Uses top academic models to perform complex tasks
      • Building document or word vectors
      • Performing topic identification and document comparison

Creating and querying a corpus with gensim

It's time to apply the methods you learned in the previous video to create your first gensim dictionary and corpus!

You'll use these data structures to investigate word trends and potential interesting topics in your document set. To get started, we have imported a few additional messy articles from Wikipedia, which were preprocessed by lowercasing all words, tokenizing them, and removing stop words and punctuation. These were then stored in a list of document tokens called articles. You'll need to do some light preprocessing and then generate the gensim dictionary and corpus.

import glob

path_list = glob.glob('dataset/wikipedia_articles/*.txt')
articles = []
for article_path in path_list:
    article = []
    with open(article_path, 'r') as file:
        a =
    tokens = word_tokenize(a)
    lower_tokens = [t.lower() for t in tokens]
    # Retain alphabetic words: alpha_only
    alpha_only = [t for t in lower_tokens if t.isalpha()]

    # Remove all stop words: no_stops
    no_stops = [t for t in alpha_only if t not in english_stops]
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
[(0, 1), (1, 1), (3, 4), (4, 1), (5, 2), (8, 2), (13, 2), (20, 1), (21, 1), (22, 1)]

Gensim bag-of-words

Now, you'll use your new gensim corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms. Take a guess at what the topics are and feel free to explore more documents in the IPython Shell!

You have access to the dictionary and corpus objects you created in the previous exercise, as well as the Python defaultdict and itertools to help with the creation of intermediate data structures for analysis.

  • defaultdict allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument int, we are able to ensure that any non-existent keys are automatically assigned a default value of 0. This makes it ideal for storing the counts of words in this exercise.

  • itertools.chain.from_iterable() allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through our corpus object (which is a list of lists).

from collections import defaultdict
import itertools

# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True)

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)
engineering 91
reverse 73
software 51
cite 26
computer 22
computer 597
software 450
cite 322
ref 259
code 235

Tf-idf with gensim

  • TF-IDF
    • Term Frequency - Inverse Document Frequency
    • Allows you to determine the most important words in each document
    • Each corpus may have shared words beyond just stop words
    • These words should be down-weighted in importance
    • Ensures most common words don't show up as key words
    • Keeps document specific frequent words wieghted high
  • Formula

$$ w_{i, j} = \text{tf}_{i, j} * \log (\frac{N}{\text{df}_i}) $$

  • $ w_{i, j}=$ tf-idf for token $i$ in document $j$
  • $ \text{tf}_{i, j} =$ number of occurences of token $i$ in document $j$
  • $ \text{df}_{i} =$ number of documents that contain token $i$
  • $ N =$ total number of documents

Tf-idf with Wikipedia

Now it's your turn to determine new significant terms for your corpus by applying gensim's tf-idf. You will again have access to the same corpus and dictionary objects you created in the previous exercises - dictionary, corpus, and doc. Will tf-idf make for more interesting results on the document level?

from gensim.models.tfidfmodel import TfidfModel

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
[(0, 0.005989539479861883), (1, 0.004742182360351364), (3, 0.023958157919447533), (4, 0.005989539479861883), (5, 0.015032362425476509)]
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)
reverse 0.5486812285298925
infringement 0.20400655120129182
engineering 0.17910469922476718
interoperability 0.13600436746752786
missile 0.11900382153408688