- Building a bag of words model
- Building a BoW Naive Bayes classifier
- Building n-gram models
import pandas as pd import numpy as np import matplotlib.pyplot as plt import spacy plt.rcParams['figure.figsize'] = (8, 8)
In this exercise, you have been provided with a
corpus of more than 7000 movie tag lines. Your job is to generate the bag of words representation
bow_matrix for these taglines. For this exercise, we will ignore the text preprocessing step and generate
movies = pd.read_csv('./dataset/movie_overviews.csv').dropna() movies['tagline'] = movies['tagline'].str.lower() movies.head()
|1||8844||Jumanji||When siblings Judy and Peter discover an encha...||roll the dice and unleash the excitement!|
|2||15602||Grumpier Old Men||A family wedding reignites the ancient feud be...||still yelling. still fighting. still ready for...|
|3||31357||Waiting to Exhale||Cheated on, mistreated and stepped on, the wom...||friends are the people who let you be yourself...|
|4||11862||Father of the Bride Part II||Just when George Banks has recovered from his ...||just when his world is back to normal... he's ...|
|5||949||Heat||Obsessive master thief, Neil McCauley leads a ...||a los angeles crime saga|
corpus = movies['tagline']
from sklearn.feature_extraction.text import CountVectorizer # Create CountVectorizer object vectorizer = CountVectorizer() # Generate matrix of word vectors bow_matrix = vectorizer.fit_transform(corpus) # Print the shape of bow_matrix print(bow_matrix.shape)
You now know how to generate a bag of words representation for a given corpus of documents. Notice that the word vectors created have more than 6600 dimensions. However, most of these dimensions have a value of zero since most words do not occur in a particular tagline.
In this exercise, you have been provided with a
lem_corpus which contains the pre-processed versions of the movie taglines from the previous exercise. In other words, the taglines have been lowercased and lemmatized, and stopwords have been removed.
Your job is to generate the bag of words representation
bow_lem_matrix for these lemmatized taglines and compare its shape with that of
bow_matrix obtained in the previous exercise.
nlp = spacy.load('en_core_web_sm') stopwords = spacy.lang.en.stop_words.STOP_WORDS
lem_corpus = corpus.apply(lambda row: ' '.join([t.lemma_ for t in nlp(row) if t.lemma_ not in stopwords and t.lemma_.isalpha()]))
1 roll dice unleash excitement 2 yell fight ready love 3 friend people let let forget 4 world normal surprise life 5 los angeles crime saga ... 9091 kingsglaive final fantasy xv 9093 happen vegas stay vegas happen 9095 decorate officer devoted family man defend hon... 9097 god incarnate city doom 9098 band know story Name: tagline, Length: 7033, dtype: object
vectorizer = CountVectorizer() # Generate of word vectors bow_lem_matrix = vectorizer.fit_transform(lem_corpus) # Print the shape of how_lem_matrix print(bow_lem_matrix.shape)
sentences = ['The lion is the king of the jungle', 'Lions have lifespans of a decade', 'The lion is an endangered species']
vectorizer = CountVectorizer() # Generate matrix of word vectors bow_matrix = vectorizer.fit_transform(sentences) # Convert bow_matrix into a DataFrame bow_df = pd.DataFrame(bow_matrix.toarray()) # Map the column names to vocabulary bow_df.columns = vectorizer.get_feature_names() # Print bow_df bow_df
Observe that the column names refer to the token whose frequency is being recorded. Therefore, since the first column name is an, the first feature represents the number of times the word
'an' occurs in a particular sentence.
get_feature_names() essentially gives us a list which represents the mapping of the feature indices to the feature name in the vocabulary.
n this exercise, you have been given two pandas Series,
X_test, which consist of movie reviews. They represent the training and the test review data respectively. Your task is to preprocess the reviews and generate BoW vectors for these two sets using
Once we have generated the BoW vector matrices
X_test_bow, we will be in a very good position to apply a machine learning model to it and conduct sentiment analysis.
movie_reviews = pd.read_csv('./dataset/movie_reviews_clean.csv') movie_reviews.head()
|0||this anime series starts out great interesting...||0|
|1||some may go for a film like this but i most as...||0|
|2||i ve seen this piece of perfection during the ...||1|
|3||this movie is likely the worst movie i ve ever...||0|
|4||it ll soon be 10 yrs since this movie was rele...||1|
X = movie_reviews['review'] y = movie_reviews['sentiment']
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
vectorizer = CountVectorizer(lowercase=True, stop_words='english') # fit and transform X_train X_train_bow = vectorizer.fit_transform(X_train) # Transform X_test X_test_bow = vectorizer.transform(X_test) # Print shape of X_train_bow and X_test_bow print(X_train_bow.shape) print(X_test_bow.shape)
(750, 14859) (250, 14859)
You now have a good idea of preprocessing text and transforming them into their bag-of-words representation using
CountVectorizer. In this exercise, you have set the lowercase argument to True. However, note that this is the default value of lowercase and passing it explicitly is not necessary. Also, note that both
X_test_bow have 7822 features. There were words present in
X_test that were not in
X_train. CountVectorizer chose to ignore them in order to ensure that the dimensions of both sets remain the same.
n the previous exercise, you generated the bag-of-words representations for the training and test movie review data. In this exercise, we will use this model to train a Naive Bayes classifier that can detect the sentiment of a movie review and compute its accuracy. Note that since this is a binary classification problem, the model is only capable of classifying a review as either positive (1) or negative (0). It is incapable of detecting neutral reviews.
from sklearn.naive_bayes import MultinomialNB # Create a MultinomialNB object clf = MultinomialNB() # Fit the classifier clf.fit(X_train_bow, y_train) # Measure the accuracy accuracy = clf.score(X_test_bow, y_test) print("The accuracy of the classifier on the test set is %.3f" % accuracy) # Predict the sentiment of a negative review review = 'The movie was terrible. The music was underwhelming and the acting mediocre.' prediction = clf.predict(vectorizer.transform([review])) print("The sentiment predicted by the classifier is %i" % (prediction))
The accuracy of the classifier on the test set is 0.836 The sentiment predicted by the classifier is 0
You have successfully performed basic sentiment analysis. Note that the accuracy of the classifier is 80%. Considering the fact that it was trained on only 750 reviews, this is reasonably good performance. The classifier also correctly predicts the sentiment of a mini negative review which we passed into it.
- BoW shortcomings
The movie was good and not boring-> positive
The movie was not good and boring-> negative
- Exactly the same BoW representation!
- Context of the words is lost.
- Sentiment dependent on the position of
- Contiguous sequence of n elements (or words) in a given document.
- Bi-grams / Tri-grams
- n-grams Shortcomings
- Increase number of dimension, occurs curse of dimensionality
- Higher order n-grams are rare
In this exercise, we have been provided with a corpus of more than 9000 movie tag lines. Our job is to generate n-gram models up to n equal to 1, n equal to 2 and n equal to 3 for this data and discover the number of features for each model.
We will then compare the number of features generated for each model.
vectorizer_ng1 = CountVectorizer(ngram_range=(1, 1)) ng1 = vectorizer_ng1.fit_transform(corpus) # Generate n-grams upto n=2 vectorizer_ng2 = CountVectorizer(ngram_range=(1, 2)) ng2 = vectorizer_ng2.fit_transform(corpus) # Generate n-grams upto n=3 vectorizer_ng3 = CountVectorizer(ngram_range=(1, 3)) ng3 = vectorizer_ng3.fit_transform(corpus) # Print the number of features for each model print("ng1, ng2 and ng3 have %i, %i and %i features respectively" % (ng1.shape, ng2.shape, ng3.shape))
ng1, ng2 and ng3 have 6614, 37100 and 76881 features respectively
You now know how to generate n-gram models containing higher order n-grams. Notice that
ng2 has over 37,000 features whereas
ng3 has over 76,000 features. This is much greater than the 6,000 dimensions obtained for
ng1. As the n-gram range increases, so does the number of features, leading to increased computational costs and a problem known as the curse of dimensionality.
ng_vectorizer = CountVectorizer(ngram_range=(1, 2)) X_train_ng = ng_vectorizer.fit_transform(X_train) X_test_ng = ng_vectorizer.transform(X_test)
clf_ng = MultinomialNB() # Fit the classifier clf_ng.fit(X_train_ng, y_train) # Measure the accuracy accuracy = clf_ng.score(X_test_ng, y_test) print("The accuracy of the classifier on the test set is %.3f" % accuracy) # Predict the sentiment of a negative review review = 'The movie was not good. The plot had several holes and the acting lacked panache' prediction = clf_ng.predict(ng_vectorizer.transform([review])) print("The sentiment predicted by the classifier is %i" % (prediction))
The accuracy of the classifier on the test set is 0.824 The sentiment predicted by the classifier is 0
Notice how this classifier performs slightly better than the BoW version. Also, it succeeds at correctly identifying the sentiment of the mini-review as negative.
You now know how to conduct sentiment analysis by converting text into various n-gram representations and feeding them to a classifier. In this exercise, we will conduct sentiment analysis for the same movie reviews from before using two n-gram models: unigrams and n-grams upto n equal to 3.
We will then compare the performance using three criteria: accuracy of the model on the test set, time taken to execute the program and the number of features created when generating the n-gram representation.
import time start_time = time.time() # Splitting the data into training and test sets train_X, test_X, train_y, test_y = train_test_split(movie_reviews['review'], movie_reviews['sentiment'], test_size=0.5, random_state=42, stratify=movie_reviews['sentiment']) # Generateing ngrams vectorizer = CountVectorizer(ngram_range=(1,1)) train_X = vectorizer.fit_transform(train_X) test_X = vectorizer.transform(test_X) # Fit classifier clf = MultinomialNB() clf.fit(train_X, train_y) # Print the accuracy, time and number of dimensions print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. " % (time.time() - start_time, clf.score(test_X, test_y))) print("The ngram representation had %i features." % (train_X.shape))
The program took 0.127 seconds to complete. The accuracy on the test set is 0.75. The ngram representation had 12347 features.
start_time = time.time() # Splitting the data into training and test sets train_X, test_X, train_y, test_y = train_test_split(movie_reviews['review'], movie_reviews['sentiment'], test_size=0.5, random_state=42, stratify=movie_reviews['sentiment']) # Generateing ngrams vectorizer = CountVectorizer(ngram_range=(1,3)) train_X = vectorizer.fit_transform(train_X) test_X = vectorizer.transform(test_X) # Fit classifier clf = MultinomialNB() clf.fit(train_X, train_y) # Print the accuracy, time and number of dimensions print("The program took %.3f seconds to complete. The accuracy on the test set is %.2f. " % (time.time() - start_time, clf.score(test_X, test_y))) print("The ngram representation had %i features." % (train_X.shape))
The program took 0.681 seconds to complete. The accuracy on the test set is 0.77. The ngram representation had 178240 features.
The program took around 0.2 seconds in the case of the unigram model and more than 10 times longer for the higher order n-gram model. The unigram model had over 12,000 features whereas the n-gram model for upto n=3 had over 178,000! Despite taking higher computation time and generating more features, the classifier only performs marginally better in the latter case, producing an accuracy of 77% in comparison to the 75% for the unigram model.