Dealing with Text Data
Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created. This is the Summary of lecture "Feature Engineering for Machine Learning in Python", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 8)
Cleaning up your text
Unstructured text data cannot be directly used in most analyses. Multiple steps need to be taken to go from a long free form string to a set of numeric columns in the right format that can be ingested by a machine learning model. The first step of this process is to standardize the data and eliminate any characters that could cause problems later on in your analytic pipeline.
In this chapter you will be working with a new dataset containing the inaugural speeches of the presidents of the United States loaded as speech_df
, with the speeches stored in the text
column.
speech_df = pd.read_csv('./dataset/inaugural_speeches.csv')
speech_df['text'].head()
speech_df['text_clean'] = speech_df['text'].str.replace('[^a-zA-Z]', ' ')
# Change to lower case
speech_df['text_clean'] = speech_df['text_clean'].str.lower()
# Print the first 5 rows of text_clean column
print(speech_df['text_clean'].head())
High level text features
Once the text has been cleaned and standardized you can begin creating features from the data. The most fundamental information you can calculate about free form text is its size, such as its length and number of words. In this exercise (and the rest of this chapter), you will focus on the cleaned/transformed text column (text_clean
) you created in the last exercise.
speech_df['char_cnt'] = speech_df['text_clean'].str.len()
# Count the number of words in each text
speech_df['word_cnt'] = speech_df['text_clean'].str.split().str.len()
# Find the average length of word
speech_df['avg_word_length'] = speech_df['char_cnt'] / speech_df['word_cnt']
# Print the first 5 rows of these columns
speech_df[['text_clean', 'char_cnt', 'word_cnt', 'avg_word_length']].head()
Counting words (I)
Once high level information has been recorded you can begin creating features based on the actual content of each text. One way to do this is to approach it in a similar way to how you worked with categorical variables in the earlier lessons.
- For each unique word in the dataset a column is created.
- For each entry, the number of times this word occurs is counted and the count value is entered into the respective column.
These "count"
columns can then be used to train machine learning models.
from sklearn.feature_extraction.text import CountVectorizer
# Instantiate CountVectorizer
cv = CountVectorizer()
# Fit the vectorizer
cv.fit(speech_df['text_clean'])
# Print feature names
print(cv.get_feature_names()[:10])
cv_transformed = cv.transform(speech_df['text_clean'])
# Print the full array
cv_array = cv_transformed.toarray()
print(cv_array)
print(cv_array.shape)
Limiting your features
As you have seen, using the CountVectorizer
with its default settings creates a feature for every single word in your corpus. This can create far too many features, often including ones that will provide very little analytical value.
For this purpose CountVectorizer
has parameters that you can set to reduce the number of features:
-
min_df
: Use only words that occur in more than this percentage of documents. This can be used to remove outlier words that will not generalize across texts. -
max_df
: Use only words that occur in less than this percentage of documents. This is useful to eliminate very common words that occur in every corpus without adding value such as "and" or "the".
from sklearn.feature_extraction.text import CountVectorizer
# Specify arguments to limit the number of features generated
cv = CountVectorizer(min_df=0.2, max_df=0.8)
# Fit, transform, and convert into array
cv_transformed = cv.fit_transform(speech_df['text_clean'])
cv_array = cv_transformed.toarray()
# Print the array shape
print(cv_array.shape)
Text to DataFrame
Now that you have generated these count based features in an array you will need to reformat them so that they can be combined with the rest of the dataset. This can be achieved by converting the array into a pandas DataFrame, with the feature names you found earlier as the column names, and then concatenate it with the original DataFrame.
cv_df = pd.DataFrame(cv_array, columns = cv.get_feature_names()).add_prefix('Counts_')
# Add the new columns to the original DataFrame
speech_df_new = pd.concat([speech_df, cv_df], axis=1, sort=False)
speech_df_new.head()
Term frequency-inverse document frequency
- TF-IDF
- Term Frequency - Inverse Document Frequency $$ \text{TF-IDF} = \frac{\frac{\text{count of word occurances}}{\text{Total words in documents}}}{\log (\frac{\text{Number of docs word is in}}{\text{Total number of docs}})} $$
- Measures of what proportion of the documents a word occurs in all documents
Tf-idf
While counts of occurrences of words can be useful to build models, words that occur many times may skew the results undesirably. To limit these common words from overpowering your model a form of normalization can be used. In this lesson you will be using Term frequency-inverse document frequency (Tf-idf) as was discussed in the video. Tf-idf has the effect of reducing the value of common words, while increasing the weight of words that do not occur in many documents.
from sklearn.feature_extraction.text import TfidfVectorizer
# Instantiate TfidfVectorizer
tv = TfidfVectorizer(max_features=100, stop_words='english')
# Fit the vectorizer and transform the data
tv_transformed = tv.fit_transform(speech_df['text_clean'])
# Create a DataFrame with these features
tv_df = pd.DataFrame(tv_transformed.toarray(),
columns=tv.get_feature_names()).add_prefix('TFIDF_')
tv_df.head()
sample_row = tv_df.iloc[0]
# Print the top 5 words of the sorted output
print(sample_row.sort_values(ascending=False).head())
Transforming unseen data
When creating vectors from text, any transformations that you perform before training a machine learning model, you also need to apply on the new unseen (test) data. To achieve this follow the same approach from the last chapter: fit the vectorizer only on the training data, and apply it to the test data.
For this exercise the speech_df
DataFrame has been split in two:
-
train_speech_df
: The training set consisting of the first 45 speeches. -
test_speech_df
: The test set consisting of the remaining speeches.
train_speech_df = speech_df.iloc[:45]
test_speech_df = speech_df.iloc[45:]
tv = TfidfVectorizer(max_features=100, stop_words='english')
# Fit the vectorizer and transform the data
tv_transformed = tv.fit_transform(train_speech_df['text_clean'])
# Transform test data
test_tv_transformed = tv.transform(test_speech_df['text_clean'])
# Create new features for the test set
test_tv_df = pd.DataFrame(test_tv_transformed.toarray(),
columns=tv.get_feature_names()).add_prefix('TFIDF_')
test_tv_df.head()
Using longer n-grams
So far you have created features based on individual words in each of the texts. This can be quite powerful when used in a machine learning model but you may be concerned that by looking at words individually a lot of the context is being ignored. To deal with this when creating models you can use n-grams which are sequence of n words grouped together. For example:
- bigrams: Sequences of two consecutive words
- trigrams: Sequences of two consecutive words
These can be automatically created in your dataset by specifying the ngram_range argument as a tuple (n1, n2)
where all n-grams in the n1
to n2
range are included.
cv_trigram_vec = CountVectorizer(max_features=100,
stop_words='english',
ngram_range=(3, 3))
# Fit and apply trigram vectorizer
cv_trigram = cv_trigram_vec.fit_transform(speech_df['text_clean'])
# Print the trigram features
cv_trigram_vec.get_feature_names()[:10]
cv_tri_df = pd.DataFrame(cv_trigram.toarray(),
columns = cv_trigram_vec.get_feature_names()).add_prefix('Counts_')
# Print the top 5 words in the sorted output
cv_tri_df.sum().sort_values(ascending=False).head()