Text preprocessing, POS tagging and NER
In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article. This is the Summary of lecture "Feature Engineering for NLP in Python", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8, 8)
Tokenization and Lemmatization
- Text preprocessing techniques
- Converting words into lowercase
- Removing leading and trailing whitespaces
- Removing punctuation
- Removing stopwords
- Expanding contractions
- Tokenization
- the process of splitting a string into its constituent tokens
- Lemmatization
- the process of converting a word into its lowercased base form or lemma
with open('./dataset/gettysburg.txt', 'r') as f:
gettysburg = f.read()
import spacy
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')
# create a Doc object
doc = nlp(gettysburg)
# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)
Lemmatizing the Gettysburg address
In this exercise, we will perform lemmatization on the same gettysburg
address from before.
However, this time, we will also take a look at the speech, before and after lemmatization, and try to adjudge the kind of changes that take place to make the piece more machine friendly.
print(gettysburg)
lemmas = [token.lemma_ for token in doc]
# Convert lemmas into a string
print(' '.join(lemmas))
Observe the lemmatized version of the speech. It isn't very readable to humans but it is in a much more convenient format for a machine to process.
with open('./dataset/blog.txt', 'r') as file:
blog = file.read()
stopwords = spacy.lang.en.stop_words.STOP_WORDS
blog = blog.lower()
doc = nlp(blog)
# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))
Take a look at the cleaned text; it is lowercased and devoid of numbers, punctuations and commonly used stopwords. Also, note that the word U.S. was present in the original text. Since it had periods in between, our text cleaning process completely removed it. This may not be ideal behavior. It is always advisable to use your custom functions in place of isalpha()
for more nuanced cases.
Cleaning TED talks in a dataframe
In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe ted
consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function preprocess
and applying it to the transcript
feature of the dataframe.
ted = pd.read_csv('./dataset/ted.csv')
ted['transcript'] = ted['transcript'].str.lower()
ted.head()
def preprocess(text):
# Create Doc object
doc = nlp(text, disable=['ner', 'parser'])
# Generate lemmas
lemmas = [token.lemma_ for token in doc]
# Remove stopwords and non-alphabetic characters
a_lemmas = [lemma for lemma in lemmas if lemma.isalpha() and lemma not in stopwords]
return ' '.join(a_lemmas)
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])
You have preprocessed all the TED talk transcripts contained in ted
and it is now in a good shape to perform operations such as vectorization. You now have a good understanding of how text preprocessing works and why it is important. In the next lessons, we will move on to generating word level features for our texts.
Part-of-speech tagging
- Part-of-Speech (POS)
- helps in identifying distinction by identifying one bear as a noun and the other as a verb
- Word-sense disambiguation
- "The bear is a majestic animal"
- "Please bear with me"
- Sentiment analysis
- Question answering
- Fake news and opinion spam detection
- POS tagging
- Assigning every word, its corresponding part of speech
- POS annotation in spaCy
-
PROPN
- proper noun -
DET
- determinant
-
with open('./dataset/lotf.txt', 'r') as file:
lotf = file.read()
doc = nlp(lotf)
# Generate tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)
Examine the various POS tags attached to each token and evaluate if they make intuitive sense to you. You will notice that they are indeed labelled correctly according to the standard rules of English grammar.
Counting nouns in a piece of text
In this exercise, we will write two functions, nouns()
and proper_nouns()
that will count the number of other nouns and proper nouns in a piece of text respectively.
These functions will take in a piece of text and generate a list containing the POS tags for each word. It will then return the number of proper nouns/other nouns that the text contains. We will use these functions in the next exercise to generate interesting insights about fake news.
def proper_nouns(text, model=nlp):
# Create doc object
doc = model(text)
# Generate list of POS tags
pos = [token.pos_ for token in doc]
# Return number of proper nouns
return pos.count('PROPN')
print(proper_nouns('Abdul, Bill and Cathy went to the market to buy apples.', nlp))
def nouns(text, model=nlp):
# create doc object
doc = model(text)
# Generate list of POS tags
pos = [token.pos_ for token in doc]
# Return number of other nouns
return pos.count('NOUN')
print(nouns('Abdul, Bill and Cathy went to the market to buy apples.', nlp))
Noun usage in fake news
In this exercise, you have been given a dataframe headlines
that contains news headlines that are either fake or real. Your task is to generate two new features num_propn
and num_noun
that represent the number of proper nouns and other nouns contained in the title
feature of headlines
.
Next, we will compute the mean number of proper nouns and other nouns used in fake and real news headlines and compare the values. If there is a remarkable difference, then there is a good chance that using the num_propn
and num_noun
features in fake news detectors will improve its performance.
headlines = pd.read_csv('./dataset/fakenews.csv')
headlines.head()
headlines['num_propn'] = headlines['title'].apply(proper_nouns)
headlines['num_noun'] = headlines['title'].apply(nouns)
# Compute mean of proper nouns
real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()
# Compute mean of other nouns
real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()
# Print results
print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively" %
(real_propn, fake_propn))
print("Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively" %
(real_noun, fake_noun))
You now know to construct features using POS tags information. Notice how the mean number of proper nouns is considerably higher for fake news than it is for real news. The opposite seems to be true in the case of other nouns. This fact can be put to great use in desgning fake news detectors.
text = 'Sundar Pichai is the CEO of Google. Its headquarter is in Mountain View.'
doc = nlp(text)
# Print all named entities and their labels
for ent in doc.ents:
print(ent.text, ent.label_)
Identifying people mentioned in a news article
In this exercise, you have been given an excerpt from a news article published in TechCrunch. Your task is to write a function find_people
that identifies the names of people that have been mentioned in a particular piece of text. You will then use find_people
to identify the people of interest in the article.
with open('./dataset/tc.txt', 'r') as file:
tc = file.read()
def find_persons(text):
# Create Doc object
doc = nlp(text)
# Indentify the persons
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
# Return persons
return persons
print(find_persons(tc))
The article was related to Facebook and our function correctly identified both the people mentioned. You can now see how NER could be used in a variety of applications. Publishers may use a technique like this to classify news articles by the people mentioned in them. A question answering system could also use something like this to answer questions such as 'Who are the people mentioned in this passage?'.