Named-entity recognition
This chapter will introduce a slightly more advanced topic - named-entity recognition. You'll learn how to identify the who, what, and where of your texts using pre-trained models on English and non-English text. You'll also learn how to use some new libraries, polyglot and spaCy, to add to your NLP toolbox. This is the Summary of lecture "Introduction to Natural Language Processing in Python", via datacamp.
from pprint import pprint
import matplotlib.pyplot as plt
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
with open('./dataset/news_articles/uber_apple.txt', 'r') as file:
article = file.read()
from nltk.tokenize import sent_tokenize, word_tokenize
# Tokenize the article into sentences: sentences
sentences = sent_tokenize(article)
# Tokenize each sentence into words: token_sentences
token_sentences = [word_tokenize(sent) for sent in sentences]
# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences]
# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)
# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
for chunk in sent:
if hasattr(chunk, "label") and chunk.label() == 'NE':
print(chunk)
Charting practice
In this exercise, you'll use some extracted named entities and their groupings from a series of newspaper articles to chart the diversity of named entity types in the articles.
You'll use a defaultdict
called ner_categories
, with keys representing every named entity group type, and values to count the number of each different named entity type. You have a chunked sentence list called chunked_sentences
similar to the last exercise, but this time with non-binary category names.
You can use hasattr()
to determine if each chunk has a 'label' and then simply use the chunk's .label()
method as the dictionary key.
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=False)
from collections import defaultdict
# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)
# Create the nested for loop
for sent in chunked_sentences:
for chunk in sent:
if hasattr(chunk, 'label'):
ner_categories[chunk.label()] += 1
# Create a list from the dictionary keys for the cart labels: labels
labels = list(ner_categories.keys())
# Create a list of the values: values
values = [ner_categories.get(l) for l in labels]
# Create the pie chart
fig = plt.figure(figsize=(8, 8))
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140);
Introduction to SpaCy
- SpaCy
- NLP library similar to
gensim
, with different implementations - Focuson creating NLP pipelines to generate models and corpora
- Open source, with extra libraries and tools
- Displacy
- NLP library similar to
- Why use SpaCy for NER?
- Easy pipeline creation
- Different entity types compared to
nltk
- Informal language corpora
- Easily find entities in Tweets and chat messages
Comparing NLTK with spaCy NER
Using the same text you used in the first exercise of this chapter, you'll now see the results using spaCy's NER annotator. How will they compare?
bash
python -m spacy download en_core_web_sm
import spacy
# Instantiate the English model: nlp
nlp = spacy.load('en_core_web_sm', tagger=False, parser=False, matcher=False)
# Create a new document: doc
doc = nlp(article)
# Print all of the found entities and their labels
for ent in doc.ents:
print(ent.label_, ent.text)
From NER in nltk
, we can get organization
, GPE
, person
as a category. In case of spacy
, we can additionally get FAC
, CARDINAL
, LOC
, MONEY
as a category.
French NER with polyglot I
In this exercise and the next, you'll use the polyglot
library to identify French entities. The library functions slightly differently than spacy
, so you'll use a few of the new things you learned in the last video to display the named entity text and category.
pip install pyicu
pip install pycld2
pip install morfessor
polyglot download ner2.fr
polyglot download embeddings2.fr
!polyglot download ner2.fr
!polyglot download embeddings2.fr
with open('./dataset/news_articles/french.txt', 'r') as file:
article = file.read()
from polyglot.text import Text
# Create a new text object using Polyglot's Text class: txt
txt = Text(article)
# Print each of the entities found
for ent in txt.entities:
print(ent)
# Print the type of ent
print(type(ent))
entities = [(ent.tag, ' '.join(ent)) for ent in txt.entities]
# Print entities
pprint(entities)
Spanish NER with polyglot
You'll continue your exploration of polyglot
now with some Spanish annotation. This article is not written by a newspaper, so it is your first example of a more blog-like text. How do you think that might compare when finding entities?
Your specific task is to determine how many of the entities contain the words "Márquez"
or "Gabo"
- these refer to the same person in different ways!
!polyglot download ner2.es embeddings2.es
with open('./dataset/news_articles/spanish.txt', 'r') as file:
article = file.read()
txt = Text(article)
# Initialize the count variable: count
count = 0
# Iterate over all the entities
for ent in txt.entities:
# check whether the entity contains 'Márquez' or 'Gabo'
if ('Márquez' in ent) or ('Gabo' in ent):
# Increment count
count += 1
# Print count
print(count)
# Calculate the percentage of entities that refer to "Gabo": percentage
percentage = count / len(txt.entities)
print(percentage)