Introduction

Natural-language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to fruitfully process large amounts of natural language data (wikipedia).

This rapidly improving area of artificial intelligence covers tasks such as speech recognition, natural-language understanding, and natural language generation.

In the following projects, we're going to be building a strong NLP foundation by practicing:

Tokenizing - Splitting sentences and words from the body of text.
Part of Speech tagging
Chunking

This foundation will open the door for machine learning in conjunction with NLP. We will cover:

Machine learning in NLP
How to tie in Scikit-learn (sklearn) with NLTK
Training classifiers with a datasets (Next Project)

Let's dive right in! We are going to be using the Natural Language Toolkit (NLTK) which is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language

Required packages

Mentioned before, it is required to import nltk package. But not only install the package itself, we need to import nltk-related packages. In that case, we just use nltk.download() to download required packages.

import nltk
import sys
import sklearn

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

True

After that, you may see some GUI form to download some packages. The efficient way to process NLP is selecting "popular" option. Then it will download and install required packages and Corpora.

nltk_download

We need more corporas for this post. Choose the next menu in Corpora, and install following things,

state_union
udhr2
udhr

Notation

Before beginning, Some words will not be familiar with us like corpus, Lexicon, and Token. Corpus is the body of text. It is singular. Corpora is the plural of this. Lexicon is the set of words and its meanings. And Token means each "entity" that is a part of whatever was split up based on specific rules. For example, We can tokenize the word based on stem, or space.

Version check

print('Python: {}'.format(sys.version))
print('NLTK: {}'.format(nltk.__version__))
print('Scikit-learn: {}'.format(sklearn.__version__))

Python: 3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
NLTK: 3.4.5
Scikit-learn: 0.22.1

Tokenization

When using Natural Language Processing, our goal is to perform some analysis or processing so that a computer can respond to text appropriately.

The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as stop words.

from nltk.tokenize import sent_tokenize, word_tokenize

text = "Hello students, how are you doing today? The olympics are inspiring, and Python is awesome. You look nice today."

If we want to tokenize this sentence,

sent_tokenize(text)

['Hello students, how are you doing today?',
 'The olympics are inspiring, and Python is awesome.',
 'You look nice today.']

You can see 3 sentence by tokenizing. Maybe we guess that it is tokenized by the Capital letter.

Next, if you want the word that containing this sentense,

word_tokenize(text)

['Hello',
 'students',
 ',',
 'how',
 'are',
 'you',
 'doing',
 'today',
 '?',
 'The',
 'olympics',
 'are',
 'inspiring',
 ',',
 'and',
 'Python',
 'is',
 'awesome',
 '.',
 'You',
 'look',
 'nice',
 'today',
 '.']

You can see almost all words in sentence is tokenized, but some specific characters are also contained like ",", "?", ".". These are called puncutation. Of course, these are important to understand the intension of sentence. But we'll cover it later.

Stop words

When using Natural Language Processing, our goal is to perform some analysis or processing so that a computer can respond to text appropriately.

The process of converting data to something a computer can understand is referred to as "pre-processing." One of the major forms of pre-processing is going to be filtering out useless data. In natural language processing, useless words (data), are referred to as stop words. Of course, it would be different in each language, in this post we will use english stop words.

from nltk.corpus import stopwords

stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 "don't",
 'should',
 "should've",
 'now',
 'd',
 'll',
 'm',
 'o',
 're',
 've',
 'y',
 'ain',
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'ma',
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 'shan',
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]

Mentioned before, it maybe important to understand the intension. But most of time, words in stopwords appears multiple times and it will be hard to understand the intension of sentence from computer side. So it is helpful to remove these words in advance. See what is different.

example_sent = "This is some sample text, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)

filtered_sent = [w for w in word_tokens if not w in stop_words]

print(word_tokens)
print(filtered_sent)

['This', 'is', 'some', 'sample', 'text', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'text', ',', 'showing', 'stop', 'words', 'filtration', '.']

Stemming

Stemming, which attempts to normalize sentences, is another preprocessing step that we can perform. In the english language, different variations of words and sentences often having the same meaning. Stemming is a way to account for these variations; furthermore, it will help us shorten the sentences and shorten our lookup. For example, consider the following sentence:

I was taking a ride on my horse.
I was riding my horse.

These sentences mean the same thing, as noted by the same tense (-ing) in each sentence; however, that isn't intuitively understood by the computer. To account for all the variations of words in the english language, we can use the Porter stemmer, which has been around since 1979. You can see the details in this pages.

from nltk.stem import PorterStemmer

ps = PorterStemmer()

example_words = ['ride', 'riding', 'rider', 'rides']

for w in example_words:
    print(ps.stem(w))

ride
ride
rider
ride

Usually, we can apply this in token, so we can analyze which words appears occasionally.

text = "When riders are riding their horses, they often think of how cowboys rode horses."

words = word_tokenize(text)

for w in words:
    print(ps.stem(w))

when
rider
are
ride
their
hors
,
they
often
think
of
how
cowboy
rode
hors
.

Part of Speech Tagging (POS)

Part of speech tagging means labeling words as nouns, verbs, adjectives, etc. Even better, NLTK can handle tenses! While we're at it, we are also going to import a new sentence tokenizer (PunktSentenceTokenizer). This tokenizer is capable of unsupervised learning, so it can be trained on any body of text.

In this section, we will use pre-downloaded "Universal declaration of human rights" (udhr for short).

from nltk.corpus import udhr
print(udhr.raw('English-Latin1'))

Universal Declaration of Human Rights
Preamble
Whereas recognition of the inherent dignity and of the equal and inalienable rights of all members of the human family is the foundation of freedom, justice and peace in the world, 

Whereas disregard and contempt for human rights have resulted in barbarous acts which have outraged the conscience of mankind, and the advent of a world in which human beings shall enjoy freedom of speech and belief and freedom from fear and want has been proclaimed as the highest aspiration of the common people, 

Whereas it is essential, if man is not to be compelled to have recourse, as a last resort, to rebellion against tyranny and oppression, that human rights should be protected by the rule of law, 

Whereas it is essential to promote the development of friendly relations between nations, 

Whereas the peoples of the United Nations have in the Charter reaffirmed their faith in fundamental human rights, in the dignity and worth of the human person and in the equal rights of men and women and have determined to promote social progress and better standards of life in larger freedom, 

Whereas Member States have pledged themselves to achieve, in cooperation with the United Nations, the promotion of universal respect for and observance of human rights and fundamental freedoms, 

Whereas a common understanding of these rights and freedoms is of the greatest importance for the full realization of this pledge, 

Now, therefore, 

The General Assembly, 

Proclaims this Universal Declaration of Human Rights as a common standard of achievement for all peoples and all nations, to the end that every individual and every organ of society, keeping this Declaration constantly in mind, shall strive by teaching and education to promote respect for these rights and freedoms and by progressive measures, national and international, to secure their universal and effective recognition and observance, both among the peoples of Member States themselves and among the peoples of territories under their jurisdiction. 

Article 1 
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. 

Article 2 
Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status. 

Furthermore, no distinction shall be made on the basis of the political, jurisdictional or international status of the country or territory to which a person belongs, whether it be independent, trust, non-self-governing or under any other limitation of sovereignty. 

Article 3 
Everyone has the right to life, liberty and security of person. 

Article 4 
No one shall be held in slavery or servitude; slavery and the slave trade shall be prohibited in all their forms. 

Article 5 
No one shall be subjected to torture or to cruel, inhuman or degrading treatment or punishment. 

Article 6 
Everyone has the right to recognition everywhere as a person before the law. 

Article 7 
All are equal before the law and are entitled without any discrimination to equal protection of the law. All are entitled to equal protection against any discrimination in violation of this Declaration and against any incitement to such discrimination. 

Article 8 
Everyone has the right to an effective remedy by the competent national tribunals for acts violating the fundamental rights granted him by the constitution or by law. 

Article 9 
No one shall be subjected to arbitrary arrest, detention or exile. 

Article 10 
Everyone is entitled in full equality to a fair and public hearing by an independent and impartial tribunal, in the determination of his rights and obligations and of any criminal charge against him. 

Article 11 
Everyone charged with a penal offence has the right to be presumed innocent until proved guilty according to law in a public trial at which he has had all the guarantees necessary for his defence. 
No one shall be held guilty of any penal offence on account of any act or omission which did not constitute a penal offence, under national or international law, at the time when it was committed. Nor shall a heavier penalty be imposed than the one that was applicable at the time the penal offence was committed. 
Article 12 
No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honour and reputation. Everyone has the right to the protection of the law against such interference or attacks. 

Article 13 
Everyone has the right to freedom of movement and residence within the borders of each State. 
Everyone has the right to leave any country, including his own, and to return to his country. 
Article 14 
Everyone has the right to seek and to enjoy in other countries asylum from persecution. 
This right may not be invoked in the case of prosecutions genuinely arising from non-political crimes or from acts contrary to the purposes and principles of the United Nations. 
Article 15 
Everyone has the right to a nationality. 
No one shall be arbitrarily deprived of his nationality nor denied the right to change his nationality. 
Article 16 
Men and women of full age, without any limitation due to race, nationality or religion, have the right to marry and to found a family. They are entitled to equal rights as to marriage, during marriage and at its dissolution. 
Marriage shall be entered into only with the free and full consent of the intending spouses. 
The family is the natural and fundamental group unit of society and is entitled to protection by society and the State. 
Article 17 
Everyone has the right to own property alone as well as in association with others. 
No one shall be arbitrarily deprived of his property. 
Article 18 
Everyone has the right to freedom of thought, conscience and religion; this right includes freedom to change his religion or belief, and freedom, either alone or in community with others and in public or private, to manifest his religion or belief in teaching, practice, worship and observance. 

Article 19 
Everyone has the right to freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek, receive and impart information and ideas through any media and regardless of frontiers. 

Article 20 
Everyone has the right to freedom of peaceful assembly and association. 
No one may be compelled to belong to an association. 
Article 21 
Everyone has the right to take part in the government of his country, directly or through freely chosen representatives. 
Everyone has the right to equal access to public service in his country. 
The will of the people shall be the basis of the authority of government; this will shall be expressed in periodic and genuine elections which shall be by universal and equal suffrage and shall be held by secret vote or by equivalent free voting procedures. 
Article 22 
Everyone, as a member of society, has the right to social security and is entitled to realization, through national effort and international co-operation and in accordance with the organization and resources of each State, of the economic, social and cultural rights indispensable for his dignity and the free development of his personality. 

Article 23 
Everyone has the right to work, to free choice of employment, to just and favourable conditions of work and to protection against unemployment. 
Everyone, without any discrimination, has the right to equal pay for equal work. 
Everyone who works has the right to just and favourable remuneration ensuring for himself and his family an existence worthy of human dignity, and supplemented, if necessary, by other means of social protection. 
Everyone has the right to form and to join trade unions for the protection of his interests. 
Article 24 
Everyone has the right to rest and leisure, including reasonable limitation of working hours and periodic holidays with pay. 

Article 25 
Everyone has the right to a standard of living adequate for the health and well-being of himself and of his family, including food, clothing, housing and medical care and necessary social services, and the right to security in the event of unemployment, sickness, disability, widowhood, old age or other lack of livelihood in circumstances beyond his control. 
Motherhood and childhood are entitled to special care and assistance. All children, whether born in or out of wedlock, shall enjoy the same social protection. 
Article 26 
Everyone has the right to education. Education shall be free, at least in the elementary and fundamental stages. Elementary education shall be compulsory. Technical and professional education shall be made generally available and higher education shall be equally accessible to all on the basis of merit. 
Education shall be directed to the full development of the human personality and to the strengthening of respect for human rights and fundamental freedoms. It shall promote understanding, tolerance and friendship among all nations, racial or religious groups, and shall further the activities of the United Nations for the maintenance of peace. 
Parents have a prior right to choose the kind of education that shall be given to their children. 
Article 27 
Everyone has the right freely to participate in the cultural life of the community, to enjoy the arts and to share in scientific advancement and its benefits. 
Everyone has the right to the protection of the moral and material interests resulting from any scientific, literary or artistic production of which he is the author. 
Article 28 
Everyone is entitled to a social and international order in which the rights and freedoms set forth in this Declaration can be fully realized. 

Article 29 
Everyone has duties to the community in which alone th

Here, we will also import some corpus examples, - George Bush's 2005 and 2006 state of the union addresses.

from nltk.corpus import state_union

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

print(train_text[:1000])

PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION
 
February 2, 2005


9:10 P.M. EST 

THE PRESIDENT: Mr. Speaker, Vice President Cheney, members of Congress, fellow citizens: 

As a new Congress gathers, all of us in the elected branches of government share a great privilege: We've been placed in office by the votes of the people we serve. And tonight that is a privilege we share with newly-elected leaders of Afghanistan, the Palestinian Territories, Ukraine, and a free and sovereign Iraq. (Applause.) 

Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. 

Tonight, with a healthy, growing economy, with more Americans going back to work, with our nation an active force for good in the world -- the state of our union is confident and strong. (Applause.) 

Our generati

Now we have some text, we can train the PunktSentenceTokenizer.

from nltk.tokenize import PunktSentenceTokenizer

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokens = custom_sent_tokenizer.tokenize(sample_text)

tokens

["PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all.",
 'Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream.',
 'Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King.',
 '(Applause.)',
 'President George W. Bush reacts to applause during his State of the Union Address at the Capitol, Tuesday, Jan.',
 '31, 2006.',
 "White House photo by Eric DraperEvery time I'm invited to this rostrum, I'm humbled by the privilege, and mindful of the history we've seen together.",
 'We have gathered under this Capitol dome in moments of national mourning and national achievement.',
 'We have served America through one of the most consequential periods of our history -- and it has been my honor to serve with you.',
 'In a system of two parties, two chambers, and two elected branches, there will always be differences and debate.',
 'But even tough debates can be conducted in a civil tone, and our differences cannot be allowed to harden into anger.',
 'To confront the great issues before us, we must act in a spirit of goodwill and respect for one another -- and I will do my part.',
 'Tonight the state of our Union is strong -- and together we will make it stronger.',
 '(Applause.)',
 'In this decisive year, you and I will make choices that determine both the future and the character of our country.',
 'We will choose to act confidently in pursuing the enemies of freedom -- or retreat from our duties in the hope of an easier life.',
 'We will choose to build our prosperity by leading the world economy -- or shut ourselves off from trade and opportunity.',
 'In a complex and challenging time, the road of isolationism and protectionism may seem broad and inviting -- yet it ends in danger and decline.',
 'The only way to protect our people, the only way to secure the peace, the only way to control our destiny is by our leadership -- so the United States of America will continue to lead.',
 '(Applause.)',
 'Abroad, our nation is committed to an historic, long-term goal -- we seek the end of tyranny in our world.',
 'Some dismiss that goal as misguided idealism.',
 'In reality, the future security of America depends on it.',
 'On September the 11th, 2001, we found that problems originating in a failed and oppressive state 7,000 miles away could bring murder and destruction to our country.',
 'Dictatorships shelter terrorists, and feed resentment and radicalism, and seek weapons of mass destruction.',
 'Democracies replace resentment with hope, respect the rights of their citizens and their neighbors, and join the fight against terror.',
 "Every step toward freedom in the world makes our country safer -- so we will act boldly in freedom's cause.",
 '(Applause.)',
 'Far from being a hopeless dream, the advance of freedom is the great story of our time.',
 'In 1945, there were about two dozen lonely democracies in the world.',
 'Today, there are 122.',
 "And we're writing a new chapter in the story of self-government -- with women lining up to vote in Afghanistan, and millions of Iraqis marking their liberty with purple ink, and men and women from Lebanon to Egypt debating the rights of individuals and the necessity of freedom.",
 'At the start of 2006, more than half the people of our world live in democratic nations.',
 'And we do not forget the other half -- in places like Syria and Burma, Zimbabwe, North Korea, and Iran -- because the demands of justice, and the peace of this world, require their freedom, as well.',
 '(Applause.)',
 'President George W. Bush delivers his State of the Union Address at the Capitol, Tuesday, Jan.',
 '31, 2006.',
 'White House photo by Eric Draper No one can deny the success of freedom, but some men rage and fight against it.',
 'And one of the main sources of reaction and opposition is radical Islam -- the perversion by a few of a noble faith into an ideology of terror and death.',
 'Terrorists like bin Laden are serious about mass murder -- and all of us must take their declared intentions seriously.',
 'They seek to impose a heartless system of totalitarian control throughout the Middle East, and arm themselves with weapons of mass murder.',
 'Their aim is to seize power in Iraq, and use it as a safe haven to launch attacks against America and the world.',
 'Lacking the military strength to challenge us directly, the terrorists have chosen the weapon of fear.',
 'When they murder children at a school in Beslan, or blow up commuters in London, or behead a bound captive, the terrorists hope these horrors will break our will, allowing the violent to inherit the Earth.',
 'But they have miscalculated: We love our freedom, and we will fight to keep it.',
 '(Applause.)',
 'In a time of testing, we cannot find security by abandoning our commitments and retreating within our borders.',
 'If we were to leave these vicious attackers alone, they would not leave us alone.',
 'They would simply move the battlefield to our own shores.',
 'There is no peace in retreat.',
 'And there is no honor in retreat.',
 'By allowing radical Islam to work its will -- by leaving an assaulted world to fend for itself -- we would signal to all that we no longer believe in our own ideals, or even in our own courage.',
 'But our enemies and our friends can be certain: The United States will not retreat from the world, and we will never surrender to evil.',
 '(Applause.)',
 'America rejects the false comfort of isolationism.',
 'We are the nation that saved liberty in Europe, and liberated death camps, and helped raise up democracies, and faced down an evil empire.',
 'Once again, we accept the call of history to deliver the oppressed and move this world toward peace.',
 'We remain on the offensive against terror networks.',
 'We have killed or captured many of their leaders -- and for the others, their day will come.',
 'President George W. Bush greets members of Congress after his State of the Union Address at the Capitol, Tuesday, Jan.',
 '31, 2006.',
 'White House photo by Eric Draper We remain on the offensive in Afghanistan, where a fine President and a National Assembly are fighting terror while building the institutions of a new democracy.',
 "We're on the offensive in Iraq, with a clear plan for victory.",
 "First, we're helping Iraqis build an inclusive government, so that old resentments will be eased and the insurgency will be marginalized.",
 "Second, we're continuing reconstruction efforts, and helping the Iraqi government to fight corruption and build a modern economy, so all Iraqis can experience the benefits of freedom.",
 "And, third, we're striking terrorist targets while we train Iraqi forces that are increasingly capable of defeating the enemy.",
 'Iraqis are showing their courage every day, and we are proud to be their allies in the cause of freedom.',
 '(Applause.)',
 'Our work in Iraq is difficult because our enemy is brutal.',
 'But that brutality has not stopped the dramatic progress of a new democracy.',
 'In less than three years, the nation has gone from dictatorship to liberation, to sovereignty, to a constitution, to national elections.',
 'At the same time, our coalition has been relentless in shutting off terrorist infiltration, clearing out insurgent strongholds, and turning over territory to Iraqi security forces.',
 'I am confident in our plan for victory; I am confident in the will of the Iraqi people; I am confident in the skill and spirit of our military.',
 'Fellow citizens, we are in this fight to win, and we are winning.',
 '(Applause.)',
 'The road of victory is the road that will take our troops home.',
 'As we make progress on the ground, and Iraqi forces increasingly take the lead, we should be able to further decrease our troop levels -- but those decisions will be made by our military commanders, not by politicians in Washington, D.C.',
 '(Applause.)',
 'Our coalition has learned from our experience in Iraq.',
 "We've adjusted our military tactics and changed our approach to reconstruction.",
 'Along the way, we have benefitted from responsible criticism and counsel offered by members of Congress of both parties.',
 'In the coming year, I will continue to reach out and seek your good advice.',
 'Yet, there is a difference between responsible criticism that aims for success, and defeatism that refuses to acknowledge anything but failure.',
 '(Applause.)',
 'Hindsight alone is not wisdom, and second-guessing is not a strategy.',
 '(Applause.)',
 'With so much in the balance, those of us in public office have a duty to speak with candor.',
 'A sudden withdrawal of our forces from Iraq would abandon our Iraqi allies to death and prison, would put men like bin Laden and Zarqawi in charge of a strategic country, and show that a pledge from America means little.',
 'Members of Congress, however we feel about the decisions and debates of the past, our nation has only one option: We must keep our word, defeat our enemies, and stand behind the American military in this vital mission.',
 '(Applause.)',
 'Laura Bush is applauded as she is introduced Tuesday evening, Jan.',
 '31, 2006 during the State of the Union Address at United States Capitol in Washington.',
 'White House photo by Eric Draper Our men and women in uniform are making sacrifices -- and showing a sense of duty stronger than all fear.',
 "They know what it's like to fight house to house in a maze of streets, to wear heavy gear in the desert heat, to see a comrade killed by a roadside bomb.",
 'And those who know the costs also know the stakes.',
 'Marine Staff Sergeant Dan Clay was killed last month fighting in Fallujah.',
 'He left behind a letter to his family, but his words could just as well be addressed to every American.',
 'Here is what Dan wrote: "I know what honor is.',
 '... It has been an honor to protect and serve all of you.',
 'I faced death with the secure knowledge that you would not have to....',
 'Never falter!',
 'Don\'t hesitate to honor and support those of us who have the honor of protecting that which is worth protecting."',
 "Staff Sergeant Dan Clay's wife, Lisa, and his mom and dad, Sara Jo and Bud, are with us this evening.",
 'Welcome.',
 '(Applause.)',
 'Our nation is grateful to the fallen, who live in the memory of our country.',
 "We're grateful to all who volunteer to wear our nation's uniform -- and as we honor our brave troops, let us never forget the sacrifices of America's military families.",
 '(Applause.)',
 'Our offensive against terror involves more than military action.',
 'Ultimately, the only way to defeat the terrorists is to defeat their dark vision of hatred and fear by offering the hopeful alternative of political freedom and peaceful change.',
 'So the United States of America supports democratic reform across the broader Middle East.',
 'Elections are vital, but they are only the beginning.',
 'Raising up a democracy requires the rule of law, and protection of minorities, and strong, accountable institutions that last longer than a single vote.',
 'The great people of Egypt have voted in a multi-party presidential election -- and now their government should open paths of peaceful opposition that will reduce the appeal of radicalism.',
 'The Palestinian people have voted in elections.',
 'And now the leaders of Hamas must recognize Israel, disarm, reject terrorism, and work for lasting peace.',
 '(Applause.)',
 'Saudi Arabia has taken the first steps of reform -- now it can offer its people a better future by pressing forward with those efforts.',
 'Democracies in the Middle East will not look like our own, because they will reflect the traditions of their own citizens.',
 'Yet liberty is the future of every nation in the Middle East, because liberty is the right and hope of all humanity.',
 '(Applause.)',
 'President George W. Bush waves toward the upper visitors gallery of the House Chamber following his State of the Union remarks Tuesday, Jan.',
 '31, 2006 at the United States Capitol.',
 'White House photo by Eric Draper The same is true of Iran, a nation now held hostage by a small clerical elite that is isolating and repressing its people.',
 'The regime in that country sponsors terrorists in the Palestinian territories and in Lebanon -- and that must come to an end.',
 '(Applause.)',
 'The Iranian government is defying the world with its nuclear ambitions, and the nations of the world must not permit the Iranian regime to gain nuclear weapons.',
 '(Applause.)',
 'America will continue to rally the world to confront these threats.',
 'Tonight, let me speak directly to the citizens of Iran: America respects you, and we respect your country.',
 'We respect your right to choose your own future and win your own freedom.',
 'And our nation hopes one day to be the closest of friends with a free and democratic Iran.',
 '(Applause.)',
 'To overcome dangers in our world, we must also take the offensive by encouraging economic progress, and fighting disease, and spreading hope in hopeless lands.',
 'Isolationism would not only tie our hands in fighting enemies, it would keep us from helping our friends in desperate need.',
 'We show compassion abroad because Americans believe in the God-given dignity and worth of a villager with HIV/AIDS, or an infant with malaria, or a refugee fleeing genocide, or a young girl sold into slavery.',
 'We also show compassion abroad because regions overwhelmed by poverty, corruption, and despair are sources of terrorism, and organized crime, and human trafficking, and the drug trade.',
 'In recent years, you and I have taken unprecedented action to fight AIDS and malaria, expand the education of girls, and reward developing nations that are moving forward with economic and political reform.',
 'For people everywhere, the United States is a partner for a better life.',
 'Short-changing these efforts would increase the suffering and chaos of our world, undercut our long-term security, and dull the conscience of our country.',
 'I urge members of Congress to serve the interests of America by showing the compassion of America.',
 'Our country must also remain on the offensive against terrorism here at home.',
 'The enemy has not lost the desire or capability to attack us.',
 'Fortunately, this nation has superb professionals in law enforcement, intelligence, the military, and homeland security.',
 'These men and women are dedicating their lives, protecting us all, and they deserve our support and our thanks.',
 '(Applause.)',
 'They also deserve the same tools they already use to fight drug trafficking and organized crime -- so I ask you to reauthorize the Patriot Act.',
 '(Applause.)',
 'It is said that prior to the attacks of September the 11th, our government failed to connect the dots of the conspiracy.',
 'We now know that two of the hijackers in the United States placed telephone calls to al Qaeda operatives overseas.',
 'But we did not know about their plans until it was too late.',
 'So to prevent another attack -- based on authority given to me by the Constitution and by statute -- I have authorized a terrorist surveillance program to aggressively pursue the international communications of suspected al Qaeda operatives and affiliates to and from America.',
 'Previous Presidents have used the same constitutional authority I have, and federal courts have approved the use of that authority.',
 'Appropriate members of Congress have been kept informed.',
 'The terrorist surveillance program has helped prevent terrorist attacks.',
 'It remains essential to the security of America.',
 'If there are people inside our country who are talking with al Qaeda, we want to know about it, because we will not sit back and wait to be hit again.',
 '(Applause.)',
 'In all these areas -- from the disruption of terror networks, to victory in Iraq, to the spread of freedom and hope in troubled regions -- we need the support of our friends and allies.',
 'To draw that support, we must always be clear in our principles and willing to act.',
 'The only alternative to American leadership is a dramatically more dangerous and anxious world.',
 'Yet we also choose to lead because it is a privilege to serve the values that gave us birth.',
 'American leaders -- from Roosevelt to Truman to Kennedy to Reagan -- rejected isolation and retreat, because they knew that America is always more secure when freedom is on the march.',
 'Our own generation is in a long war against a determined enemy -- a war that will be fought by Presidents of both parties, who will need steady bipartisan support from the Congress.',
 'And tonight I ask for yours.',
 'Together, let us protect our country, support the men and women who defend us, and lead this world toward freedom.',
 '(Applause.)',
 'Here at home, America also has a great opportunity: We will build the prosperity of our country by strengthening our economic leadership in the world.',
 'Our economy is healthy and vigorous, and growing faster than other major industrialized nations.',
 'In the last two-and-a-half years, America has created 4.6 million new jobs -- more than Japan and the European Union combined.',
 '(Applause.)',
 'Even in the face of higher energy prices and natural disasters, the American people have turned in an economic performance that is the envy of the world.',
 'The American economy is preeminent, but we cannot afford to be complacent.',
 "In a dynamic world economy, we are seeing new competitors, like China and India, and this creates uncertainty, which makes it easier to feed people's fears.",
 "So we're seeing some old temptations return.",
 'Protectionists want to escape competition, pretending that we can keep our high standard of living while walling off our economy.',
 'Others say that the government needs to take a larger role in directing the economy, centralizing more power in Washington and increasing taxes.',
 'We hear claims that immigrants are somehow bad for the economy -- even though this economy could not function without them.',
 '(Applause.)',
 'All these are forms of economic retreat, and they lead in the same direction -- toward a stagnant and second-rate economy.',
 'Tonight I will set out a better path: an agenda for a nation that competes with confidence; an agenda that will raise standards of living and generate new jobs.',
 'Americans should not fear our economic future, because we intend to shape it.',
 'Keeping America competitive begins with keeping our economy growing.',
 'And our economy grows when Americans have more of their own money to spend, save, and invest.',
 'In the last five years, the tax relief you passed has left $880 billion in the hands of American workers, investors, small businesses, and families -- and they have used it to help produce more than four years of uninterrupted economic growth.',
 '(Applause.)',
 'Yet the tax relief is set to expire in the next few years.',
 'If we do nothing, American families will face a massive tax increase they do not expect and will not welcome.',
 'Because America needs more than a temporary expansion, we need more than temporary tax relief.',
 'I urge the Congress to act responsibly, and make the tax cuts permanent.',
 '(Applause.)',
 'Keeping America competitive requires us to be good stewards of tax dollars.',
 "Every year of my presidency, we've reduced the growth of non-security discretionary spending, and last year you passed bills that cut this spending.",
 'This year my budget will cut it again, and reduce or eliminate more than 140 programs that are performing poorly or not fulfilling essential priorities.',
 'By passing these reforms, we will save the American taxpayer another $14 billion next year, and stay on track to cut the deficit in half by 2009.',
 '(Applause.)',
 'I am pleased that members of Congress are working on earmark reform, because the federal budget has too many special interest projects.',
 '(Applause.)',
 'And we can tackle this problem together, if you pass the line-item veto.',
 '(Applause.)',
 'We must also confront the larger challenge of mandatory spending, or entitlements.',
 "This year, the first of about 78 million baby boomers turn 60, including two of my Dad's favorite people -- me and President Clinton.",
 '(Laughter.)',
 'This milestone is more than a personal crisis -- (laughter) -- it is a national challenge.',
 'The retirement of the baby boom generation will put unprecedented strains on the federal government.',
 'By 2030, spending for Social Security, Medicare and Medicaid alone will be almost 60 percent of the entire federal budget.',
 'And that will present future Congresses with impossible choices -- staggering tax increases, immense deficits, or deep cuts in every category of spending.',
 'Congress did not act last year on my proposal to save Social Security -- (applause) -- yet the rising cost of entitlements is a problem that is not going away.',
 '(Applause.)',
 'And every year we fail to act, the situation gets worse.',
 'So tonight, I ask you to join me in creating a commission to examine the full impact of baby boom retirements on Social Security, Medicare, and Medicaid.',
 'This commission should include members of Congress of both parties, and offer bipartisan solutions.',
 'We need to put aside partisan politics and work together and get this problem solved.',
 '(Applause.)',
 'Keeping America competitive requires us to open more markets for all that Americans make and grow.',
 'One out of every five factory jobs in America is related to global trade, and we want people everywhere to buy American.',
 'With open markets and a level playing field, no one can out-produce or out-compete the American worker.',
 '(Applause.)',
 'Keeping America competitive requires an immigration system that upholds our laws, reflects our values, and serves the interests of our economy.',
 'Our nation needs orderly and secure borders.',
 '(Applause.)',
 'To meet this goal, we must have stronger immigration enforcement and border protection.',
 '(Applause.)',
 'And we must have a rational, humane guest worker program that rejects amnesty, allows temporary jobs for people who seek them legally, and reduces smuggling and crime at the border.',
 '(Applause.)',
 'Keeping America competitive requires affordable health care.',
 '(Applause.)',
 'Our government has a responsibility to provide health care for the poor and the elderly, and we are meeting that responsibility.',
 '(Applause.)',
 'For all Americans -- for all Americans, we must confront the rising cost of care, strengthen the doctor-patient relationship, and help people afford the insurance coverage they need.',
 '(Applause.)',
 'We will make wider use of electronic records and other health information technology, to help control costs and reduce dangerous medical errors.',
 'We will strengthen health savings accounts -- making sure individuals and small business employees can buy insurance with the same advantages that people working for big businesses now get.',
 '(Applause.)',
 'We will do more to make this coverage portable, so workers can switch jobs without having to worry about losing their health insurance.',
 '(Applause.)',
 'And because lawsuits are driving many good doctors out of practice -- leaving women in nearly 1,500 American counties without a single OB/GYN -- I ask the Congress to pass medical liability reform this year.',
 '(Applause.)',
 'Keeping America competitive requires affordable energy.',
 'And here we have a serious problem: America is addicted to oil, which is often imported from unstable parts of the world.',
 'The best way to break this addiction is through technology.',
 'Since 2001, we have spent nearly $10 billion to develop cleaner, cheaper, and more reliable alternative energy sources -- and we are on the threshold of incredible advances.',
 'So tonight, I announce the Advanced Energy Initiative -- a 22-percent increase in clean-energy research -- at the Department of Energy, to push for breakthroughs in two vital areas.',
 'To change how we power our homes and offices, we will invest more in zero-emission coal-fired plants, revolutionary solar and wind technologies, and clean, safe nuclear energy.',
 '(Applause.)',
 'We must also change how we power our automobiles.',
 'We will increase our research in better batteries for hybrid and electric cars, and in pollution-free cars that run on hydrogen.',
 "We'll also fund additional research in cutting-edge methods of producing ethanol, not just from corn, but from wood chips and stalks, or switch grass.",
 'Our goal is to make this new kind of ethanol practical and competitive within six years.',
 '(Applause.)',
 'Breakthroughs on this and other new technologies will help us reach another great goal: to replace more than 75 percent of our oil imports from the Middle East by 2025.',
 '(Applause.)',
 'By applying the talent and technology of America, this country can dramatically improve our environment, move beyond a petroleum-based economy, and make our dependence on Middle Eastern oil a thing of the past.',
 '(Applause.)',
 'And to keep America competitive, one commitment is necessary above all: We must continue to lead the world in human talent and creativity.',
 "Our greatest advantage in the world has always been our educated, hardworking, ambitious people -- and we're going to keep that edge.",
 "Tonight I announce an American Competitiveness Initiative, to encourage innovation throughout our economy, and to give our nation's children a firm grounding in math and science.",
 '(Applause.)',
 'First, I propose to double the federal commitment to the most critical basic research programs in the physical sciences over the next 10 years.',
 "This funding will support the work of America's most creative minds as they explore promising areas such as nanotechnology, supercomputing, and alternative energy sources.",
 'Second, I propose to make permanent the research and development tax credit -- (applause) -- to encourage bolder private-sector initiatives in technology.',
 'With more research in both the public and private sectors, we will improve our quality of life -- and ensure that America will lead the world in opportunity and innovation for decades to come.',
 '(Applause.)',
 'Third, we need to encourage children to take more math and science, and to make sure those courses are rigorous enough to compete with other nations.',
 "We've made a good start in the early grades with the No Child Left Behind Act, which is raising standards and lifting test scores across our country.",
 'Tonight I propose to train 70,000 high school teachers to lead advanced-placement courses in math and science, bring 30,000 math and science professionals to teach in classrooms, and give early help to students who struggle with math, so they have a better chance at good, high-wage jobs.',
 "If we ensure that America's children succeed in life, they will ensure that America succeeds in the world.",
 '(Applause.)',
 'Preparing our nation to compete in the world is a goal that all of us can share.',
 'I urge you to support the American Competitiveness Initiative, and together we will show the world what the American people can achieve.',
 'America is a great force for freedom and prosperity.',
 'Yet our greatness is not measured in power or luxuries, but by who we are and how we treat one another.',
 'So we strive to be a compassionate, decent, hopeful society.',
 'In recent years, America has become a more hopeful nation.',
 'Violent crime rates have fallen to their lowest levels since the 1970s.',
 'Welfare cases have dropped by more than half over the past decade.',
 'Drug use among youth is down 19 percent since 2001.',
 'There are fewer abortions in America than at any point in the last three decades, and the number of children born to teenage mothers has been falling for a dozen years in a row.',
 '(Applause.)',
 'These gains are evidence of a quiet transformation -- a revolution of conscience, in which a rising generation is finding that a life of personal responsibility is a life of fulfillment.',
 'Government has played a role.',
 'Wise policies, such as welfare reform and drug education and support for abstinence and adoption have made a difference in the character of our country.',
 'And everyone here tonight, Democrat and Republican, has a right to be proud of this record.',
 '(Applause.)',
 'Yet many Americans, especially parents, still have deep concerns about the direction of our culture, and the health of our most basic institutions.',
 "They're concerned about unethical conduct by public officials, and discouraged by activist courts that try to redefine marriage.",
 'They worry about children in our society who need direction and love, and about fellow citizens still displaced by natural disaster, and about suffering caused by treatable diseases.',
 'As we look at these challenges, we must never give in to the belief that America is in decline, or that our culture is doomed to unravel.',
 'The American people know better than that.',
 'We have proven the pessimists wrong before -- and we will do it again.',
 '(Applause.)',
 'A hopeful society depends on courts that deliver equal justice under the law.',
 'The Supreme Court now has two superb new members -- new members on its bench: Chief Justice John Roberts and Justice Sam Alito.',
 '(Applause.)',
 'I thank the Senate for confirming both of them.',
 'I will continue to nominate men and women who understand that judges must be servants of the law, and not legislate from the bench.',
 '(Applause.)',
 'Today marks the official retirement of a very special American.',
 "For 24 years of faithful service to our nation, the United States is grateful to Justice Sandra Day O'Connor.",
 '(Applause.)',
 'A hopeful society has institutions of science and medicine that do not cut ethical corners, and that recognize the matchless value of every life.',
 'Tonight I ask you to pass legislation to prohibit the most egregious abuses of medical research: human cloning in all its forms, creating or implanting embryos for experiments, creating human-animal hybrids, and buying, selling, or patenting human embryos.',
 'Human life is a gift from our Creator -- and that gift should never be discarded, devalued or put up for sale.',
 '(Applause.)',
 'A hopeful society expects elected officials to uphold the public trust.',
 '(Applause.)',
 'Honorable people in both parties are working on reforms to strengthen the ethical standards of Washington -- I support your efforts.',
 'Each of us has made a pledge to be worthy of public responsibility -- and that is a pledge we must never forget, never dismiss, and never betray.',
 '(Applause.)',
 'As we renew the promise of our institutions, let us also show the character of America in our compassion and care for one another.',
 'A hopeful society gives special attention to children who lack direction and love.',
 "Through the Helping America's Youth Initiative, we are encouraging caring adults to get involved in the life of a child -- and this good work is being led by our First Lady, Laura Bush.",
 '(Applause.)',
 "This year we will add resources to encourage young people to stay in school, so more of America's youth can raise their sights and achieve their dreams.",
 "A hopeful society comes to the aid of fellow citizens in times of suffering and emergency -- and stays at it until they're back on their feet.",
 'So far the federal government has committed $85 billion to the people of the Gulf Coast and New Orleans.',
 "We're removing debris and repairing highways and rebuilding stronger levees.",
 "We're providing business loans and housing assistance.",
 'Yet as we meet these immediate needs, we must also address deeper challenges that existed before the storm arrived.',
 'In New Orleans and in other places, many of our fellow citizens have felt excluded from the promise of our country.',
 'The answer is not only temporary relief, but schools that teach every child, and job skills that bring upward mobility, and more opportunities to own a home and start a business.',
 'As we recover from a disaster, let us also work for the day when all Americans are protected by justice, equal in hope, and rich in opportunity.',
 '(Applause.)',
 'A hopeful society acts boldly to fight diseases like HIV/AIDS, which can be prevented, and treated, and defeated.',
 'More than a million Americans live with HIV, and half of all AIDS cases occur among African Americans.',
 'I ask Congress to reform and reauthorize the Ryan White Act, and provide new funding to states, so we end the waiting lists for AIDS medicines in America.',
 '(Applause.)',
 'We will also lead a nationwide effort, working closely with African American churches and faith-based groups, to deliver rapid HIV tests to millions, end the stigma of AIDS, and come closer to the day when there are no new infections in America.',
 '(Applause.)',
 "Fellow citizens, we've been called to leadership in a period of consequence.",
 "We've entered a great ideological conflict we did nothing to invite.",
 'We see great changes in science and commerce that will influence all our lives.',
 'Sometimes it can seem that history is turning in a wide arc, toward an unknown shore.',
 'Yet the destination of history is determined by human action, and every great movement of history comes to a point of choosing.',
 'Lincoln could have accepted peace at the cost of disunity and continued slavery.',
 'Martin Luther King could have stopped at Birmingham or at Selma, and achieved only half a victory over segregation.',
 'The United States could have accepted the permanent division of Europe, and been complicit in the oppression of others.',
 'Today, having come far in our own historical journey, we must decide: Will we turn back, or finish well?',
 'Before history is written down in books, it is written in courage.',
 'Like Americans before us, we will show that courage and we will finish well.',
 "We will lead freedom's advance.",
 'We will compete and excel in the global economy.',
 'We will renew the defining moral commitments of this land.',
 'And so we move forward -- optimistic about our country, faithful to its cause, and confident of the victories to come.',
 'May God bless America.',
 '(Applause.)']

After that, we can tokenize each word in sentence. Now we need to tag each words with part of speech (also known as POS)

def process_content():
    try:
        for t in tokens[:5]:
            words = nltk.word_tokenize(t)
            tagged = nltk.pos_tag(words)
            print(tagged)
    except Exception as e:
        print(str(e))
        
process_content()

[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]
[('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NN'), (',', ','), ('distinguished', 'JJ'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'VB'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBD'), ('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')]
[('Tonight', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'JJ'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'JJ'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')]
[('(', '('), ('Applause', 'NNP'), ('.', '.'), (')', ')')]
[('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]

Here, you can see tags are added after each words. Each tag means:

POS: Possesive ending
NNP: Proper noun, singular
IN: Preposition or subordinating conjuntion
NN: Noun, sigular or mass
RB : Adverb
VBP: Verb, non-3rd person sigular present
...

(The detailed list are found in here)

Or you can download the tagsets from nltk.download(). (All packages -> tagsets)

nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
JJR: adjective, comparative
    bleaker braver breezier briefer brighter brisker broader bumper busier
    calmer cheaper choosier cleaner clearer closer colder commoner costlier
    cozier creamier crunchier cuter ...
JJS: adjective, superlative
    calmest cheapest choicest classiest cleanest clearest closest commonest
    corniest costliest crassest creepiest crudest cutest darkest deadliest
    dearest deepest densest dinkiest ...
LS: list item marker
    A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
    SP-44007 Second Third Three Two * a b c d first five four one six three
    two
MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
PDT: pre-determiner
    all both half many quite such sure this
POS: genitive marker
    ' 's
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
PRP$: pronoun, possessive
    her his mine my our ours their thy your
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
RBR: adverb, comparative
    further gloomier grander graver greater grimmer harder harsher
    healthier heavier higher however larger later leaner lengthier less-
    perfectly lesser lonelier longer louder lower more ...
RBS: adverb, superlative
    best biggest bluntest earliest farthest first furthest hardest
    heartiest highest largest least less most nearest second tightest worst
RP: particle
    aboard about across along apart around aside at away back before behind
    by crop down ever fast for forth from go high i.e. in into just later
    low more off on open out over per pie raising start teeth that through
    under unto up up-pp upon whole with you
SYM: symbol
    % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
TO: "to" as preposition or infinitive marker
    to
UH: interjection
    Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
    huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
    man baby diddle hush sonuvabitch ...
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...
WDT: WH-determiner
    that what whatever which whichever
WP: WH-pronoun
    that what whatever whatsoever which who whom whosoever
WP$: WH-pronoun, possessive
    whose
WRB: Wh-adverb
    how however whence whenever where whereby whereever wherein whereof why
``: opening quotation mark
    ` ``

Chunking

Now that each word has been tagged with a part of speech, we can move onto chunking, meaning that grouping the words into meaningful clusters. The main goal of chunking is to group words into "noun phrases", which is a noun with any associated verbs, adjectives, or adverbs.

The part of speech tags that were generated in the previous step will be combined with regular expressions, such as the following:

$+$ = match 1 or more
$?$ = match 0 or 1 repetitions.
$*$ = match 0 or MORE repetitions
$.$ = Any character except a new line

def process_content():
    try:
        for i in tokens[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # combine the part-of-speech tag with a regular expression
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # draw the chunks with nltk
            chunked.draw()     

    except Exception as e:
        print(str(e))

We build the chuck rule as follows:

$<\text{RB}.?>*$ = "0 or more of any tense of adverb," followed by:

$<\text{VB}.?>*$ = "0 or more of any tense of verb," followed by:

$<\text{NNP}>+$ = "One or more proper nouns," followed by

$<\text{NN}>?$ = "zero or one singular noun."

See what's going on.

process_content()

Maybe you can this kind of tree diagrams:

nltk_draw

This diagram shows the hierarchical relationship between words and which words are grouping with some tokens.

Or we can print it inline, not showing in GUI.

def process_content():
    try:
        for i in tokens[:10]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # combine the part-of-speech tag with a regular expression
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # print the nltk tree
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

    except Exception as e:
        print(str(e))

process_content()

(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk THE/NNP UNION/NNP January/NNP)
(Chunk THE/NNP PRESIDENT/NNP)
(Chunk Thank/NNP)
(Chunk Mr./NNP Speaker/NNP)
(Chunk Vice/NNP President/NNP Cheney/NNP)
(Chunk Congress/NNP)
(Chunk Supreme/NNP Court/NNP)
(Chunk called/VBD America/NNP)
(Chunk Coretta/NNP Scott/NNP King/NNP)
(Chunk Applause/NNP)
(Chunk President/NNP George/NNP W./NNP Bush/NNP)
(Chunk State/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk Capitol/NNP)
(Chunk Tuesday/NNP)
(Chunk Jan/NNP)
(Chunk White/NNP House/NNP photo/NN)
(Chunk Eric/NNP DraperEvery/NNP time/NN)
(Chunk Capitol/NNP dome/NN)
(Chunk have/VBP served/VBN America/NNP)

Chinking

Another process in NLP is chinking that remove the chunks that we don't want to use.

def process_content():
    try:
        for i in tokens[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            
            # The main difference here is the }{, vs. the {}. This means we're removing 
            # from the chink one or more verbs, prepositions, determiners, or the word 'to'.
            chunkGram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            # print(chunked)
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

            # chunked.draw()

    except Exception as e:
        print(str(e))

process_content()

(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP 'S/POS ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
(Chunk
  THE/NNP
  UNION/NNP
  January/NNP
  31/CD
  ,/,
  2006/CD
  THE/NNP
  PRESIDENT/NNP
  :/:
  Thank/NNP
  you/PRP)
(Chunk ./.)
(Chunk
  Mr./NNP
  Speaker/NNP
  ,/,
  Vice/NNP
  President/NNP
  Cheney/NNP
  ,/,
  members/NNS)
(Chunk Congress/NNP ,/, members/NNS)
(Chunk
  Supreme/NNP
  Court/NNP
  and/CC
  diplomatic/JJ
  corps/NN
  ,/,
  distinguished/JJ
  guests/NNS
  ,/,
  and/CC
  fellow/JJ
  citizens/NNS
  :/:)
(Chunk our/PRP$ nation/NN)
(Chunk ,/, graceful/JJ ,/, courageous/JJ woman/NN who/WP)
(Chunk America/NNP)
(Chunk its/PRP$ founding/NN ideals/NNS and/CC)
(Chunk noble/JJ dream/NN ./.)
(Chunk Tonight/NN we/PRP)
(Chunk hope/NN)
(Chunk glad/JJ reunion/NN)
(Chunk husband/NN who/WP)
(Chunk so/RB long/RB ago/RB ,/, and/CC we/PRP)
(Chunk grateful/JJ)
(Chunk good/JJ life/NN)
(Chunk Coretta/NNP Scott/NNP King/NNP ./.)
(Chunk (/( Applause/NNP ./. )/))
(Chunk President/NNP George/NNP W./NNP Bush/NNP)
(Chunk his/PRP$ State/NNP)
(Chunk Union/NNP Address/NNP)
(Chunk Capitol/NNP ,/, Tuesday/NNP ,/, Jan/NNP ./.)

For summary, using modified regular expressions, we can define chunk patterns. These are patterns of part-of-speech tags that define what kinds of words make up a chunk. We can also define patterns for what kinds of words should not be in a chunk. These unchunked words are known as chinks.

Named Entity Recognition (NER)

One of the most common forms of chunking in natural language processing is called Named Entity Recognition (NER for short). NLTK is able to identify people, places, things, locations, monetary figures, and more.

There are two major options with NLTK's named entity recognition: either recognize all named entities, or recognize named entities as their respective type, like people, places, locations, etc.

def process_content():
    try:
        for i in tokens[:5]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            
            # print(chunked)
            for subtree in namedEnt.subtrees(filter=lambda t: t.label() == 'NE'):
                print(subtree)
            
#             namedEnt.draw()
            
    except Exception as e:
        print(str(e))

process_content()

(NE GEORGE/NNP)
(NE ADDRESS/NNP)
(NE THE/NNP)
(NE CONGRESS/NNP)
(NE THE/NNP UNION/NNP)
(NE Mr./NNP Speaker/NNP)
(NE Cheney/NNP)
(NE Congress/NNP)
(NE Supreme/NNP Court/NNP)
(NE America/NNP)
(NE Coretta/NNP Scott/NNP King/NNP)
(NE Applause/NNP)
(NE George/NNP)
(NE Union/NNP Address/NNP)
(NE Capitol/NNP)
(NE Jan/NNP)

Same visualization we can see above.

ner_draw

Text Classification

Now, it's time to process text classification. All processes we've done is some kind of preprocessing the text data, like tokenization, stemming, POS tagging, chunking and chinking, and NER.

In this part, we will use movie review dataset in NLTK, one of famous NLP datasets. This datasets are commonly used to sentimental analysis. But we need to classify each words in advance.

from nltk.corpus import movie_reviews
import random

# Build documents
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories() 
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents
random.shuffle(documents)

print('Number of Documents: {}'.format(len(documents)))
print('First Review: {}'.format(documents[1]))

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

# Generate frequency distribution
all_words = nltk.FreqDist(all_words)

print('\nMost common words: {}'.format(all_words.most_common(15)))
print('\nThe word happy: {}'.format(all_words["happy"]))

Number of Documents: 2000
First Review: (['capsule', ':', 'gal', 'is', 'a', '50s', '-', 'ish', 'london', 'cockney', 'gangster', 'who', 'has', 'retired', 'to', 'spain', '.', 'his', 'old', 'associates', 'want', 'him', 'for', 'one', 'last', 'job', 'and', 'send', 'the', 'vicious', 'don', 'to', 'give', 'him', 'an', 'offer', 'he', 'can', "'", 't', 'refuse', '.', 'a', 'standout', 'performance', 'by', 'ben', 'kingsley', 'as', 'don', 'cannot', 'save', 'what', 'is', 'essentially', 'a', 'set', 'of', 'cliches', 'recycled', 'from', 'old', 'westerns', '.', ',', '0', '(', '-', '4', 'to', '+', '4', ')', 'roger', 'ebert', 'asks', 'in', 'his', 'review', 'of', 'sexy', 'beast', ',', '"', 'who', 'would', 'have', 'guessed', 'that', 'the', 'most', 'savage', 'mad', '-', 'dog', 'frothing', 'gangster', 'in', 'recent', 'movies', 'would', 'be', 'played', 'by', '.', '.', '.', 'ben', 'kingsley', '?', '"', 'my', 'response', 'would', 'be', 'that', 'anyone', 'who', 'has', 'seen', 'alan', 'arkin', 'in', 'wait', 'until', 'dark', ',', 'henry', 'fonda', 'in', 'once', 'upon', 'a', 'time', 'in', 'the', 'west', ',', 'or', 'anthony', 'hopkins', 'in', 'the', 'silence', 'of', 'the', 'lambs', 'should', 'have', 'guessed', 'it', '.', 'they', 'should', 'know', 'that', 'the', 'way', 'for', 'a', 'film', 'to', 'create', 'a', 'really', 'creepy', 'sociopath', 'is', 'cast', 'someone', 'who', 'generally', 'plays', 'mild', ',', 'sympathetic', ',', 'or', 'even', 'ineffectual', 'character', 'roles', '.', 'the', 'same', 'characteristics', 'that', 'make', 'an', 'actor', 'seem', 'gentle', 'in', 'most', 'of', 'his', 'roles', 'can', 'work', 'in', 'his', 'favor', 'when', 'a', 'role', 'calls', 'for', 'him', 'to', 'be', 'fierce', 'and', 'vicious', '.', 'that', 'is', 'the', 'principle', 'that', 'works', 'for', 'kingsley', 'in', 'sexy', 'beast', '.', 'gary', '"', 'gal', '"', 'dove', '(', 'played', 'by', 'ray', 'winstone', ')', 'has', 'retired', 'from', 'a', 'london', 'career', 'of', 'crime', 'and', 'is', 'living', 'on', 'a', 'luxurious', 'villa', 'in', 'spain', '.', 'life', 'has', 'become', 'a', 'routine', 'of', 'sunning', 'himself', 'and', 'relaxing', '.', 'but', 'his', 'paradise', 'is', 'about', 'to', 'be', 'shattered', 'by', 'a', 'one', '-', 'two', '-', 'punch', '.', 'the', 'first', 'punch', 'is', 'a', 'boulder', 'that', 'comes', 'rolling', 'down', 'the', 'hill', 'next', 'to', 'the', 'villa', '.', 'the', 'second', 'punch', 'comes', 'from', 'gal', "'", 's', 'past', '.', 'back', 'in', 'london', 'gang', 'boss', 'teddy', 'bass', '(', 'ian', 'mcshane', ',', 'tv', "'", 's', 'lovejoy', ')', 'is', 'planning', 'to', 'break', 'into', 'a', 'safety', 'deposit', 'room', 'in', 'a', 'bank', 'and', 'he', 'wants', 'gal', '.', 'he', 'sends', 'his', 'most', 'rabid', 'henchman', 'don', 'logan', '(', 'ben', 'kingsley', ')', 'to', 'fetch', 'gal', '.', 'don', 'will', 'accept', 'any', 'decision', 'gal', 'makes', 'from', '"', 'yes', '"', 'to', '"', 'certainly', '.', '"', 'however', ',', 'if', 'gal', 'says', '"', 'no', '"', 'don', 'will', 'do', 'whatever', 'it', 'takes', 'to', 'turn', 'it', 'into', 'a', 'yes', 'including', 'threatening', 'guy', "'", 's', 'ex', '-', 'porn', '-', 'star', 'wife', 'deedee', '(', 'amanda', 'redman', ')', '.', 'in', 'the', 'meantime', 'don', 'knows', 'just', 'how', 'to', 'get', 'under', 'everybody', "'", 's', 'skin', '.', 'kingsley', 'makes', 'don', 'a', 'compact', 'package', 'of', 'fury', 'and', 'nastiness', '.', 'there', 'are', 'some', 'serious', 'problems', 'in', 'louis', 'mellis', "'", 's', 'and', 'david', 'scinto', "'", 's', 'script', 'that', 'should', 'have', 'been', 'caught', 'before', 'filming', '.', 'when', 'we', 'see', 'the', 'actual', 'crime', 'we', 'have', 'no', 'idea', 'why', 'gal', 'was', 'so', 'important', 'to', 'its', 'success', '.', 'beyond', 'an', 'ability', 'to', 'use', 'skin', '-', 'diving', 'gear', ',', 'no', 'special', 'talents', 'are', 'required', 'of', 'him', '.', 'any', 'local', 'hood', 'could', 'have', 'done', 'what', 'gal', 'is', 'needed', 'for', '.', 'additionally', 'the', 'crime', 'involves', 'digging', 'from', 'a', 'swimming', 'pool', 'to', 'the', 'bank', 'vault', ',', 'flooding', 'the', 'vault', '.', 'no', 'only', 'could', 'they', 'have', 'let', 'the', 'water', 'out', 'of', 'the', 'pool', 'and', 'avoided', 'the', 'complication', 'altogether', ',', 'but', 'there', 'is', 'by', 'far', 'too', 'much', 'water', 'to', 'be', 'accounted', 'for', 'by', 'what', 'was', 'in', 'the', 'pool', '.', 'in', 'spite', 'of', 'the', 'provocative', 'title', ',', 'the', 'story', 'is', 'cliched', 'and', 'overly', 'familiar', '.', 'i', 'know', 'i', 'have', 'seen', 'all', 'the', 'plot', 'elements', 'of', 'sexy', 'beast', 'in', 'old', 'westerns', 'like', 'the', 'law', 'and', 'jake', 'wade', '.', 'the', 'story', 'is', 'usually', 'of', 'the', 'reformed', 'outlaw', ',', 'a', 'robert', 'taylor', 'type', ',', 'who', 'has', 'hung', 'up', 'his', 'guns', 'and', 'is', 'trying', 'for', 'a', 'life', 'of', 'peaceful', 'respectability', '.', 'the', 'old', 'gang', ',', 'however', ',', 'wants', 'to', 'do', 'one', 'more', 'job', 'with', 'their', 'old', 'buddy', 'and', 'sends', 'a', 'rabid', 'richard', 'widmark', 'type', 'to', 'go', 'and', 'git', '?', 'im', '.', 'it', 'is', 'not', 'a', 'great', 'plot', '.', 'in', 'sexy', 'beast', 'even', 'the', 'plot', 'twists', 'have', 'gray', 'beards', '.', 'perhaps', 'the', 'film', 'has', 'a', 'little', 'more', 'respectability', 'because', 'it', 'was', 'made', 'not', 'as', 'a', 'western', 'but', 'as', 'a', 'stylish', 'british', 'gangster', 'film', '.', 'it', 'is', 'an', 'old', 'plot', 'dressed', 'up', 'to', 'look', 'new', '.', 'if', 'the', 'plot', 'is', 'old', ',', 'at', 'least', 'the', 'style', 'is', 'creative', '.', 'this', 'is', 'director', 'jonathan', 'glazer', "'", 's', 'first', 'film', ',', 'but', 'he', 'has', 'reputedly', 'done', 'some', 'notable', 'tv', 'ads', 'for', 'guinness', 'stout', '.', 'his', 'style', 'does', 'have', 'some', 'unexpected', 'touches', 'including', 'some', 'very', 'odd', 'dream', 'sequences', '.', 'cinematographer', 'ivan', 'bird', 'uses', 'a', 'lot', 'of', 'half', 'lit', 'scenes', '.', 'we', 'see', 'one', 'side', 'of', 'a', 'person', "'", 's', 'faces', '.', 'but', 'the', 'other', 'side', 'fades', 'into', 'the', 'darkness', ',', 'a', 'sort', 'of', 'metaphor', 'for', 'the', 'half', '-', 'world', 'these', 'characters', 'in', '-', 'habit', '.', 'half', 'of', 'everything', 'that', 'is', 'happening', 'is', 'also', 'kept', 'hidden', '.', 'we', 'yanks', 'will', 'have', 'a', 'hard', 'time', 'with', 'some', 'of', 'the', 'dialog', '.', 'at', 'least', 'in', 'my', 'theater', 'it', 'was', 'difficult', 'to', 'make', 'out', 'the', 'words', 'with', 'the', 'quiet', 'speaking', ',', 'the', 'heavy', 'accents', ',', 'and', 'the', 'cockney', 'language', '.', 'sexy', 'beast', 'is', 'a', 'very', 'and', 'familiar', 'minor', 'plot', 'lent', 'respectability', 'in', 'the', 'us', 'by', 'being', 'done', 'in', 'what', 'is', 'here', 'a', 'still', 'somewhat', 'novel', 'genre', ',', 'the', 'london', 'crime', 'film', '.', 'the', 'plot', 'may', 'be', 'new', 'to', 'british', 'crime', 'films', ',', 'but', 'it', 'would', 'be', 'overly', 'familiar', 'as', 'a', 'western', '.', 'further', 'respectability', 'comes', 'from', 'ben', 'kingsley', "'", 's', 'high', '-', 'powered', 'performance', '.', 'i', 'give', 'it', 'a', '4', 'on', 'the', '0', 'to', '10', 'scale', 'and', 'a', '0', 'on', 'the', '-', '4', 'to', '+', '4', 'scale', '.'], 'neg')

Most common words: [(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]

The word happy: 215

You can see that there are 215 "happy" words in movie_reviews. It means that that review maybe positive review. And you can also notice that the common words contain punctuation or stop words and so on.

Now we need to build features. In this post, we will use 4000 high frequent words as features.

word_features = list(all_words.keys())[:4000]

And it will be helpful to define the function that find the features.

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

Then Let's use an example from a negative review.

neg_features = find_features(movie_reviews.words('neg/cv000_29416.txt'))
for k, v in neg_features.items():
    if v == True:
        print(k)

plot
:
two
teen
couples
go
to
a
church
party
,
drink
and
then
drive
.
they
get
into
an
accident
one
of
the
guys
dies
but
his
girlfriend
continues
see
him
in
her
life
has
nightmares
what
'
s
deal
?
watch
movie
"
sorta
find
out
critique
mind
-
fuck
for
generation
that
touches
on
very
cool
idea
presents
it
bad
package
which
is
makes
this
review
even
harder
write
since
i
generally
applaud
films
attempt
break
mold
mess
with
your
head
such
(
lost
highway
&
memento
)
there
are
good
ways
making
all
types
these
folks
just
didn
t
snag
correctly
seem
have
taken
pretty
neat
concept
executed
terribly
so
problems
well
its
main
problem
simply
too
jumbled
starts
off
normal
downshifts
fantasy
world
you
as
audience
member
no
going
dreams
characters
coming
back
from
dead
others
who
look
like
strange
apparitions
disappearances
looooot
chase
scenes
tons
weird
things
happen
most
not
explained
now
personally
don
trying
unravel
film
every
when
does
give
me
same
clue
over
again
kind
fed
up
after
while
biggest
obviously
got
big
secret
hide
seems
want
completely
until
final
five
minutes
do
make
entertaining
thrilling
or
engaging
meantime
really
sad
part
arrow
both
dig
flicks
we
actually
figured
by
half
way
point
strangeness
did
start
little
bit
sense
still
more
guess
bottom
line
movies
should
always
sure
before
given
password
enter
understanding
mean
showing
melissa
sagemiller
running
away
visions
about
20
throughout
plain
lazy
!
okay
people
chasing
know
need
how
giving
us
different
offering
further
insight
down
apparently
studio
took
director
chopped
themselves
shows
might
ve
been
decent
here
somewhere
suits
decided
turning
music
video
edge
would
actors
although
wes
bentley
seemed
be
playing
exact
character
he
american
beauty
only
new
neighborhood
my
kudos
holds
own
entire
feeling
unraveling
overall
doesn
stick
because
entertain
confusing
rarely
excites
feels
redundant
runtime
despite
ending
explanation
craziness
came
oh
horror
slasher
flick
packaged
someone
assuming
genre
hot
kids
also
wrapped
production
years
ago
sitting
shelves
ever
whatever
skip
where
joblo
nightmare
elm
street
3
7
/
10
blair
witch
2
crow
9
salvation
4
stir
echoes
8

Now redo this in whole documents.

featuresets = [(find_features(rev), category) for (rev, category) in documents]

After that, we will use Support Vector Classifier for text classification. Before classifcation, we need to make train and test set, same as usual.

from sklearn.model_selection import train_test_split

training, test = train_test_split(featuresets, test_size=0.25, random_state=1)

print(len(training), len(test))

1500 500

Then, we use SVC from sklearn and SklearnClassifier from nltk.

from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC

# Instantiate the model
model = SklearnClassifier(SVC(kernel='linear'))

# Train the model
model.train(training)

# Evaluate the model
accuracy = nltk.classify.accuracy(model, test)
print("SVC Accuracy: {}".format(accuracy))

SVC Accuracy: 0.802

As a result, we can build the text classifier model with almost 80% accuracy.

Summary

In this post, we covered tokenization, stemming, POS tagging, chucking/chinking, and NER for data preprocessing. After that, we built the classifier model with Support Vector Machine, the training it with data. As a result we could build the text classifier model with 80% accuracy.