Building a fake news classifier
You'll apply the basics of what you've learned along with some supervised machine learning to build a "fake news" detector. You'll begin by learning the basics of supervised machine learning, and then move forward by choosing a few important features and testing ideas to identify and classify fake news articles. This is the Summary of lecture "Introduction to Natural Language Processing in Python", via datacamp.
- Classifying fake news using supervised learning with NLP
- Building word count vectors with scikit-learn
- Training and testing a classification model with scikit-learn
- Simple NLP, complex problems
import pandas as pd
import numpy as np
Classifying fake news using supervised learning with NLP
- Supervised learning with NLP
- Need to use language instead of geometric features
- Use bag-of-words models or tf-idf features
- Supervised learning steps
- Collect and preprocess our data
- Determine a label
- Split data into training and test sets
- Extract features from the text to help predict the label
- Evaluate trained model using test set
CountVectorizer for text classification
It's time to begin building your text classifier! The data has been loaded into a DataFrame called df. The .head()
method is particularly informative.
In this exercise, you'll use pandas
alongside scikit-learn to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a CountVectorizer
and investigate some of its features.
df = pd.read_csv('./dataset/fake_or_real_news.csv')
df.head()
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
# Create a series to store the labels: y
y = df.label
# Create training set and test set
X_train, X_test, y_train, y_test = train_test_split(df['text'], y,
test_size=0.33, random_state=53)
# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')
# Transform the training data using only the 'text' column values: count_train
count_train = count_vectorizer.fit_transform(X_train)
# Transform the test data using only the 'text' column values: count_test
count_test = count_vectorizer.transform(X_test)
# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])
from sklearn.feature_extraction.text import TfidfVectorizer
# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
# Transform the training data: tfidf_train
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
# transform the test data: tfidf_test
tfidf_test = tfidf_vectorizer.transform(X_test)
# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])
# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())
# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())
# Print the head of count_df
print(count_df.head())
# Print the head of tfidf_df
print(tfidf_df.head())
# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)
# Check whether the DataFrame are equal
print(count_df.equals(tfidf_df))
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()
# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)
# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)
# Calculate the accuracy score: score
score = accuracy_score(y_test, pred)
print(score)
# Calculate the confusion matrix: cm
cm =confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)
nb_classifier = MultinomialNB()
# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)
# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)
# Calculate the accuracy score: score
score = accuracy_score(y_test, pred)
print(score)
# Calculate the confusion matrix: cm
cm = confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)
alphas = np.arange(0, 1, 0.1)
# Define train_and_predict()
def train_and_predict(alpha):
# Instantiate the classifier: nb_classifier
nb_classifier = MultinomialNB(alpha=alpha)
# Fit to the training data
nb_classifier.fit(tfidf_train, y_train)
# Predict the labels: pred
pred = nb_classifier.predict(tfidf_test)
# Compute accuracy: score
score = accuracy_score(y_test, pred)
return score
# Iterate over the alphas and print the corresponding score
for alpha in alphas:
print('Alpha: ', alpha)
print('Score: ', train_and_predict(alpha))
print()
class_labels = nb_classifier.classes_
# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()
# Zip the feature names together with the coefficient array
# and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))
# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])
# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])