Introduction to Data Preprocessing
In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data. This is the Summary of lecture "Preprocessing for Machine Learning in Python", via datacamp.
import pandas as pd
volunteer = pd.read_csv('./dataset/volunteer_opportunities.csv')
volunteer.head()
volunteer.info()
volunteer.dropna(axis=1, thresh=3).shape
volunteer.shape
Missing data - rows
Taking a look at the volunteer
dataset again, we want to drop rows where the category_desc
column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.
print(volunteer['category_desc'].isnull().sum())
# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]
# Print out the shape of the subset
print(volunteer_subset.shape)
volunteer.dtypes
print(volunteer['hits'].head())
# Convert the hits column to type int
volunteer['hits'] = volunteer['hits'].astype(int)
# Look at the dtypes of the dataset
print(volunteer.dtypes)
volunteer['category_desc'].value_counts()
Stratified sampling
We know that the distribution of variables in the category_desc
column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc
, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.
from sklearn.model_selection import train_test_split
# Create a data with all columns except category_desc
volunteer_X = volunteer.dropna(subset=['category_desc'], axis=0)
# Create a category_desc labels dataset
volunteer_y = volunteer.dropna(subset=['category_desc'], axis=0)[['category_desc']]
# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)
# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())
train_test_split
cannot handle the NaN
data, so you need to drop NaN values before sampling