import pandas as pd

What is data preprocessing?

  • Data Preprocessing
    • Beyond cleaning and exploratory data analysis
    • Prepping data for modeling
    • Modeling in python requires numerical input

Missing data - columns

We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values.

volunteer = pd.read_csv('./dataset/volunteer_opportunities.csv')
volunteer.head()
opportunity_id content_id vol_requests event_time title hits summary is_priority category_id category_desc ... end_date_date status Latitude Longitude Community Board Community Council Census Tract BIN BBL NTA
0 4996 37004 50 0 Volunteers Needed For Rise Up & Stay Put! Home... 737 Building on successful events last summer and ... NaN NaN NaN ... July 30 2011 approved NaN NaN NaN NaN NaN NaN NaN NaN
1 5008 37036 2 0 Web designer 22 Build a website for an Afghan business NaN 1.0 Strengthening Communities ... February 01 2011 approved NaN NaN NaN NaN NaN NaN NaN NaN
2 5016 37143 20 0 Urban Adventures - Ice Skating at Lasker Rink 62 Please join us and the students from Mott Hall... NaN 1.0 Strengthening Communities ... January 29 2011 approved NaN NaN NaN NaN NaN NaN NaN NaN
3 5022 37237 500 0 Fight global hunger and support women farmers ... 14 The Oxfam Action Corps is a group of dedicated... NaN 1.0 Strengthening Communities ... March 31 2012 approved NaN NaN NaN NaN NaN NaN NaN NaN
4 5055 37425 15 0 Stop 'N' Swap 31 Stop 'N' Swap reduces NYC's waste by finding n... NaN 4.0 Environment ... February 05 2011 approved NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 35 columns

volunteer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 35 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   opportunity_id      665 non-null    int64  
 1   content_id          665 non-null    int64  
 2   vol_requests        665 non-null    int64  
 3   event_time          665 non-null    int64  
 4   title               665 non-null    object 
 5   hits                665 non-null    int64  
 6   summary             665 non-null    object 
 7   is_priority         62 non-null     object 
 8   category_id         617 non-null    float64
 9   category_desc       617 non-null    object 
 10  amsl                0 non-null      float64
 11  amsl_unit           0 non-null      float64
 12  org_title           665 non-null    object 
 13  org_content_id      665 non-null    int64  
 14  addresses_count     665 non-null    int64  
 15  locality            595 non-null    object 
 16  region              665 non-null    object 
 17  postalcode          659 non-null    float64
 18  primary_loc         0 non-null      float64
 19  display_url         665 non-null    object 
 20  recurrence_type     665 non-null    object 
 21  hours               665 non-null    int64  
 22  created_date        665 non-null    object 
 23  last_modified_date  665 non-null    object 
 24  start_date_date     665 non-null    object 
 25  end_date_date       665 non-null    object 
 26  status              665 non-null    object 
 27  Latitude            0 non-null      float64
 28  Longitude           0 non-null      float64
 29  Community Board     0 non-null      float64
 30  Community Council   0 non-null      float64
 31  Census Tract        0 non-null      float64
 32  BIN                 0 non-null      float64
 33  BBL                 0 non-null      float64
 34  NTA                 0 non-null      float64
dtypes: float64(13), int64(8), object(14)
memory usage: 182.0+ KB
volunteer.dropna(axis=1, thresh=3).shape
(665, 24)
volunteer.shape
(665, 35)

Missing data - rows

Taking a look at the volunteer dataset again, we want to drop rows where the category_desc column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.

print(volunteer['category_desc'].isnull().sum())

# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)
48
(617, 35)

Working with data types

  • dtypes in pandas
    • object: string/mixed types
    • int64: integer
    • float64: float
    • datetime64 (or timedelta): datetime

Exploring data types

Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing.

volunteer.dtypes
opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL                   float64
NTA                   float64
dtype: object

Converting a column type

If you take a look at the volunteer dataset types, you'll see that the column hits is type object. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type int.

print(volunteer['hits'].head())

# Convert the hits column to type int
volunteer['hits'] = volunteer['hits'].astype(int)

# Look at the dtypes of the dataset
print(volunteer.dtypes)
0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64
opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL                   float64
NTA                   float64
dtype: object

Class distribution

  • Stratified sampling
    • A way of sampling that takes into account the distribution of classes or features in your dataset

Class imbalance

In the volunteer dataset, we're thinking about trying to predict the category_desc variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.

volunteer['category_desc'].value_counts()
Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

Stratified sampling

We know that the distribution of variables in the category_desc column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

from sklearn.model_selection import train_test_split

# Create a data with all columns except category_desc
volunteer_X = volunteer.dropna(subset=['category_desc'], axis=0)

# Create a category_desc labels dataset
volunteer_y = volunteer.dropna(subset=['category_desc'], axis=0)[['category_desc']]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())
Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64

Warning: stratify sampling on train_test_split cannot handle the NaN data, so you need to drop NaN values before sampling