import pandas as pd
import numpy as np

Why generate features?

  • Different types of data
    • Continuous: either integers (or whole numbers) or floats (decimals)
    • Categorical: one of a limited set of values, e.g., gender, country of birth
    • Ordinal: ranked values often with no details of distance between them
    • Boolean: True/False values
    • Datetime: dates and times

Getting to know your data

You will be working with a modified subset of the Stackoverflow survey response data in the first three chapters of this course. This data set records the details, and preferences of thousands of users of the StackOverflow website.

so_survey_df = pd.read_csv('./dataset/Combined_DS_v10.csv')

# Print the first five rows of the DataFrame
SurveyDate FormalEducation ConvertedSalary Hobby Country StackOverflowJobsRecommend VersionControl Age Years Experience Gender RawSalary
0 2/28/18 20:20 Bachelor's degree (BA. BS. B.Eng.. etc.) NaN Yes South Africa NaN Git 21 13 Male NaN
1 6/28/18 13:26 Bachelor's degree (BA. BS. B.Eng.. etc.) 70841.0 Yes Sweeden 7.0 Git;Subversion 38 9 Male 70,841.00
2 6/6/18 3:37 Bachelor's degree (BA. BS. B.Eng.. etc.) NaN No Sweeden 8.0 Git 45 11 NaN NaN
3 5/9/18 1:06 Some college/university study without earning ... 21426.0 Yes Sweeden NaN Zip file back-ups 46 12 Male 21,426.00
4 4/12/18 22:41 Bachelor's degree (BA. BS. B.Eng.. etc.) 41671.0 Yes UK 8.0 Git 39 7 Male £41,671.00
SurveyDate                     object
FormalEducation                object
ConvertedSalary               float64
Hobby                          object
Country                        object
StackOverflowJobsRecommend    float64
VersionControl                 object
Age                             int64
Years Experience                int64
Gender                         object
RawSalary                      object
dtype: object

Selecting specific data types

Often a data set will contain columns with several different data types (like the one you are working with). The majority of machine learning models require you to have a consistent data type across features. Similarly, most feature engineering techniques are applicable to only one type of data at a time. For these reasons among others, you will often want to be able to access just the columns of certain types when working with a DataFrame.

so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])

# Print the column names contained in so_numeric_df
Index(['ConvertedSalary', 'StackOverflowJobsRecommend'], dtype='object')

Dealing with categorical features

  • Encoding categorical features
    • One-hot encoding
    • Dummy encoding
  • One-hot vs. dummies
    • One-hot encoding: Explainable features
    • Dummy encoding: Necessary information without duplication

One-hot encoding and dummy variables

To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables. In this exercise, you will create both types of encoding, and compare the created column sets.

one_hot_encoded = pd.get_dummies(so_survey_df, columns=['Country'], prefix='OH')

# Print the columns names
Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',
       'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden',
       'OH_UK', 'OH_USA', 'OH_Ukraine'],
dummy = pd.get_dummies(so_survey_df, columns=['Country'], drop_first=True, prefix='DM')

# Print the columns names
Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',
       'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK',
       'DM_USA', 'DM_Ukraine'],

Dealing with uncommon categories

Some features can have many different categories but a very uneven distribution of their occurrences. Take for example Data Science's favorite languages to code in, some common choices are Python, R, and Julia, but there can be individuals with bespoke choices, like FORTRAN, C etc. In these cases, you may not want to create a feature for each value, but only the more common occurrences.

countries = so_survey_df.Country

# Get the counts of each category
country_counts = countries.value_counts()

# Print the count values for each category
South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
India            95
UK               95
Ukraine           9
Ireland           5
Name: Country, dtype: int64
mask = countries.isin(country_counts[country_counts < 10].index)

# Print the top 5 rows in the mask series
0    False
1    False
2    False
3    False
4    False
Name: Country, dtype: bool
countries[mask] = 'Other'

# Print the updated category counts
South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
India            95
UK               95
Other            14
Name: Country, dtype: int64
C:\Users\kcsgo\anaconda3\lib\site-packages\ SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:

Numeric variables

Binarizing columns

While numeric values can often be used without any feature engineering, there will be cases when some form of manipulation can be useful. For example on some occasions, you might not care about the magnitude of a value but only care about its direction, or if it exists at all. In these situations, you will want to binarize a column. In the so_survey_df data, you have a large number of survey respondents that are working voluntarily (without pay). You will create a new column titled Paid_Job indicating whether each person is paid (their salary is greater than zero).

so_survey_df['Paid_Job'] = 0

# Replace all the Paid_Job values where ConvertedSalary is > 0
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1

# Print the first five rows of the columns
so_survey_df[['Paid_Job', 'ConvertedSalary']].head()
Paid_Job ConvertedSalary
0 0 NaN
1 1 70841.0
2 0 NaN
3 1 21426.0
4 1 41671.0

Binning values

For many continuous values you will care less about the exact value of a numeric column, but instead care about the bucket it falls into. This can be useful when plotting values, or simplifying your machine learning models. It is mostly used on continuous variables where accuracy is not the biggest concern e.g. age, height, wages.

Bins are created using pd.cut(df['column_name'], bins) where bins can be an integer specifying the number of evenly spaced bins, or a list of bin boundaries.

so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=5)

# Print the first 5 rows of the equal_binned column
so_survey_df[['equal_binned', 'ConvertedSalary']].head()
equal_binned ConvertedSalary
0 NaN NaN
1 (-2000.0, 400000.0] 70841.0
2 NaN NaN
3 (-2000.0, 400000.0] 21426.0
4 (-2000.0, 400000.0] 41671.0
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=bins, labels=labels)

# Print the first 5 rows of the boundary_binned column
so_survey_df[['boundary_binned', 'ConvertedSalary']].head()
boundary_binned ConvertedSalary
0 NaN NaN
1 Medium 70841.0
2 NaN NaN
3 Low 21426.0
4 Low 41671.0