import pandas as pd
import numpy as np

Why generate features?

Different types of data
- Continuous: either integers (or whole numbers) or floats (decimals)
- Categorical: one of a limited set of values, e.g., gender, country of birth
- Ordinal: ranked values often with no details of distance between them
- Boolean: True/False values
- Datetime: dates and times

Getting to know your data

You will be working with a modified subset of the Stackoverflow survey response data in the first three chapters of this course. This data set records the details, and preferences of thousands of users of the StackOverflow website.

so_survey_df = pd.read_csv('./dataset/Combined_DS_v10.csv')

# Print the first five rows of the DataFrame
so_survey_df.head()

print(so_survey_df.dtypes)

SurveyDate                     object
FormalEducation                object
ConvertedSalary               float64
Hobby                          object
Country                        object
StackOverflowJobsRecommend    float64
VersionControl                 object
Age                             int64
Years Experience                int64
Gender                         object
RawSalary                      object
dtype: object

Selecting specific data types

Often a data set will contain columns with several different data types (like the one you are working with). The majority of machine learning models require you to have a consistent data type across features. Similarly, most feature engineering techniques are applicable to only one type of data at a time. For these reasons among others, you will often want to be able to access just the columns of certain types when working with a DataFrame.

so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])

# Print the column names contained in so_numeric_df
print(so_numeric_df.columns)

Index(['ConvertedSalary', 'StackOverflowJobsRecommend'], dtype='object')

Dealing with categorical features

Encoding categorical features
- One-hot encoding
- Dummy encoding
One-hot vs. dummies
- One-hot encoding: Explainable features
- Dummy encoding: Necessary information without duplication

One-hot encoding and dummy variables

To use categorical variables in a machine learning model, you first need to represent them in a quantitative way. The two most common approaches are to one-hot encode the variables using or to use dummy variables. In this exercise, you will create both types of encoding, and compare the created column sets.

one_hot_encoded = pd.get_dummies(so_survey_df, columns=['Country'], prefix='OH')

# Print the columns names
print(one_hot_encoded.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'OH_France', 'OH_India',
       'OH_Ireland', 'OH_Russia', 'OH_South Africa', 'OH_Spain', 'OH_Sweeden',
       'OH_UK', 'OH_USA', 'OH_Ukraine'],
      dtype='object')

dummy = pd.get_dummies(so_survey_df, columns=['Country'], drop_first=True, prefix='DM')

# Print the columns names
print(dummy.columns)

Index(['SurveyDate', 'FormalEducation', 'ConvertedSalary', 'Hobby',
       'StackOverflowJobsRecommend', 'VersionControl', 'Age',
       'Years Experience', 'Gender', 'RawSalary', 'DM_India', 'DM_Ireland',
       'DM_Russia', 'DM_South Africa', 'DM_Spain', 'DM_Sweeden', 'DM_UK',
       'DM_USA', 'DM_Ukraine'],
      dtype='object')

Dealing with uncommon categories

Some features can have many different categories but a very uneven distribution of their occurrences. Take for example Data Science's favorite languages to code in, some common choices are Python, R, and Julia, but there can be individuals with bespoke choices, like FORTRAN, C etc. In these cases, you may not want to create a feature for each value, but only the more common occurrences.

countries = so_survey_df.Country

# Get the counts of each category
country_counts = countries.value_counts()

# Print the count values for each category
print(country_counts)

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
India            95
UK               95
Ukraine           9
Ireland           5
Name: Country, dtype: int64

mask = countries.isin(country_counts[country_counts < 10].index)

# Print the top 5 rows in the mask series
print(mask.head())

0    False
1    False
2    False
3    False
4    False
Name: Country, dtype: bool

countries[mask] = 'Other'

# Print the updated category counts
print(countries.value_counts())

South Africa    166
USA             164
Spain           134
Sweeden         119
France          115
Russia           97
India            95
UK               95
Other            14
Name: Country, dtype: int64

C:\Users\kcsgo\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Numeric variables

Binarizing columns

While numeric values can often be used without any feature engineering, there will be cases when some form of manipulation can be useful. For example on some occasions, you might not care about the magnitude of a value but only care about its direction, or if it exists at all. In these situations, you will want to binarize a column. In the so_survey_df data, you have a large number of survey respondents that are working voluntarily (without pay). You will create a new column titled Paid_Job indicating whether each person is paid (their salary is greater than zero).

so_survey_df['Paid_Job'] = 0

# Replace all the Paid_Job values where ConvertedSalary is > 0
so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1

# Print the first five rows of the columns
so_survey_df[['Paid_Job', 'ConvertedSalary']].head()

Binning values

For many continuous values you will care less about the exact value of a numeric column, but instead care about the bucket it falls into. This can be useful when plotting values, or simplifying your machine learning models. It is mostly used on continuous variables where accuracy is not the biggest concern e.g. age, height, wages.

Bins are created using pd.cut(df['column_name'], bins) where bins can be an integer specifying the number of evenly spaced bins, or a list of bin boundaries.

so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=5)

# Print the first 5 rows of the equal_binned column
so_survey_df[['equal_binned', 'ConvertedSalary']].head()

bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins=bins, labels=labels)

# Print the first 5 rows of the boundary_binned column
so_survey_df[['boundary_binned', 'ConvertedSalary']].head()

	SurveyDate	FormalEducation	ConvertedSalary	Hobby	Country	StackOverflowJobsRecommend	VersionControl	Age	Years Experience	Gender	RawSalary
0	2/28/18 20:20	Bachelor's degree (BA. BS. B.Eng.. etc.)	NaN	Yes	South Africa	NaN	Git	21	13	Male	NaN
1	6/28/18 13:26	Bachelor's degree (BA. BS. B.Eng.. etc.)	70841.0	Yes	Sweeden	7.0	Git;Subversion	38	9	Male	70,841.00
2	6/6/18 3:37	Bachelor's degree (BA. BS. B.Eng.. etc.)	NaN	No	Sweeden	8.0	Git	45	11	NaN	NaN
3	5/9/18 1:06	Some college/university study without earning ...	21426.0	Yes	Sweeden	NaN	Zip file back-ups	46	12	Male	21,426.00
4	4/12/18 22:41	Bachelor's degree (BA. BS. B.Eng.. etc.)	41671.0	Yes	UK	8.0	Git	39	7	Male	£41,671.00

	Paid_Job	ConvertedSalary
0	0	NaN
1	1	70841.0
2	0	NaN
3	1	21426.0
4	1	41671.0

	equal_binned	ConvertedSalary
0	NaN	NaN
1	(-2000.0, 400000.0]	70841.0
2	NaN	NaN
3	(-2000.0, 400000.0]	21426.0
4	(-2000.0, 400000.0]	41671.0

	boundary_binned	ConvertedSalary
0	NaN	NaN
1	Medium	70841.0
2	NaN	NaN
3	Low	21426.0
4	Low	41671.0