Analyzing the effect of weather on policing
A Summary of lecture "Analyzing Police Activity with pandas, via datacamp
- Exploring the weather dataset
- Categorizing the weather
- Merging datasets
- Does weather affect the arrest rate?
import pandas as pd
import matplotlib.pyplot as plt
Plotting the temperature
In this exercise, you'll examine the temperature columns from the weather dataset to assess whether the data seems trustworthy. First you'll print the summary statistics, and then you'll visualize the data using a box plot.
When deciding whether the values seem reasonable, keep in mind that the temperature is measured in degrees Fahrenheit, not Celsius!
weather = pd.read_csv('./dataset/weather.csv')
# Describe the temperature columns
print(weather[['TMIN', 'TAVG', 'TMAX']].describe())
# Create a box plot of the temperature columns
weather[['TMIN', 'TAVG', 'TMAX']].plot(kind='box')
plt.savefig('../images/tmin-boxplot.png')
Plotting the temperature difference
In this exercise, you'll continue to assess whether the dataset seems trustworthy by plotting the difference between the maximum and minimum temperatures.
What do you notice about the resulting histogram? Does it match your expectations, or do you see anything unusual?
weather['TDIFF'] = weather.TMAX - weather.TMIN
# Describe the 'TDIFF' column
print(weather.TDIFF.describe())
# Create a histogram with 20 bins to visualize 'TDIFF'
weather.TDIFF.plot(kind='hist', bins=20)
Counting bad weather conditions
The weather DataFrame contains 20 columns that start with 'WT', each of which represents a bad weather condition. For example:
- WT05 indicates "Hail"
- WT11 indicates "High or damaging winds"
- WT17 indicates "Freezing rain"
For every row in the dataset, each WT column contains either a 1 (meaning the condition was present that day) or NaN (meaning the condition was not present).
In this exercise, you'll quantify "how bad" the weather was each day by counting the number of 1 values in each row.
WT = weather.loc[:, 'WT01':'WT22']
# Calculate the sum of each row in 'WT'
weather['bad_conditions'] = WT.sum(axis='columns')
# Replace missing values in 'bad_conditions' with 0
weather['bad_conditions'] = weather.bad_conditions.fillna(0).astype('int')
# Create a histogram to visualist 'bad_conditions'
weather['bad_conditions'].plot(kind='hist')
Rating the weather conditions
In the previous exercise, you counted the number of bad weather conditions each day. In this exercise, you'll use the counts to create a rating system for the weather.
The counts range from 0 to 9, and should be converted to ratings as follows:
- Convert 0 to 'good'
- Convert 1 through 4 to 'bad'
- Convert 5 through 9 to 'worse'
print(weather.bad_conditions.value_counts().sort_index())
# Create a dictionary that maps integers to strings
mapping = {0:'good', 1:'bad', 2:'bad', 3:'bad', 4:'bad', 5:'worse',
6:'worse', 7:'worse', 8:'worse', 9:'worse'}
# Convert the 'bad_conditions' integers to strings using the 'mapping'
weather['rating'] = weather.bad_conditions.map(mapping)
# Count the unique values in 'rating'
print(weather.rating.value_counts())
Changing the data type to category
Since the rating column only has a few possible values, you'll change its data type to category in order to store the data more efficiently. You'll also specify a logical order for the categories, which will be useful for future exercises.
NOTE : in pandas 1.0.3, ordered categories must be defined with pd.api.types.CategoricalDtype
cats = pd.api.types.CategoricalDtype(categories=['good', 'bad', 'worse'], ordered=True)
# cats = pd.categories=['good', 'bad', 'worse'] # for newer versions of pandas
# Change the data type of 'rating' to category
weather['rating'] = weather.rating.astype(cats)
# Examine the head of 'rating'
print(weather.rating.head())
Preparing the DataFrames
In this exercise, you'll prepare the traffic stop and weather rating DataFrames so that they're ready to be merged:
- With the ri DataFrame, you'll move the stop_datetime index to a column since the index will be lost during the merge.
- With the weather DataFrame, you'll select the DATE and rating columns and put them in a new DataFrame.
ri = pd.read_csv('./dataset/police.csv')
combined = ri.stop_date.str.cat(ri.stop_time, sep=' ')
ri['stop_datetime'] = pd.to_datetime(combined)
ri['is_arrested'] = ri['is_arrested'].astype(bool)
ri.reset_index(inplace=True)
# Examine the head of 'ri'
print(ri.head())
# Create a DataFrame from the 'DATE' and 'rating' columns
weather_rating = weather[['DATE', 'rating']]
# Examine the head of 'weather_rating'
print(weather_rating.head())
Merging the DataFrames
In this exercise, you'll merge the ri and weather_rating DataFrames into a new DataFrame, ri_weather.
The DataFrames will be joined using the stop_date column from ri and the DATE column from weather_rating. Thankfully the date formatting matches exactly, which is not always the case!
Once the merge is complete, you'll set stop_datetime as the index, which is the column you saved in the previous exercise.
print(ri.shape)
# Merge 'ri' and 'weather_rating' using left join
ri_weather = pd.merge(left=ri, right=weather_rating,
left_on='stop_date', right_on='DATE', how='left')
# Examine the shape of 'ri_weather'
print(ri_weather.shape)
# Set 'stop_datetime' as the index of 'ri_weather'
ri_weather.set_index('stop_datetime', inplace=True)
Comparing arrest rates by weather rating
Do police officers arrest drivers more often when the weather is bad? Find out below!
- First, you'll calculate the overall arrest rate.
- Then, you'll calculate the arrest rate for each of the weather ratings you previously assigned.
- Finally, you'll add violation type as a second factor in the analysis, to see if that accounts for any differences in the arrest rate.
Since you previously defined a logical order for the weather categories, good < bad < worse, they will be sorted that way in the results.
print(ri_weather.is_arrested.mean())
print(ri_weather.groupby('rating').is_arrested.mean())
# Calculate the arrest rate for each 'violation' and 'rating'
print(ri_weather.groupby(['violation', 'rating']).is_arrested.mean())
Selecting from a multi-indexed Series
The output of a single .groupby() operation on multiple columns is a Series with a MultiIndex. Working with this type of object is similar to working with a DataFrame:
- The outer index level is like the DataFrame rows.
- The inner index level is like the DataFrame columns.
In this exercise, you'll practice accessing data from a multi-indexed Series using the .loc[] accessor.
arrest_rate = ri_weather.groupby(['violation', 'rating']).is_arrested.mean()
# Print the 'arrest_rate' Series
print(arrest_rate)
# Print the arrest rate for moving violations in bad weather
print(arrest_rate.loc['Moving violation', 'bad'])
# Print the arrest rates for speeding violations in all three weather condtions
print(arrest_rate.loc['Speeding'])
Reshaping the arrest rate data
In this exercise, you'll start by reshaping the arrest_rate Series into a DataFrame. This is a useful step when working with any multi-indexed Series, since it enables you to access the full range of DataFrame methods.
Then, you'll create the exact same DataFrame using a pivot table. This is a great example of how pandas often gives you more than one way to reach the same result!
print(arrest_rate.unstack())
# Create the same DataFrame using a pivot table
print(ri_weather.pivot_table(index='violation', columns='rating', values='is_arrested'))