Visualization in the data science workflow
Often visualization is taught in isolation, with best practices only discussed in a general way. In reality, you will need to bend the rules for different scenarios. From messy exploratory visualizations to polishing the font sizes of your final product; in this chapter, we dive into how to optimize your visualizations at each step of a data science workflow. This is the Summary of lecture "Improving Your Data Visualizations in Python", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
Looking at the farmers market data
Loaded is a new dataset, markets
. Each row of this DataFrame belongs to an individual farmers market in the continental United States with various information about the market contained in the columns. In this exercise, explore the columns of the data to get familiar with them for future analysis and plotting.
As a first step, print out the first three lines of markets
to get an idea of what type of data the columns encode. Then look at the summary descriptions of all of the columns. Since there are so many columns in the DataFrame, you'll want to turn the results 'sideways' by transposing the output to avoid cutting off rows.
markets = pd.read_csv('./dataset/markets_cleaned.csv', index_col=0)
markets.head()
first_rows = markets.head(3).transpose()
print(first_rows)
# Get descriptions of every columns
col_descriptions = markets.describe(include='all',
percentiles=[0.5]).transpose()
print(col_descriptions)
It may seem boring, but these preliminary explorations of your data help set up the foundations of a successful data science project. Now that you've investigated the data, you can see that it is very "wide" – with many columns corresponding to the different goods sold. The goods are encoded with 1s and 0s that indicate whether the market sells the good or not.
Scatter matrix of numeric columns
You've investigated the new farmer's market data, and it's rather wide – with lots of columns of information for each market's row. Rather than painstakingly going through every combination of numeric columns and making a scatter plot to look at correlations, you decide to make a scatter matrix using the pandas built-in function.
Increasing the figure size with the figsize
argument will help give the dense visualization some breathing room. Since there will be a lot of overlap for the points, decreasing the point opacity will help show the density of these overlaps.
numeric_columns = ['lat', 'lon', 'months_open', 'num_items_sold', 'state_pop']
# Make a scatter matrix of numeric columns
pd.plotting.scatter_matrix(markets[numeric_columns],
# Make figure larget to show details
figsize=(15, 10),
# Lower point opacity to show overlap
alpha=0.5,
);
Scatter matrices can be a lot of information to take in but are super helpful exploration tools. In this plot, we see that, due to many of the variables taking integers values (e.g., days of the week = 1,2,3,...), there is a lot of 'banding' with points clustering in a line along a given axis. Also, you will likely want to log-transform the population values as the distribution is highly skewed.
Digging in with basic transforms
You are curious to see if the population of a state correlates to the number of items sold at farmer's markets. To check this, take the log of the population and draw a scatter plot against the number of items sold by a market. From your previous explorations of the dataset, you know there will be a lot of overlap, so to get a better handle on the patterns you want to reduce the marker opacity.
markets['log_pop'] = np.log(markets['state_pop'])
# Draw a scatterplot of log-population to # of items sold
sns.scatterplot(x='log_pop',
y='num_items_sold',
# Reduce point opacity to show overlap
alpha=0.25,
data=markets
);
This plot shows you that even after transforming the population to remove skew and lowering the opacity it's hard to see if there's any relationship between the population and number of items sold.
Is latitude related to months open?
While exploring the farmers market dataset with a scatter matrix, you noticed a potentially interesting relationship between a market's latitude and the number of months it stays open. Digging into this relationship a bit further, you decide to use Seaborn's regression plot to see if there's any weight to this pattern or if the heavy overlap of the points is playing tricks on your eyes.
To make the regression line stand out, you'll want to lower the overlapping background points opacity and color them a muted gray. Since you're not going to be making any formal inference and want to quickly investigate a pattern, you can turn off the default uncertainty band.
sns.regplot(x='lat',
y='months_open',
# Set scatter point opacity & color
scatter_kws={'alpha':0.1, 'color':'gray'},
# Disable confidence band
ci=False,
data=markets);
Here you see that underneath all the overlapping points there may be a negative relationship between a market's latitude and the number of months it's open. While you would never take these results and call the relationship true, you now have a path to dig into further to see if the signal is true or simply noise.
What state is the most market-friendly?
While exploring the farmer's market data, you wonder what patterns may show up if you aggregated to the state level. Are some states more market-friendly than other states? To investigate this, you group your data by state and get the log-transformed number of markets (log_markets
) and state populations (log_pop
).
markets_and_pop = (markets
.groupby('state', as_index = False)
.agg({
'name': lambda d: np.log(len(d)),
'state_pop': lambda d: np.log(d.iloc[0]) })
.rename(columns = {
'name': 'log_markets',
'state_pop': 'log_pop' }))
To visualize, you decide to use a regression plot to get an idea of the 'normal' relationship between market and population numbers and a text-scatter to quickly identify interesting outliers.
fig, ax = plt.subplots(figsize=(15, 10));
g = sns.regplot(
"log_markets", "log_pop",
ci = False,
# Shrink scatter plot points
scatter_kws = {'s':2},
data = markets_and_pop,
ax=ax
)
# Iterate over the rows of the data
for _, row in markets_and_pop.iterrows():
state, log_markets, log_pop = row
# Place annotation and reduce size for clarity
g.annotate(state, (log_markets, log_pop), size=10);
The plot you've just made demonstrates how regression plots are not only great for revealing correlations in your data; they also can help you figure out what's out of the ordinary.
Here you see that Vermont and Texas seem to be the largest outliers. With Vermont falling the furthest below the best-fit line and Texas the furthest above. Giving you an idea about what to investigate next.
Popularity of goods sold by state
The farmer's market dataset contains columns corresponding to 28 different goods and whether or not they are sold at that market. You're curious to see if there are any interesting stories in this dataset regarding how likely you are to find a given good at a state's markets. To answer this question, you collapse the data into three columns:
-
state
- the name of the state -
good
- the good of interest -
prop_selling
- the proportion of markets in that state that sell that good
To quickly determine if patterns emerge, you choose a subset of goods you find interesting and decide to make a simple text-scatter: the good on the x-axis and the proportion of a state's markets that sell that good on the y-axis.
goods_by_state = pd.read_csv('./dataset/goods_by_state.csv', index_col=0)
fig, ax = plt.subplots(figsize=(10, 10))
# Subset goods to interting ones
to_plot = ['Cheese', 'Maple', 'Fruits', 'Grains', 'Seafood', 'Plants', 'Vegetables']
goods_by_state_small = goods_by_state.query('good in ' + str(to_plot))
g = sns.scatterplot('good','prop_selling', data = goods_by_state_small,
# Hide scatter points by shrinking to nothing
s = 0, ax=ax)
for _,row in goods_by_state_small.iterrows():
g.annotate(row['state'], (row['good'], row['prop_selling']),
# Center annotation on axis
ha = 'center',
size = 10)
You are making some advanced plots now. While this plot may be very messy but also very interesting. You can see specific states rise above the rest, such as Arizona with grains, and other states like New Mexico consistently lag behind others in the availability of goods. This plot provides you with a lot of potential future avenues of exploration. In the coming exercises you'll explore some of these avenues.
Stacking to find trends
In the farmers market dataset, you are interested in the number of months that a market stays open in relation to its geography, more specifically its longitude. You're curious to see if there are any regions of the country that behave noticeably different from the others.
To do this, you create a basic map with a scatter plot of the latitude and longitude of each market, coloring each market by the number of months it's open. Further digging into the latitude relationship, you draw a regression plot of the latitude to the number of months open with a flexible fit line to determine if any trends appear. You want to view these simultaneously to get the clearest picture of the trends.
_, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 10))
# Draw location scatter plot on first plot
sns.scatterplot("lon", "lat", "months_open",
palette=sns.light_palette("orangered", n_colors=12),
legend=False, data=markets,
ax=ax1);
# Plot a regression plot on second plot
sns.regplot("lon", "months_open",
scatter_kws={'alpha':0.2, 'color':'gray', 'marker': '|'},
lowess=True,
marker='|', data=markets,
ax=ax2);