Visualizing Two Quantitative Variables
In this chapter, you will create and customize plots that visualize the relationship between two quantitative variables. To do this, you will use scatter plots and line plots to explore how the level of air pollution in a city changes over the course of a day and how horsepower relates to fuel efficiency in cars. You will also see another big advantage of using Seaborn - the ability to easily create subplots in a single figure! This is the Summary of lecture "Introduction to Data Visualization with Seaborn", via datacamp.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
Creating subplots with col and row
We've seen in prior exercises that students with more absences ("absences"
) tend to have lower final grades ("G3"
). Does this relationship hold regardless of how much time students study each week?
To answer this, we'll look at the relationship between the number of absences that a student has in school and their final grade in the course, creating separate subplots based on each student's weekly study time ("study_time"
).
student_data = pd.read_csv('./dataset/student-alcohol-consumption.csv', index_col=0)
student_data.head()
sns.relplot(x="absences", y="G3", data=student_data, kind='scatter');
sns.relplot(x="absences", y="G3",
data=student_data,
kind="scatter",
col='study_time');
sns.relplot(x="absences", y="G3",
data=student_data,
kind="scatter",
row="study_time");
Because these subplots had a large range of x values, it's easier to read them arranged in rows instead of columns.
Creating two-factor subplots
Let's continue looking at the student_data dataset of students in secondary school. Here, we want to answer the following question: does a student's first semester grade ("G1"
) tend to correlate with their final grade ("G3"
)?
There are many aspects of a student's life that could result in a higher or lower final grade in the class. For example, some students receive extra educational support from their school ("schoolsup"
) or from their family ("famsup"
), which could result in higher grades. Let's try to control for these two factors by creating subplots based on whether the student received extra educational support from their school or family.
sns.relplot(x='G1', y='G3', data=student_data, kind='scatter');
sns.relplot(x="G1", y="G3", data=student_data, kind="scatter", col='schoolsup', col_order=['yes', 'no']);
sns.relplot(x="G1", y="G3",
data=student_data,
kind="scatter",
col="schoolsup",
row='famsup',
col_order=["yes", "no"],
row_order=['yes', 'no']);
It looks like the first semester grade does correlate with the final grade, regardless of what kind of support the student received.
Changing the size of scatter plot points
In this exercise, we'll explore Seaborn's mpg
dataset, which contains one row per car model and includes information such as the year the car was made, the number of miles per gallon ("M.P.G.") it achieves, the power of its engine (measured in "horsepower"), and its country of origin.
What is the relationship between the power of a car's engine ("horsepower"
) and its fuel efficiency ("mpg"
)? And how does this relationship vary by the number of cylinders ("cylinders"
) the car has? Let's find out.
Let's continue to use relplot()
instead of scatterplot()
since it offers more flexibility.
mpg = pd.read_csv('./dataset/mpg.csv')
mpg.head()
sns.relplot(x='horsepower', y='mpg', data=mpg, size='cylinders', kind='scatter');
sns.relplot(x="horsepower", y="mpg",
data=mpg, kind="scatter",
size="cylinders",
hue='cylinders');
Cars with higher horsepower tend to get a lower number of miles per gallon. They also tend to have a higher number of cylinders.
Changing the style of scatter plot points
Let's continue exploring Seaborn's mpg
dataset by looking at the relationship between how fast a car can accelerate ("acceleration"
) and its fuel efficiency ("mpg"
). Do these properties vary by country of origin ("origin"
)?
Note that the "acceleration" variable is the time to accelerate from 0 to 60 miles per hour, in seconds. Higher values indicate slower acceleration.
sns.relplot(x='acceleration', y='mpg', data=mpg, kind='scatter', style='origin', hue='origin');
Cars from the USA tend to accelerate more quickly and get lower miles per gallon compared to cars from Europe and Japan.
Introduction to line plots
- What are line plots?\
Two types of relational plots: scatter plots and line plots
- Scatter plots Each plot point is an independent observation
- Line plots Each plot point represents the same "thing", typically tracked over time.
- Multiple observations per x-value\
Shared region is the confidence interval
- Assume dataset is random sample
- 95% confident that the mean is within this interval
- Indicates uncertainty in our estimate
Interpreting line plots
In this exercise, we'll continue to explore Seaborn's mpg dataset, which contains one row per car model and includes information such as the year the car was made, its fuel efficiency (measured in "miles per gallon" or "M.P.G"), and its country of origin (USA, Europe, or Japan).
How has the average miles per gallon achieved by these cars changed over time? Let's use line plots to find out!
sns.relplot(x='model_year', y='mpg', data=mpg, kind='line');
sns.relplot(x="model_year", y="mpg", data=mpg, kind="line", ci='sd');
Unlike the plot in the last exercise, this plot shows us the distribution of miles per gallon for all the cars in each year.
sns.relplot(x='model_year', y='horsepower', data=mpg, kind='line', ci=None);
sns.relplot(x="model_year", y="horsepower", data=mpg, kind="line", style='origin', hue='origin',
ci=None);
sns.relplot(x="model_year", y="horsepower",
data=mpg, kind="line", ci=None, style="origin", hue="origin",
markers=True, dashes=False);
Now that we've added subgroups, we can see that this downward trend in horsepower was more pronounced among cars from the USA.