Visualizing a Categorical and a Quantitative Variable
Categorical variables are present in nearly every dataset, but they are especially prominent in survey data. In this chapter, you will learn how to create and customize categorical plots such as box plots, bar plots, count plots, and point plots. Along the way, you will explore survey data from young people about their interests, students about their study habits, and adult men about their feelings about masculinity. This is the Summary of lecture "Introduction to Data Visualization with Seaborn", via datacamp.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
Count plots
In this exercise, we'll return to exploring our dataset that contains the responses to a survey sent out to young people. We might suspect that young people spend a lot of time on the internet, but how much do they report using the internet each day? Let's use a count plot to break down the number of survey responses in each category and then explore whether it changes based on age.
survey_data = pd.read_csv('./dataset/young-people-survey-responses.csv', index_col=0)
survey_data.head()
sns.catplot(x='Internet usage', data=survey_data, kind='count', aspect=2);
plt.tight_layout();
sns.catplot(y="Internet usage", data=survey_data, kind="count");
sns.catplot(y="Internet usage", data=survey_data, kind="count", col='Age Category');
Bar plots with percentages
Let's continue exploring the responses to a survey sent out to young people. The variable "Interested in Math"
is True if the person reported being interested or very interested in mathematics, and False otherwise. What percentage of young people report being interested in math, and does this vary based on gender? Let's use a bar plot to find out.
sns.catplot(x='Gender', y='Interested in Math', data=survey_data, kind='bar');
When the y-variable is True/False, bar plots will show the percentage of responses reporting True. This plot shows us that males report a much higher interest in math compared to females.
Customizing bar plots
In this exercise, we'll explore data from students in secondary school. The "study_time" variable records each student's reported weekly study time as one of the following categories: "<2 hours"
, "2 to 5 hours"
, "5 to 10 hours"
, or ">10 hours"
. Do students who report higher amounts of studying tend to get better final grades? Let's compare the average final grade among students in each category using a bar plot.
student_data = pd.read_csv('./dataset/student-alcohol-consumption.csv', index_col=0)
student_data.head()
sns.catplot(x='study_time', y='G3', data=student_data, kind='bar');
sns.catplot(x="study_time", y="G3", data=student_data, kind="bar",
order = [
'<2 hours',
'2 to 5 hours',
'5 to 10 hours',
'>10 hours'
]);
sns.catplot(x="study_time", y="G3",
data=student_data,
kind="bar",
order=["<2 hours",
"2 to 5 hours",
"5 to 10 hours",
">10 hours"],
ci=None);
Students in our sample who studied more have a slightly higher average grade, but it's not a strong relationship.
study_time_order = ["<2 hours", "2 to 5 hours",
"5 to 10 hours", ">10 hours"]
# Create a box plot and set the order of the categories
sns.catplot(x='study_time', y='G3',
data=student_data,
kind='box',
order=study_time_order);
Omitting outliers
Now let's use the student_data
dataset to compare the distribution of final grades ("G3"
) between students who have internet access at home and those who don't. To do this, we'll use the "internet" variable, which is a binary (yes/no) indicator of whether the student has internet access at home.
Since internet may be less accessible in rural areas, we'll add subgroups based on where the student lives. For this, we can use the "location" variable, which is an indicator of whether a student lives in an urban ("Urban"
) or rural ("Rural"
) location.
sns.catplot(x='internet', y='G3',
data=student_data,
kind='box',
hue='location',
sym='');
The median grades are quite similar between each group, but the spread of the distribution looks larger among students who have internet access.
Adjusting the whiskers
In the lesson we saw that there are multiple ways to define the whiskers in a box plot. In this set of exercises, we'll continue to use the student_data
dataset to compare the distribution of final grades ("G3"
) between students who are in a romantic relationship and those that are not. We'll use the "romantic" variable, which is a yes/no indicator of whether the student is in a romantic relationship.
Let's create a box plot to look at this relationship and try different ways to define the whiskers.
sns.catplot(x="romantic", y="G3",
data=student_data,
kind="box",
whis=0.5);
sns.catplot(x="romantic", y="G3",
data=student_data,
kind="box",
whis=[5, 95]);
sns.catplot(x="romantic", y="G3",
data=student_data,
kind="box",
whis=[0, 100]);
The median grade is the same between these two groups, but the max grade is higher among students who are not in a romantic relationship.
Point plots
- What are point plots?
- Points show mean of quantitative variable
- Vertical lines show 95% condence intervals
- Point plots vs. line plots
- Both show:
- Mean of quantitative variable
- 95% condence intervals for the mean
- Differences:
- Line plot has quantitative variable (usually time) on x-axis
- Point plot has categorical variable on x-axis
- Both show:
- Point plots vs. bar plots
- Both show:
- Mean of quantitative variable
- 95% condence intervals for the mean
- Both show:
Customizing point plots
Let's continue to look at data from students in secondary school, this time using a point plot to answer the question: does the quality of the student's family relationship influence the number of absences the student has in school? Here, we'll use the "famrel" variable, which describes the quality of a student's family relationship from 1 (very bad) to 5 (very good).
sns.catplot(x='famrel', y='absences',
data=student_data,
kind='point');
sns.catplot(x="famrel", y="absences",
data=student_data,
kind="point",
capsize=0.2);
sns.catplot(x="famrel", y="absences",
data=student_data,
kind="point",
capsize=0.2,
join=False);
While the average number of absences is slightly smaller among students with higher-quality family relationships, the large confidence intervals tell us that we can't be sure there is an actual association here.
Point plots with subgroups
Let's continue exploring the dataset of students in secondary school. This time, we'll ask the question: is being in a romantic relationship associated with higher or lower school attendance? And does this association differ by which school the students attend? Let's find out using a point plot.
sns.catplot(x='romantic', y='absences',
data=student_data,
kind='point',
hue='school');
sns.catplot(x="romantic", y="absences",
data=student_data,
kind="point",
hue="school",
ci=None);
import numpy as np
# Plot the median number of absences instead of the mean
sns.catplot(x="romantic", y="absences",
data=student_data,
kind="point",
hue="school",
ci=None,
estimator=np.median);
It looks like students in romantic relationships have a higher average and median number of absences in the GP school, but this association does not hold for the MS school.