import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (10, 5)

Count plots and bar plots

Count plots

In this exercise, we'll return to exploring our dataset that contains the responses to a survey sent out to young people. We might suspect that young people spend a lot of time on the internet, but how much do they report using the internet each day? Let's use a count plot to break down the number of survey responses in each category and then explore whether it changes based on age.

survey_data = pd.read_csv('./dataset/young-people-survey-responses.csv', index_col=0)
survey_data.head()

sns.catplot(x='Internet usage', data=survey_data, kind='count', aspect=2);
plt.tight_layout();

sns.catplot(y="Internet usage", data=survey_data, kind="count");

sns.catplot(y="Internet usage", data=survey_data, kind="count", col='Age Category');

Bar plots with percentages

Let's continue exploring the responses to a survey sent out to young people. The variable "Interested in Math" is True if the person reported being interested or very interested in mathematics, and False otherwise. What percentage of young people report being interested in math, and does this vary based on gender? Let's use a bar plot to find out.

sns.catplot(x='Gender', y='Interested in Math', data=survey_data, kind='bar');

When the y-variable is True/False, bar plots will show the percentage of responses reporting True. This plot shows us that males report a much higher interest in math compared to females.

Customizing bar plots

In this exercise, we'll explore data from students in secondary school. The "study_time" variable records each student's reported weekly study time as one of the following categories: "<2 hours", "2 to 5 hours", "5 to 10 hours", or ">10 hours". Do students who report higher amounts of studying tend to get better final grades? Let's compare the average final grade among students in each category using a bar plot.

student_data = pd.read_csv('./dataset/student-alcohol-consumption.csv', index_col=0)
student_data.head()

sns.catplot(x='study_time', y='G3', data=student_data, kind='bar');

sns.catplot(x="study_time", y="G3", data=student_data, kind="bar",
            order = [
                '<2 hours',
                '2 to 5 hours',
                '5 to 10 hours',
                '>10 hours'
                ]);

sns.catplot(x="study_time", y="G3",
            data=student_data,
            kind="bar",
            order=["<2 hours", 
                   "2 to 5 hours", 
                   "5 to 10 hours", 
                   ">10 hours"],
            ci=None);

Students in our sample who studied more have a slightly higher average grade, but it's not a strong relationship.

Box plots

What is a box plot?
- Shows the distribution of quantitative data
- See median, spread, skewness, and outliers
- Facilitates comparisons between groups

Create and interpret a box plot

Let's continue using the student_data dataset. In an earlier exercise, we explored the relationship between studying and final grade by using a bar plot to compare the average final grade ("G3") among students in different categories of "study_time".

study_time_order = ["<2 hours", "2 to 5 hours", 
                    "5 to 10 hours", ">10 hours"]

# Create a box plot and set the order of the categories
sns.catplot(x='study_time', y='G3',
            data=student_data,
            kind='box',
            order=study_time_order);

Omitting outliers

Now let's use the student_data dataset to compare the distribution of final grades ("G3") between students who have internet access at home and those who don't. To do this, we'll use the "internet" variable, which is a binary (yes/no) indicator of whether the student has internet access at home.

Since internet may be less accessible in rural areas, we'll add subgroups based on where the student lives. For this, we can use the "location" variable, which is an indicator of whether a student lives in an urban ("Urban") or rural ("Rural") location.

sns.catplot(x='internet', y='G3',
            data=student_data,
            kind='box',
            hue='location',
            sym='');

The median grades are quite similar between each group, but the spread of the distribution looks larger among students who have internet access.

Adjusting the whiskers

In the lesson we saw that there are multiple ways to define the whiskers in a box plot. In this set of exercises, we'll continue to use the student_data dataset to compare the distribution of final grades ("G3") between students who are in a romantic relationship and those that are not. We'll use the "romantic" variable, which is a yes/no indicator of whether the student is in a romantic relationship.

Let's create a box plot to look at this relationship and try different ways to define the whiskers.

sns.catplot(x="romantic", y="G3",
            data=student_data,
            kind="box",
            whis=0.5);

sns.catplot(x="romantic", y="G3",
            data=student_data,
            kind="box",
            whis=[5, 95]);

sns.catplot(x="romantic", y="G3",
            data=student_data,
            kind="box",
            whis=[0, 100]);

The median grade is the same between these two groups, but the max grade is higher among students who are not in a romantic relationship.

Point plots

What are point plots?
- Points show mean of quantitative variable
- Vertical lines show 95% condence intervals
Point plots vs. line plots
- Both show:
  - Mean of quantitative variable
  - 95% condence intervals for the mean
- Differences:
  - Line plot has quantitative variable (usually time) on x-axis
  - Point plot has categorical variable on x-axis
Point plots vs. bar plots
- Both show:
  - Mean of quantitative variable
  - 95% condence intervals for the mean

Customizing point plots

Let's continue to look at data from students in secondary school, this time using a point plot to answer the question: does the quality of the student's family relationship influence the number of absences the student has in school? Here, we'll use the "famrel" variable, which describes the quality of a student's family relationship from 1 (very bad) to 5 (very good).

sns.catplot(x='famrel', y='absences', 
            data=student_data,
            kind='point');

sns.catplot(x="famrel", y="absences",
			data=student_data,
            kind="point",
            capsize=0.2);

sns.catplot(x="famrel", y="absences",
            data=student_data,
            kind="point",
            capsize=0.2,
            join=False);

While the average number of absences is slightly smaller among students with higher-quality family relationships, the large confidence intervals tell us that we can't be sure there is an actual association here.

Point plots with subgroups

Let's continue exploring the dataset of students in secondary school. This time, we'll ask the question: is being in a romantic relationship associated with higher or lower school attendance? And does this association differ by which school the students attend? Let's find out using a point plot.

sns.catplot(x='romantic', y='absences',
            data=student_data,
            kind='point',
            hue='school');

sns.catplot(x="romantic", y="absences",
			data=student_data,
            kind="point",
            hue="school",
            ci=None);

import numpy as np
# Plot the median number of absences instead of the mean
sns.catplot(x="romantic", y="absences",
			data=student_data,
            kind="point",
            hue="school",
            ci=None,
            estimator=np.median);

It looks like students in romantic relationships have a higher average and median number of absences in the GP school, but this association does not hold for the MS school.

	Music	Techno	Movies	History	Mathematics	Pets	Spiders	Loneliness	Parents' advice	Internet usage	Finances	Age	Siblings	Gender	Village - town	Age Category	Interested in Math
0	5.0	1.0	5.0	1.0	3.0	4.0	1.0	3.0	4.0	few hours a day	3.0	20.0	1.0	female	village	Less than 21	False
1	4.0	1.0	5.0	1.0	5.0	5.0	1.0	2.0	2.0	few hours a day	3.0	19.0	2.0	female	city	Less than 21	True
2	5.0	1.0	5.0	1.0	5.0	5.0	1.0	5.0	3.0	few hours a day	2.0	20.0	2.0	female	city	Less than 21	True
3	5.0	2.0	5.0	4.0	4.0	1.0	5.0	5.0	2.0	most of the day	2.0	22.0	1.0	female	city	21+	True
4	5.0	2.0	5.0	3.0	2.0	1.0	1.0	3.0	3.0	few hours a day	4.0	20.0	1.0	female	village	Less than 21	False

	school	sex	age	famsize	Pstatus	Medu	Fedu	traveltime	failures	schoolsup	...	goout	Dalc	Walc	health	absences	G1	G2	G3	location	study_time
0	GP	F	18	GT3	A	4	4	2	0	yes	...	4	1	1	3	6	5	6	6	Urban	2 to 5 hours
1	GP	F	17	GT3	T	1	1	1	0	no	...	3	1	1	3	4	5	5	6	Urban	2 to 5 hours
2	GP	F	15	LE3	T	1	1	1	3	yes	...	2	2	3	3	10	7	8	10	Urban	2 to 5 hours
3	GP	F	15	GT3	T	4	2	1	0	no	...	2	1	1	5	2	15	14	15	Urban	5 to 10 hours
4	GP	F	16	GT3	T	3	3	1	0	no	...	2	1	2	5	4	6	10	10	Urban	2 to 5 hours