import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (10, 5)

Introduction to Seaborn

  • What is seaborn
    • Python data visualization library
    • Easily create the most common types of plots

Making a scatter plot with lists

In this exercise, we'll use a dataset that contains information about 227 countries. This dataset has lots of interesting information on each country, such as the country's birth rates, death rates, and its gross domestic product (GDP). GDP is the value of all the goods and services produced in a year, expressed as dollars per person.

We've created three lists of data from this dataset to get you started. gdp is a list that contains the value of GDP per country, expressed as dollars per person. phones is a list of the number of mobile phones per 1,000 people in that country. Finally, percent_literate is a list that contains the percent of each country's population that can read and write.

df = pd.read_csv('./dataset/countries-of-the-world.csv')
df.head()
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
0 Afghanistan ASIA (EX. NEAR EAST) 31056997 647500 48,0 0,00 23,06 163,07 700.0 36,0 3,2 12,13 0,22 87,65 1 46,6 20,34 0,38 0,24 0,38
1 Albania EASTERN EUROPE 3581655 28748 124,6 1,26 -4,93 21,52 4500.0 86,5 71,2 21,09 4,42 74,49 3 15,11 5,22 0,232 0,188 0,579
2 Algeria NORTHERN AFRICA 32930091 2381740 13,8 0,04 -0,39 31 6000.0 70,0 78,1 3,22 0,25 96,53 1 17,14 4,61 0,101 0,6 0,298
3 American Samoa OCEANIA 57794 199 290,4 58,29 -20,71 9,27 8000.0 97,0 259,5 10 15 75 2 22,46 3,27 NaN NaN NaN
4 Andorra WESTERN EUROPE 71201 468 152,1 0,00 6,6 4,05 19000.0 100,0 497,2 2,22 0 97,78 3 8,71 6,25 NaN NaN NaN
gdp = df['GDP ($ per capita)'].tolist()
phones = df['Phones (per 1000)'].tolist()
percent_literate = df['Literacy (%)'].tolist()
region = df['Region'].tolist()
sns.scatterplot(x=gdp, y=phones);
sns.scatterplot(x=gdp, y=percent_literate);

Making a count plot with a list

In the last exercise, we explored a dataset that contains information about 227 countries. Let's do more exploration of this data - specifically, how many countries are in each region of the world?

To do this, we'll need to use a count plot. Count plots take in a categorical list and return bars that represent the number of list entries per category. You can create one here using a list of regions for each country, which is a variable named region.

sns.countplot(y=region);

Sub-Saharan Africa contains the most countries in this list. We'll revisit count plots later in the course.

Using pandas with Seaborn

  • What is pandas?
    • Python library for data analysis
    • Easily read datasets from csv, txt, and other types of files
    • Datasets take the form of DataFrame objects

"Tidy" vs. "untidy" data

Here, we have a sample dataset from a survey of children about their favorite animals. But can we use this dataset as-is with Seaborn? Let's use Pandas to import the csv file with the data collected from the survey and determine whether it is tidy, which is essential to having it work well with Seaborn.

Making a count plot with a DataFrame

In this exercise, we'll look at the responses to a survey sent out to young people. Our primary question here is: how many young people surveyed report being scared of spiders? Survey participants were asked to agree or disagree with the statement "I am afraid of spiders". Responses vary from 1 to 5, where 1 is "Strongly disagree" and 5 is "Strongly agree".

df = pd.read_csv('./dataset/young-people-survey-responses.csv')

# Create a count plot with "Spiders" on the x-axis
sns.countplot('Spiders', data=df);

Adding a third variable with hue

Hue and scatter plots

In the prior video, we learned how hue allows us to easily make subgroups within Seaborn plots. Let's try it out by exploring data from students in secondary school. We have a lot of information about each student like their age, where they live, their study habits and their extracurricular activities.

For now, we'll look at the relationship between the number of absences they have in school and their final grade in the course, segmented by where the student lives (rural vs. urban area).

student_data = pd.read_csv('./dataset/student-alcohol-consumption.csv', index_col=0)
student_data.head()
school sex age famsize Pstatus Medu Fedu traveltime failures schoolsup ... goout Dalc Walc health absences G1 G2 G3 location study_time
0 GP F 18 GT3 A 4 4 2 0 yes ... 4 1 1 3 6 5 6 6 Urban 2 to 5 hours
1 GP F 17 GT3 T 1 1 1 0 no ... 3 1 1 3 4 5 5 6 Urban 2 to 5 hours
2 GP F 15 LE3 T 1 1 1 3 yes ... 2 2 3 3 10 7 8 10 Urban 2 to 5 hours
3 GP F 15 GT3 T 4 2 1 0 no ... 2 1 1 5 2 15 14 15 Urban 5 to 10 hours
4 GP F 16 GT3 T 3 3 1 0 no ... 2 1 2 5 4 6 10 10 Urban 2 to 5 hours

5 rows × 29 columns

sns.scatterplot(x="absences", y="G3", data=student_data, hue='location');
sns.scatterplot(x="absences", y="G3", 
                data=student_data,
                hue='location',
                hue_order=['Rural', 'Urban']);

Hue and count plots

Let's continue exploring our dataset from students in secondary school by looking at a new variable. The "school" column indicates the initials of which school the student attended - either "GP" or "MS".

In the last exercise, we created a scatter plot where the plot points were colored based on whether the student lived in an urban or rural area. How many students live in urban vs. rural areas, and does this vary based on what school the student attends? Let's make a count plot with subgroups to find out.

palette_colors = {'Rural': "green", 'Urban': "blue"}

# Create a count plot of school with location subgroups
sns.countplot('school', data=student_data, hue='location' ,palette=palette_colors);

Students at GP tend to come from an urban location, but students at MS are more evenly split.