Correlation and Experimental Design
In this chapter, you'll learn how to quantify the strength of a linear relationship between two variables, and explore how confounding variables can affect the relationship between two other variables. You'll also see how a study’s design can influence its results, change how the data should be analyzed, and potentially affect the reliability of your conclusions. This is the Summary of lecture "Introduction to Statistics in Python", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Correlation
 Correlation coefficient
 Quantifies the linear relationship between two variables
 Number between 1 and 1
 Magnitude corresponds to strength of relationship
 Sign (+ or ) corresponds to direction of relationship

Pearson productmoment correlation($r$)
 Most Common
 $\bar{x}$ = mean of $x$

$\sigma_x$ = standard deviation of $x$
$$ r = \sum_{i=1}^{n} \frac{(x_i  \bar{x})(y_i  \bar{y})}{\sigma_x \times \sigma_y} $$

Variation
 Kendall's Tau
 Spearman's rho
Relationships between variables
In this chapter, you'll be working with a dataset world_happiness
containing results from the 2019 World Happiness Report. The report scores various countries based on how happy people in that country are. It also ranks each country on various societal aspects such as social support, freedom, corruption, and others. The dataset also includes the GDP per capita and life expectancy for each country.
In this exercise, you'll examine the relationship between a country's life expectancy (life_exp
) and happiness score (happiness_score
) both visually and quantitatively.
world_happiness = pd.read_csv('./dataset/world_happiness.csv', index_col=0)
world_happiness.head()
sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness);
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None);
cor = world_happiness['life_exp'].corr(world_happiness['happiness_score'])
print(cor)
What can't correlation measure?
While the correlation coefficient is a convenient way to quantify the strength of a relationship between two variables, it's far from perfect. In this exercise, you'll explore one of the caveats of the correlation coefficient by examining the relationship between a country's GDP per capita (gdp_per_cap
) and happiness score.
sns.scatterplot(x='gdp_per_cap', y='life_exp', data=world_happiness);
# Correlation between gdp_per_cap and life_exp
cor = world_happiness['gdp_per_cap'].corr(world_happiness['life_exp'])
print(cor)
sns.scatterplot(x='gdp_per_cap', y='happiness_score', data=world_happiness);
# Calculate correlation
cor = world_happiness['gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor)
world_happiness['log_gdp_per_cap'] = np.log(world_happiness['gdp_per_cap'])
# Scatterplot of log_gdp_per_cap and happiness_score
sns.scatterplot(x='log_gdp_per_cap', y='happiness_score', data=world_happiness);
# Calculate correlation
cor = world_happiness['log_gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor)
world_happiness = pd.read_csv('./dataset/world_happiness_add_sugar.csv', index_col=0)
world_happiness
sns.scatterplot(x='grams_sugar_per_day', y='happiness_score', data=world_happiness);
# Correlation between grams_sugar_per_day and happiness_score
cor = world_happiness['grams_sugar_per_day'].corr(world_happiness['happiness_score'])
print(cor)
Design of experiments
 Vocabulary
 Experiment aims to answer: What is the effect of the treatment on the response?
 Treatment: explanatory / independent variable
 Response: response / dependent variable
 E.g.: What is the effect of an advertisement on the number of products purchased?
 Treatment: advertisement
 Response: number of products purchased
 Controlled experiments
 Participants are assigned by researchers to either treatment group or control group
 Treatment group sees advertisement
 Control group does not
 Group should be comparable so that causation can be inferred
 If groups are not comparable, this could lead to confounding (bias)
 Experiment aims to answer: What is the effect of the treatment on the response?
 Gold standard of experiment
 Randomized controlled trial
 Participants are assigned to treatment/control randomly, not based on any other characteristics
 Choosing randomly helps ensure that groups are comparable
 Placebo
 Resembles treatement, but has no effect
 Participants will not know which group they're in
 Doubleblind trial
 Person administering the treatment/running the study doesn't know whether the treatment is real or a placebo
 Prevents bias in the response and/or analysis of results
 Fewopportunities for bias = more reliable conclusion about causation
 Randomized controlled trial
 Observational studies
 Participants are not assigned randomly to groups
 Participants assign themselves, usually based on preexisting characteristics
 Many research questions are not conductive to a controlled experiment
 Cannot force someone to smoke or have a disease
 Establish association, not causation
 Effects can be confounded by factors that got certain people into the control or treatment group
 There are ways to control for confounders to get more reliable conclusions about association
 Longitudinal vs. crosssectional studies
 Longitudinal study
 Participants are followed over a period of time to examine effect of treatment on response
 Effect of age on height is not confounded by generation
 More expensive, results take longer
 Crosssectional study
 Data on participants is collected from a single snapshot in time
 Effect of age on height is confounded by generation
 Cheaper, fater, more convenient
 Participants are not assigned randomly to groups