Thinking probabilistically  Continuous variables
A Summary of lecture "Statistical Thinking in Python (Part 1)", via datacamp
 Probability density functions
 Introduction to the Normal distribution
 The Normal distribution: Properties and warnings
 The Exponential distribution
Introduction to the Normal distribution

Normal distribution

Describes a continuous variable whose PDF has a single symmetric peak.
$$ \begin{align} \text{mean of a Normal distribution} & \neq \text{mean computed from data} \\ \text{st. dev of a Normal distribution} & \neq \text{st. dev computed from data} \end{align} $$

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# with stds of interest: samples_std1, samples_std3, stamples_std10
samples_std1 = np.random.normal(20, 1, 100000)
samples_std3 = np.random.normal(20, 3, 100000)
samples_std10 = np.random.normal(20, 10, 100000)
# Make histograms
_ = plt.hist(samples_std1, histtype='step', density=True, bins=100)
_ = plt.hist(samples_std3, histtype='step', density=True, bins=100)
_ = plt.hist(samples_std10, histtype='step', density=True,bins=100)
# Make a legend, set limits
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
plt.ylim(0.01, 0.42)
def ecdf(data):
"""Compute ECDF for a onedimensional array of measurements."""
# Number of data points: n
n = len(data)
# xdata for the ECDF: x
x = np.sort(data)
# ydata for the ECDF: y
y = np.arange(1, n + 1) / n
return x, y
x_std1, y_std1 = ecdf(samples_std1)
x_std3, y_std3 = ecdf(samples_std3)
x_std10, y_std10 = ecdf(samples_std10)
# Plot CDFs
_ = plt.plot(x_std1, y_std1, marker='.', linestyle='none')
_ = plt.plot(x_std3, y_std3, marker='.', linestyle='none')
_ = plt.plot(x_std10, y_std10, marker='.', linestyle='none')
# Make a legend
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'), loc='lower right')
plt.savefig('../images/stdcdf.png')
Are the Belmont Stakes results Normally distributed?
Since 1926, the Belmont Stakes is a 1.5 milelong race of 3year old thoroughbred horses. Secretariat ) ran the fastest Belmont Stakes in history in 1973. While that was the fastest year, 1970 was the slowest because of unusually wet and sloppy conditions. With these two outliers removed from the data set, compute the mean and standard deviation of the Belmont winners' times. Sample out of a Normal distribution with this mean and standard deviation using the np.random.normal()
function and plot a CDF. Overlay the ECDF from the winning Belmont times. Are these close to Normally distributed?
Note: Justin scraped the data concerning the Belmont Stakes from the Belmont Wikipedia page.
df = pd.read_csv('./dataset/belmont.csv')
belmont_no_outliers = np.array([148.51, 146.65, 148.52, 150.7 , 150.42, 150.88, 151.57, 147.54,
149.65, 148.74, 147.86, 148.75, 147.5 , 148.26, 149.71, 146.56,
151.19, 147.88, 149.16, 148.82, 148.96, 152.02, 146.82, 149.97,
146.13, 148.1 , 147.2 , 146. , 146.4 , 148.2 , 149.8 , 147. ,
147.2 , 147.8 , 148.2 , 149. , 149.8 , 148.6 , 146.8 , 149.6 ,
149. , 148.2 , 149.2 , 148. , 150.4 , 148.8 , 147.2 , 148.8 ,
149.6 , 148.4 , 148.4 , 150.2 , 148.8 , 149.2 , 149.2 , 148.4 ,
150.2 , 146.6 , 149.8 , 149. , 150.8 , 148.6 , 150.2 , 149. ,
148.6 , 150.2 , 148.2 , 149.4 , 150.8 , 150.2 , 152.2 , 148.2 ,
149.2 , 151. , 149.6 , 149.6 , 149.4 , 148.6 , 150. , 150.6 ,
149.2 , 152.6 , 152.8 , 149.6 , 151.6 , 152.8 , 153.2 , 152.4 ,
152.2 ])
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)
# Sample out of a normal distribution with this mu and sigma: samples
samples = np.random.normal(mu, sigma, size=10000)
# Get the CDF of the samples and of the data
x_theor, y_theor = ecdf(belmont_no_outliers)
x, y = ecdf(samples)
# Plot the CDFs
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
samples = np.random.normal(mu, sigma, size=1000000)
# Compute the fraction that are faster than 144 seconds: prob
prob = np.sum(samples < 144) / len(samples)
# Print the result
print('Probability of besting Secretariat:', prob)
If you have a story, you can simulate it!
Sometimes, the story describing our probability distribution does not have a named distribution to go along with it. In these cases, fear not! You can always simulate it. We'll do that in this and the next exercise.
In earlier exercises, we looked at the rare event of nohitters in Major League Baseball. Hitting the cycle is another rare baseball event. When a batter hits the cycle, he gets all four kinds of hits, a single, double, triple, and home run, in a single game. Like nohitters, this can be modeled as a Poisson process, so the time between hits of the cycle are also Exponentially distributed.
How long must we wait to see both a nohitter and then a batter hit the cycle? The idea is that we have to wait some time for the nohitter, and then after the nohitter, we have to wait for hitting the cycle. Stated another way, what is the total waiting time for the arrival of two different Poisson processes? The total waiting time is the time waited for the nohitter, plus the time waited for the hitting the cycle.
Now, you will write a function to sample out of the distribution described by this story.
def successive_poisson(tau1, tau2, size=1):
"""Compute time for arrival of 2 successive Poisson processes."""
# Draw samples out of first exponential distribution: t1
t1 = np.random.exponential(tau1, size)
# Draw samples out of second exponential distribution: t2
t2 = np.random.exponential(tau2, size)
return t1 + t2
waiting_times = successive_poisson(764, 715, size=100000)
# Make the histogram
_ = plt.hist(waiting_times, bins=100, density=True, histtype='step')
# Label axes
_ = plt.xlabel('waiting times')
_ = plt.ylabel('PDF')