Parameter estimation by optimization
A Summary of lecture "Statistical Thinking in Python (Part 2)", via datacamp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
How often do we get no-hitter?
The number of games played between each no-hitter in the modern era (1901-2015) of Major League Baseball is stored in the array nohitter_times
.
If you assume that no-hitters are described as a Poisson process, then the time between no-hitters is Exponentially distributed. As you have seen, the Exponential distribution has a single parameter, which we will call $\tau$, the typical interval time. The value of the parameter $\tau$ that makes the exponential distribution best match the data is the mean interval time (where time is in units of number of games) between no-hitters.
Compute the value of this parameter from the data. Then, use np.random.exponential()
to "repeat" the history of Major League Baseball by drawing inter-no-hitter times from an exponential distribution with the τ you found and plot the histogram as an approximation to the PDF.
nohitter_times = np.array([ 843, 1613, 1101, 215, 684, 814, 278, 324, 161, 219, 545,
715, 966, 624, 29, 450, 107, 20, 91, 1325, 124, 1468,
104, 1309, 429, 62, 1878, 1104, 123, 251, 93, 188, 983,
166, 96, 702, 23, 524, 26, 299, 59, 39, 12, 2,
308, 1114, 813, 887, 645, 2088, 42, 2090, 11, 886, 1665,
1084, 2900, 2432, 750, 4021, 1070, 1765, 1322, 26, 548, 1525,
77, 2181, 2752, 127, 2147, 211, 41, 1575, 151, 479, 697,
557, 2267, 542, 392, 73, 603, 233, 255, 528, 397, 1529,
1023, 1194, 462, 583, 37, 943, 996, 480, 1497, 717, 224,
219, 1531, 498, 44, 288, 267, 600, 52, 269, 1086, 386,
176, 2199, 216, 54, 675, 1243, 463, 650, 171, 327, 110,
774, 509, 8, 197, 136, 12, 1124, 64, 380, 811, 232,
192, 731, 715, 226, 605, 539, 1491, 323, 240, 179, 702,
156, 82, 1397, 354, 778, 603, 1001, 385, 986, 203, 149,
576, 445, 180, 1403, 252, 675, 1351, 2983, 1568, 45, 899,
3260, 1025, 31, 100, 2055, 4043, 79, 238, 3931, 2351, 595,
110, 215, 0, 563, 206, 660, 242, 577, 179, 157, 192,
192, 1848, 792, 1693, 55, 388, 225, 1134, 1172, 1555, 31,
1582, 1044, 378, 1687, 2915, 280, 765, 2819, 511, 1521, 745,
2491, 580, 2072, 6450, 578, 745, 1075, 1103, 1549, 1520, 138,
1202, 296, 277, 351, 391, 950, 459, 62, 1056, 1128, 139,
420, 87, 71, 814, 603, 1349, 162, 1027, 783, 326, 101,
876, 381, 905, 156, 419, 239, 119, 129, 467])
np.random.seed(42)
# Compute mean no-hitter time: tau
tau = np.mean(nohitter_times)
# Draw out of an exponential distribution with parameter tau: inter_nohitter_time
inter_nohitter_time = np.random.exponential(tau, 100000)
# Plot the PDF and label axes
_ = plt.hist(inter_nohitter_time, bins=50, density=True, histtype='step')
_ = plt.xlabel('Games between no-hitters')
_ = plt.ylabel('PDF')
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
# Number of data points: n
n = len(data)
# x-data for the ECDF: x
x = np.sort(data)
# y-data for the ECDF: y
y = np.arange(1, n + 1) / n
return x, y
x, y = ecdf(nohitter_times)
# Create a CDF from theoretical samples: x_theor, y_theor
x_theor, y_theor = ecdf(inter_nohitter_time)
# Overlay the plots
plt.plot(x_theor, y_theor)
plt.plot(x, y, marker='.', linestyle='none')
# Margins and axis labels
plt.margins(0.02)
plt.xlabel('Games between no-hitters')
plt.ylabel('CDF')
How is this parameter optimal?
Now sample out of an exponential distribution with $\tau$ being twice as large as the optimal $\tau$. Do it again for $\tau$ half as large. Make CDFs of these samples and overlay them with your data. You can see that they do not reproduce the data as well. Thus, the τ you computed from the mean inter-no-hitter times is optimal in that it best reproduces the data.
plt.plot(x_theor, y_theor)
plt.plot(x, y, marker='.', linestyle='none')
plt.margins(0.02)
plt.xlabel('Games between no-hitters')
plt.ylabel('CDF')
# Take samples with half tau: samples_half
samples_half = np.random.exponential(tau / 2, 10000)
# Take samples with double tau: samples_double
samples_double = np.random.exponential(tau * 2, 10000)
# Generate CDFs from these samples
x_half, y_half = ecdf(samples_half)
x_double, y_double = ecdf(samples_double)
# Plot these CDFs as lines
_ = plt.plot(x_half, y_half)
_ = plt.plot(x_double, y_double)
EDA of literacy/fertility data
In the next few exercises, we will look at the correlation between female literacy and fertility (defined as the average number of children born per woman) throughout the world. For ease of analysis and interpretation, we will work with the illiteracy rate.
It is always a good idea to do some EDA ahead of our analysis. To this end, plot the fertility versus illiteracy and compute the Pearson correlation coefficient. The Numpy array illiteracy
has the illiteracy rate among females for most of the world's nations. The array fertility
has the corresponding fertility data.
df = pd.read_csv('./dataset/female_literacy_fertility.csv')
fertility = np.array(df['fertility'])
illiteracy = np.array(100 - df['female literacy'])
def pearson_r(x, y):
"""Compute Pearson correlation coefficient between two arrays
Args:
x: arrays
y: arrays
returns:
r: int
"""
# Compute correlation matrix: corr_mat
corr_mat = np.corrcoef(x, y)
# Return entry[0, 1]
return corr_mat[0, 1]
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
# Set the margins and label axes
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')
# Show the Pearson correlation coefficient
print(pearson_r(illiteracy, fertility))
Linear regression
We will assume that fertility is a linear function of the female illiteracy rate. That is, f=ai+b, where a is the slope and b is the intercept. We can think of the intercept as the minimal fertility rate, probably somewhere between one and two. The slope tells us how the fertility rate varies with illiteracy. We can find the best fit line using np.polyfit()
.
Plot the data and the best fit line. Print out the slope and intercept. (Think: what are their units?)
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')
# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy, fertility, deg=1)
# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')
# Make theoretical line to plot
x = np.array([0, 100])
y = a * x + b
# Add regression line to your plot
_ = plt.plot(x, y)
plt.savefig('../images/scatter-with-polyfit.png')
How is it optimal?
The function np.polyfit()
that you used to get your regression parameters finds the optimal slope and intercept. It is optimizing the sum of the squares of the residuals, also known as RSS (for residual sum of squares). In this exercise, you will plot the function that is being optimized, the RSS, versus the slope parameter a
. To do this, fix the intercept to be what you found in the optimization. Then, plot the RSS vs. the slope. Where is it minimal?
a_vals = np.linspace(0, 0.1, 200)
# Initialize sum of square of residuals: rss
rss = np.empty_like(a_vals)
# Compute sum of square of residuals for each value of a_vals
for i, a in enumerate(a_vals):
rss[i] = np.sum((fertility - a * illiteracy - b) ** 2)
# Plot the RSS
plt.plot(a_vals, rss, '-')
plt.xlabel('slope (children per woman / percent illiterate)')
plt.ylabel('sum of square of residuals')
df = pd.read_csv('./dataset/anscombe.csv')
x = np.array(df.iloc[1:, 1].astype('float'))
y = np.array(df.iloc[1:, 2].astype('float'))
a, b = np.polyfit(x, y, deg=1)
# Print the slope and intercept
print(a, b)
# Generate theoretical x and y data: x_theor, y_theor
x_theor = np.array([3, 15])
y_theor = a * x_theor + b
# Plot the Anscombe data and theoretical line
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.plot(x_theor, y_theor)
# Label the axes
plt.xlabel('x')
plt.ylabel('y')
Linear regression on all Anscombe data
Now, to verify that all four of the Anscombe data sets have the same slope and intercept from a linear regression, you will compute the slope and intercept for each set. The data are stored in lists; anscombe_x = [x1, x2, x3, x4]
and anscombe_y = [y1, y2, y3, y4]
, where, for example, x2
and y2
are the x and y values for the second Anscombe data set.
x1 = df.iloc[1:, 0].astype('float')
x2 = df.iloc[1:, 2].astype('float')
x3 = df.iloc[1:, 4].astype('float')
x4 = df.iloc[1:, 6].astype('float')
y1 = df.iloc[1:, 1].astype('float')
y2 = df.iloc[1:, 3].astype('float')
y3 = df.iloc[1:, 5].astype('float')
y4 = df.iloc[1:, 7].astype('float')
anscombe_x = [x1, x2, x3, x4]
anscombe_y = [y1, y2, y3, y4]
for x, y in zip(anscombe_x, anscombe_y):
# Compute the slope and intercept: a, b
a, b = np.polyfit(x, y, deg=1)
# Print the result
print('slope:', a, 'intercept:', b)