Earthquakes and oil mining in Oklahoma
Of course, earthquakes have a big impact on society, and recently are connected to human activity. In this final chapter, you'll investigate the effect that increased injection of saline wastewater due to oil mining in Oklahoma has had on the seismicity of the region. This is the Summary of lecture "Case Studies in Statistical Thinking", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import dc_stat_think as dcst
plt.rcParams['figure.figsize'] = (10, 5)
df = pd.read_csv('./dataset/oklahoma_earthquakes_1950-2017.csv', skiprows=2)
df.head()
from datetime import datetime as dt
import time
# Resource: https://stackoverflow.com/questions/6451655/how-to-convert-python-datetime-dates-to-decimal-float-years
def toYearFraction(date):
def sinceEpoch(date): # returns seconds since epoch
return time.mktime(date.timetuple())
s = sinceEpoch
year = date.year
startOfThisYear = dt(year=year, month=1, day=1)
startOfNextYear = dt(year=year+1, month=1, day=1)
yearElapsed = s(date) - s(startOfThisYear)
yearDuration = s(startOfNextYear) - s(startOfThisYear)
fraction = yearElapsed/yearDuration
return date.year + fraction
times = pd.to_datetime(df['time']).apply(toYearFraction).to_numpy()
mags = df['mag'].to_numpy()
_ = plt.plot(times, mags, marker='.', linestyle='none', alpha=0.1)
# Label axes
_ = plt.xlabel('time (year)')
_ = plt.ylabel('magnitude')
Estimates of the mean interearthquake times
The graphical EDA in the last exercise shows an obvious change in earthquake frequency around 2010. To compare, compute the mean time between earthquakes of magnitude 3 and larger from 1980 through 2009 and also from 2010 through mid-2017. Also include 95% confidence intervals of the mean. The variables dt_pre and dt_post respectively contain the time gap between all earthquakes of magnitude at least 3 from pre-2010 and post-2010 in units of days.
df_over3 = df[df['mag'] >= 3]
df_over3['time'] = pd.to_datetime(df_over3['time']).copy()
df_over3
dt_pre_df = df_over3[(df_over3['time'].dt.year < 2010) & (df_over3['time'].dt.year >= 1980)]
dt_post_df = df_over3[df_over3['time'].dt.year >= 2010]
dt_pre = dt_pre_df.time.diff().dt.total_seconds().apply(lambda x: x / 86400).to_numpy()[1:]
dt_post = dt_post_df.time.diff().dt.total_seconds().apply(lambda x: x / 86400).to_numpy()[1:]
dt_pre
dt_post
mean_dt_pre = np.mean(dt_pre)
mean_dt_post = np.mean(dt_post)
# Draw 10,000 bootstrap replicates of the mean
bs_reps_pre = dcst.draw_bs_reps(dt_pre, np.mean, size=10000)
bs_reps_post = dcst.draw_bs_reps(dt_post, np.mean, size=10000)
# Compute the confidence interval
conf_int_pre = np.percentile(bs_reps_pre, [2.5, 97.5])
conf_int_post = np.percentile(bs_reps_post, [2.5, 97.5])
# Print the results
print("""1980 through 2009
mean time gap: {0:.2f} days
95% conf int: [{1:.2f}, {2:.2f}] days""".format(mean_dt_pre, *conf_int_pre))
print("""
2010 through mid-2017
mean time gap: {0:.2f} days
95% conf int: [{1:.2f}, {2:.2f}] days""".format(mean_dt_post, *conf_int_post))
Hypothesis test: did earthquake frequency change?
Obviously, there was a massive increase in earthquake frequency once wastewater injection began. Nonetheless, you will still do a hypothesis test for practice. You will not test the hypothesis that the interearthquake times have the same distribution before and after 2010, since wastewater injection may affect the distribution. Instead, you will assume that they have the same mean. So, compute the p-value associated with the hypothesis that the pre- and post-2010 interearthquake times have the same mean, using the mean of pre-2010 time gaps minus the mean of post-2010 time gaps as your test statistic.
mean_dt_diff = mean_dt_pre - mean_dt_post
# Shift the post-2010 data to have the same mean as the pre-2010 data
dt_post_shift = dt_post - mean_dt_post + mean_dt_pre
# Compute 10,000 bootstrap replicates from arrays
bs_reps_pre = dcst.draw_bs_reps(dt_pre, np.mean, size=10000)
bs_reps_post = dcst.draw_bs_reps(dt_post_shift, np.mean, size=10000)
# Get replicates of difference of means
bs_reps = bs_reps_pre - bs_reps_post
# Compute and print the p-value
p_val = np.sum(bs_reps >= mean_dt_diff) / 10000
print('p =', p_val)
In 10,000 samples, not one had a test statistic greater than was was observed. The p-value is, predictably based on what we have done so far, is tiiiiiny!
mags_pre = mags[times < 2010]
mags_post = mags[times >= 2010]
# Generate ECDFs
_ = plt.plot(*dcst.ecdf(mags_pre), marker='.', linestyle='none')
_ = plt.plot(*dcst.ecdf(mags_post), marker='.', linestyle='none')
# Label axes
_ = plt.xlabel('magnitude')
_ = plt.ylabel('ECDF')
plt.legend(('1980 though 2009', '2010 through mid-2017'), loc='upper left');
def b_value(mags, mt, perc=[2.5, 97.5], n_reps=None):
"""Compute the b-value and optionally its confidence interval."""
# Extract magnitudes above completeness threshold: m
m = mags[mags >= mt]
# Compute b-value: b
b = (np.mean(m) - mt) * np.log(10)
# Draw bootstrap replicates
if n_reps is None:
return b
else:
m_bs_reps = dcst.draw_bs_reps(m, np.mean, n_reps)
# Compute b-value from replicates: b_bs_reps
b_bs_reps = (m_bs_reps - mt) * np.log(10)
# Compute confidence interval: conf_int
conf_int = np.percentile(b_bs_reps, perc)
return b, conf_int
mt = 3
b_pre, conf_int_pre = b_value(mags_pre, mt, perc=[2.5, 97.5], n_reps=10000)
# Compute b-value and confidence interval for post-2010
b_post, conf_int_post = b_value(mags_post, mt, perc=[2.5, 97.5], n_reps=10000)
# Report the results
print("""
1980 through 2009
b-value: {0:.2f}
95% conf int: [{1:.2f}, {2:.2f}]
2010 through mid-2017
b-value: {3:.2f}
95% conf int: [{4:.2f}, {5:.2f}]
""".format(b_pre, *conf_int_pre, b_post, *conf_int_post))
The confidence interval for the b-value for recent earthquakes is tighter than for earlier ones because there are many more recent ones. Still, the confidence intervals overlap, and we can perform a hypothesis test to see if we might get these results if the b-values are actually the same.
mags_pre = mags_pre[mags_pre >= mt]
mags_post = mags_post[mags_post >= mt]
# Observed difference in mean magnitudes: diff_obs
diff_obs = np.mean(mags_post) - np.mean(mags_pre)
# Generate permutation replicates: perm_reps
perm_reps = dcst.draw_perm_reps(mags_post, mags_pre, dcst.diff_of_means, size=10000)
# Compute and print p-value
p_val = np.sum(perm_reps < diff_obs) / 10000
print('p =', p_val)
A p-value around 0.1 suggests that the observed magnitudes are commensurate with there being no change in b-value after wastewater injection began.