Written on April 23, 2021.
Tags: statistics, cool stuffs, extreme values
Table of contents
In this post we are going to play with pyextremes a python library that does everything (and more) I have been talking about the two previous posts. After a bit of EDA, we are going to fit use BM and POT to compute the return values – could be interesting to know what magnitude to expect at a given horizon, no?
\(m\)-years return values/levels are the “high quantile for which the probability that the annual maximum exceeds this quantile is \(1/m\)” – see 100-year flood (wiki).
Setup and data - Earthquakes in CA
Setup
A bit of setup first…
!pip3.8 install -q -U emcee folium pyextremes statsmodels
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib as mpl
from matplotlib import pylab as plt
import matplotlib.dates as mdates
import seaborn as sns
set(rc={'figure.figsize':(18,6)})
sns."ticks")
sns.set_style(
= plt.get_cmap('plasma')
cmap
import statsmodels.api as sm
from pyextremes import EVA, __version__
import pyextremes
from pyextremes.plotting import (
plot_extremes,
plot_trace,
plot_corner,
)
=3.0
vmin=9.0
vmax
print("pyextremes", __version__)
pyextremes 2.2.1
Data
Data are from the Northern California Earthquake Data Center (NCEDC).
- Waldhauser, F. and D.P. Schaff, Large-scale relocation of two decades of Northern California seismicity using cross-correlation and double-difference methods, J. Geophys. Res.,113, B08311, doi:10.1029/2007JB005479, 2008.
- Waldhauser, F., Near-real-time double-difference event location using long-term seismic archives, with application to Northern California, Bull. Seism. Soc. Am., 99, 2736-2848, doi:10.1785/0120080294, 2009.
Go here if you want to download them.
= pd.read_csv('ncedc.csv')
df_ df_
time | latitude | longitude | magnitude | magType | … | source | eventID | |
---|---|---|---|---|---|---|---|---|
0 | 1967/08/22 08:29:47.47 | 36.40050 | -120.98650 | 3.00 | Mx | … | NCSN | 1001120 |
1 | 1967/08/26 10:35:05.61 | 36.53867 | -121.16033 | 3.30 | Mx | … | NCSN | 1001154 |
2 | 1967/08/27 13:20:08.75 | 36.51083 | -121.14500 | 3.60 | Mx | … | NCSN | 1001166 |
3 | 1968/01/12 22:19:10.35 | 36.64533 | -121.24966 | 3.00 | ML | … | NCSN | 1001353 |
4 | 1968/02/09 13:42:37.05 | 37.15267 | -121.54483 | 3.00 | ML | … | NCSN | 1001402 |
… | … | … | … | … | … | … | … | … |
18330 | 2020/12/27 14:44:40.09 | 39.50583 | -122.01767 | 3.86 | Mw | … | NCSN | 73503450 |
18331 | 2020/12/27 19:05:47.53 | 36.84417 | -121.57484 | 3.19 | ML | … | NCSN | 73503535 |
18332 | 2020/12/30 08:30:22.09 | 36.60500 | -121.20883 | 3.10 | ML | … | NCSN | 73504420 |
18333 | 2020/12/31 04:22:03.38 | 38.83883 | -122.82317 | 3.63 | Mw | … | NCSN | 73504795 |
18334 | 2020/12/31 13:41:59.04 | 37.79950 | -122.59367 | 3.33 | ML | … | NCSN | 73505175 |
18335 rows × 12 columns
= sns.color_palette("hls", 8)
magTypeColors = {
palette 'ML': magTypeColors[0],
'Md': magTypeColors[1],
'Mh': magTypeColors[2],
'Mw': magTypeColors[3],
'Mx': magTypeColors[4],
'Unk': magTypeColors[5],
}=df_.sort_values(by='magType'),
sns.histplot(data='magnitude',
x='magType',
hue=20,
bins='stack',
multiple=palette)
palette=True)
sns.despine(left plt.show()
And more specifically above 5.
=df_[df_['magnitude'] > 5].sort_values(by='magType'),
sns.histplot(data='magnitude',
x='magType',
hue=20,
bins='stack',
multiple=palette)
palette=True)
sns.despine(left plt.show()
The magniture are based on different scales.
='magType').size().to_frame('count') df_.groupby(by
magType | count |
---|---|
ML | 6513 |
Md | 10307 |
Mh | 12 |
Mw | 1456 |
Mx | 46 |
Unk | 1 |
We will stick to only one scale as mapping on to each other seems to be difficult. They basically measure different things… – See USGS - what-are-different-magnitude-scales.
We pick moment magnitude (Mw) scale as “[it is] based on the concept of seismic moment, is uniformly applicable to all sizes of earthquakes but is more difficult to compute than the other types” – See magnitude. Other scales are not covering the entire range of eathquakes creating truncated data. Note that this could introduce a selection bias as we are not sure why a particular events is using one scale or another.
= df_[(df_['magType'] == 'Mw')]
df__ ='magType').size() df__.groupby(by
Mw 1456
= df__[['time', 'latitude', 'longitude', 'magnitude']].copy()
df_all 'magnitude': 'value',
df_all.rename({'time': 'obs'}, inplace=True, axis=1)
= df_all.dropna(axis=0)
df_all = df_all.set_index(pd.DatetimeIndex(df_all['obs'])).drop('obs', axis=1)
df_all df_all.head()
obs | latitude | longitude | value |
---|---|---|---|
1992-04-26 07:41:40.090 | 40.43250 | -124.56600 | 6.45 |
1992-04-26 11:18:25.980 | 40.38283 | -124.55500 | 6.57 |
1992-09-19 23:04:46.830 | 38.86000 | -122.79183 | 4.47 |
1993-01-16 06:29:35.010 | 37.01833 | -121.46317 | 4.91 |
1993-01-18 23:27:10.610 | 38.84250 | -122.77800 | 3.92 |
=df_all, x='value', bins=20, color=palette['Mw'])
sns.histplot(data=True)
sns.despine(left plt.show()
The data we are looking at cover the following time interval:
={'min': [min(df_all.index)],
pd.DataFrame.from_dict(data'max': [max(df_all.index)]},
='index',
orient=['obs']) columns
obs | |
---|---|
min | 1992-04-26 07:41:40.090 |
max | 2020-12-31 04:22:03.380 |
and range as follows:
'value'].describe()) pd.DataFrame(df_all[
value | |
---|---|
count | 1456.000000 |
mean | 3.933427 |
std | 0.535629 |
min | 3.000000 |
25% | 3.560000 |
50% | 3.850000 |
75% | 4.200000 |
max | 6.980000 |
The biggest earthquake in CA (in this dataset and using this scale…) are:
'value', ascending=False) \
df_all.sort_values(=['value'], n=10) \
.nlargest(columns=['value'],
.style.background_gradient(subset=cmap,
cmap=vmin,
vmin=vmax) vmax
obs | latitude | longitude | value |
---|---|---|---|
2005-06-15 02:50:51.330000 | 41.232670 | -125.986340 | 6.980000 |
2014-03-10 05:18:13.430000 | 40.828670 | -125.133830 | 6.800000 |
1995-02-19 04:03:14.940000 | 40.591840 | -125.756670 | 6.600000 |
1992-04-26 11:18:25.980000 | 40.382830 | -124.555000 | 6.570000 |
2003-12-22 19:15:56.240000 | 35.700500 | -121.100500 | 6.500000 |
2010-01-10 00:27:39.320000 | 40.652000 | -124.692500 | 6.500000 |
1992-04-26 07:41:40.090000 | 40.432500 | -124.566000 | 6.450000 |
2020-05-15 11:03:26.400000 | 38.216170 | -117.844330 | 6.440000 |
2014-08-24 10:20:44.070000 | 38.215170 | -122.312330 | 6.020000 |
2004-09-28 17:15:24.250000 | 35.818160 | -120.366000 | 5.970000 |
import folium
def transparent(r, g, b, a, alpha=None):
if alpha is None:
return (r, g, b, a)
return (r, g, b, alpha)
def rgb2hex(r, g, b, a=None):
if a is None:
return "#{:02x}{:02x}{:02x}".format(int(r*255),
int(g*255),
int(b*255))
else:
return "#{:02x}{:02x}{:02x}{:02x}".format(int(r*255),
int(g*255),
int(b*255),
int(a*255))
= folium.Map(location=[df_all['latitude'].mean(),
m 'longitude'].mean()],
df_all[="Stamen Terrain", zoom_start=6)
tiles
def add_to_map(row):
folium.Circle(=100*row['value'],
radius=[row['latitude'], row['longitude']],
location="{obs} {value}".format(obs=row.name,
popup=row['value']),
value=rgb2hex(*transparent(*cmap((row['value']-vmin)/(vmax - vmin)),
color=0.5)),
alpha=True,
fill
).add_to(m)
apply(add_to_map, axis=1)
df_all.
m
from matplotlib.dates import YearLocator, DateFormatter
=df_all, x='obs', y='value',
sns.scatterplot(data='.', palette=cmap, hue='value',
marker=(vmin, vmax), s=150, alpha=0.9)
hue_norm= plt.gca()
ax 2))
ax.xaxis.set_major_locator(YearLocator('%Y'))
ax.xaxis.set_major_formatter(DateFormatter(=True)
sns.despine(left plt.show()
The scale (Mw) seems to have been used throughout the period we are looking at.
= df_all.groupby(by='obs')['value'].max()
data_df = pd.to_datetime(data_df.index) data_df.index
How independent are these events?
=20)
sm.graphics.tsa.plot_acf(data_df, lags=True)
sns.despine(left plt.show()
There is a bit of autocorrelation. Not completely surprising as “Aftershocks can continue over a period of weeks, months, or years.” USGS glossary. To remove that we are going look at the max in 1-week windows. It is not perfect as aftershocks can span longer than that and there is no guarantee that all the aftershocks are going to fall in the same buckets. Let see how it does.
= data_df.resample('1W').max().dropna()
data_df_ =20)
sm.graphics.tsa.plot_acf(data_df_.squeeze(), lags=True)
sns.despine(left plt.show()
Looking better.
=data_df_.to_frame('value'), x='obs', y='value', marker='.', palette=cmap, hue='value', hue_norm=(vmin, vmax), s=150, alpha=0.9)
sns.scatterplot(data= plt.gca()
ax 2))
ax.xaxis.set_major_locator(YearLocator('%Y'))
ax.xaxis.set_major_formatter(DateFormatter(=True)
sns.despine(left plt.show()
Let’s put the data in the form pyextremes expect them.
=data_df_.to_frame('value').squeeze()
data data
obs
1992-04-26 6.57
1992-09-20 4.47
1993-01-17 4.91
1993-01-24 3.92
1993-02-21 3.88
...
2020-12-06 4.96
2020-12-13 4.77
2020-12-20 4.81
2020-12-27 3.86
2021-01-03 3.63
Name: value, Length: 777, dtype: float64
BM model
= EVA(data=data)
model model
Univariate Extreme Value Analysis
===========================================================================
Source Data
---------------------------------------------------------------------------
Data label: value Size: 777
Start: April 1992 End: January 2021
===========================================================================
Extreme Values
---------------------------------------------------------------------------
Extreme values have not been extracted
===========================================================================
Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================
model.get_extremes(="BM",
method="high",
extremes_type="365.2425D",
block_size="ignore",
errors
) model
Univariate Extreme Value Analysis
===========================================================================
Source Data
---------------------------------------------------------------------------
Data label: value Size: 777
Start: April 1992 End: January 2021
===========================================================================
Extreme Values
---------------------------------------------------------------------------
Count: 29 Extraction method: BM
Type: high Block size: 365 days 05:49:12
===========================================================================
Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================
model.plot_extremes()=True)
sns.despine(left plt.show()
MLE
model.fit_model() model
Univariate Extreme Value Analysis
===========================================================================
Source Data
---------------------------------------------------------------------------
Data label: value Size: 777
Start: April 1992 End: January 2021
===========================================================================
Extreme Values
---------------------------------------------------------------------------
Count: 29 Extraction method: BM
Type: high Block size: 365 days 05:49:12
===========================================================================
Model
---------------------------------------------------------------------------
Model: MLE Distribution: gumbel_r
Log-likelihood: -28.048 AIC: 60.558
---------------------------------------------------------------------------
Free parameters: loc=5.340 Fixed parameters: All parameters are free
scale=0.550
===========================================================================
=0.95)
model.plot_diagnostic(alpha=True)
sns.despine(left plt.show()
model.plot_return_values(=np.logspace(0.01, 2, 100),
return_period="365.2425D",
return_period_size=0.95,
alpha
)=True)
sns.despine(left plt.show()
= model.get_summary(
summary_bm_mle =[2, 5, 10, 25, 50, 100],
return_period=0.95,
alpha=1000,
n_samples
)=cmap, vmin=vmin, vmax=vmax) summary_bm_mle.style.background_gradient(cmap
return period | return value | lower ci | upper ci |
---|---|---|---|
2.0 | 5.541965 | 5.327484 | 5.777972 |
5.0 | 6.165502 | 5.836295 | 6.472050 |
10.0 | 6.578338 | 6.165001 | 6.960001 |
25.0 | 7.099957 | 6.576074 | 7.581792 |
50.0 | 7.486924 | 6.863262 | 8.037220 |
100.0 | 7.871033 | 7.156755 | 8.490038 |
MCMC
The library also supports fitting the distribution using a Bayesian approach. Wont go over this here. Check the tutorials that come with it.
POT
POT requires to select a threshold. A method to do that is to apply the fit the GPD for a range of thresholds and look at the stability of its parameters. This is what the following does:
=data, thresholds=np.linspace(4, 6, 50))
pyextremes.plot_parameter_stability(ts=True)
sns.despine(left plt.show()
We then pick a threshold in the range where the scale and shape parameters of the GPD are stable.
= 4.25 threshold
= EVA(data=data)
model model
Univariate Extreme Value Analysis
===========================================================================
Source Data
---------------------------------------------------------------------------
Data label: value Size: 777
Start: April 1992 End: January 2021
===========================================================================
Extreme Values
---------------------------------------------------------------------------
Extreme values have not been extracted
===========================================================================
Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================
=data[data>threshold], bins=20)
sns.histplot(data=True)
sns.despine(left plt.show()
model.get_extremes(="POT",
method="high",
extremes_type=threshold)
threshold model
Univariate Extreme Value Analysis
===========================================================================
Source Data
---------------------------------------------------------------------------
Data label: value Size: 777
Start: April 1992 End: January 2021
===========================================================================
Extreme Values
---------------------------------------------------------------------------
Count: 237 Extraction method: POT
Type: high Threshold: 4.25
===========================================================================
Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================
model.plot_extremes()=True)
sns.despine(left plt.show()
model.fit_model() model
Univariate Extreme Value Analysis
===========================================================================
Source Data
---------------------------------------------------------------------------
Data label: value Size: 777
Start: April 1992 End: January 2021
===========================================================================
Extreme Values
---------------------------------------------------------------------------
Count: 237 Extraction method: POT
Type: high Threshold: 4.25
===========================================================================
Model
---------------------------------------------------------------------------
Model: MLE Distribution: expon
Log-likelihood: -77.084 AIC: 156.185
---------------------------------------------------------------------------
Free parameters: scale=0.509 Fixed parameters: floc=4.250
===========================================================================
=0.95)
model.plot_diagnostic(alpha=True)
sns.despine(left plt.show()
model.plot_return_values(=np.logspace(0.01, 2, 100),
return_period="365.2425D",
return_period_size=0.95,
alpha
)=True)
sns.despine(left plt.show()
= model.get_summary(
summary_pot_mle =[2, 5, 10, 25, 50, 100],
return_period=0.95,
alpha=1000,
n_samples
)=cmap, vmin=vmin, vmax=vmax) summary_pot_mle.style.background_gradient(cmap
return period | return value | lower ci | upper ci |
---|---|---|---|
2.0 | 5.678355 | 5.508168 | 5.870555 |
5.0 | 6.145006 | 5.919218 | 6.399999 |
10.0 | 6.498014 | 6.230166 | 6.800508 |
25.0 | 6.964665 | 6.641216 | 7.329952 |
50.0 | 7.317673 | 6.952163 | 7.730461 |
100.0 | 7.670680 | 7.263111 | 8.130970 |
Comparison
Let’s compare what both methods.
= summary_bm_mle
df_bm = summary_pot_mle
df_pot # bm
plt.errorbar(df_bm.index,'return value'],
df_bm[=[df_bm['return value'] - df_bm['lower ci'],
yerr'upper ci'] - df_bm['return value']],
df_bm[='-ob', label='return value (bm)');
fmt
plt.fill_between(df_bm.index,'lower ci'],
df_bm['upper ci'],
df_bm[='b',
color=0.2)
alpha# pot
plt.errorbar(df_pot.index,'return value'],
df_pot[=[df_pot['return value'] - df_pot['lower ci'],
yerr'upper ci'] - df_pot['return value']],
df_pot[='-or',
fmt='return value (pot)');
label
plt.fill_between(df_pot.index,'lower ci'],
df_pot['upper ci'],
df_pot[='r',
color=0.2)
alphaTrue, which="both", linestyle='--', linewidth=1)
plt.grid('log')
plt.gca().set_xscale('global earthquakes - return values - bm and pot 95% CIs')
plt.title(
plt.legend()=True)
sns.despine(left plt.show()
pd.concat([summary_bm_mle, summary_pot_mle],=1,
axis=['bm', 'pot']).style.background_gradient(cmap=cmap,
keys=vmin,
vmin=vmax) vmax
bm | pot | |||||
---|---|---|---|---|---|---|
return period | return value | lower ci | upper ci | return value | lower ci | upper ci |
2.0 | 5.541965 | 5.327484 | 5.777972 | 5.678355 | 5.508168 | 5.870555 |
5.0 | 6.165502 | 5.836295 | 6.472050 | 6.145006 | 5.919218 | 6.399999 |
10.0 | 6.578338 | 6.165001 | 6.960001 | 6.498014 | 6.230166 | 6.800508 |
25.0 | 7.099957 | 6.576074 | 7.581792 | 6.964665 | 6.641216 | 7.329952 |
50.0 | 7.486924 | 6.863262 | 8.037220 | 7.317673 | 6.952163 | 7.730461 |
100.0 | 7.871033 | 7.156755 | 8.490038 | 7.670680 | 7.263111 | 8.130970 |
Lets see what we had between 1992 and 2021:
'value', ascending=False) \
df_all.sort_values(=['value'], n=10) \
.nlargest(columns=['value'],
.style.background_gradient(subset=cmap,
cmap=vmin,
vmin=vmax) vmax
obs | latitude | longitude | value |
---|---|---|---|
2005-06-15 02:50:51.330000 | 41.232670 | -125.986340 | 6.980000 |
2014-03-10 05:18:13.430000 | 40.828670 | -125.133830 | 6.800000 |
1995-02-19 04:03:14.940000 | 40.591840 | -125.756670 | 6.600000 |
1992-04-26 11:18:25.980000 | 40.382830 | -124.555000 | 6.570000 |
2003-12-22 19:15:56.240000 | 35.700500 | -121.100500 | 6.500000 |
2010-01-10 00:27:39.320000 | 40.652000 | -124.692500 | 6.500000 |
1992-04-26 07:41:40.090000 | 40.432500 | -124.566000 | 6.450000 |
2020-05-15 11:03:26.400000 | 38.216170 | -117.844330 | 6.440000 |
2014-08-24 10:20:44.070000 | 38.215170 | -122.312330 | 6.020000 |
2004-09-28 17:15:24.250000 | 35.818160 | -120.366000 | 5.970000 |
Quakes greater than 6.5 and 7.2 (Mw) have definitively been observed between 1992 and 2021 - looking at CI for the return value at 20 years - remember it doesn’t have to by the definition of the return value.
Conclusion
Cool piece of theory too look at! Nice lib! Nice tool!
References
- An Introduction to Statistical Modeling of Extremes Values, Stuart Coles
- pyextremes - python library