Written on April 23, 2021

In this post we are going to play with pyextremes a python library that does everything (and more) I have been talking about the two previous posts. After a bit of EDA, we are going to fit use BM and POT to compute the return values – could be interesting to know what magnitude to expect at a given horizon, no?

\(m\)-years return values/levels are the “high quantile for which the probability that the annual maximum exceeds this quantile is \(1/m\)” – see 100-year flood (wiki).

Setup and data - Earthquakes in CA

Setup

A bit of setup first…

!pip3.8 install -q -U emcee folium pyextremes statsmodels

import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib as mpl
from matplotlib import pylab as plt
import matplotlib.dates as mdates
import seaborn as sns
sns.set(rc={'figure.figsize':(18,6)})
sns.set_style("ticks")

cmap = plt.get_cmap('plasma')

import statsmodels.api as sm

from pyextremes import EVA, __version__
import pyextremes

from pyextremes.plotting import (
    plot_extremes,
    plot_trace,
    plot_corner,
)

vmin=3.0
vmax=9.0

print("pyextremes", __version__)

pyextremes 2.2.1

Data

Data are from the Northern California Earthquake Data Center (NCEDC).

Waldhauser, F. and D.P. Schaff, Large-scale relocation of two decades of Northern California seismicity using cross-correlation and double-difference methods, J. Geophys. Res.,113, B08311, doi:10.1029/2007JB005479, 2008.
Waldhauser, F., Near-real-time double-difference event location using long-term seismic archives, with application to Northern California, Bull. Seism. Soc. Am., 99, 2736-2848, doi:10.1785/0120080294, 2009.

Go here if you want to download them.

df_ = pd.read_csv('ncedc.csv')
df_

	time	latitude	longitude	magnitude	magType	…	source	eventID
0	1967/08/22 08:29:47.47	36.40050	-120.98650	3.00	Mx	…	NCSN	1001120
1	1967/08/26 10:35:05.61	36.53867	-121.16033	3.30	Mx	…	NCSN	1001154
2	1967/08/27 13:20:08.75	36.51083	-121.14500	3.60	Mx	…	NCSN	1001166
3	1968/01/12 22:19:10.35	36.64533	-121.24966	3.00	ML	…	NCSN	1001353
4	1968/02/09 13:42:37.05	37.15267	-121.54483	3.00	ML	…	NCSN	1001402
…	…	…	…	…	…	…	…	…
18330	2020/12/27 14:44:40.09	39.50583	-122.01767	3.86	Mw	…	NCSN	73503450
18331	2020/12/27 19:05:47.53	36.84417	-121.57484	3.19	ML	…	NCSN	73503535
18332	2020/12/30 08:30:22.09	36.60500	-121.20883	3.10	ML	…	NCSN	73504420
18333	2020/12/31 04:22:03.38	38.83883	-122.82317	3.63	Mw	…	NCSN	73504795
18334	2020/12/31 13:41:59.04	37.79950	-122.59367	3.33	ML	…	NCSN	73505175

18335 rows × 12 columns

magTypeColors = sns.color_palette("hls", 8)
palette = {
    'ML': magTypeColors[0],
    'Md': magTypeColors[1],
    'Mh': magTypeColors[2],
    'Mw': magTypeColors[3],
    'Mx': magTypeColors[4],
    'Unk': magTypeColors[5],
}
sns.histplot(data=df_.sort_values(by='magType'),
             x='magnitude',
             hue='magType',
             bins=20,
             multiple='stack',
             palette=palette)
sns.despine(left=True)
plt.show()

Earthquakes magnitude distribution by types.

And more specifically above 5.

sns.histplot(data=df_[df_['magnitude'] > 5].sort_values(by='magType'),
             x='magnitude',
             hue='magType',
             bins=20,
             multiple='stack',
             palette=palette)
sns.despine(left=True)
plt.show()

The magniture are based on different scales.

df_.groupby(by='magType').size().to_frame('count')

magType	count
ML	6513
Md	10307
Mh	12
Mw	1456
Mx	46
Unk	1

We will stick to only one scale as mapping on to each other seems to be difficult. They basically measure different things… – See USGS - what-are-different-magnitude-scales.

We pick moment magnitude (Mw) scale as “[it is] based on the concept of seismic moment, is uniformly applicable to all sizes of earthquakes but is more difficult to compute than the other types” – See magnitude. Other scales are not covering the entire range of eathquakes creating truncated data. Note that this could introduce a selection bias as we are not sure why a particular events is using one scale or another.

df__ = df_[(df_['magType'] == 'Mw')]
df__.groupby(by='magType').size()

Mw    1456

df_all = df__[['time', 'latitude', 'longitude', 'magnitude']].copy()
df_all.rename({'magnitude': 'value',
               'time': 'obs'}, inplace=True, axis=1)
df_all = df_all.dropna(axis=0)
df_all = df_all.set_index(pd.DatetimeIndex(df_all['obs'])).drop('obs', axis=1)
df_all.head()

obs	latitude	longitude	value
1992-04-26 07:41:40.090	40.43250	-124.56600	6.45
1992-04-26 11:18:25.980	40.38283	-124.55500	6.57
1992-09-19 23:04:46.830	38.86000	-122.79183	4.47
1993-01-16 06:29:35.010	37.01833	-121.46317	4.91
1993-01-18 23:27:10.610	38.84250	-122.77800	3.92

sns.histplot(data=df_all, x='value', bins=20, color=palette['Mw'])
sns.despine(left=True)
plt.show()

CA Earthquakes between 1992 and 2021 (NECDC data, Mw scale).

The data we are looking at cover the following time interval:

pd.DataFrame.from_dict(data={'min': [min(df_all.index)],
                             'max': [max(df_all.index)]},
                       orient='index',
                       columns=['obs'])

	obs
min	1992-04-26 07:41:40.090
max	2020-12-31 04:22:03.380

and range as follows:

pd.DataFrame(df_all['value'].describe())

	value
count	1456.000000
mean	3.933427
std	0.535629
min	3.000000
25%	3.560000
50%	3.850000
75%	4.200000
max	6.980000

The biggest earthquake in CA (in this dataset and using this scale…) are:

df_all.sort_values('value', ascending=False) \
      .nlargest(columns=['value'], n=10) \
      .style.background_gradient(subset=['value'],
                                 cmap=cmap,
                                 vmin=vmin,
                                 vmax=vmax)

obs	latitude	longitude	value
2005-06-15 02:50:51.330000	41.232670	-125.986340	6.980000
2014-03-10 05:18:13.430000	40.828670	-125.133830	6.800000
1995-02-19 04:03:14.940000	40.591840	-125.756670	6.600000
1992-04-26 11:18:25.980000	40.382830	-124.555000	6.570000
2003-12-22 19:15:56.240000	35.700500	-121.100500	6.500000
2010-01-10 00:27:39.320000	40.652000	-124.692500	6.500000
1992-04-26 07:41:40.090000	40.432500	-124.566000	6.450000
2020-05-15 11:03:26.400000	38.216170	-117.844330	6.440000
2014-08-24 10:20:44.070000	38.215170	-122.312330	6.020000
2004-09-28 17:15:24.250000	35.818160	-120.366000	5.970000

import folium

def transparent(r, g, b, a, alpha=None):
    if alpha is None:
        return (r, g, b, a)
    return (r, g, b, alpha)
def rgb2hex(r, g, b, a=None):
    if a is None:
        return "#{:02x}{:02x}{:02x}".format(int(r*255),
                                            int(g*255),
                                            int(b*255))
    else:
        return "#{:02x}{:02x}{:02x}{:02x}".format(int(r*255),
                                                  int(g*255),
                                                  int(b*255),
                                                  int(a*255))

m = folium.Map(location=[df_all['latitude'].mean(),
                         df_all['longitude'].mean()],
                         tiles="Stamen Terrain", zoom_start=6)

def add_to_map(row):
    folium.Circle(
        radius=100*row['value'],
        location=[row['latitude'], row['longitude']],
        popup="{obs} {value}".format(obs=row.name,
                                     value=row['value']),
        color=rgb2hex(*transparent(*cmap((row['value']-vmin)/(vmax - vmin)),
                                   alpha=0.5)),
        fill=True,
    ).add_to(m)

df_all.apply(add_to_map, axis=1)

m

Earthquakes in CA - NCEDC 1992-2021 (Mw scale).

from matplotlib.dates import YearLocator, DateFormatter

sns.scatterplot(data=df_all, x='obs', y='value',
                marker='.', palette=cmap, hue='value',
                hue_norm=(vmin, vmax), s=150, alpha=0.9)
ax = plt.gca()
ax.xaxis.set_major_locator(YearLocator(2))
ax.xaxis.set_major_formatter(DateFormatter('%Y'))
sns.despine(left=True)
plt.show()

Earthquakes in CA over time - NCEDC 1992-2021 (Mw scale).

The scale (Mw) seems to have been used throughout the period we are looking at.

data_df = df_all.groupby(by='obs')['value'].max()
data_df.index = pd.to_datetime(data_df.index)

How independent are these events?

sm.graphics.tsa.plot_acf(data_df, lags=20)
sns.despine(left=True)
plt.show()

There is a bit of autocorrelation. Not completely surprising as “Aftershocks can continue over a period of weeks, months, or years.” USGS glossary. To remove that we are going look at the max in 1-week windows. It is not perfect as aftershocks can span longer than that and there is no guarantee that all the aftershocks are going to fall in the same buckets. Let see how it does.

data_df_ = data_df.resample('1W').max().dropna()
sm.graphics.tsa.plot_acf(data_df_.squeeze(), lags=20)
sns.despine(left=True)
plt.show()

Looking better.

sns.scatterplot(data=data_df_.to_frame('value'), x='obs', y='value', marker='.', palette=cmap, hue='value', hue_norm=(vmin, vmax), s=150, alpha=0.9)
ax = plt.gca()
ax.xaxis.set_major_locator(YearLocator(2))
ax.xaxis.set_major_formatter(DateFormatter('%Y'))
sns.despine(left=True)
plt.show()

Largest earthquakes over 1 week period in CA over time - NCEDC 1992-2021 (Mw scale).

Let’s put the data in the form pyextremes expect them.

data=data_df_.to_frame('value').squeeze()
data

    obs
    1992-04-26    6.57
    1992-09-20    4.47
    1993-01-17    4.91
    1993-01-24    3.92
    1993-02-21    3.88
                  ...
    2020-12-06    4.96
    2020-12-13    4.77
    2020-12-20    4.81
    2020-12-27    3.86
    2021-01-03    3.63
    Name: value, Length: 777, dtype: float64

BM model

model = EVA(data=data)
model

                           Univariate Extreme Value Analysis
===========================================================================
                                      Source Data
---------------------------------------------------------------------------
Data label:                         value     Size:                     777
Start:                         April 1992     End:             January 2021
===========================================================================
                                     Extreme Values
---------------------------------------------------------------------------
Extreme values have not been extracted
===========================================================================
                                         Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================

model.get_extremes(
    method="BM",
    extremes_type="high",
    block_size="365.2425D",
    errors="ignore",
)
model

                           Univariate Extreme Value Analysis
===========================================================================
                                      Source Data
---------------------------------------------------------------------------
Data label:                         value     Size:                     777
Start:                         April 1992     End:             January 2021
===========================================================================
                                     Extreme Values
---------------------------------------------------------------------------
Count:                                 29     Extraction method:         BM
Type:                                high     Block size: 365 days 05:49:12
===========================================================================
                                         Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================

model.plot_extremes()
sns.despine(left=True)
plt.show()

MLE

model.fit_model()
model

                           Univariate Extreme Value Analysis
===========================================================================
                         Source Data
---------------------------------------------------------------------------
Data label:            value      Size:                                 777
Start:            April 1992      End:                         January 2021
===========================================================================
                        Extreme Values
---------------------------------------------------------------------------
Count:                    29      Extraction method:                     BM
Type:                   high      Block size:             365 days 05:49:12
===========================================================================
                            Model
---------------------------------------------------------------------------
Model:                   MLE      Distribution:                    gumbel_r
Log-likelihood:      -28.048      AIC:                               60.558
---------------------------------------------------------------------------
Free parameters:   loc=5.340      Fixed parameters: All parameters are free
                 scale=0.550
===========================================================================

model.plot_diagnostic(alpha=0.95)
sns.despine(left=True)
plt.show()

model.plot_return_values(
    return_period=np.logspace(0.01, 2, 100),
    return_period_size="365.2425D",
    alpha=0.95,
)
sns.despine(left=True)
plt.show()

summary_bm_mle = model.get_summary(
    return_period=[2, 5, 10, 25, 50, 100],
    alpha=0.95,
    n_samples=1000,
)
summary_bm_mle.style.background_gradient(cmap=cmap, vmin=vmin, vmax=vmax)

return period	return value	lower ci	upper ci
2.0	5.541965	5.327484	5.777972
5.0	6.165502	5.836295	6.472050
10.0	6.578338	6.165001	6.960001
25.0	7.099957	6.576074	7.581792
50.0	7.486924	6.863262	8.037220
100.0	7.871033	7.156755	8.490038

MCMC

The library also supports fitting the distribution using a Bayesian approach. Wont go over this here. Check the tutorials that come with it.

POT

POT requires to select a threshold. A method to do that is to apply the fit the GPD for a range of thresholds and look at the stability of its parameters. This is what the following does:

pyextremes.plot_parameter_stability(ts=data, thresholds=np.linspace(4, 6, 50))
sns.despine(left=True)
plt.show()

We then pick a threshold in the range where the scale and shape parameters of the GPD are stable.

threshold = 4.25

model = EVA(data=data)
model

                           Univariate Extreme Value Analysis
===========================================================================
                                      Source Data
---------------------------------------------------------------------------
Data label:                         value      Size:                    777
Start:                         April 1992      End:            January 2021
===========================================================================
                                     Extreme Values
---------------------------------------------------------------------------
Extreme values have not been extracted
===========================================================================
                                         Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================

sns.histplot(data=data[data>threshold], bins=20)
sns.despine(left=True)
plt.show()

Distribution of the Earthquakes (still max over 1-week interval, Mw scale between 1992 and 2021) over the threshold.

model.get_extremes(
    method="POT",
    extremes_type="high",
    threshold=threshold)
model

                           Univariate Extreme Value Analysis
===========================================================================
                                  Source Data
---------------------------------------------------------------------------
Data label:                     value      Size:                       777
Start:                     April 1992      End:               January 2021
===========================================================================
                                 Extreme Values
---------------------------------------------------------------------------
Count:                            237      Extraction method:          POT
Type:                            high      Threshold:                 4.25
===========================================================================
                                     Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================

model.plot_extremes()
sns.despine(left=True)
plt.show()

model.fit_model()
model

                           Univariate Extreme Value Analysis
===========================================================================
                                  Source Data
---------------------------------------------------------------------------
Data label:                     value      Size:                        777
Start:                     April 1992      End:                January 2021
===========================================================================
                                 Extreme Values
---------------------------------------------------------------------------
Count:                            237      Extraction method:           POT
Type:                            high      Threshold:                  4.25
===========================================================================
                                     Model
---------------------------------------------------------------------------
Model:                            MLE      Distribution:              expon
Log-likelihood:               -77.084      AIC:                     156.185
---------------------------------------------------------------------------
Free parameters:          scale=0.509      Fixed parameters:     floc=4.250
===========================================================================

model.plot_diagnostic(alpha=0.95)
sns.despine(left=True)
plt.show()

model.plot_return_values(
    return_period=np.logspace(0.01, 2, 100),
    return_period_size="365.2425D",
    alpha=0.95,
)
sns.despine(left=True)
plt.show()

summary_pot_mle = model.get_summary(
    return_period=[2, 5, 10, 25, 50, 100],
    alpha=0.95,
    n_samples=1000,
)
summary_pot_mle.style.background_gradient(cmap=cmap, vmin=vmin, vmax=vmax)

return period	return value	lower ci	upper ci
2.0	5.678355	5.508168	5.870555
5.0	6.145006	5.919218	6.399999
10.0	6.498014	6.230166	6.800508
25.0	6.964665	6.641216	7.329952
50.0	7.317673	6.952163	7.730461
100.0	7.670680	7.263111	8.130970

Comparison

Let’s compare what both methods.

df_bm = summary_bm_mle
df_pot = summary_pot_mle
# bm
plt.errorbar(df_bm.index,
             df_bm['return value'],
             yerr=[df_bm['return value'] - df_bm['lower ci'],
                   df_bm['upper ci'] - df_bm['return value']],
             fmt='-ob', label='return value (bm)');
plt.fill_between(df_bm.index,
                 df_bm['lower ci'],
                 df_bm['upper ci'],
                 color='b',
                 alpha=0.2)
# pot
plt.errorbar(df_pot.index,
             df_pot['return value'],
             yerr=[df_pot['return value'] - df_pot['lower ci'],
                   df_pot['upper ci'] - df_pot['return value']],
             fmt='-or',
             label='return value (pot)');
plt.fill_between(df_pot.index,
                 df_pot['lower ci'],
                 df_pot['upper ci'],
                 color='r',
                 alpha=0.2)
plt.grid(True, which="both", linestyle='--', linewidth=1)
plt.gca().set_xscale('log')
plt.title('global earthquakes - return values - bm and pot 95% CIs')
plt.legend()
sns.despine(left=True)
plt.show()

Comparison on the confidence intervals for the return values with BM and POT.

pd.concat([summary_bm_mle, summary_pot_mle],
          axis=1,
          keys=['bm', 'pot']).style.background_gradient(cmap=cmap,
                                                        vmin=vmin,
                                                        vmax=vmax)

	bm			pot
return period	return value	lower ci	upper ci	return value	lower ci	upper ci
2.0	5.541965	5.327484	5.777972	5.678355	5.508168	5.870555
5.0	6.165502	5.836295	6.472050	6.145006	5.919218	6.399999
10.0	6.578338	6.165001	6.960001	6.498014	6.230166	6.800508
25.0	7.099957	6.576074	7.581792	6.964665	6.641216	7.329952
50.0	7.486924	6.863262	8.037220	7.317673	6.952163	7.730461
100.0	7.871033	7.156755	8.490038	7.670680	7.263111	8.130970

Lets see what we had between 1992 and 2021:

df_all.sort_values('value', ascending=False) \
      .nlargest(columns=['value'], n=10) \
      .style.background_gradient(subset=['value'],
                                 cmap=cmap,
                                 vmin=vmin,
                                 vmax=vmax)

obs	latitude	longitude	value
2005-06-15 02:50:51.330000	41.232670	-125.986340	6.980000
2014-03-10 05:18:13.430000	40.828670	-125.133830	6.800000
1995-02-19 04:03:14.940000	40.591840	-125.756670	6.600000
1992-04-26 11:18:25.980000	40.382830	-124.555000	6.570000
2003-12-22 19:15:56.240000	35.700500	-121.100500	6.500000
2010-01-10 00:27:39.320000	40.652000	-124.692500	6.500000
1992-04-26 07:41:40.090000	40.432500	-124.566000	6.450000
2020-05-15 11:03:26.400000	38.216170	-117.844330	6.440000
2014-08-24 10:20:44.070000	38.215170	-122.312330	6.020000
2004-09-28 17:15:24.250000	35.818160	-120.366000	5.970000

Quakes greater than 6.5 and 7.2 (Mw) have definitively been observed between 1992 and 2021 - looking at CI for the return value at 20 years - remember it doesn’t have to by the definition of the return value.

Conclusion

Cool piece of theory too look at! Nice lib! Nice tool!

ssoudan.blog

Abstract