Abstract

Some code around BM, POT and quakes in California!

Written on April 23, 2021.
Tags: statistics, cool stuffs, extreme values

Table of contents


In this post we are going to play with pyextremes a python library that does everything (and more) I have been talking about the two previous posts. After a bit of EDA, we are going to fit use BM and POT to compute the return values – could be interesting to know what magnitude to expect at a given horizon, no?

\(m\)-years return values/levels are the “high quantile for which the probability that the annual maximum exceeds this quantile is \(1/m\)” – see 100-year flood (wiki).

Setup and data - Earthquakes in CA

Setup

A bit of setup first…

!pip3.8 install -q -U emcee folium pyextremes statsmodels
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib as mpl
from matplotlib import pylab as plt
import matplotlib.dates as mdates
import seaborn as sns
sns.set(rc={'figure.figsize':(18,6)})
sns.set_style("ticks")

cmap = plt.get_cmap('plasma')

import statsmodels.api as sm

from pyextremes import EVA, __version__
import pyextremes

from pyextremes.plotting import (
    plot_extremes,
    plot_trace,
    plot_corner,
)

vmin=3.0
vmax=9.0

print("pyextremes", __version__)
pyextremes 2.2.1

Data

Data are from the Northern California Earthquake Data Center (NCEDC).

  • Waldhauser, F. and D.P. Schaff, Large-scale relocation of two decades of Northern California seismicity using cross-correlation and double-difference methods, J. Geophys. Res.,113, B08311, doi:10.1029/2007JB005479, 2008.
  • Waldhauser, F., Near-real-time double-difference event location using long-term seismic archives, with application to Northern California, Bull. Seism. Soc. Am., 99, 2736-2848, doi:10.1785/0120080294, 2009.

Go here if you want to download them.

df_ = pd.read_csv('ncedc.csv')
df_
time latitude longitude magnitude magType source eventID
0 1967/08/22 08:29:47.47 36.40050 -120.98650 3.00 Mx NCSN 1001120
1 1967/08/26 10:35:05.61 36.53867 -121.16033 3.30 Mx NCSN 1001154
2 1967/08/27 13:20:08.75 36.51083 -121.14500 3.60 Mx NCSN 1001166
3 1968/01/12 22:19:10.35 36.64533 -121.24966 3.00 ML NCSN 1001353
4 1968/02/09 13:42:37.05 37.15267 -121.54483 3.00 ML NCSN 1001402
18330 2020/12/27 14:44:40.09 39.50583 -122.01767 3.86 Mw NCSN 73503450
18331 2020/12/27 19:05:47.53 36.84417 -121.57484 3.19 ML NCSN 73503535
18332 2020/12/30 08:30:22.09 36.60500 -121.20883 3.10 ML NCSN 73504420
18333 2020/12/31 04:22:03.38 38.83883 -122.82317 3.63 Mw NCSN 73504795
18334 2020/12/31 13:41:59.04 37.79950 -122.59367 3.33 ML NCSN 73505175

18335 rows × 12 columns

magTypeColors = sns.color_palette("hls", 8)
palette = {
    'ML': magTypeColors[0],
    'Md': magTypeColors[1],
    'Mh': magTypeColors[2],
    'Mw': magTypeColors[3],
    'Mx': magTypeColors[4],
    'Unk': magTypeColors[5],
}
sns.histplot(data=df_.sort_values(by='magType'),
             x='magnitude',
             hue='magType',
             bins=20,
             multiple='stack',
             palette=palette)
sns.despine(left=True)
plt.show()
Earthquakes magnitude distribution by types.

And more specifically above 5.

sns.histplot(data=df_[df_['magnitude'] > 5].sort_values(by='magType'),
             x='magnitude',
             hue='magType',
             bins=20,
             multiple='stack',
             palette=palette)
sns.despine(left=True)
plt.show()
Tail of the distribution.

The magniture are based on different scales.

df_.groupby(by='magType').size().to_frame('count')
magType count
ML 6513
Md 10307
Mh 12
Mw 1456
Mx 46
Unk 1

We will stick to only one scale as mapping on to each other seems to be difficult. They basically measure different things… – See USGS - what-are-different-magnitude-scales.

We pick moment magnitude (Mw) scale as “[it is] based on the concept of seismic moment, is uniformly applicable to all sizes of earthquakes but is more difficult to compute than the other types” – See magnitude. Other scales are not covering the entire range of eathquakes creating truncated data. Note that this could introduce a selection bias as we are not sure why a particular events is using one scale or another.

df__ = df_[(df_['magType'] == 'Mw')]
df__.groupby(by='magType').size()
Mw    1456
df_all = df__[['time', 'latitude', 'longitude', 'magnitude']].copy()
df_all.rename({'magnitude': 'value',
               'time': 'obs'}, inplace=True, axis=1)
df_all = df_all.dropna(axis=0)
df_all = df_all.set_index(pd.DatetimeIndex(df_all['obs'])).drop('obs', axis=1)
df_all.head()
obs latitude longitude value
1992-04-26 07:41:40.090 40.43250 -124.56600 6.45
1992-04-26 11:18:25.980 40.38283 -124.55500 6.57
1992-09-19 23:04:46.830 38.86000 -122.79183 4.47
1993-01-16 06:29:35.010 37.01833 -121.46317 4.91
1993-01-18 23:27:10.610 38.84250 -122.77800 3.92
sns.histplot(data=df_all, x='value', bins=20, color=palette['Mw'])
sns.despine(left=True)
plt.show()
CA Earthquakes between 1992 and 2021 (NECDC data, Mw scale).

The data we are looking at cover the following time interval:

pd.DataFrame.from_dict(data={'min': [min(df_all.index)],
                             'max': [max(df_all.index)]},
                       orient='index',
                       columns=['obs'])
obs
min 1992-04-26 07:41:40.090
max 2020-12-31 04:22:03.380

and range as follows:

pd.DataFrame(df_all['value'].describe())
value
count 1456.000000
mean 3.933427
std 0.535629
min 3.000000
25% 3.560000
50% 3.850000
75% 4.200000
max 6.980000

The biggest earthquake in CA (in this dataset and using this scale…) are:

df_all.sort_values('value', ascending=False) \
      .nlargest(columns=['value'], n=10) \
      .style.background_gradient(subset=['value'],
                                 cmap=cmap,
                                 vmin=vmin,
                                 vmax=vmax)
obs latitude longitude value
2005-06-15 02:50:51.330000 41.232670 -125.986340 6.980000
2014-03-10 05:18:13.430000 40.828670 -125.133830 6.800000
1995-02-19 04:03:14.940000 40.591840 -125.756670 6.600000
1992-04-26 11:18:25.980000 40.382830 -124.555000 6.570000
2003-12-22 19:15:56.240000 35.700500 -121.100500 6.500000
2010-01-10 00:27:39.320000 40.652000 -124.692500 6.500000
1992-04-26 07:41:40.090000 40.432500 -124.566000 6.450000
2020-05-15 11:03:26.400000 38.216170 -117.844330 6.440000
2014-08-24 10:20:44.070000 38.215170 -122.312330 6.020000
2004-09-28 17:15:24.250000 35.818160 -120.366000 5.970000
import folium

def transparent(r, g, b, a, alpha=None):
    if alpha is None:
        return (r, g, b, a)
    return (r, g, b, alpha)
def rgb2hex(r, g, b, a=None):
    if a is None:
        return "#{:02x}{:02x}{:02x}".format(int(r*255),
                                            int(g*255),
                                            int(b*255))
    else:
        return "#{:02x}{:02x}{:02x}{:02x}".format(int(r*255),
                                                  int(g*255),
                                                  int(b*255),
                                                  int(a*255))

m = folium.Map(location=[df_all['latitude'].mean(),
                         df_all['longitude'].mean()],
                         tiles="Stamen Terrain", zoom_start=6)

def add_to_map(row):
    folium.Circle(
        radius=100*row['value'],
        location=[row['latitude'], row['longitude']],
        popup="{obs} {value}".format(obs=row.name,
                                     value=row['value']),
        color=rgb2hex(*transparent(*cmap((row['value']-vmin)/(vmax - vmin)),
                                   alpha=0.5)),
        fill=True,
    ).add_to(m)

df_all.apply(add_to_map, axis=1)

m
Earthquakes in CA - NCEDC 1992-2021 (Mw scale).
from matplotlib.dates import YearLocator, DateFormatter

sns.scatterplot(data=df_all, x='obs', y='value',
                marker='.', palette=cmap, hue='value',
                hue_norm=(vmin, vmax), s=150, alpha=0.9)
ax = plt.gca()
ax.xaxis.set_major_locator(YearLocator(2))
ax.xaxis.set_major_formatter(DateFormatter('%Y'))
sns.despine(left=True)
plt.show()
Earthquakes in CA over time - NCEDC 1992-2021 (Mw scale).

The scale (Mw) seems to have been used throughout the period we are looking at.

data_df = df_all.groupby(by='obs')['value'].max()
data_df.index = pd.to_datetime(data_df.index)

How independent are these events?

sm.graphics.tsa.plot_acf(data_df, lags=20)
sns.despine(left=True)
plt.show()
acf.

There is a bit of autocorrelation. Not completely surprising as “Aftershocks can continue over a period of weeks, months, or years.” USGS glossary. To remove that we are going look at the max in 1-week windows. It is not perfect as aftershocks can span longer than that and there is no guarantee that all the aftershocks are going to fall in the same buckets. Let see how it does.

data_df_ = data_df.resample('1W').max().dropna()
sm.graphics.tsa.plot_acf(data_df_.squeeze(), lags=20)
sns.despine(left=True)
plt.show()
acf.

Looking better.

sns.scatterplot(data=data_df_.to_frame('value'), x='obs', y='value', marker='.', palette=cmap, hue='value', hue_norm=(vmin, vmax), s=150, alpha=0.9)
ax = plt.gca()
ax.xaxis.set_major_locator(YearLocator(2))
ax.xaxis.set_major_formatter(DateFormatter('%Y'))
sns.despine(left=True)
plt.show()
Largest earthquakes over 1 week period in CA over time - NCEDC 1992-2021 (Mw scale).

Let’s put the data in the form pyextremes expect them.

data=data_df_.to_frame('value').squeeze()
data
    obs
    1992-04-26    6.57
    1992-09-20    4.47
    1993-01-17    4.91
    1993-01-24    3.92
    1993-02-21    3.88
                  ...
    2020-12-06    4.96
    2020-12-13    4.77
    2020-12-20    4.81
    2020-12-27    3.86
    2021-01-03    3.63
    Name: value, Length: 777, dtype: float64

BM model

model = EVA(data=data)
model
                           Univariate Extreme Value Analysis
===========================================================================
                                      Source Data
---------------------------------------------------------------------------
Data label:                         value     Size:                     777
Start:                         April 1992     End:             January 2021
===========================================================================
                                     Extreme Values
---------------------------------------------------------------------------
Extreme values have not been extracted
===========================================================================
                                         Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================
model.get_extremes(
    method="BM",
    extremes_type="high",
    block_size="365.2425D",
    errors="ignore",
)
model
                           Univariate Extreme Value Analysis
===========================================================================
                                      Source Data
---------------------------------------------------------------------------
Data label:                         value     Size:                     777
Start:                         April 1992     End:             January 2021
===========================================================================
                                     Extreme Values
---------------------------------------------------------------------------
Count:                                 29     Extraction method:         BM
Type:                                high     Block size: 365 days 05:49:12
===========================================================================
                                         Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================
model.plot_extremes()
sns.despine(left=True)
plt.show()
Block Maxima.

MLE

model.fit_model()
model
                           Univariate Extreme Value Analysis
===========================================================================
                         Source Data
---------------------------------------------------------------------------
Data label:            value      Size:                                 777
Start:            April 1992      End:                         January 2021
===========================================================================
                        Extreme Values
---------------------------------------------------------------------------
Count:                    29      Extraction method:                     BM
Type:                   high      Block size:             365 days 05:49:12
===========================================================================
                            Model
---------------------------------------------------------------------------
Model:                   MLE      Distribution:                    gumbel_r
Log-likelihood:      -28.048      AIC:                               60.558
---------------------------------------------------------------------------
Free parameters:   loc=5.340      Fixed parameters: All parameters are free
                 scale=0.550
===========================================================================
model.plot_diagnostic(alpha=0.95)
sns.despine(left=True)
plt.show()
BM diagnostic plots.
model.plot_return_values(
    return_period=np.logspace(0.01, 2, 100),
    return_period_size="365.2425D",
    alpha=0.95,
)
sns.despine(left=True)
plt.show()
BM return values.
summary_bm_mle = model.get_summary(
    return_period=[2, 5, 10, 25, 50, 100],
    alpha=0.95,
    n_samples=1000,
)
summary_bm_mle.style.background_gradient(cmap=cmap, vmin=vmin, vmax=vmax)
return period return value lower ci upper ci
2.0 5.541965 5.327484 5.777972
5.0 6.165502 5.836295 6.472050
10.0 6.578338 6.165001 6.960001
25.0 7.099957 6.576074 7.581792
50.0 7.486924 6.863262 8.037220
100.0 7.871033 7.156755 8.490038

MCMC

The library also supports fitting the distribution using a Bayesian approach. Wont go over this here. Check the tutorials that come with it.

POT

POT requires to select a threshold. A method to do that is to apply the fit the GPD for a range of thresholds and look at the stability of its parameters. This is what the following does:

pyextremes.plot_parameter_stability(ts=data, thresholds=np.linspace(4, 6, 50))
sns.despine(left=True)
plt.show()
Parameter stability plot.

We then pick a threshold in the range where the scale and shape parameters of the GPD are stable.

threshold = 4.25
model = EVA(data=data)
model
                           Univariate Extreme Value Analysis
===========================================================================
                                      Source Data
---------------------------------------------------------------------------
Data label:                         value      Size:                    777
Start:                         April 1992      End:            January 2021
===========================================================================
                                     Extreme Values
---------------------------------------------------------------------------
Extreme values have not been extracted
===========================================================================
                                         Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================
sns.histplot(data=data[data>threshold], bins=20)
sns.despine(left=True)
plt.show()
Distribution of the Earthquakes (still max over 1-week interval, Mw scale between 1992 and 2021) over the threshold.
model.get_extremes(
    method="POT",
    extremes_type="high",
    threshold=threshold)
model
                           Univariate Extreme Value Analysis
===========================================================================
                                  Source Data
---------------------------------------------------------------------------
Data label:                     value      Size:                       777
Start:                     April 1992      End:               January 2021
===========================================================================
                                 Extreme Values
---------------------------------------------------------------------------
Count:                            237      Extraction method:          POT
Type:                            high      Threshold:                 4.25
===========================================================================
                                     Model
---------------------------------------------------------------------------
Model has not been fit to the extremes
===========================================================================
model.plot_extremes()
sns.despine(left=True)
plt.show()
Selected values.
model.fit_model()
model
                           Univariate Extreme Value Analysis
===========================================================================
                                  Source Data
---------------------------------------------------------------------------
Data label:                     value      Size:                        777
Start:                     April 1992      End:                January 2021
===========================================================================
                                 Extreme Values
---------------------------------------------------------------------------
Count:                            237      Extraction method:           POT
Type:                            high      Threshold:                  4.25
===========================================================================
                                     Model
---------------------------------------------------------------------------
Model:                            MLE      Distribution:              expon
Log-likelihood:               -77.084      AIC:                     156.185
---------------------------------------------------------------------------
Free parameters:          scale=0.509      Fixed parameters:     floc=4.250
===========================================================================
model.plot_diagnostic(alpha=0.95)
sns.despine(left=True)
plt.show()
POT diagnostic plots.
model.plot_return_values(
    return_period=np.logspace(0.01, 2, 100),
    return_period_size="365.2425D",
    alpha=0.95,
)
sns.despine(left=True)
plt.show()
POT return values.
summary_pot_mle = model.get_summary(
    return_period=[2, 5, 10, 25, 50, 100],
    alpha=0.95,
    n_samples=1000,
)
summary_pot_mle.style.background_gradient(cmap=cmap, vmin=vmin, vmax=vmax)
return period return value lower ci upper ci
2.0 5.678355 5.508168 5.870555
5.0 6.145006 5.919218 6.399999
10.0 6.498014 6.230166 6.800508
25.0 6.964665 6.641216 7.329952
50.0 7.317673 6.952163 7.730461
100.0 7.670680 7.263111 8.130970

Comparison

Let’s compare what both methods.

df_bm = summary_bm_mle
df_pot = summary_pot_mle
# bm
plt.errorbar(df_bm.index,
             df_bm['return value'],
             yerr=[df_bm['return value'] - df_bm['lower ci'],
                   df_bm['upper ci'] - df_bm['return value']],
             fmt='-ob', label='return value (bm)');
plt.fill_between(df_bm.index,
                 df_bm['lower ci'],
                 df_bm['upper ci'],
                 color='b',
                 alpha=0.2)
# pot
plt.errorbar(df_pot.index,
             df_pot['return value'],
             yerr=[df_pot['return value'] - df_pot['lower ci'],
                   df_pot['upper ci'] - df_pot['return value']],
             fmt='-or',
             label='return value (pot)');
plt.fill_between(df_pot.index,
                 df_pot['lower ci'],
                 df_pot['upper ci'],
                 color='r',
                 alpha=0.2)
plt.grid(True, which="both", linestyle='--', linewidth=1)
plt.gca().set_xscale('log')
plt.title('global earthquakes - return values - bm and pot 95% CIs')
plt.legend()
sns.despine(left=True)
plt.show()
Comparison on the confidence intervals for the return values with BM and POT.
pd.concat([summary_bm_mle, summary_pot_mle],
          axis=1,
          keys=['bm', 'pot']).style.background_gradient(cmap=cmap,
                                                        vmin=vmin,
                                                        vmax=vmax)
bm pot
return period return value lower ci upper ci return value lower ci upper ci
2.0 5.541965 5.327484 5.777972 5.678355 5.508168 5.870555
5.0 6.165502 5.836295 6.472050 6.145006 5.919218 6.399999
10.0 6.578338 6.165001 6.960001 6.498014 6.230166 6.800508
25.0 7.099957 6.576074 7.581792 6.964665 6.641216 7.329952
50.0 7.486924 6.863262 8.037220 7.317673 6.952163 7.730461
100.0 7.871033 7.156755 8.490038 7.670680 7.263111 8.130970

Lets see what we had between 1992 and 2021:

df_all.sort_values('value', ascending=False) \
      .nlargest(columns=['value'], n=10) \
      .style.background_gradient(subset=['value'],
                                 cmap=cmap,
                                 vmin=vmin,
                                 vmax=vmax)
obs latitude longitude value
2005-06-15 02:50:51.330000 41.232670 -125.986340 6.980000
2014-03-10 05:18:13.430000 40.828670 -125.133830 6.800000
1995-02-19 04:03:14.940000 40.591840 -125.756670 6.600000
1992-04-26 11:18:25.980000 40.382830 -124.555000 6.570000
2003-12-22 19:15:56.240000 35.700500 -121.100500 6.500000
2010-01-10 00:27:39.320000 40.652000 -124.692500 6.500000
1992-04-26 07:41:40.090000 40.432500 -124.566000 6.450000
2020-05-15 11:03:26.400000 38.216170 -117.844330 6.440000
2014-08-24 10:20:44.070000 38.215170 -122.312330 6.020000
2004-09-28 17:15:24.250000 35.818160 -120.366000 5.970000

Quakes greater than 6.5 and 7.2 (Mw) have definitively been observed between 1992 and 2021 - looking at CI for the return value at 20 years - remember it doesn’t have to by the definition of the return value.

Conclusion

Cool piece of theory too look at! Nice lib! Nice tool!

References

April 23, 2021


Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Powered by Hakyll.