SARIMAX: モデル選択、欠損データ¶

この例は、Durbin と Koopman (2012) の第 8.4 章で Box-Jenkins 手法を ARMA モデルに適用した例を反映しています。新しい機能は、欠損値のあるデータセットでモデルが機能する機能です。

[1]:

%matplotlib inline

[2]:

import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
import matplotlib.pyplot as plt

[3]:

import requests
from io import BytesIO
from zipfile import ZipFile

# Download the dataset
df = pd.read_table(
    "https://raw.githubusercontent.com/jrnold/ssmodels-in-stan/master/StanStateSpace/data-raw/DK-data/internet.dat",
    skiprows=1, header=None, sep='\s+', engine='python',
    names=['internet','dinternet']
)

モデルの選択¶

Durbin と Koopman と同様に、いくつかの値を欠損値として強制します。

[4]:

# Get the basic series
dta_full = df.dinternet[1:].values
dta_miss = dta_full.copy()

# Remove datapoints
missing = np.r_[6,16,26,36,46,56,66,72,73,74,75,76,86,96]-1
dta_miss[missing] = np.nan

次に、赤池情報量規準（AIC）を用いてモデル選択を検討できますが、各変種でモデルを実行し、最もAIC値が低いモデルを選択します。

ここで注意すべき点がいくつかあります：

特に自己回帰（AR）および移動平均（MA）の次数が大きくなると、大量のモデルを実行する際に最大尤度推定の収束が不良になる可能性があります。この例は説明用であるため、警告は無視しています。
enforce_invertibility=False オプションを使用して、移動平均多項式が非可逆であってもよいようにし、より多くのモデルが推定可能になるようにしています。
いくつかのモデルは良好な結果を出さず、その　AIC　値は　NaN　に設定されます。これは驚くべきことではなく、DurbinとKoopmanは高次モデルに関する数値的な問題を指摘しています。

[5]:

import warnings

aic_full = pd.DataFrame(np.zeros((6,6), dtype=float))
aic_miss = pd.DataFrame(np.zeros((6,6), dtype=float))

warnings.simplefilter('ignore')

# Iterate over all ARMA(p,q) models with p,q in [0,6]
for p in range(6):
    for q in range(6):
        if p == 0 and q == 0:
            continue

        # Estimate the model with no missing datapoints
        mod = sm.tsa.statespace.SARIMAX(dta_full, order=(p,0,q), enforce_invertibility=False)
        try:
            res = mod.fit(disp=False)
            aic_full.iloc[p,q] = res.aic
        except:
            aic_full.iloc[p,q] = np.nan

        # Estimate the model with missing datapoints
        mod = sm.tsa.statespace.SARIMAX(dta_miss, order=(p,0,q), enforce_invertibility=False)
        try:
            res = mod.fit(disp=False)
            aic_miss.iloc[p,q] = res.aic
        except:
            aic_miss.iloc[p,q] = np.nan

完全な（欠損のない）データセットで推定されたモデルにおいて、AICはARMA(1,1)またはARMA(3,0)を選択します。DurbinとKoopmanは、ARMA(1,1)の仕様が簡潔さのためにより優れていると示唆しています。

\[\begin{split}\text\{再現:\}\\ \textbf\{表 8.1\} ~~ \text\{異なるARMAモデルに対するAIC.\}\\ \renewcommand{\r}[1]{\color{red}{#1}} \begin{array}{lrrrrrr} \hline q & 0 & 1 & 2 & 3 & 4 & 5 \\ \hline p & {} & {} & {} & {} & {} & {} \\ 0 & 0.00 & 549.81 & 519.87 & 520.27 & 519.38 & 518.86 \\ 1 & 529.24 & \r{514.30} & 516.25 & 514.58 & 515.10 & 516.28 \\ 2 & 522.18 & 516.29 & 517.16 & 515.77 & 513.24 & 514.73 \\ 3 & \r{511.99} & 513.94 & 515.92 & 512.06 & 513.72 & 514.50 \\ 4 & 513.93 & 512.89 & nan & nan & 514.81 & 516.08 \\ 5 & 515.86 & 517.64 & nan & nan & nan & nan \\ \hline \end{array}\end{split}\]

欠損データセットで推定されたモデルにおいて、AICはARMA(1,1)を選択します。

\[\begin{split}\text{再現:}\\ \textbf{表 8.2} ~~ \text{欠損観測値を含む異なるARMAモデルに対するAIC.}\\ \begin{array}{lrrrrrr} \hline q & 0 & 1 & 2 & 3 & 4 & 5 \\ \hline p & {} & {} & {} & {} & {} & {} \\ 0 & 0.00 & 488.93 & 464.01 & 463.86 & 462.63 & 463.62 \\ 1 & 468.01 & \r{457.54} & 459.35 & 458.66 & 459.15 & 461.01 \\ 2 & 469.68 & nan & 460.48 & 459.43 & 459.23 & 460.47 \\ 3 & 467.10 & 458.44 & 459.64 & 456.66 & 459.54 & 460.05 \\ 4 & 469.00 & 459.52 & nan & 463.04 & 459.35 & 460.96 \\ 5 & 471.32 & 461.26 & nan & nan & 461.00 & 462.97 \\ \hline \end{array}\end{split}\]

注: AICの値はDurbinとKoopmanとは異なる方法で計算されていますが、全体的に似たような傾向を示しています。

事後分析¶

上記で選択したARMA(1,1)仕様を使用して、標本内予測と標本外予測を実施します。

[6]:

# Statespace
mod = sm.tsa.statespace.SARIMAX(dta_miss, order=(1,0,1))
res = mod.fit(disp=False)
print(res.summary())

                               SARIMAX Results
==============================================================================
Dep. Variable:                      y   No. Observations:                   99
Model:               SARIMAX(1, 0, 1)   Log Likelihood                -225.770
Date:                Wed, 28 Aug 2024   AIC                            457.541
Time:                        14:58:41   BIC                            465.326
Sample:                             0   HQIC                           460.691
                                 - 99
Covariance Type:                  opg
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6562      0.092      7.125      0.000       0.476       0.837
ma.L1          0.4878      0.111      4.390      0.000       0.270       0.706
sigma2        10.3402      1.569      6.591      0.000       7.265      13.415
===================================================================================
Ljung-Box (L1) (Q):                   0.00   Jarque-Bera (JB):                 1.87
Prob(Q):                              0.96   Prob(JB):                         0.39
Heteroskedasticity (H):               0.59   Skew:                            -0.10
Prob(H) (two-sided):                  0.13   Kurtosis:                         3.64
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

[7]:

# In-sample one-step-ahead predictions, and out-of-sample forecasts
nforecast = 20
predict = res.get_prediction(end=mod.nobs + nforecast)
idx = np.arange(len(predict.predicted_mean))
predict_ci = predict.conf_int(alpha=0.5)

# Graph
fig, ax = plt.subplots(figsize=(12,6))
ax.xaxis.grid()
ax.plot(dta_miss, 'k.')

# Plot
ax.plot(idx[:-nforecast], predict.predicted_mean[:-nforecast], 'gray')
ax.plot(idx[-nforecast:], predict.predicted_mean[-nforecast:], 'k--', linestyle='--', linewidth=2)
ax.fill_between(idx, predict_ci[:, 0], predict_ci[:, 1], alpha=0.15)

ax.set(title='Figure 8.9 - Internet series');

../../../_images/examples_notebooks_generated_statespace_sarimax_internet_12_0.png

最終更新日: 2025年01月28日