データセットパッケージ¶

statsmodels は、サンプル、チュートリアル、モデルテストなどで使用するデータセット(つまり data と meta-data )を提供します。

Stata のデータセットの使用¶

`webuse`(data[, baseurl, as_df])	Stata からサンプルデータセットをダウンロードして返します。

R からのデータセットの使用¶

Rdatasets project は、R のコアデータセットパッケージやその他の多くの一般的なRパッケージで利用可能なデータセットへのアクセスを提供します。これらのデータセットはすべて、 get_rdataset ファンクションを使用して statsmodels で利用できます。実際のデータには、 data アトリビュートでアクセスできます。例:

In [1]: import statsmodels.api as sm

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "carData")

In [3]: print(duncan_prestige.__doc__)
.. container::

   .. container::

      ====== ===============
      Duncan R Documentation
      ====== ===============

      .. rubric:: Duncan's Occupational Prestige Data
         :name: duncans-occupational-prestige-data

      .. rubric:: Description
         :name: description

      The ``Duncan`` data frame has 45 rows and 4 columns. Data on the
      prestige and other characteristics of 45 U. S. occupations in
      1950.

      .. rubric:: Usage
         :name: usage

      .. code:: R

         Duncan

      .. rubric:: Format
         :name: format

      This data frame contains the following columns:

      type
         Type of occupation. A factor with the following levels:
         ``prof``, professional and managerial; ``wc``, white-collar;
         ``bc``, blue-collar.

      income
         Percentage of occupational incumbents in the 1950 US Census who
         earned $3,500 or more per year (about $36,000 in 2017 US
         dollars).

      education
         Percentage of occupational incumbents in 1950 who were high
         school graduates (which, were we cynical, we would say is
         roughly equivalent to a PhD in 2017)

      prestige
         Percentage of respondents in a social survey who rated the
         occupation as “good” or better in prestige

      .. rubric:: Source
         :name: source

      Duncan, O. D. (1961) A socioeconomic index for all occupations. In
      Reiss, A. J., Jr. (Ed.) *Occupations and Social Status.* Free
      Press [Table VI-1].

      .. rubric:: References
         :name: references

      Fox, J. (2016) *Applied Regression Analysis and Generalized Linear
      Models*, Third Edition. Sage.

      Fox, J. and Weisberg, S. (2019) *An R Companion to Applied
      Regression*, Third Edition, Sage.


In [4]: duncan_prestige.data.head(5)
Out[4]: 
            type  income  education  prestige
rownames                                     
accountant  prof      62         86        82
pilot       prof      72         76        83
architect   prof      75         92        90
author      prof      55         90        76
chemist     prof      64         86        90

R データセット関数リファレンス¶

`get_rdataset`(dataname[, package, cache])	R データセットをダウンロードして返す
`get_data_home`([data_home])	statsmodels データディレクトリのパスを返します。
`clear_data_home`([data_home])	データホームキャッシュのコンテンツをすべて削除します。

利用可能なデータセット¶

使用法¶

データセットをロードします:

In [5]: import statsmodels.api as sm

In [6]: data = sm.datasets.longley.load_pandas()

Dataset オブジェクトはバンチパターンに従います。完全なデータセットは data 属性で利用できます。

In [7]: data.data
Out[7]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
 60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
 61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
 60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
 61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
 63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
 63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
 64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
 63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
 66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
 67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

ほとんどのデータセットは、属性 endog および exog にデータの便利な表現を保持します:

In [8]: data.endog.iloc[:5]
Out[8]: 
0    60323.0
1    61122.0
2    60171.0
3    61187.0
4    63221.0
Name: TOTEMP, dtype: float64

In [9]: data.exog.iloc[:5,:]
Out[9]: 
   GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4     96.2  328975.0  2099.0  3099.0  112075.0  1951.0

ただし、単変量データセットには exog 属性がありません。

変数名は、次のように入力して取得できます:

In [10]: data.endog_name
Out[10]: 'TOTEMP'

In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

データセットが endog および exog であるべきものの明確な解釈を持っていない場合、 data または raw_data 属性にいつでもアクセスできます。これは、特定の例を念頭に置いたデータセットではなく、米国のマクロ経済データのコレクションである macrodata データセットの場合にあてはまります。 data 属性には完全なデータセットのレコード配列が含まれ、 raw_data 属性には names 属性で指定された列の名前を持つ ndarray が含まれます。

In [12]: type(data.data)
Out[12]: pandas.core.frame.DataFrame

In [13]: type(data.raw_data)
Out[13]: pandas.core.frame.DataFrame

In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

データを pandas オブジェクトとしてロードする¶

多くのユーザーにとって、データセットを pandas の DataFrame または Series オブジェクトとして取得することが望ましい場合があります。各データセットモジュールには、pandas オブジェクトとして簡単に利用できるデータを含む Dataset インスタンスを返す load_pandas メソッドが用意されています:

In [15]: data = sm.datasets.longley.load_pandas()

In [16]: data.exog
Out[16]: 
    GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
    83.0  234289.0  2356.0  1590.0  107608.0  1947.0
    88.5  259426.0  2325.0  1456.0  108632.0  1948.0
    88.2  258054.0  3682.0  1616.0  109773.0  1949.0
    89.5  284599.0  3351.0  1650.0  110929.0  1950.0
    96.2  328975.0  2099.0  3099.0  112075.0  1951.0
    98.1  346999.0  1932.0  3594.0  113270.0  1952.0
    99.0  365385.0  1870.0  3547.0  115094.0  1953.0
   100.0  363112.0  3578.0  3350.0  116219.0  1954.0
   101.2  397469.0  2904.0  3048.0  117388.0  1955.0
   104.6  419180.0  2822.0  2857.0  118734.0  1956.0
  108.4  442769.0  2936.0  2798.0  120445.0  1957.0
  110.8  444546.0  4681.0  2637.0  121950.0  1958.0
  112.6  482704.0  3813.0  2552.0  123366.0  1959.0
  114.2  502601.0  3931.0  2514.0  125368.0  1960.0
  115.7  518173.0  4806.0  2572.0  127852.0  1961.0
  116.9  554894.0  4007.0  2827.0  130081.0  1962.0

In [17]: data.endog
Out[17]: 
   60323.0
   61122.0
   60171.0
   61187.0
   63221.0
   63639.0
   64989.0
   63761.0
   66019.0
   67857.0
  68169.0
  66513.0
  68655.0
  69564.0
  69331.0
  70551.0
Name: TOTEMP, dtype: float64

完全な DataFrame は Dataset オブジェクトの data 属性で利用できます

In [18]: data.data
Out[18]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
 60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
 61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
 60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
 61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
 63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
 63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
 64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
 63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
 66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
 67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

推定クラスに pandas を統合すると、メタデータがモデル結果にアタッチされる:

In [19]: y, x = data.endog, data.exog

In [20]: res = sm.OLS(y, x).fit()

In [21]: res.params
Out[21]: 
GNPDEFL   -52.993570
GNP         0.071073
UNEMP      -0.423466
ARMED      -0.572569
POP        -0.414204
YEAR       48.417866
dtype: float64

In [22]: res.summary()
Out[22]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                 TOTEMP   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          5.052e+04
Date:                Tue, 28 Jan 2025   Prob (F-statistic):                    8.20e-22
Time:                        00:01:28   Log-Likelihood:                         -117.56
No. Observations:                  16   AIC:                                      247.1
Df Residuals:                      10   BIC:                                      251.8
Df Model:                           6                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
GNPDEFL      -52.9936    129.545     -0.409      0.691    -341.638     235.650
GNP            0.0711      0.030      2.356      0.040       0.004       0.138
UNEMP         -0.4235      0.418     -1.014      0.335      -1.354       0.507
ARMED         -0.5726      0.279     -2.052      0.067      -1.194       0.049
POP           -0.4142      0.321     -1.289      0.226      -1.130       0.302
YEAR          48.4179     17.689      2.737      0.021       9.003      87.832
==============================================================================
Omnibus:                        1.443   Durbin-Watson:                   1.277
Prob(Omnibus):                  0.486   Jarque-Bera (JB):                0.605
Skew:                           0.476   Prob(JB):                        0.739
Kurtosis:                       3.031   Cond. No.                     4.56e+05
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 4.56e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

追加情報¶

データセット自体についてさらに詳しく知りたい場合は、例として Longley データセットを使用して、以下にアクセスできます:

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

追加情報¶

データセットパッケージのアイデアは、もともと David Cournapeau によって提案されました。
データセットを追加するには、 notes on adding a dataset を参照してください。

最終更新日: 2025年01月28日

データセット パッケージ¶

Stata のデータセットの使用¶

R からのデータセットの使用¶

R データセット関数リファレンス¶

利用可能なデータセット¶

使用法¶

データを pandas オブジェクトとしてロードする¶

追加情報¶

追加情報¶

データセットパッケージ¶