<ポイント>
・指定された説明変数の全パターンから最良モデルを計算する変数選択手法の1つ
--- 完全法とも呼ばれる
--- 計算コストが高く、多くの変数を持つ大きなデータセットでは実現不可能
・計算コストに配慮した変数選択手法として、ステップワイズ法などもある
<ポイント>
tidyverse --- 簡単なデータ操作と視覚化
caret --- 機械学習のワークフローを容易にする
leaps --- 最良のサブセット回帰を計算する
# ライブラリ参照
library(tidyverse)
library(caret)
library(leaps)
<ポイント>
・swiss社会経済指標に基づいて出生率スコアを予測
# データロード
data("swiss")
# データ確認
swiss %>% as_tibble()
## # A tibble: 47 x 6
## Fertility Agriculture Examination Education Catholic Infant.Mortality
## * <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 80.2 17 15 12 9.96 22.2
## 2 83.1 45.1 6 9 84.8 22.2
## 3 92.5 39.7 5 5 93.4 20.2
## 4 85.8 36.5 12 7 33.8 20.3
## 5 76.9 43.5 17 15 5.16 20.6
## 6 76.1 35.3 9 7 90.6 26.6
## 7 83.8 70.2 16 7 92.8 23.6
## 8 92.4 67.8 14 8 97.2 24.9
## 9 82.4 53.3 12 7 97.7 21
## 10 82.9 45.2 16 13 91.4 24.4
## # ... with 37 more rows
# データ概要
swiss %>% glimpse()
## Observations: 47
## Variables: 6
## $ Fertility <dbl> 80.2, 83.1, 92.5, 85.8, 76.9, 76.1, 83.8, 92....
## $ Agriculture <dbl> 17.0, 45.1, 39.7, 36.5, 43.5, 35.3, 70.2, 67....
## $ Examination <int> 15, 6, 5, 12, 17, 9, 16, 14, 12, 16, 14, 21, ...
## $ Education <int> 12, 9, 5, 7, 15, 7, 7, 8, 7, 13, 6, 12, 7, 12...
## $ Catholic <dbl> 9.96, 84.84, 93.40, 33.77, 5.16, 90.57, 92.85...
## $ Infant.Mortality <dbl> 22.2, 22.2, 20.2, 20.3, 20.6, 26.6, 23.6, 24....
<ポイント>
・説明変数の上限を引数として指定する
--- 今回は5を指定するので、1-5個の説明変数を持つ最良モデルが出力される
# モデル構築
models <- regsubsets(Fertility~., data = swiss, nvmax = 5)
# データ構造
models %>% attributes()
## $names
## [1] "np" "nrbar" "d" "rbar" "thetab"
## [6] "first" "last" "vorder" "tol" "rss"
## [11] "bound" "nvmax" "ress" "ir" "nbest"
## [16] "lopt" "il" "ier" "xnames" "method"
## [21] "force.in" "force.out" "sserr" "intercept" "lindep"
## [26] "nullrss" "nn" "call"
##
## $class
## [1] "regsubsets"
<ポイント>
・説明変数の数に応じたベストモデルを作成する
--- アスタリスクは、指定された変数が対応するモデルに含まれることを示す
# サマリー
res.sum <- models %>% summary()
res.sum %>% print()
## Subset selection object
## Call: regsubsets.formula(Fertility ~ ., data = swiss, nvmax = 5)
## 5 Variables (and intercept)
## Forced in Forced out
## Agriculture FALSE FALSE
## Examination FALSE FALSE
## Education FALSE FALSE
## Catholic FALSE FALSE
## Infant.Mortality FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## Agriculture Examination Education Catholic Infant.Mortality
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" "*" " "
## 3 ( 1 ) " " " " "*" "*" "*"
## 4 ( 1 ) "*" " " "*" "*" "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"
# データ構造
models %>% summary() %>% attributes()
## $names
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
##
## $class
## [1] "summary.regsubsets"
<ポイント>
・summary()はR2、adjR2、Cp、BICなどの統計量を出力する
--- 最適モデルは、R2を最大化、モデル予測誤差(RSS、cpおよびBIC)を最小化する
<注意事項>
・この精度検証は訓練データで作成したモデルに対するものである点に注意
--- K倍交差検証法で検証データの精度検証をする必要がある
# モデルごとの統計量
data.frame(
rsq = res.sum$rsq,
rss = res.sum$rss,
adjr2 = res.sum$adjr2,
cp = res.sum$cp,
bic = res.sum$bic
) %>%
set_rownames(paste("model", 1:5, sep = "")) %>%
round(2)
## rsq rss adjr2 cp bic
## model1 0.44 4015.24 0.43 35.20 -19.60
## model2 0.57 3054.17 0.56 18.49 -28.61
## model3 0.66 2422.25 0.64 8.18 -35.66
## model4 0.70 2158.07 0.67 5.03 -37.23
## model5 0.71 2105.04 0.67 6.00 -34.55
# 各統計値からベストモデルを選択
data.frame(
Adj.R2 = which.max(res.sum$adjr2),
CP = which.min(res.sum$cp),
BIC = which.min(res.sum$bic)
)
## Adj.R2 CP BIC
## 1 5 4 4
<ポイント>
・regsubsets()の結果に基づいて線型回帰モデルを構築する関数を構築
<引数>
・id : 説明変数の数
・object : regsubsets object
・outcome : 目的変数
# 関数定義
get_model_formula <- function(id, object, outcome){
# get models data
models <- summary(object)$which[id,-1]
# 説明変数を結合
predictors <- names(which(models == TRUE)) %>% paste(collapse = "+")
# 目的変数と結合、フォーミュラ変換
as.formula(paste0(outcome, "~", predictors))
}
# 初期モデル構築
models <- regsubsets(Fertility~., data = swiss, nvmax = 5)
models$xnames
## [1] "(Intercept)" "Agriculture" "Examination"
## [4] "Education" "Catholic" "Infant.Mortality"
# 3変数の最良モデル
get_model_formula(3, models, "Fertility")
## Fertility ~ Education + Catholic + Infant.Mortality
## <environment: 0x00000000186079d0>
<ポイント>
・指定したモデルの交差検証エラーを取得
--- 交差検証の手順を関数化
<引数>
・model.formula :交差検証を実行するフォーミュラ
・data :元のデータセット
# 関数定義
get_cv_error <- function(model.formula, data){
# 乱数シード
set.seed(1)
# 訓練モデル設定
train.control <- trainControl(method = "cv", number = 5)
# 交差検証
cv <- train(model.formula, data = data, method = "lm", trControl = train.control)
# モデル精度の取得
cv$results$RMSE
}
<ポイント>
・各説明変数の数ごとの最良モデルで交差検証を行いRMSEを取得
--- 最終的に説明変数がいくつのモデルが最良化を導く
# 説明変数の数を1-5でセット
model.ids <- 1:5
model.ids
## [1] 1 2 3 4 5
# 各説明変数の数ごとの最良モデルのRMSEを取得
cv.errors <-
model.ids %>%
map(get_model_formula, models, "Fertility") %>%
map(get_cv_error, data = swiss) %>%
unlist()
# RMSEの出力
cv.errors %>% set_names(model.ids) %>% round(2)
## 1 2 3 4 5
## 9.42 8.45 7.93 7.68 7.92
# 最良モデルの説明変数の数
cv.errors %>% which.min()
## [1] 4
# 最良モデル
models %>% coef(4)
## (Intercept) Agriculture Education Catholic
## 62.1013116 -0.1546175 -0.9802638 0.1246664
## Infant.Mortality
## 1.0784422
<ポイント>
・
<ポイント>
tidyverse --- 簡単なデータ操作と視覚化
caret --- 機械学習のワークフローを容易にする
leaps --- 最良のサブセット回帰を計算する
# ライブラリ参照
library(tidyverse)
library(caret)
library(leaps)
<ポイント>
・swiss社会経済指標に基づいて出生率スコアを予測
# データロード
data("swiss")
# データ確認
swiss %>% as_tibble()
## # A tibble: 47 x 6
## Fertility Agriculture Examination Education Catholic Infant.Mortality
## * <dbl> <dbl> <int> <int> <dbl> <dbl>
## 1 80.2 17 15 12 9.96 22.2
## 2 83.1 45.1 6 9 84.8 22.2
## 3 92.5 39.7 5 5 93.4 20.2
## 4 85.8 36.5 12 7 33.8 20.3
## 5 76.9 43.5 17 15 5.16 20.6
## 6 76.1 35.3 9 7 90.6 26.6
## 7 83.8 70.2 16 7 92.8 23.6
## 8 92.4 67.8 14 8 97.2 24.9
## 9 82.4 53.3 12 7 97.7 21
## 10 82.9 45.2 16 13 91.4 24.4
## # ... with 37 more rows
# データ概要
swiss %>% glimpse()
## Observations: 47
## Variables: 6
## $ Fertility <dbl> 80.2, 83.1, 92.5, 85.8, 76.9, 76.1, 83.8, 92....
## $ Agriculture <dbl> 17.0, 45.1, 39.7, 36.5, 43.5, 35.3, 70.2, 67....
## $ Examination <int> 15, 6, 5, 12, 17, 9, 16, 14, 12, 16, 14, 21, ...
## $ Education <int> 12, 9, 5, 7, 15, 7, 7, 8, 7, 13, 6, 12, 7, 12...
## $ Catholic <dbl> 9.96, 84.84, 93.40, 33.77, 5.16, 90.57, 92.85...
## $ Infant.Mortality <dbl> 22.2, 22.2, 20.2, 20.3, 20.6, 26.6, 23.6, 24....
<ポイント>
・全ての変数を用いたモデルを初期モデルとして構築
・説明変数は5つ
# 全変数のモデル
full.model <- lm(Fertility ~., data = swiss)
# 出力
full.model %>% print()
##
## Call:
## lm(formula = Fertility ~ ., data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Examination Education
## 66.9152 -0.1721 -0.2580 -0.8709
## Catholic Infant.Mortality
## 0.1041 1.0770
<ポイント>
・MASS::stepAIC()は、AICによって最良モデルを選択する
・初期モデル(全変数のモデル)としてlmオブジェクトを投入する
・説明変数は4つに絞られた
--- Examinationが除外された
# 変数増減法によるステップワイズ回帰
step.model <- full.model %>% MASS::stepAIC(direction = "both", trace = TRUE)
## Start: AIC=190.69
## Fertility ~ Agriculture + Examination + Education + Catholic +
## Infant.Mortality
##
## Df Sum of Sq RSS AIC
## - Examination 1 53.03 2158.1 189.86
## <none> 2105.0 190.69
## - Agriculture 1 307.72 2412.8 195.10
## - Infant.Mortality 1 408.75 2513.8 197.03
## - Catholic 1 447.71 2552.8 197.75
## - Education 1 1162.56 3267.6 209.36
##
## Step: AIC=189.86
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## + Examination 1 53.03 2105.0 190.69
## - Agriculture 1 264.18 2422.2 193.29
## - Infant.Mortality 1 409.81 2567.9 196.03
## - Catholic 1 956.57 3114.6 205.10
## - Education 1 2249.97 4408.0 221.43
# 出力
step.model %>% print()
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Coefficients:
## (Intercept) Agriculture Education Catholic
## 62.1013 -0.1546 -0.9803 0.1247
## Infant.Mortality
## 1.0784
step.model %>% attributes()
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
## [13] "anova"
##
## $class
## [1] "lm"
<ポイント>
・説明変数の数ごとの最良モデルを計算する
--- ベストサブセット回帰
--- 説明変数がいくつだと最良かは交差検証によりアプローチする
・method引数でステップワイズの手法を選択することができる
・初期モデル(全変数のモデル)を予め準備する必要はない
# モデル構築
models <- regsubsets(Fertility~., data = swiss, nvmax = 5, method = "seqrep")
# サマリー
models %>% summary()
## Subset selection object
## Call: regsubsets.formula(Fertility ~ ., data = swiss, nvmax = 5, method = "seqrep")
## 5 Variables (and intercept)
## Forced in Forced out
## Agriculture FALSE FALSE
## Examination FALSE FALSE
## Education FALSE FALSE
## Catholic FALSE FALSE
## Infant.Mortality FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: 'sequential replacement'
## Agriculture Examination Education Catholic Infant.Mortality
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" "*" " "
## 3 ( 1 ) " " " " "*" "*" "*"
## 4 ( 1 ) "*" "*" "*" "*" " "
## 5 ( 1 ) "*" "*" "*" "*" "*"
# 乱数シードの設定
set.seed(123)
# k倍交差検証の設定
train.control <- trainControl(method = "cv", number = 10)
# Train the model
step.model <- train(Fertility ~., data = swiss,
method = "leapBackward",
tuneGrid = data.frame(nvmax = 1:5),
trControl = train.control
)
<ポイント>
・モデルサイズごとの最良モデルのモデル精度の統計量を比較する
・ここでは説明変数が4つのモデルが最良
# モデル精度
step.model$results %>% round(2)
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 9.30 0.41 7.91 1.53 0.39 1.65
## 2 2 9.08 0.51 7.75 1.66 0.25 1.40
## 3 3 8.07 0.66 6.55 1.84 0.22 1.57
## 4 4 7.27 0.73 5.93 2.14 0.24 1.67
## 5 5 7.38 0.75 6.03 2.23 0.24 1.64
# 最良モデル
step.model$bestTune
## nvmax
## 4 4
# サマリー
step.model$finalModel %>% summary()
## Subset selection object
## 5 Variables (and intercept)
## Forced in Forced out
## Agriculture FALSE FALSE
## Examination FALSE FALSE
## Education FALSE FALSE
## Catholic FALSE FALSE
## Infant.Mortality FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: backward
## Agriculture Examination Education Catholic Infant.Mortality
## 1 ( 1 ) " " " " "*" " " " "
## 2 ( 1 ) " " " " "*" "*" " "
## 3 ( 1 ) " " " " "*" "*" "*"
## 4 ( 1 ) "*" " " "*" "*" "*"
# 回帰係数
step.model$finalModel %>% coef(4)
## (Intercept) Agriculture Education Catholic
## 62.1013116 -0.1546175 -0.9802638 0.1246664
## Infant.Mortality
## 1.0784422
# 検証
swiss %$% lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality)
##
## Coefficients:
## (Intercept) Agriculture Education Catholic
## 62.1013 -0.1546 -0.9803 0.1247
## Infant.Mortality
## 1.0784
<ポイント>
・
<ポイント>
tidyverse --- 簡単なデータ操作と視覚化
caret --- 機械学習のワークフローを容易にする
glmnet --- ペナルティ回帰を計算するため
# ライブラリ参照
library(tidyverse)
library(caret)
library(glmnet)
<ポイント>
・swiss社会経済指標に基づいて出生率スコアを予測
# データロード
data("Boston", package = "MASS")
# データ確認
Boston %>% as_tibble()
## # A tibble: 506 x 14
## crim zn indus chas nox rm age dis rad tax ptratio
## * <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3
## 2 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8
## 3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8
## 4 0.0324 0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7
## 5 0.0690 0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7
## 6 0.0298 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7
## 7 0.0883 12.5 7.87 0 0.524 6.01 66.6 5.56 5 311 15.2
## 8 0.145 12.5 7.87 0 0.524 6.17 96.1 5.95 5 311 15.2
## 9 0.211 12.5 7.87 0 0.524 5.63 100 6.08 5 311 15.2
## 10 0.170 12.5 7.87 0 0.524 6.00 85.9 6.59 5 311 15.2
## # ... with 496 more rows, and 3 more variables: black <dbl>, lstat <dbl>,
## # medv <dbl>
# データ概要
Boston %>% glimpse()
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
<ポイント>
・訓練データと検証データを分けておく
# 乱数設定
set.seed(123)
# データ分割
training.samples <-
Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
# 訓練データに使用するレコード番号
training.samples %>% glimpse()
## int [1:407, 1] 1 2 4 6 7 8 9 10 13 17 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "Resample1"
# 訓練データ
train.data <- Boston[training.samples, ]
train.data %>% dim()
## [1] 407 14
# 検証データ
test.data <- Boston[-training.samples, ]
test.data %>% dim()
## [1] 99 14
<ポイント>
・yをベクトル化しておく
・xをmodel.matrix()
# 予測変数 (説明変数)
x <- model.matrix(medv~., train.data)[,-1]
# 結果変数 (目的変数)
y <- train.data$medv
model <- glmnet(x, y, alpha = 1, lambda = NULL)
model %>% print()
##
## Call: glmnet(x = x, y = y, alpha = 1, lambda = NULL)
##
## Df %Dev Lambda
## [1,] 0 0.00000 6.908000
## [2,] 1 0.09343 6.294000
## [3,] 2 0.18670 5.735000
## [4,] 2 0.26620 5.225000
## [5,] 2 0.33220 4.761000
## [6,] 2 0.38700 4.338000
## [7,] 2 0.43240 3.953000
## [8,] 2 0.47020 3.602000
## [9,] 2 0.50150 3.282000
## [10,] 3 0.53390 2.990000
## [11,] 3 0.56100 2.725000
## [12,] 3 0.58350 2.482000
## [13,] 3 0.60210 2.262000
## [14,] 3 0.61770 2.061000
## [15,] 3 0.63050 1.878000
## [16,] 3 0.64120 1.711000
## [17,] 3 0.65010 1.559000
## [18,] 3 0.65740 1.421000
## [19,] 3 0.66360 1.294000
## [20,] 4 0.66990 1.179000
## [21,] 4 0.67560 1.075000
## [22,] 4 0.68030 0.979100
## [23,] 4 0.68420 0.892200
## [24,] 4 0.68750 0.812900
## [25,] 5 0.69080 0.740700
## [26,] 5 0.69380 0.674900
## [27,] 6 0.69730 0.614900
## [28,] 7 0.70200 0.560300
## [29,] 7 0.70610 0.510500
## [30,] 8 0.70960 0.465200
## [31,] 8 0.71430 0.423800
## [32,] 8 0.71820 0.386200
## [33,] 8 0.72140 0.351900
## [34,] 10 0.72520 0.320600
## [35,] 11 0.72850 0.292100
## [36,] 11 0.73120 0.266200
## [37,] 11 0.73350 0.242500
## [38,] 11 0.73540 0.221000
## [39,] 11 0.73690 0.201400
## [40,] 11 0.73820 0.183500
## [41,] 12 0.73950 0.167200
## [42,] 12 0.74220 0.152300
## [43,] 12 0.74440 0.138800
## [44,] 12 0.74620 0.126500
## [45,] 12 0.74760 0.115200
## [46,] 12 0.74880 0.105000
## [47,] 12 0.74990 0.095660
## [48,] 12 0.75070 0.087160
## [49,] 12 0.75140 0.079420
## [50,] 12 0.75200 0.072370
## [51,] 12 0.75250 0.065940
## [52,] 12 0.75290 0.060080
## [53,] 12 0.75320 0.054740
## [54,] 12 0.75350 0.049880
## [55,] 12 0.75370 0.045450
## [56,] 12 0.75390 0.041410
## [57,] 12 0.75410 0.037730
## [58,] 12 0.75420 0.034380
## [59,] 12 0.75430 0.031330
## [60,] 12 0.75440 0.028540
## [61,] 12 0.75450 0.026010
## [62,] 12 0.75460 0.023700
## [63,] 12 0.75460 0.021590
## [64,] 12 0.75460 0.019670
## [65,] 12 0.75470 0.017930
## [66,] 12 0.75470 0.016330
## [67,] 12 0.75470 0.014880
## [68,] 12 0.75480 0.013560
## [69,] 12 0.75480 0.012360
## [70,] 12 0.75480 0.011260
## [71,] 12 0.75480 0.010260
## [72,] 12 0.75480 0.009346
## [73,] 12 0.75480 0.008516
## [74,] 12 0.75480 0.007760
<ポイント>
・交差検証で最良のlambdaを見つける
# 乱数シードの設定
set.seed(123)
#
cv <- cv.glmnet(x, y, alpha = 0)
# 属性データ
cv %>% attributes()
## $names
## [1] "lambda" "cvm" "cvsd" "cvup" "cvlo"
## [6] "nzero" "name" "glmnet.fit" "lambda.min" "lambda.1se"
##
## $class
## [1] "cv.glmnet"
# 最良のラムダ値
cv$lambda.min
## [1] 0.7581162
# モデル構築
model <- glmnet(x, y, alpha = 0, lambda = cv$lambda.min)
model %>% print()
##
## Call: glmnet(x = x, y = y, alpha = 0, lambda = cv$lambda.min)
##
## Df %Dev Lambda
## [1,] 13 0.7478 0.7581
# 格納データ
model %>% attributes()
## $names
## [1] "a0" "beta" "df" "dim" "lambda"
## [6] "dev.ratio" "nulldev" "npasses" "jerr" "offset"
## [11] "call" "nobs"
##
## $class
## [1] "elnet" "glmnet"
# Display regression coefficients
model %>% coef()
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 28.696325584
## crim -0.072846256
## zn 0.034166905
## indus -0.057447836
## chas 2.491226902
## nox -11.092319910
## rm 3.981320545
## age -0.003137072
## dis -1.192961125
## rad 0.140678481
## tax -0.006099569
## ptratio -0.863997246
## black 0.009365355
## lstat -0.479141585
# 検証データで予測値を作成
x.test <- model.matrix(medv ~., test.data)[,-1]
predictions <- model %>% predict(x.test) %>% as.vector()
# モデル精度の検証
data.frame(
RMSE = caret::RMSE(predictions, test.data$medv),
Rsquare = caret::R2(predictions, test.data$medv)
)
## RMSE Rsquare
## 1 4.983643 0.671074
<ポイント>
・交差検証で最良のlambdaを見つける
・Lasso回帰では"alpha=1"を指定する
# 乱数シードの設定
set.seed(123)
#
cv <- cv.glmnet(x, y, alpha = 1)
# 属性データ
cv %>% attributes()
## $names
## [1] "lambda" "cvm" "cvsd" "cvup" "cvlo"
## [6] "nzero" "name" "glmnet.fit" "lambda.min" "lambda.1se"
##
## $class
## [1] "cv.glmnet"
# 最良のラムダ値
cv$lambda.min
## [1] 0.008516101
# モデル構築
model <- glmnet(x, y, alpha = 1, lambda = cv$lambda.min)
model %>% print()
##
## Call: glmnet(x = x, y = y, alpha = 1, lambda = cv$lambda.min)
##
## Df %Dev Lambda
## [1,] 12 0.7548 0.008516
# 格納データ
model %>% attributes()
## $names
## [1] "a0" "beta" "df" "dim" "lambda"
## [6] "dev.ratio" "nulldev" "npasses" "jerr" "offset"
## [11] "call" "nobs"
##
## $class
## [1] "elnet" "glmnet"
# Display regression coefficients
model %>% coef()
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 36.905385208
## crim -0.092224281
## zn 0.048424768
## indus -0.008413932
## chas 2.286235925
## nox -16.796507504
## rm 3.811860949
## age .
## dis -1.596031128
## rad 0.285461530
## tax -0.012401417
## ptratio -0.950408780
## black 0.009646870
## lstat -0.528799085
# 検証データで予測値を作成
x.test <- model.matrix(medv ~., test.data)[,-1]
predictions <- model %>% predict(x.test) %>% as.vector()
# モデル精度の検証
data.frame(
RMSE = caret::RMSE(predictions, test.data$medv)
)
## RMSE
## 1 4.989791
<ポイント>
・主成分回帰(PCR)や部分最小二乗回帰(PLS)は説明変数が多い多変量データに非常に有用
・説明変数が直交化しているので、多重共線性の問題が生じない
・主成分分析と同様に、説明変数(主成分)の解釈が難しくなる
<ポイント>
tidyverse --- 簡単なデータ操作と視覚化
caret --- 機械学習のワークフローを容易にする
pls --- PCRおよびPLSの計算用
# ライブラリ参照
library(tidyverse)
library(caret)
library(pls)
<ポイント>
・swiss社会経済指標に基づいて出生率スコアを予測
# データロード
data("Boston", package = "MASS")
# データ確認
Boston %>% as_tibble()
## # A tibble: 506 x 14
## crim zn indus chas nox rm age dis rad tax ptratio
## * <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3
## 2 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8
## 3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8
## 4 0.0324 0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7
## 5 0.0690 0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7
## 6 0.0298 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7
## 7 0.0883 12.5 7.87 0 0.524 6.01 66.6 5.56 5 311 15.2
## 8 0.145 12.5 7.87 0 0.524 6.17 96.1 5.95 5 311 15.2
## 9 0.211 12.5 7.87 0 0.524 5.63 100 6.08 5 311 15.2
## 10 0.170 12.5 7.87 0 0.524 6.00 85.9 6.59 5 311 15.2
## # ... with 496 more rows, and 3 more variables: black <dbl>, lstat <dbl>,
## # medv <dbl>
# データ概要
Boston %>% glimpse()
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
<ポイント>
・訓練データと検証データを分けておく
# 乱数設定
set.seed(123)
# データ分割
training.samples <-
Boston$medv %>%
createDataPartition(p = 0.8, list = FALSE)
# 訓練データに使用するレコード番号
training.samples %>% glimpse()
## int [1:407, 1] 1 2 4 6 7 8 9 10 13 17 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "Resample1"
# 訓練データ
train.data <- Boston[training.samples, ]
train.data %>% dim()
## [1] 407 14
# 検証データ
test.data <- Boston[-training.samples, ]
test.data %>% dim()
## [1] 99 14
# 乱数シードの設定
set.seed(123)
# モデル構築
model <- train(
medv~., data = train.data, method = "pcr",
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# 結果確認
model %>% print()
## Principal Component Analysis
##
## 407 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 367, 367, 367, 366, 366, 366, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 7.227334 0.4057420 5.125100
## 2 6.613104 0.5082340 4.852206
## 3 5.471926 0.6520889 3.897771
## 4 5.386228 0.6632146 3.846863
## 5 4.915491 0.7164960 3.404668
## 6 4.922791 0.7159471 3.409811
## 7 4.938164 0.7142156 3.418723
## 8 4.922146 0.7163155 3.427136
## 9 4.936817 0.7144186 3.416268
## 10 5.029506 0.7053292 3.496918
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.
# 属性データ
model %>% attributes()
## $names
## [1] "method" "modelInfo" "modelType" "results"
## [5] "pred" "bestTune" "call" "dots"
## [9] "metric" "control" "finalModel" "preProcess"
## [13] "trainingData" "resample" "resampledCM" "perfNames"
## [17] "maximize" "yLimits" "times" "levels"
## [21] "terms" "coefnames" "xlevels"
##
## $class
## [1] "train" "train.formula"
# Components数ごとのRMSE
model %>% plot()
# 最良モデルのComponents数
model$bestTune
## ncomp
## 5 5
# モデルサマリー
model$finalModel %>% summary()
## Data: X dimension: 407 13
## Y dimension: 407 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 47.48 58.40 68.00 74.75 80.94
## .outcome 38.10 51.02 64.43 65.24 71.17
# 検証データで予測データを作成
predictions <- model %>% predict(test.data)
# モデルの予測精度の検証
data.frame(
RMSE = caret::RMSE(predictions, test.data$medv),
Rsquare = caret::R2(predictions, test.data$medv)
) %>% round(2)
## RMSE Rsquare
## 1 5.18 0.65
# 乱数シードの設定
set.seed(123)
# モデル構築
model <- train(
medv~., data = train.data, method = "pls",
scale = TRUE,
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
# 結果確認
model %>% print()
## Partial Least Squares
##
## 407 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 367, 367, 367, 366, 366, 366, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 6.452697 0.5277704 4.561556
## 2 4.886852 0.7183182 3.401903
## 3 4.819579 0.7264369 3.333601
## 4 4.840861 0.7264274 3.399306
## 5 4.786733 0.7326372 3.369500
## 6 4.745189 0.7370422 3.346671
## 7 4.747050 0.7376106 3.353168
## 8 4.733393 0.7395382 3.343834
## 9 4.731628 0.7398160 3.340095
## 10 4.736906 0.7392631 3.343730
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 9.
# 属性データ
model %>% attributes()
## $names
## [1] "method" "modelInfo" "modelType" "results"
## [5] "pred" "bestTune" "call" "dots"
## [9] "metric" "control" "finalModel" "preProcess"
## [13] "trainingData" "resample" "resampledCM" "perfNames"
## [17] "maximize" "yLimits" "times" "levels"
## [21] "terms" "coefnames" "xlevels"
##
## $class
## [1] "train" "train.formula"
# Components数ごとのRMSE
model %>% plot()
# 最良モデルのComponents数
model$bestTune
## ncomp
## 9 9
# モデルサマリー
model$finalModel %>% summary()
## Data: X dimension: 407 13
## Y dimension: 407 1
## Fit method: oscorespls
## Number of components considered: 9
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 46.19 57.32 64.15 69.76 75.63 78.66 82.85
## .outcome 50.90 71.84 73.71 74.71 75.18 75.35 75.42
## 8 comps 9 comps
## X 85.92 90.36
## .outcome 75.48 75.49
# 検証データで予測データを作成
predictions <- model %>% predict(test.data)
# モデルの予測精度の検証
data.frame(
RMSE = caret::RMSE(predictions, test.data$medv),
Rsquare = caret::R2(predictions, test.data$medv)
) %>% round(2)
## RMSE Rsquare
## 1 4.99 0.67