Masser’s Blog

データサイエンティスト/Kaggle/Python/ADHD/旅行/サンフレ 等語る予定。

今期のサンフレッチェ広島の観客動員数を機械学習で予測した その2(XGBoostのチューニング含む)

今週末にもJリーグが開幕するので、今週発表された週間天気を基にXGBoostのアルゴリズムを使って再度今期の観客を予測しました。今回はKaggleのHouse PricesのコンペをやっていくうちにXGBoostの使い方を覚えたので、XGBoostのチューニングの仕方も併せて乗せてみます。

今週末の天気

今週末は晴れで気温が11度の予想です。 f:id:masser199:20180220042026j:plain

XGBoostとは

Kaggleのようなデータ分析コンペで実績を残しているアルゴリズムです。 具体的にどのような動きをするかというのはこういうところが参考になります。 今でこそ目的関数は分かりますがMachine Learningを取る前だとそもそも何かわからなかったかも。

パッケージユーザーのための機械学習(12):Xgboost (eXtreme Gradient Boosting) - 六本木で働くデータサイエンティストのブログ XGBoostやパラメータチューニングの仕方に関する調査 | かものはしの分析ブログ

XGBoostのチューニング

チューニングの元ネタはこちらを参考にしました。

Analytics_Vidhya/XGBoost models.ipynb at master · aarshayj/Analytics_Vidhya · GitHub

実際にチューニングした結果はこちらとなります。 まずはライブラリの読み込みと、テーブルの読み込みです。 テーブルはこちらで作成したものを使用します。

今期のサンフレッチェ広島の観客動員数を機械学習で予測した - Masser’s Blog

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
%matplotlib inline

train = pd.read_csv('sanftrain.csv')
test = pd.read_csv('sanftest.csv')
train_ID = train['Key']
test_ID = test['Key']
y_train = train['Audience']
X_train = train.drop(['Key','Audience'], axis=1)
X_test = test.drop('Key', axis=1)

まずは、max_depthと、n_estimatorsをチューニングします。

import xgboost as xgb
from sklearn.model_selection import GridSearchCV

xgb_model = xgb.XGBRegressor(max_depth=5,min_child_weight=1,gamma=0,subsample=0.2,colsample_bytree=0.2,learning_rate=0.1)
reg = GridSearchCV(xgb_model,
                   {'max_depth': [2,3,4,5,6],
                    'n_estimators': [50,75,100,125,150,175,200,225,250,275,300]}, verbose=1)
reg.fit(X_train, y_train)

結果はこちらで分かります。これからmax_depthは4, n_estimatorsは300だと分かります。

reg.grid_scores_,reg.best_params_, reg.best_score_

   ([mean: 0.30059, std: 0.01853, params: {'max_depth': 2, 'n_estimators': 50},
      mean: 0.39571, std: 0.00777, params: {'max_depth': 2, 'n_estimators': 75},
      mean: 0.44242, std: 0.01296, params: {'max_depth': 2, 'n_estimators': 100},
      mean: 0.47352, std: 0.02664, params: {'max_depth': 2, 'n_estimators': 125},
      mean: 0.49547, std: 0.01577, params: {'max_depth': 2, 'n_estimators': 150},
      mean: 0.54260, std: 0.02425, params: {'max_depth': 2, 'n_estimators': 175},
      mean: 0.54957, std: 0.03926, params: {'max_depth': 2, 'n_estimators': 200},
      mean: 0.52969, std: 0.04004, params: {'max_depth': 2, 'n_estimators': 225},
      mean: 0.54029, std: 0.05254, params: {'max_depth': 2, 'n_estimators': 250},
      mean: 0.54485, std: 0.04447, params: {'max_depth': 2, 'n_estimators': 275},
      mean: 0.55701, std: 0.04541, params: {'max_depth': 2, 'n_estimators': 300},
      mean: 0.29411, std: 0.02525, params: {'max_depth': 3, 'n_estimators': 50},
      mean: 0.39210, std: 0.00699, params: {'max_depth': 3, 'n_estimators': 75},
      mean: 0.44110, std: 0.01439, params: {'max_depth': 3, 'n_estimators': 100},
      mean: 0.47381, std: 0.02580, params: {'max_depth': 3, 'n_estimators': 125},
      mean: 0.49691, std: 0.01668, params: {'max_depth': 3, 'n_estimators': 150},
      mean: 0.54583, std: 0.02631, params: {'max_depth': 3, 'n_estimators': 175},
      mean: 0.55439, std: 0.04097, params: {'max_depth': 3, 'n_estimators': 200},
      mean: 0.53532, std: 0.04101, params: {'max_depth': 3, 'n_estimators': 225},
      mean: 0.54449, std: 0.05164, params: {'max_depth': 3, 'n_estimators': 250},
      mean: 0.54941, std: 0.04269, params: {'max_depth': 3, 'n_estimators': 275},
      mean: 0.56011, std: 0.04418, params: {'max_depth': 3, 'n_estimators': 300},
      mean: 0.29457, std: 0.02471, params: {'max_depth': 4, 'n_estimators': 50},
      mean: 0.39260, std: 0.00701, params: {'max_depth': 4, 'n_estimators': 75},
      mean: 0.44148, std: 0.01402, params: {'max_depth': 4, 'n_estimators': 100},
      mean: 0.47413, std: 0.02567, params: {'max_depth': 4, 'n_estimators': 125},
      mean: 0.49719, std: 0.01653, params: {'max_depth': 4, 'n_estimators': 150},
      mean: 0.54507, std: 0.02510, params: {'max_depth': 4, 'n_estimators': 175},
      mean: 0.55360, std: 0.03993, params: {'max_depth': 4, 'n_estimators': 200},
      mean: 0.53499, std: 0.04068, params: {'max_depth': 4, 'n_estimators': 225},
      mean: 0.54440, std: 0.05068, params: {'max_depth': 4, 'n_estimators': 250},
      mean: 0.54975, std: 0.04196, params: {'max_depth': 4, 'n_estimators': 275},
      mean: 0.56034, std: 0.04323, params: {'max_depth': 4, 'n_estimators': 300},
      mean: 0.29457, std: 0.02471, params: {'max_depth': 5, 'n_estimators': 50},
      mean: 0.39260, std: 0.00701, params: {'max_depth': 5, 'n_estimators': 75},
      mean: 0.44148, std: 0.01402, params: {'max_depth': 5, 'n_estimators': 100},
      mean: 0.47413, std: 0.02567, params: {'max_depth': 5, 'n_estimators': 125},
      mean: 0.49719, std: 0.01653, params: {'max_depth': 5, 'n_estimators': 150},
      mean: 0.54507, std: 0.02510, params: {'max_depth': 5, 'n_estimators': 175},
      mean: 0.55360, std: 0.03993, params: {'max_depth': 5, 'n_estimators': 200},
      mean: 0.53499, std: 0.04068, params: {'max_depth': 5, 'n_estimators': 225},
      mean: 0.54440, std: 0.05068, params: {'max_depth': 5, 'n_estimators': 250},
      mean: 0.54975, std: 0.04196, params: {'max_depth': 5, 'n_estimators': 275},
      mean: 0.56034, std: 0.04323, params: {'max_depth': 5, 'n_estimators': 300},
      mean: 0.29457, std: 0.02471, params: {'max_depth': 6, 'n_estimators': 50},
      mean: 0.39260, std: 0.00701, params: {'max_depth': 6, 'n_estimators': 75},
      mean: 0.44148, std: 0.01402, params: {'max_depth': 6, 'n_estimators': 100},
      mean: 0.47413, std: 0.02567, params: {'max_depth': 6, 'n_estimators': 125},
      mean: 0.49719, std: 0.01653, params: {'max_depth': 6, 'n_estimators': 150},
      mean: 0.54507, std: 0.02510, params: {'max_depth': 6, 'n_estimators': 175},
      mean: 0.55360, std: 0.03993, params: {'max_depth': 6, 'n_estimators': 200},
      mean: 0.53499, std: 0.04068, params: {'max_depth': 6, 'n_estimators': 225},
      mean: 0.54440, std: 0.05068, params: {'max_depth': 6, 'n_estimators': 250},
      mean: 0.54975, std: 0.04196, params: {'max_depth': 6, 'n_estimators': 275},
      mean: 0.56034, std: 0.04323, params: {'max_depth': 6, 'n_estimators': 300}],
     {'max_depth': 4, 'n_estimators': 300},
     0.5603390632246239)

次に、gammaをチューニングします。

import xgboost as xgb
from sklearn.model_selection import GridSearchCV

xgb_model = xgb.XGBRegressor(max_depth=4,n_estimators=300,min_child_weight=1,gamma=0,subsample=0.2,colsample_bytree=0.2,learning_rate=0.1)
reg = GridSearchCV(xgb_model,
                   {'gamma':[i/10.0 for i in range(0,5)]}, verbose=1)
reg.fit(X_train, y_train)

結果はこちらで分かります。これからgammaは0.0だと分かります。

reg.grid_scores_,reg.best_params_, reg.best_score_

    ([mean: 0.56034, std: 0.04323, params: {'gamma': 0.0},
      mean: 0.56034, std: 0.04323, params: {'gamma': 0.1},
      mean: 0.56034, std: 0.04323, params: {'gamma': 0.2},
      mean: 0.56034, std: 0.04323, params: {'gamma': 0.3},
      mean: 0.56034, std: 0.04323, params: {'gamma': 0.4}],
     {'gamma': 0.0},
     0.5603390632246239)

次に、subsampleとcolsample_bytreeをチューニングします。

import xgboost as xgb
from sklearn.model_selection import GridSearchCV

xgb_model = xgb.XGBRegressor(max_depth=2,n_estimators=100,min_child_weight=1,gamma=0,subsample=0.2,colsample_bytree=0.2,learning_rate=0.1)
reg = GridSearchCV(xgb_model,
                    {'subsample':[i/10.0 for i in range(4,10)],
    'colsample_bytree':[i/10.0 for i in range(4,10)]}, verbose=1)
reg.fit(X_train, y_train)

結果はこちらで分かります。これからsubsampleとcolsample_bytreeはどちらも0.7だと分かります。

reg.grid_scores_,reg.best_params_, reg.best_score_

    ([mean: 0.55839, std: 0.05922, params: {'colsample_bytree': 0.4, 'subsample': 0.4},
      mean: 0.55703, std: 0.08538, params: {'colsample_bytree': 0.4, 'subsample': 0.5},
      mean: 0.55295, std: 0.09469, params: {'colsample_bytree': 0.4, 'subsample': 0.6},
      mean: 0.54332, std: 0.07073, params: {'colsample_bytree': 0.4, 'subsample': 0.7},
      mean: 0.54041, std: 0.06807, params: {'colsample_bytree': 0.4, 'subsample': 0.8},
      mean: 0.55073, std: 0.06947, params: {'colsample_bytree': 0.4, 'subsample': 0.9},
      mean: 0.56410, std: 0.06286, params: {'colsample_bytree': 0.5, 'subsample': 0.4},
      mean: 0.54650, std: 0.08238, params: {'colsample_bytree': 0.5, 'subsample': 0.5},
      mean: 0.55583, std: 0.09343, params: {'colsample_bytree': 0.5, 'subsample': 0.6},
      mean: 0.55248, std: 0.09086, params: {'colsample_bytree': 0.5, 'subsample': 0.7},
      mean: 0.54473, std: 0.08220, params: {'colsample_bytree': 0.5, 'subsample': 0.8},
      mean: 0.55531, std: 0.08295, params: {'colsample_bytree': 0.5, 'subsample': 0.9},
      mean: 0.56410, std: 0.06286, params: {'colsample_bytree': 0.6, 'subsample': 0.4},
      mean: 0.54650, std: 0.08238, params: {'colsample_bytree': 0.6, 'subsample': 0.5},
      mean: 0.55583, std: 0.09343, params: {'colsample_bytree': 0.6, 'subsample': 0.6},
      mean: 0.55248, std: 0.09086, params: {'colsample_bytree': 0.6, 'subsample': 0.7},
      mean: 0.54473, std: 0.08220, params: {'colsample_bytree': 0.6, 'subsample': 0.8},
      mean: 0.55531, std: 0.08295, params: {'colsample_bytree': 0.6, 'subsample': 0.9},
      mean: 0.55025, std: 0.07755, params: {'colsample_bytree': 0.7, 'subsample': 0.4},
      mean: 0.54751, std: 0.09015, params: {'colsample_bytree': 0.7, 'subsample': 0.5},
      mean: 0.55105, std: 0.09073, params: {'colsample_bytree': 0.7, 'subsample': 0.6},
      mean: 0.56454, std: 0.09483, params: {'colsample_bytree': 0.7, 'subsample': 0.7},
      mean: 0.54205, std: 0.10489, params: {'colsample_bytree': 0.7, 'subsample': 0.8},
      mean: 0.53911, std: 0.09839, params: {'colsample_bytree': 0.7, 'subsample': 0.9},
      mean: 0.55933, std: 0.07821, params: {'colsample_bytree': 0.8, 'subsample': 0.4},
      mean: 0.54618, std: 0.11937, params: {'colsample_bytree': 0.8, 'subsample': 0.5},
      mean: 0.55731, std: 0.11517, params: {'colsample_bytree': 0.8, 'subsample': 0.6},
      mean: 0.55182, std: 0.11408, params: {'colsample_bytree': 0.8, 'subsample': 0.7},
      mean: 0.55087, std: 0.09362, params: {'colsample_bytree': 0.8, 'subsample': 0.8},
      mean: 0.53608, std: 0.10538, params: {'colsample_bytree': 0.8, 'subsample': 0.9},
      mean: 0.55311, std: 0.09649, params: {'colsample_bytree': 0.9, 'subsample': 0.4},
      mean: 0.53675, std: 0.12155, params: {'colsample_bytree': 0.9, 'subsample': 0.5},
      mean: 0.56001, std: 0.10962, params: {'colsample_bytree': 0.9, 'subsample': 0.6},
      mean: 0.55277, std: 0.10686, params: {'colsample_bytree': 0.9, 'subsample': 0.7},
      mean: 0.55744, std: 0.09592, params: {'colsample_bytree': 0.9, 'subsample': 0.8},
      mean: 0.52927, std: 0.09682, params: {'colsample_bytree': 0.9, 'subsample': 0.9}],
     {'colsample_bytree': 0.7, 'subsample': 0.7},
     0.56453621270808296)

最後に、reg_alphaをチューニングします。

import xgboost as xgb
from sklearn.model_selection import GridSearchCV

xgb_model = xgb.XGBRegressor(max_depth=2,n_estimators=100,min_child_weight=1,gamma=0,subsample=0.7,colsample_bytree=0.7,learning_rate=0.1)
reg = GridSearchCV(xgb_model,
                    {'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]}, verbose=1)
reg.fit(X_train, y_train)

結果はreg_alphaは1となりました。

reg.grid_scores_,reg.best_params_, reg.best_score_

    ([mean: 0.56454, std: 0.09483, params: {'reg_alpha': 1e-05},
      mean: 0.56454, std: 0.09483, params: {'reg_alpha': 0.01},
      mean: 0.56454, std: 0.09483, params: {'reg_alpha': 0.1},
      mean: 0.56522, std: 0.09479, params: {'reg_alpha': 1},
      mean: 0.56443, std: 0.10188, params: {'reg_alpha': 100}],
     {'reg_alpha': 1},
     0.56521892471311341)

これらのチューニングを元にして予測します。

reg = xgb.XGBRegressor(max_depth=4,n_estimators=300,min_child_weight=1,gamma=0,subsample=0.7,colsample_bytree=0.7,learning_rate=0.1,reg_alpha=1)
reg.fit(X_train, y_train)

結果は17645人となりました。ちょっと多いかな?実際の観客がどれくらいになるか楽しみです。