A A
[ํ˜ผ๊ณต๋จธ์‹ ] Tree's Ensemble - Gradient Boosting (๊ทธ๋ ˆ์ด์–ธํŠธ ๋ถ€์ŠคํŒ…)

Gradient Boosting (๊ทธ๋ ˆ์ด์–ธํŠธ ๋ถ€์ŠคํŒ…)

๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…(Gradient Boosting)์€ ์–•์€ ๊ฒฐ์ • ํŠธ๋ฆฌ๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ „ ํŠธ๋ฆฌ์˜ ์˜ค์ฐจ๋ฅผ ๋ณด์™„ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์•™์ƒ๋ธ”์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ GradientBoostingClassifier๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ ๊นŠ์ด๊ฐ€ 3์ธ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ 100๊ฐœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์–•์€ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ๋Œ€์ ํ•ฉ์— ๊ฐ•ํ•˜๊ณ , ์ผ๋ฐ˜์ ์œผ๋กœ ๋†’์€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • '๊ทธ๋ ˆ์ด๋””์–ธํŠธ'๋ผ๋Š” ์ด๋ฆ„์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด, ์ด ๋ฐฉ๋ฒ•์€ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ํŠธ๋ฆฌ๋ฅผ ์•™์ƒ๋ธ”์— ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ๋Š” ๋กœ์ง€์Šคํ‹ฑ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ํšŒ๊ท€ ๋ฌธ์ œ์—์„œ๋Š” ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์˜ ์›๋ฆฌ์ฒ˜๋Ÿผ, ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์€ ์†์‹ค ํ•จ์ˆ˜์˜ ์ตœ์†Œ์ ์„ ์ฐพ๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜์™€ ์ ˆํŽธ์„ ์กฐ๊ธˆ์”ฉ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๊นŠ์ด๊ฐ€ ์–•์€ ํŠธ๋ฆฌ๋ฅผ ๊ณ„์† ์ถ”๊ฐ€ํ•˜๋ฉฐ ์ตœ์ ์˜ ๋ชจ๋ธ์„ ํ–ฅํ•ด ๋‚˜์•„๊ฐ‘๋‹ˆ๋‹ค.
  • ํ•™์Šต๋ฅ (learning rate) ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ์†๋„๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ชจ๋ธ์ด ์†์‹ค ํ•จ์ˆ˜์˜ ์ตœ์†Œ์ ์— ์ฒœ์ฒœํžˆ ์ ‘๊ทผํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด์ œ ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ GradientBoostingClassifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ต์ฐจ ๊ฒ€์ฆ ์ ์ˆ˜๋ฅผ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
from sklearn.ensemble import GradientBoostingClassifier

# GradientBoostingClassifier ๊ฐ์ฒด ์ƒ์„ฑ. random_state๋ฅผ ์„ค์ •ํ•˜์—ฌ ๊ฒฐ๊ณผ์˜ ์žฌํ˜„์„ฑ์„ ๋ณด์žฅ.
gb = GradientBoostingClassifier(random_state=42)

# cross_validate ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰.
# train_input๊ณผ train_target์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๊ณ , ํ›ˆ๋ จ ์ ์ˆ˜์™€ ๊ฒ€์ฆ ์ ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜.
scores = cross_validate(gb, train_input, train_target, return_train_score=True, n_jobs=-1)

# ํ›ˆ๋ จ ์„ธํŠธ์™€ ๊ฒ€์ฆ ์„ธํŠธ์˜ ํ‰๊ท  ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ถœ๋ ฅ
print(np.mean(scores['train_score']), np.mean(scores['test_score']))

# 0.8881086892152563 0.8720430147331015
  • ๋ณด๋ฉด, Overfitting(๊ณผ๋Œ€์ ํ•ฉ)์ด ๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…(Gradient Boosting)์€ ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜๋ฅผ ๋Š˜๋ ค๋„ ๊ณผ๋Œ€์ ํ•ฉ์— ๊ฐ•ํ•œ ํŠน์„ฑ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • ํ•™์Šต๋ฅ ์„ ์ฆ๊ฐ€์‹œํ‚ค๊ณ  ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋ฉด ์„ฑ๋Šฅ์ด ๋”์šฑ ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
gb = GradientBoostingClassifier(n_estimators=500, learning_rate=0.2, random_state=42)
scores = cross_validate(gb, train_input, train_target, return_train_score=True, n_jobs=-1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

# 0.9464595437171814 0.8780082549788999
  • ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜๋ฅผ 500๊ฐœ๋กœ ๋Š˜๋ ค๋„ ๊ณผ๋Œ€์ ํ•ฉ์ด ์ž˜ ์–ต์ œ๋ฉ๋‹ˆ๋‹ค.
  • ์ด ๋ชจ๋ธ์˜ ํ•™์Šต๋ฅ (learning_rate) ๊ธฐ๋ณธ๊ฐ’์€ 0.1์ž…๋‹ˆ๋‹ค.
  • Gradient Boosting(๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…) ์—ญ์‹œ ํŠน์„ฑ ์ค‘์š”๋„๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด, ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์€ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋ณด๋‹ค ํŠน์ • ํŠน์„ฑ(์˜ˆ: ๋‹น๋„)์— ๋” ์ง‘์ค‘ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
gb.fit(train_input, train_target)
print(gb.feature_importances_)

# [0.15872278 0.68010884 0.16116839]
  • subsample์ด๋ผ๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Š” ํŠธ๋ฆฌ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•  ํ›ˆ๋ จ ์„ธํŠธ์˜ ๋น„์œจ์„ ์ •ํ•˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ, ๊ธฐ๋ณธ๊ฐ’์€ 1.0์œผ๋กœ ์ „์ฒด ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋‚˜ subsample ๊ฐ’์ด 1๋ณด๋‹ค ์ž‘์œผ๋ฉด ํ›ˆ๋ จ ์„ธํŠธ์˜ ์ผ๋ถ€๋งŒ ์‚ฌ์šฉํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋Š” ๋งˆ์น˜ ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์ด๋‚˜ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•์—์„œ ์ผ๋ถ€ ์ƒ˜ํ”Œ์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ผ๋ฐ˜์ ์œผ๋กœ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์€ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋ณด๋‹ค ์•ฝ๊ฐ„ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ํ•˜์ง€๋งŒ ์ˆœ์ฐจ์ ์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ›ˆ๋ จ ์†๋„๊ฐ€ ๋Š๋ฆฝ๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ GradientBoostingClassifier์—๋Š” n_jobs ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์˜ ํšŒ๊ท€ ๋ฒ„์ „์€ GradientBoostingRegressor์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์˜ ์†๋„์™€ ์„ฑ๋Šฅ์„ ๋”์šฑ ๊ฐœ์„ ํ•œ ๋ชจ๋ธ์ด ๋ฐ”๋กœ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์ž…๋‹ˆ๋‹ค.

Histogram-Based Gradient Boosting (ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…)

ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…(Histogram-based Gradient Boosting)์€ ์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘์—์„œ
๋งค์šฐ ์ธ๊ธฐ ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.
  • ์ด ๋ฐฉ๋ฒ•์€ ๋จผ์ € ์ž…๋ ฅ ํŠน์„ฑ์„ 256๊ฐœ์˜ ๊ตฌ๊ฐ„์œผ๋กœ ๋‚˜๋ˆ„์–ด, ๋…ธ๋“œ๋ฅผ ๋ถ„ํ• ํ•  ๋•Œ ์ตœ์ ์˜ ๋ถ„ํ• ์„ ๋งค์šฐ ๋น ๋ฅด๊ฒŒ ์ฐพ์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์€ 256๊ฐœ์˜ ๊ตฌ๊ฐ„ ์ค‘ ํ•˜๋‚˜๋ฅผ ๋–ผ์–ด๋†“์•„ ๋ˆ„๋ฝ๋œ ๊ฐ’์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์— ๋ˆ„๋ฝ๋œ ๊ฐ’์ด ์žˆ์–ด๋„ ์ด๋ฅผ ๋ณ„๋„๋กœ ์ „์ฒ˜๋ฆฌํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.
  • ์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ๋Š” HistGradientBoostingClassifier ํด๋ž˜์Šค๋ฅผ ํ†ตํ•ด ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ธฐ๋ณธ ๋งค๊ฐœ๋ณ€์ˆ˜ ์„ค์ •์—์„œ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉฐ, ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚ค๋ ค๋ฉด max_iter ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์กฐ์ •ํ•˜์—ฌ ๋ถ€์ŠคํŒ… ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ๋Š˜๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
# ์‚ฌ์ดํ‚ท๋Ÿฐ 1.0 ๋ฒ„์ „ ์•„๋ž˜์—์„œ๋Š” HistGradientBoostingClassifier๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด 
# ๋‹ค์Œ ๋ผ์ธ์˜ ์ฃผ์„์„ ํ•ด์ œํ•˜๊ณ  ์‹คํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
# from sklearn.experimental import enable_hist_gradient_boosting

from sklearn.ensemble import HistGradientBoostingClassifier  # HistGradientBoostingClassifier ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# HistGradientBoostingClassifier ๊ฐ์ฒด ์ƒ์„ฑ. random_state๋ฅผ ์„ค์ •ํ•˜์—ฌ ๊ฒฐ๊ณผ์˜ ์žฌํ˜„์„ฑ์„ ๋ณด์žฅ.
hgb = HistGradientBoostingClassifier(random_state=42)

# cross_validate ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰.
# train_input๊ณผ train_target์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๊ณ , ํ›ˆ๋ จ ์ ์ˆ˜์™€ ๊ฒ€์ฆ ์ ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜.
scores = cross_validate(hgb, train_input, train_target, return_train_score=True, n_jobs=-1)

# ํ›ˆ๋ จ ์„ธํŠธ์™€ ๊ฒ€์ฆ ์„ธํŠธ์˜ ํ‰๊ท  ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ถœ๋ ฅ
print(np.mean(scores['train_score']), np.mean(scores['test_score']))

# 0.9321723946453317 0.8801241948619236
 n_jobs=-1: ๋ชจ๋“  CPU ์ฝ”์–ด๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Overfitting(๊ณผ๋Œ€์ ํ•ฉ)์„ ์ž˜์–ต์ œํ•˜๋ฉด์„œ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ณด๋‹ค ์กฐ๊ธˆ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ ํŠน์„ฑ์ค‘์š”๋„๋ฅผ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
# HistGradientBoostingClassifier ๋ชจ๋ธ์„ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋งž์ถ”์–ด ํ›ˆ๋ จ
hgb.fit(train_input, train_target)  

# ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ์˜ ํŠน์„ฑ ์ค‘์š”๋„๋ฅผ ์ถœ๋ ฅ
print(rf.feature_importances_)

# [0.23167441 0.50039841 0.26792718]
  • ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์€ ๋‹น๋„์— ์ข€ ๋” ์ง‘์ค‘ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์—ˆ์ง€๋งŒ, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์€ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์™€ ๋น„์Šทํ•˜๊ฒŒ ๋‹ค์–‘ํ•œ ํŠน์„ฑ์— ๊ณ ๋ฅด๊ฒŒ ๊ด€์‹ฌ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • ์ด๋Š” ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์ด ๋‹ค์–‘ํ•œ ํŠน์„ฑ์„ ๊ท ํ˜• ์žˆ๊ฒŒ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ด์ œ HistGradientBoostingClassifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
hgb.score(test_input, test_target)

# 0.8723076923076923
  • ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ ์•ฝ 87%์˜ ์ •ํ™•๋„๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์‹ค์ œ ์ƒํ™ฉ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์ด๋ณด๋‹ค ์•ฝ๊ฐ„ ๋‚ฎ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•™์ƒ๋ธ” ๋ชจ๋ธ์ด ๋‹จ์ผ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ณด๋‹ค ํ›จ์”ฌ ์šฐ์ˆ˜ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ฐธ๊ณ ๋กœ, ์ด์ „์˜ ๋žœ๋ค ์„œ์น˜์—์„œ๋Š” ํ…Œ์ŠคํŠธ ์ •ํ™•๋„๊ฐ€ 86%์˜€์Šต๋‹ˆ๋‹ค.
  • ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์˜ ํšŒ๊ท€ ๋ฒ„์ „์€ HistGradientBoostingRegressor ํด๋ž˜์Šค๋กœ ๊ตฌํ˜„๋˜์–ด ์žˆ์œผ๋ฉฐ, ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ ๋น„๊ต์  ์ƒˆ๋กœ์šด ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค.
  • ์‚ฌ์ดํ‚ท๋Ÿฐ ์™ธ์—๋„ ์—ฌ๋Ÿฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ์ค‘ ๋Œ€ํ‘œ์ ์ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” XGBoost์ž…๋‹ˆ๋‹ค.

XGBoost

XGBoost๋Š” ๋‹ค์–‘ํ•œ ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ง€์›ํ•˜๋ฉฐ, tree_method ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ 'hist'๋กœ ์„ค์ •ํ•˜๋ฉด ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
from xgboost import XGBClassifier  # XGBoost ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ž„ํฌํŠธ

# XGBClassifier ๊ฐ์ฒด ์ƒ์„ฑ. tree_method๋ฅผ 'hist'๋กœ ์„ค์ •ํ•˜์—ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜์˜ ํŠธ๋ฆฌ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ.
# random_state๋ฅผ ์„ค์ •ํ•˜์—ฌ ๊ฒฐ๊ณผ์˜ ์žฌํ˜„์„ฑ์„ ๋ณด์žฅ.
xgb = XGBClassifier(tree_method='hist', random_state=42)

# cross_validate ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰.
# train_input๊ณผ train_target์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๊ณ , ํ›ˆ๋ จ ์ ์ˆ˜์™€ ๊ฒ€์ฆ ์ ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜.
scores = cross_validate(xgb, train_input, train_target, return_train_score=True, n_jobs=-1)

# ํ›ˆ๋ จ ์„ธํŠธ์™€ ๊ฒ€์ฆ ์„ธํŠธ์˜ ํ‰๊ท  ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ถœ๋ ฅ
print(np.mean(scores['train_score']), np.mean(scores['test_score']))

# 0.9558403027491312 0.8782000074035686

LightGBM

LightGBM์€ ๋งˆ์ดํฌ๋กœ์†Œํ”„ํŠธ์—์„œ ๊ฐœ๋ฐœํ•œ ๋˜ ๋‹ค๋ฅธ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ธ ์„ฑ๋Šฅ ๋•๋ถ„์— ์ตœ๊ทผ ๋งŽ์€ ์ธ๊ธฐ๋ฅผ ์–ป๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. LightGBM์€ ์ตœ์‹  ๊ธฐ์ˆ ์„ ์ ์šฉํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๋น ๋ฅธ ํ•™์Šต ์†๋„์™€ ๋†’์€ ์„ฑ๋Šฅ์„ ์ž๋ž‘ํ•ฉ๋‹ˆ๋‹ค.
  • LightGBM์€ ์ฝ”๋žฉ์—์„œ ์ด๋ฏธ ์„ค์น˜๋˜์–ด ์žˆ์–ด ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ cross_validate() ํ•จ์ˆ˜์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
from lightgbm import LGBMClassifier  # LightGBM ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ž„ํฌํŠธ

# LGBMClassifier ๊ฐ์ฒด ์ƒ์„ฑ. random_state๋ฅผ ์„ค์ •ํ•˜์—ฌ ๊ฒฐ๊ณผ์˜ ์žฌํ˜„์„ฑ์„ ๋ณด์žฅ.
lgb = LGBMClassifier(random_state=42)

# cross_validate ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜ํ–‰.
# train_input๊ณผ train_target์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๊ณ , ํ›ˆ๋ จ ์ ์ˆ˜์™€ ๊ฒ€์ฆ ์ ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜.
scores = cross_validate(lgb, train_input, train_target, return_train_score=True, n_jobs=-1)

# ํ›ˆ๋ จ ์„ธํŠธ์™€ ๊ฒ€์ฆ ์„ธํŠธ์˜ ํ‰๊ท  ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ถœ๋ ฅ
print(np.mean(scores['train_score']), np.mean(scores['test_score']))

# 0.935828414851749 0.8801251203079884
  • ๋˜ํ•œ Scikit-learn์˜ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธ ๋ถ€์ŠคํŒ…์ด LightGBM ์—์„œ ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

Summary

๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… (Gradient Boosting)

  • ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…: ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ฒฐ์ •์„ ๋‚˜๋ฌด๋ฅผ ์—ฐ์†์ ์œผ๋กœ ์ถ”๊ฐ€ํ•˜์—ฌ ์•™์ƒ๋ธ”ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ์†๋„๊ฐ€ ๋‹ค์†Œ ๋Š๋ฆด ์ˆ˜ ์žˆ์ง€๋งŒ, ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์ฃผ์š” ๋งค๊ฐœ๋ณ€์ˆ˜:
    • loss: ์†์‹ค ํ•จ์ˆ˜ (deviance ๋˜๋Š” exponential)
    • learning_rate: ํŠธ๋ฆฌ๊ฐ€ ์•™์ƒ๋ธ”์— ๊ธฐ์—ฌํ•˜๋Š” ์ •๋„ (๊ธฐ๋ณธ๊ฐ’: 0.1)
    • n_estimators: ๋ถ€์ŠคํŒ… ๋‹จ๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜ (๊ธฐ๋ณธ๊ฐ’: 100)
    • subsample: ํ›ˆ๋ จ ์„ธํŠธ์˜ ์ƒ˜ํ”Œ ๋น„์œจ (๊ธฐ๋ณธ๊ฐ’: 1.0)
    • max_depth: ๊ฐœ๋ณ„ ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด (๊ธฐ๋ณธ๊ฐ’: 3)

ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… (Histogram-based Gradient Boosting)

  • ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…: ์ž…๋ ฅ ํŠน์„ฑ์„ ๊ตฌ๊ฐ„์œผ๋กœ ๋‚˜๋ˆ„์–ด ๋ถ„ํ• ์„ ๋น ๋ฅด๊ฒŒ ์ฐพ์œผ๋ฉฐ, ๋ˆ„๋ฝ๋œ ๊ฐ’์„ ๋ณ„๋„๋กœ ์ฒ˜๋ฆฌํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.
  • ์ฃผ์š” ๋งค๊ฐœ๋ณ€์ˆ˜:
    • learning_rate: ํ•™์Šต๋ฅ  (๊ธฐ๋ณธ๊ฐ’: 0.1)
    • max_iter: ๋ถ€์ŠคํŒ… ๋‹จ๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜ (๊ธฐ๋ณธ๊ฐ’: 100)
    • max_bins: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆŒ ๊ตฌ๊ฐ„์˜ ๊ฐœ์ˆ˜ (๊ธฐ๋ณธ๊ฐ’: 255)

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • XGBoost: ๋‹ค์–‘ํ•œ ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ง€์›ํ•˜๋ฉฐ, tree_method ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ 'hist'๋กœ ์„ค์ •ํ•˜๋ฉด ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • LightGBM: ๋น ๋ฅด๊ณ  ์ตœ์‹  ๊ธฐ์ˆ ์„ ์ ์šฉํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๋†’์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.