A A
[ML] Emsemble Methods (앙상블 기법)
μ΄λ²ˆμ—” Emsemble Methods (앙상블 기법)에 λ°ν•˜μ—¬ ν•œλ²ˆ μ•Œμ•„λ³΄κ² μŠ΅λ‹ˆλ‹€.

 

앙상블 기법은 μ—¬λŸ¬ 개의 예츑 λͺ¨λΈμ„ κ²°ν•©ν•˜μ—¬ 단일 λͺ¨λΈλ³΄λ‹€ 더 λ‚˜μ€ μ„±λŠ₯을 μ–»λŠ” λ°©λ²•μž…λ‹ˆλ‹€.

이λ₯Ό 톡해 예츑의 정확도λ₯Ό 높이고, λͺ¨λΈμ˜ μ•ˆμ •μ„±μ„ ν–₯μƒμ‹œν‚€λ©°, 과적합을 쀄일 수 μžˆμŠ΅λ‹ˆλ‹€.

Emsemble κΈ°λ²•μ˜ λͺ©μ 

앙상블 기법을 μ‚¬μš©ν•˜λŠ” λͺ©μ μ€ κ³Όμ—° λ¬΄μ—‡μΌκΉŒμš”? 

 

https://ohdsi.github.io/PatientLevelPrediction/articles/BuildingEnsembleModels.html

  • 예츑 μ„±λŠ₯ ν–₯상: μ—¬λŸ¬ λͺ¨λΈμ„ κ²°ν•©ν•˜μ—¬ κ°œλ³„ λͺ¨λΈλ³΄λ‹€ 더 높은 예츑 정확도λ₯Ό λ‹¬μ„±ν•©λ‹ˆλ‹€.
  • 과적합 κ°μ†Œ: λ‹€μ–‘ν•œ λͺ¨λΈμ˜ κ²°κ³Όλ₯Ό κ²°ν•©ν•¨μœΌλ‘œμ¨ κ°œλ³„ λͺ¨λΈμ΄ ν•™μŠ΅ 데이터에 κ³Όμ ν•©λ˜λŠ” 것을 λ°©μ§€ν•©λ‹ˆλ‹€.
  • μ•ˆμ •μ„± ν–₯상: λͺ¨λΈμ˜ 변동성을 쀄이고 예츑의 일관성을 λ†’μ΄λŠ” 데 도움을 μ€λ‹ˆλ‹€.

Emsemble κΈ°λ²•μ˜ μ’…λ₯˜

앙상블 기법은 3κ°€μ§€μ˜ μ’…λ₯˜κ°€ μžˆμŠ΅λ‹ˆλ‹€. μ•„λž˜μ—μ„œ μžμ„Ένžˆ ν•œλ²ˆ μ„€λͺ…ν•΄ λ³΄κ² μŠ΅λ‹ˆλ‹€.
  • λ°°κΉ…(Bagging)
  • λΆ€μŠ€νŒ…(Boosting)
  • μŠ€νƒœν‚Ή(Stacking)

λ°°κΉ…(Bagging)

λ°°κΉ…(Bootstrap Aggregating)은 μ—¬λŸ¬ λͺ¨λΈμ„ λ³‘λ ¬μ μœΌλ‘œ ν•™μŠ΅ν•˜κ³ ,
μ˜ˆμΈ‘μ„ ν‰κ· λ‚΄κ±°λ‚˜ λ‹€μˆ˜κ²° νˆ¬ν‘œλ₯Ό 톡해 μ΅œμ’… μ˜ˆμΈ‘μ„ κ²°μ •ν•˜λŠ” λ°©λ²•μž…λ‹ˆλ‹€.

https://www.datacamp.com/tutorial/what-bagging-in-machine-learning-a-guide-with-examples

Bagging의 원리

λΆ€νŠΈμŠ€νŠΈλž© μƒ˜ν”Œλ§

  • 원본 λ°μ΄ν„°μ…‹μ—μ„œ 쀑볡을 ν—ˆμš©ν•˜μ—¬ μ—¬λŸ¬ 개의 μƒ˜ν”Œμ„ λ¬΄μž‘μœ„λ‘œ μΆ”μΆœν•©λ‹ˆλ‹€. μ΄λŠ” 각 λͺ¨λΈμ΄ μ„œλ‘œ λ‹€λ₯Έ μƒ˜ν”Œμ„ ν•™μŠ΅ν•˜λ„λ‘ ν•˜μ—¬ 닀양성을 ν™•λ³΄ν•©λ‹ˆλ‹€.

κ°œλ³„ λͺ¨λΈ ν•™μŠ΅

  • 각 λΆ€νŠΈμŠ€νŠΈλž© μƒ˜ν”Œμ„ μ‚¬μš©ν•˜μ—¬ κ°œλ³„ λͺ¨λΈμ„ ν•™μŠ΅ν•©λ‹ˆλ‹€. 각 λͺ¨λΈμ€ λ™μΌν•œ μ•Œκ³ λ¦¬μ¦˜μ„ μ‚¬μš©ν•˜μ§€λ§Œ, μ„œλ‘œ λ‹€λ₯Έ λ°μ΄ν„°λ‘œ ν•™μŠ΅λ©λ‹ˆλ‹€.

예츑 κ²°ν•©

  • λͺ¨λ“  λͺ¨λΈμ˜ μ˜ˆμΈ‘μ„ ν‰κ· λ‚΄κ±°λ‚˜, λ‹€μˆ˜κ²° νˆ¬ν‘œλ₯Ό 톡해 μ΅œμ’… μ˜ˆμΈ‘μ„ κ²°μ •ν•©λ‹ˆλ‹€. 이 κ³Όμ •μ—μ„œ λͺ¨λΈ κ°„μ˜ μ˜€μ°¨κ°€ μƒμ‡„λ˜μ–΄ 예츑 μ„±λŠ₯이 ν–₯μƒλ©λ‹ˆλ‹€.

λŒ€ν‘œ μ•Œκ³ λ¦¬μ¦˜

  • 랜덀 포레슀트(Random Forest): μ—¬λŸ¬ 개의 κ²°μ • 트리λ₯Ό λ°°κΉ… λ°©μ‹μœΌλ‘œ κ²°ν•©ν•œ μ•Œκ³ λ¦¬μ¦˜μž…λ‹ˆλ‹€.

λΆ€μŠ€νŒ…(Boosting)

λΆ€μŠ€νŒ…(Boosting)은 λͺ¨λΈμ„ 순차적으둜 ν•™μŠ΅μ‹œν‚€λ©°,
이전 λͺ¨λΈμ΄ 잘λͺ» μ˜ˆμΈ‘ν•œ μƒ˜ν”Œμ— 더 큰 κ°€μ€‘μΉ˜λ₯Ό λΆ€μ—¬ν•˜μ—¬ 였λ₯˜λ₯Ό λ³΄μ •ν•˜λŠ” λ°©λ²•μž…λ‹ˆλ‹€.

https://medium.com/@brijesh_soni/understanding-boosting-in-machine-learning-a-comprehensive-guide-bdeaa1167a6

 

λΆ€μŠ€νŒ…μ˜ 원리

초기 λͺ¨λΈ ν•™μŠ΅

  • 첫 번째 λͺ¨λΈμ„ ν•™μŠ΅μ‹œν‚΅λ‹ˆλ‹€. 초기 λͺ¨λΈμ€ 전체 데이터셋을 μ‚¬μš©ν•˜μ—¬ ν•™μŠ΅λ©λ‹ˆλ‹€.

였λ₯˜ μƒ˜ν”Œ κ°€μ€‘μΉ˜ 증가

  • 첫 번째 λͺ¨λΈμ΄ 잘λͺ» μ˜ˆμΈ‘ν•œ μƒ˜ν”Œμ˜ κ°€μ€‘μΉ˜λ₯Ό μ¦κ°€μ‹œν‚΅λ‹ˆλ‹€. μ΄λŠ” 이후 λͺ¨λΈμ΄ 이 였λ₯˜λ₯Ό 더 잘 ν•™μŠ΅ν•  수 μžˆλ„λ‘ λ•μŠ΅λ‹ˆλ‹€.

순차적 λͺ¨λΈ ν•™μŠ΅

  • κ°€μ€‘μΉ˜κ°€ μ‘°μ •λœ μƒ˜ν”Œμ„ μ‚¬μš©ν•˜μ—¬ λ‹€μŒ λͺ¨λΈμ„ ν•™μŠ΅ν•©λ‹ˆλ‹€. 이 과정이 λ°˜λ³΅λ˜λ©΄μ„œ λͺ¨λΈλ“€μ΄ μ°¨λ‘€λ‘œ ν•™μŠ΅λ©λ‹ˆλ‹€.

예츑 κ²°ν•©

  • λͺ¨λ“  λͺ¨λΈμ˜ μ˜ˆμΈ‘μ„ 가쀑 ν‰κ· ν•˜μ—¬ μ΅œμ’… μ˜ˆμΈ‘μ„ λ§Œλ“­λ‹ˆλ‹€. 각 λͺ¨λΈμ˜ μ„±λŠ₯에 따라 κ°€μ€‘μΉ˜κ°€ λΆ€μ—¬λ©λ‹ˆλ‹€.

λŒ€ν‘œ μ•Œκ³ λ¦¬μ¦˜

  • AdaBoost
  • Gradient Boosting
  • XGBoost

μŠ€νƒœν‚Ή(Stacking)

μŠ€νƒœν‚Ή(Stacking)은 μ—¬λŸ¬ λͺ¨λΈμ˜ 예츑 κ²°κ³Όλ₯Ό μž…λ ₯으둜 μ‚¬μš©ν•˜μ—¬ 메타 λͺ¨λΈμ„ ν•™μŠ΅μ‹œν‚€λŠ” λ°©λ²•μž…λ‹ˆλ‹€.

https://medium.com/@brijesh_soni/stacking-to-improve-model-performance-a-comprehensive-guide-on-ensemble-learning-in-python-9ed53c93ce28

μŠ€νƒœν‚Ήμ˜ 원리

κΈ°λ³Έ λͺ¨λΈ ν•™μŠ΅

  • μ—¬λŸ¬ 개의 κΈ°λ³Έ λͺ¨λΈμ„ ν•™μŠ΅μ‹œν‚΅λ‹ˆλ‹€. 각 κΈ°λ³Έ λͺ¨λΈμ€ λ…λ¦½μ μœΌλ‘œ ν•™μŠ΅λ˜λ©°, μ„œλ‘œ λ‹€λ₯Έ μ•Œκ³ λ¦¬μ¦˜μ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

메타 λͺ¨λΈ ν•™μŠ΅

  • κΈ°λ³Έ λͺ¨λΈμ˜ 예츑 κ²°κ³Όλ₯Ό μž…λ ₯으둜 μ‚¬μš©ν•˜μ—¬ 메타 λͺ¨λΈμ„ ν•™μŠ΅μ‹œν‚΅λ‹ˆλ‹€. 메타 λͺ¨λΈμ€ κΈ°λ³Έ λͺ¨λΈμ˜ κ²°κ³Όλ₯Ό κ²°ν•©ν•˜μ—¬ μ΅œμ’… μ˜ˆμΈ‘μ„ λ§Œλ“­λ‹ˆλ‹€.

μ΅œμ’… 예츑

  • 메타 λͺ¨λΈμ˜ 예츑 κ²°κ³Όλ₯Ό μ΅œμ’… 예츑으둜 μ‚¬μš©ν•©λ‹ˆλ‹€. 이 κ³Όμ •μ—μ„œ κΈ°λ³Έ λͺ¨λΈλ“€μ΄ 가진 μž₯점을 μ΅œλŒ€ν•œ ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λŒ€ν‘œ μ•Œκ³ λ¦¬μ¦˜

  • λ‹€μΈ΅ νΌμ…‰νŠΈλ‘ (MLP)을 메타 λͺ¨λΈλ‘œ μ‚¬μš©ν•˜λŠ” μŠ€νƒœν‚Ή 방법이 λŒ€ν‘œμ μž…λ‹ˆλ‹€.

Emsemble κΈ°λ²•μ˜ μž₯, 단점

앙상블 κΈ°λ²•μ˜ μž₯점

  1. 예츑 μ„±λŠ₯ ν–₯상: μ—¬λŸ¬ λͺ¨λΈμ˜ μ˜ˆμΈ‘μ„ κ²°ν•©ν•˜μ—¬ 더 높은 정확도λ₯Ό 얻을 수 μžˆμŠ΅λ‹ˆλ‹€.
  2. 과적합 κ°μ†Œ: λ‹€μ–‘ν•œ λͺ¨λΈμ˜ κ²°κ³Όλ₯Ό κ²°ν•©ν•¨μœΌλ‘œμ¨ κ°œλ³„ λͺ¨λΈμ˜ 과적합을 방지할 수 μžˆμŠ΅λ‹ˆλ‹€.
  3. μ•ˆμ •μ„± ν–₯상: λͺ¨λΈμ˜ 변동성을 쀄이고 예츑의 일관성을 높일 수 μžˆμŠ΅λ‹ˆλ‹€.

앙상블 κΈ°λ²•μ˜ 단점

  1. λ³΅μž‘μ„± 증가: μ—¬λŸ¬ λͺ¨λΈμ„ ν•™μŠ΅μ‹œν‚€κ³  κ²°ν•©ν•˜λŠ” 과정이 λ³΅μž‘ν•  수 있으며, λͺ¨λΈμ˜ 섀계와 νŠœλ‹μ΄ κΉŒλ‹€λ‘œμšΈ 수 μžˆμŠ΅λ‹ˆλ‹€.
  2. 해석 어렀움: 단일 λͺ¨λΈμ— λΉ„ν•΄ κ²°κ³Όλ₯Ό ν•΄μ„ν•˜λŠ” 것이 더 μ–΄λ €μšΈ 수 μžˆμŠ΅λ‹ˆλ‹€. 특히, μŠ€νƒœν‚Ήμ²˜λŸΌ λ³΅μž‘ν•œ 앙상블 기법은 해석이 μ–΄λ ΅μŠ΅λ‹ˆλ‹€.
  3. 계산 λΉ„μš©: μ—¬λŸ¬ λͺ¨λΈμ„ ν•™μŠ΅μ‹œν‚€λŠ” 데 μ‹œκ°„μ΄ 많이 걸리며, 계산 μžμ›λ„ 많이 μ†Œλͺ¨λ©λ‹ˆλ‹€. λŒ€κ·œλͺ¨ λ°μ΄ν„°μ…‹μ—μ„œ 특히 κ·Έλ ‡μŠ΅λ‹ˆλ‹€.

Emsemble Method Example Code

λ°°κΉ…(Bagging) Example

# ν•„μš”ν•œ 라이브러리 μž„ν¬νŠΈ
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Iris 데이터셋 λ‘œλ“œ
iris = load_iris()
X = iris.data
y = iris.target

# ν•™μŠ΅ 데이터와 ν…ŒμŠ€νŠΈ λ°μ΄ν„°λ‘œ λΆ„ν• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Random Forest λͺ¨λΈ 생성
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# λͺ¨λΈ ν•™μŠ΅
rf.fit(X_train, y_train)

# 예츑 μˆ˜ν–‰
y_pred = rf.predict(X_test)

# 정확도 계산
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
# ν˜Όλ™ ν–‰λ ¬ μ‹œκ°ν™”
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# λΆ„λ₯˜ 리포트 좜λ ₯
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Accuracy: 1.0

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      1.00      1.00        13
   virginica       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

λΆ€μŠ€νŒ…(Boosting) Example - Gradient Boosting

# ν•„μš”ν•œ 라이브러리 μž„ν¬νŠΈ
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Iris 데이터셋 λ‘œλ“œ
iris = load_iris()
X = iris.data
y = iris.target

# ν•™μŠ΅ 데이터와 ν…ŒμŠ€νŠΈ λ°μ΄ν„°λ‘œ λΆ„ν• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Gradient Boosting λͺ¨λΈ 생성
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)

# λͺ¨λΈ ν•™μŠ΅
gb.fit(X_train, y_train)

# 예츑 μˆ˜ν–‰
y_pred = gb.predict(X_test)

# 정확도 계산
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
# ν˜Όλ™ ν–‰λ ¬ μ‹œκ°ν™”
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# λΆ„λ₯˜ 리포트 좜λ ₯
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Accuracy: 1.0

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      1.00      1.00        13
   virginica       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

μŠ€νƒœν‚Ή(Stacking) Example

# ν•„μš”ν•œ 라이브러리 μž„ν¬νŠΈ
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Iris 데이터셋 λ‘œλ“œ
iris = load_iris()
X = iris.data
y = iris.target

# ν•™μŠ΅ 데이터와 ν…ŒμŠ€νŠΈ λ°μ΄ν„°λ‘œ λΆ„ν• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# κΈ°λ³Έ λͺ¨λΈ μ •μ˜
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svr', SVC(kernel='rbf', probability=True, random_state=42))
]

# μŠ€νƒœν‚Ή λͺ¨λΈ 생성
stacking = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

# λͺ¨λΈ ν•™μŠ΅
stacking.fit(X_train, y_train)

# 예츑 μˆ˜ν–‰
y_pred = stacking.predict(X_test)

# 정확도 계산
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
# ν˜Όλ™ ν–‰λ ¬ μ‹œκ°ν™”
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# λΆ„λ₯˜ 리포트 좜λ ₯
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Accuracy: 1.0

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      1.00      1.00        13
   virginica       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45