My Dev & Engineering Repository

이번엔 Emsemble Methods (앙상블 기법)에 데하여 한번 알아보겠습니다.

앙상블 기법은 여러 개의 예측 모델을 결합하여 단일 모델보다 더 나은 성능을 얻는 방법입니다.

이를 통해 예측의 정확도를 높이고, 모델의 안정성을 향상시키며, 과적합을 줄일 수 있습니다.

Emsemble 기법의 목적

앙상블 기법을 사용하는 목적은 과연 무엇일까요?

https://ohdsi.github.io/PatientLevelPrediction/articles/BuildingEnsembleModels.html

예측 성능 향상: 여러 모델을 결합하여 개별 모델보다 더 높은 예측 정확도를 달성합니다.
과적합 감소: 다양한 모델의 결과를 결합함으로써 개별 모델이 학습 데이터에 과적합되는 것을 방지합니다.
안정성 향상: 모델의 변동성을 줄이고 예측의 일관성을 높이는 데 도움을 줍니다.

Emsemble 기법의 종류

앙상블 기법은 3가지의 종류가 있습니다. 아래에서 자세히 한번 설명해 보겠습니다.

배깅(Bagging)
부스팅(Boosting)
스태킹(Stacking)

배깅(Bagging)

배깅(Bootstrap Aggregating)은 여러 모델을 병렬적으로 학습하고,
예측을 평균내거나 다수결 투표를 통해 최종 예측을 결정하는 방법입니다.

https://www.datacamp.com/tutorial/what-bagging-in-machine-learning-a-guide-with-examples

Bagging의 원리

부트스트랩 샘플링

원본 데이터셋에서 중복을 허용하여 여러 개의 샘플을 무작위로 추출합니다. 이는 각 모델이 서로 다른 샘플을 학습하도록 하여 다양성을 확보합니다.

개별 모델 학습

각 부트스트랩 샘플을 사용하여 개별 모델을 학습합니다. 각 모델은 동일한 알고리즘을 사용하지만, 서로 다른 데이터로 학습됩니다.

예측 결합

모든 모델의 예측을 평균내거나, 다수결 투표를 통해 최종 예측을 결정합니다. 이 과정에서 모델 간의 오차가 상쇄되어 예측 성능이 향상됩니다.

대표 알고리즘

랜덤 포레스트(Random Forest): 여러 개의 결정 트리를 배깅 방식으로 결합한 알고리즘입니다.

부스팅(Boosting)

부스팅(Boosting)은 모델을 순차적으로 학습시키며,
이전 모델이 잘못 예측한 샘플에 더 큰 가중치를 부여하여 오류를 보정하는 방법입니다.

https://medium.com/@brijesh_soni/understanding-boosting-in-machine-learning-a-comprehensive-guide-bdeaa1167a6

부스팅의 원리

초기 모델 학습

첫 번째 모델을 학습시킵니다. 초기 모델은 전체 데이터셋을 사용하여 학습됩니다.

오류 샘플 가중치 증가

첫 번째 모델이 잘못 예측한 샘플의 가중치를 증가시킵니다. 이는 이후 모델이 이 오류를 더 잘 학습할 수 있도록 돕습니다.

순차적 모델 학습

가중치가 조정된 샘플을 사용하여 다음 모델을 학습합니다. 이 과정이 반복되면서 모델들이 차례로 학습됩니다.

예측 결합

모든 모델의 예측을 가중 평균하여 최종 예측을 만듭니다. 각 모델의 성능에 따라 가중치가 부여됩니다.

대표 알고리즘

AdaBoost
Gradient Boosting
XGBoost

스태킹(Stacking)

스태킹(Stacking)은 여러 모델의 예측 결과를 입력으로 사용하여 메타 모델을 학습시키는 방법입니다.

https://medium.com/@brijesh_soni/stacking-to-improve-model-performance-a-comprehensive-guide-on-ensemble-learning-in-python-9ed53c93ce28

스태킹의 원리

기본 모델 학습

여러 개의 기본 모델을 학습시킵니다. 각 기본 모델은 독립적으로 학습되며, 서로 다른 알고리즘을 사용할 수 있습니다.

메타 모델 학습

기본 모델의 예측 결과를 입력으로 사용하여 메타 모델을 학습시킵니다. 메타 모델은 기본 모델의 결과를 결합하여 최종 예측을 만듭니다.

최종 예측

메타 모델의 예측 결과를 최종 예측으로 사용합니다. 이 과정에서 기본 모델들이 가진 장점을 최대한 활용할 수 있습니다.

대표 알고리즘

다층 퍼셉트론(MLP)을 메타 모델로 사용하는 스태킹 방법이 대표적입니다.

Emsemble 기법의 장, 단점

앙상블 기법의 장점

예측 성능 향상: 여러 모델의 예측을 결합하여 더 높은 정확도를 얻을 수 있습니다.
과적합 감소: 다양한 모델의 결과를 결합함으로써 개별 모델의 과적합을 방지할 수 있습니다.
안정성 향상: 모델의 변동성을 줄이고 예측의 일관성을 높일 수 있습니다.

앙상블 기법의 단점

복잡성 증가: 여러 모델을 학습시키고 결합하는 과정이 복잡할 수 있으며, 모델의 설계와 튜닝이 까다로울 수 있습니다.
해석 어려움: 단일 모델에 비해 결과를 해석하는 것이 더 어려울 수 있습니다. 특히, 스태킹처럼 복잡한 앙상블 기법은 해석이 어렵습니다.
계산 비용: 여러 모델을 학습시키는 데 시간이 많이 걸리며, 계산 자원도 많이 소모됩니다. 대규모 데이터셋에서 특히 그렇습니다.

Emsemble Method Example Code

배깅(Bagging) Example

# 필요한 라이브러리 임포트
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Iris 데이터셋 로드
iris = load_iris()
X = iris.data
y = iris.target

# 학습 데이터와 테스트 데이터로 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Random Forest 모델 생성
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# 모델 학습
rf.fit(X_train, y_train)

# 예측 수행
y_pred = rf.predict(X_test)

# 정확도 계산
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# 혼동 행렬 시각화
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# 분류 리포트 출력
print(classification_report(y_test, y_pred, target_names=iris.target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      1.00      1.00        13
   virginica       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

부스팅(Boosting) Example - Gradient Boosting

# 필요한 라이브러리 임포트
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Iris 데이터셋 로드
iris = load_iris()
X = iris.data
y = iris.target

# 학습 데이터와 테스트 데이터로 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Gradient Boosting 모델 생성
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)

# 모델 학습
gb.fit(X_train, y_train)

# 예측 수행
y_pred = gb.predict(X_test)

# 정확도 계산
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# 혼동 행렬 시각화
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# 분류 리포트 출력
print(classification_report(y_test, y_pred, target_names=iris.target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      1.00      1.00        13
   virginica       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

스태킹(Stacking) Example

# 필요한 라이브러리 임포트
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Iris 데이터셋 로드
iris = load_iris()
X = iris.data
y = iris.target

# 학습 데이터와 테스트 데이터로 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 기본 모델 정의
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svr', SVC(kernel='rbf', probability=True, random_state=42))
]

# 스태킹 모델 생성
stacking = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

# 모델 학습
stacking.fit(X_train, y_train)

# 예측 수행
y_pred = stacking.predict(X_test)

# 정확도 계산
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

# 혼동 행렬 시각화
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

# 분류 리포트 출력
print(classification_report(y_test, y_pred, target_names=iris.target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      1.00      1.00        13
   virginica       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

저작자표시 비영리 동일조건

'📈 Data Engineering > 📇 Machine Learning' 카테고리의 다른 글

[ML] Reinforcement Learning (강화 학습) - Q-Learning (0)	2024.08.27
[ML] Recommender System (추천시스템) (0)	2024.08.26
[ML] 연관 규칙 학습 (Association Rule Learning) (0)	2024.08.22
[ML] t-SNE (t-Distributed Stochastic Neighbor Embedding) (0)	2024.08.20
[ML] Isomap (아이소맵) (0)	2024.08.20

Notice

Emsemble 기법의 목적

Emsemble 기법의 종류

배깅(Bagging)

Bagging의 원리

예측 결합

부스팅(Boosting)

부스팅의 원리

초기 모델 학습

오류 샘플 가중치 증가

순차적 모델 학습

예측 결합

대표 알고리즘

스태킹(Stacking)

스태킹의 원리

Emsemble 기법의 장, 단점

Emsemble Method Example Code

배깅(Bagging) Example

부스팅(Boosting) Example - Gradient Boosting

스태킹(Stacking) Example

'📈 Data Engineering > 📇 Machine Learning' 카테고리의 다른 글

티스토리툴바

SUBSCRIBE

Notice

Emsemble 기법의 목적

Emsemble 기법의 종류

배깅(Bagging)

Bagging의 원리

예측 결합

부스팅(Boosting)

부스팅의 원리

초기 모델 학습

오류 샘플 가중치 증가

순차적 모델 학습

예측 결합

대표 알고리즘

스태킹(Stacking)

스태킹의 원리

Emsemble 기법의 장, 단점

Emsemble Method Example Code

배깅(Bagging) Example

부스팅(Boosting) Example - Gradient Boosting

스태킹(Stacking) Example

'📈 Data Engineering > 📇 Machine Learning' 카테고리의 다른 글

티스토리툴바