A A
[Data Analysis] ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ & ๋‹ค๋ณ€๋Ÿ‰ ๋ถ„์„

์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ

์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋Š” ์‹œ๊ฐ„ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ๋œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์—ฐ์†์ž…๋‹ˆ๋‹ค.


์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ

 

  • ์ถ”์„ธ (Trend): ์žฅ๊ธฐ์ ์ธ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ€ ๋˜๋Š” ๊ฐ์†Œ ๊ฒฝํ–ฅ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • ์˜ˆ: ํšŒ์‚ฌ ๋งค์ถœ์ด ํ•ด๋งˆ๋‹ค ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ.
  • ๊ณ„์ ˆ์„ฑ (Seasonality): ํŠน์ • ์‹œ๊ฐ„ ํŒจํ„ด์ด ๋ฐ˜๋ณต๋˜๋Š” ํ˜„์ƒ์œผ๋กœ, ์ฃผ๊ธฐ์ ์ธ ๋ณ€๋™์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ: ์—ฌ๋ฆ„์ฒ  ์•„์ด์Šคํฌ๋ฆผ ํŒ๋งค๋Ÿ‰ ์ฆ๊ฐ€.
  • ์ฃผ๊ธฐ์„ฑ (Cyclicality): ๋ถˆ๊ทœ์น™์ ์ธ ๊ฐ„๊ฒฉ์œผ๋กœ ๋ฐ˜๋ณต๋˜๋Š” ๋ณ€๋™์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • ์˜ˆ: ๊ฒฝ์ œ ํ˜ธํ™ฉ๊ณผ ๋ถˆํ™ฉ ์ฃผ๊ธฐ.
  • ์žก์Œ (Noise): ๋ฐ์ดํ„ฐ์— ํฌํ•จ๋œ ๋ถˆ๊ทœ์น™ํ•œ ๋ณ€๋™์œผ๋กœ, ์˜ˆ์ธก์— ๋ฐฉํ•ด๊ฐ€ ๋˜๋Š” ์š”์†Œ์ž…๋‹ˆ๋‹ค.

์‹œ๊ณ„์—ด ๋ถ„์„ ๋ฐฉ๋ฒ•

์‹œ๊ณ„์—ด ๋ถ„ํ•ด๋Š” ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์—ฌ๋Ÿฌ ์š”์†Œ(์ถ”์„ธ, ๊ณ„์ ˆ์„ฑ, ์ฃผ๊ธฐ์„ฑ, ์žก์Œ)๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ๋ถ„์„ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • ๊ฐ€๋ฒ• ๋ชจํ˜• (Additive Model): ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐœ๋ณ„ ์š”์ธ์˜ ํšจ๊ณผ๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ณ  ํ•จ๊ป˜ ๋”ํ•˜์—ฌ ๋ชจํ˜•ํ™”ํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ: ์‹œ๊ณ„์—ด = ์ถ”์„ธ + ๊ณ„์ ˆ์„ฑ + ์ˆœํ™˜์„ฑ + ์žก์Œ
  • ์Šน๋ฒ• ๋ชจํ˜• (Multiplicative Model): ๋ฐ์ดํ„ฐ๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด ๊ณ„์ ˆ ํŒจํ„ด๋„ ์ฆ๊ฐ€ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋Š” ๋ชจํ˜•์ž…๋‹ˆ๋‹ค.
    • ์˜ˆ: ์‹œ๊ณ„์—ด = ์ถ”์„ธ * ๊ณ„์ ˆ์„ฑ * ์ˆœํ™˜์„ฑ * ์žก์Œ
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose

# ๊ฐ€์ƒ์˜ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
date_rng = pd.date_range(start='1/1/2020', end='1/1/2022', freq='M')
data = pd.Series(np.random.randn(len(date_rng)), index=date_rng)

# ์‹œ๊ณ„์—ด ๋ถ„ํ•ด
result = seasonal_decompose(data, model='additive')
result.plot()
plt.show()

 


ํ†ต๊ณ„์  ๋ฐฉ๋ฒ•

1. ์ด๋™ ํ‰๊ท  (Moving Average): ๋ฐ์ดํ„ฐ์˜ ๋‹จ๊ธฐ ๋ณ€๋™์„ ํ‰ํ™œํ™”ํ•˜์—ฌ ์ถ”์„ธ๋ฅผ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค.

data['moving_avg'] = data.rolling(window=3).mean()
data.plot()
plt.show()

 

2.์ง€์ˆ˜ ํ‰ํ™œ (Exponential Smoothing): ์ตœ๊ทผ ๊ด€์ธก๊ฐ’์— ๋” ํฐ ๊ฐ€์ค‘์น˜๋ฅผ ๋‘๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

from statsmodels.tsa.holtwinters import SimpleExpSmoothing

model = SimpleExpSmoothing(data).fit()
data['exp_smoothing'] = model.fittedvalues
data.plot()
plt.show()

์‹œ๊ณ„์—ด ์˜ˆ์ธก

  • ARIMA ๋ชจ๋ธ (Autoregressive Integrated Moving Average): ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์˜ ์˜ˆ์ธก์— ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(data, order=(5,1,0))
model_fit = model.fit()
data['forecast'] = model_fit.predict(start=len(data), end=len(data)+12, dynamic=True)
data[['value', 'forecast']].plot()
plt.show()

์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ์ฃผ์š” ๋ฌธ์ œ์™€ ํ•ด๊ฒฐ์ฑ…

๋ฌธ์ œ

  • ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ๋•Œ๋Š” ๋ณดํ†ต 2๊ฐ€์ง€์˜ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ์ธก์น˜(Missing Values), ์ด์ƒ์น˜(Outliers). 2๊ฐ€์ง€์˜ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

  • ๊ฒฐ์ธก์น˜ (Missing Values): ํŠน์ • ์‹œ์ ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ˆ„๋ฝ๋˜๋Š” ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.
  • ์ด์ƒ์น˜ (Outliers): ์˜ˆ์ƒ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋กœ, ์ผ๊ด€์„ฑ ์—†๋Š” ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ์ด ๋ฌธ์ œ๋“ค์„ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐ ํ•ด์•ผ ํ• ๊นŒ์š”? ํ•ด๊ฒฐ์ฑ…์„ ํ•œ๋ฒˆ ์ œ์‹œํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

ํ•ด๊ฒฐ์ฑ…

1. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

  • ๋ณด๊ฐ„๋ฒ• (Interpolation): ๊ฒฐ์ธก์น˜๋ฅผ ์ฃผ๋ณ€ ๋ฐ์ดํ„ฐ๋กœ ๋ณด๊ฐ„ํ•˜์—ฌ ์ฑ„์›๋‹ˆ๋‹ค.
data.interpolate(method='linear', inplace=True)
  • ํ‰๊ท  ๋Œ€์ฒด: ๊ฒฐ์ธก์น˜๋ฅผ ํ•ด๋‹น ์—ด์˜ ํ‰๊ท ๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.
data.fillna(data.mean(), inplace=True)

 

2. ์ด์ƒ์น˜ ํƒ์ง€ ๋ฐ ์ฒ˜๋ฆฌ

Z-์ ์ˆ˜์™€ IQR ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ƒ์น˜๋ฅผ ํƒ์ง€ํ•˜๊ณ  ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • Z-์ ์ˆ˜ (Z-score)
from scipy import stats

data['z_score'] = np.abs(stats.zscore(data['value']))
outliers = data[data['z_score'] > 3]
  • IQR (Interquartile Range) ๋ฐฉ๋ฒ•
Q1 = data['value'].quantile(0.25)
Q3 = data['value'].quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data['value'] < (Q1 - 1.5 * IQR)) | (data['value'] > (Q3 + 1.5 * IQR))]

 

3. ์ฐจ๋ถ„ (Differencing)

  • ์ถ”์„ธ ๋ฐ ๊ณ„์ ˆ์„ฑ์„ ์ œ๊ฑฐํ•˜์—ฌ ์‹œ๊ณ„์—ด์„ ์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค.
data['differenced'] = data['value'].diff()
data['differenced'].dropna().plot()
plt.show()

 


๋‹ค๋ณ€๋Ÿ‰ ๋ถ„์„

๋‹ค๋ณ€๋Ÿ‰ ๋ถ„์„์€ ์—ฌ๋Ÿฌ ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋™์‹œ์— ๋ถ„์„ํ•˜๋Š” ํ†ต๊ณ„ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

 

  • ์—ฌ๋Ÿฌ ํ˜„์ƒ์ด๋‚˜ ์‚ฌ๊ฑด์— ๋Œ€ํ•œ ์ธก์ •์น˜๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋ถ„์„ํ•˜์ง€ ์•Š๊ณ  ๋™์‹œ์— ํ•œ๋ฒˆ์— ๋ถ„์„ํ•˜๋Š” ํ†ต๊ณ„์  ๊ธฐ๋ฒ• ์ž…๋‹ˆ๋‹ค.
  • ์ด๋Š” ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์—์„œ ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ์ดํ•ดํ•˜๊ณ , ์ค‘์š”ํ•œ ํŒจํ„ด๊ณผ ์ธ์‚ฌ์ดํŠธ๋ฅผ ๋„์ถœํ•˜๋Š” ๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

๋‹ค๋ณ€๋Ÿ‰ ๋ฐ์ดํ„ฐ์˜ ์ดํ•ด

๋‹ค๋ณ€๋Ÿ‰ ๋ถ„์„์€ ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ๊ณผ ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๊ณ  ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

  • ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ ํŒŒ์•…
    • ๋‹ค๋ณ€๋Ÿ‰ ๋ถ„์„์€ ๋‹จ๋ณ€๋Ÿ‰ ๋ถ„์„์—์„œ๋Š” ๊ฐ„๊ณผํ•  ์ˆ˜ ์žˆ๋Š” ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ณ€์ˆ˜๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ๊ฐ€ ์•„๋‹ˆ๋ผ ์ƒํ˜ธ ์˜์กด์ ์œผ๋กœ ์ž‘์šฉํ•  ๋•Œ ํŠนํžˆ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ: ์†Œ๋น„์ž ๋ฐ์ดํ„ฐ์—์„œ ์—ฐ๋ น๊ณผ ์†Œ๋“ ์ˆ˜์ค€์ด ๋™์‹œ์— ๊ตฌ๋งค ํŒจํ„ด์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ณต์žกํ•œ ๊ด€๊ณ„ ํฌ์ฐฉ
    • ๋‹ค๋ณ€๋Ÿ‰ ๋ฐ์ดํ„ฐ๋Š” ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์–ด, ๋‹จ์ˆœํ•œ ๊ด€๊ณ„ ๋ถ„์„์„ ๋„˜์–ด ๋” ๊นŠ์€ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ: ๊ณ ๊ฐ ์„ธ๋ถ„ํ™”๋ฅผ ํ†ตํ•ด ๊ณ ๊ฐ๋“ค์˜ ๋‹ค์–‘ํ•œ ํŠน์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ทธ๋ฃน์„ ๋‚˜๋ˆ„๊ณ , ๊ฐ ๊ทธ๋ฃน์˜ ํ–‰๋™ ํŒจํ„ด์„ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํŒจํ„ด ์˜ˆ์ธก
    • ๋‹ค๋ณ€๋Ÿ‰ ๋ถ„์„ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์—์„œ ํŒจํ„ด์„ ์˜ˆ์ธกํ•˜๊ณ , ๋ฏธ๋ž˜์˜ ํ–‰๋™์ด๋‚˜ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์˜ˆ: ์—ฌ๋Ÿฌ ๋ณ€์ˆ˜๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์†Œ๋น„์ž ๊ตฌ๋งค ํ–‰๋™์„ ์˜ˆ์ธกํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋งˆ์ผ€ํŒ… ์ „๋žต์„ ์„ธ์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ƒ๊ด€ ๋ถ„์„

์ƒ๊ด€ ๋ถ„์„์—์„œ ๋‘๊ฐœ์˜ ์ƒ๊ด€๊ณ„์ˆ˜์ธ ํ”ผ์–ด์Šจ, ์Šคํ”ผ์–ด๋งŒ ์ƒ๊ด€๊ณ„์ˆ˜์— ๋ฐํ•˜์—ฌ ์•Œ์•„ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜ (Pearson Correlation)

  • ์ •์˜: ๋‘ ๋ณ€์ˆ˜ ๊ฐ„์˜ ์„ ํ˜• ๊ด€๊ณ„์˜ ๊ฐ•๋„์™€ ๋ฐฉํ–ฅ์„ ์ธก์ •ํ•˜๋Š” ํ†ต๊ณ„์  ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ํŠน์ง•:
    • ์—ฐ์†์ ์ธ ์ˆ˜์น˜ ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ์„ ํ˜• ๊ด€๊ณ„๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.
    • ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฐ์†์ ์ด๊ณ  ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ๊ฒฝ์šฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  • ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•:
    • ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜๋Š” -1์—์„œ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ, 1์€ ์™„๋ฒฝํ•œ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„, -1์€ ์™„๋ฒฝํ•œ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„, 0์€ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์—†์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
import numpy as np
from scipy.stats import pearsonr

# ์˜ˆ์ œ ๋ฐ์ดํ„ฐ
x = np.random.rand(100)
y = np.random.rand(100)

# ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ
corr, _ = pearsonr(x, y)
print(f'Pearson correlation coefficient: {corr}')

# Pearson correlation coefficient: -0.09606113577342046

 

 

์Šคํ”ผ์–ด๋งŒ ์ƒ๊ด€๊ณ„์ˆ˜ (Spearman Correlation)

  • ์ •์˜: ๋‘ ๋ณ€์ˆ˜์˜ ์ˆœ์œ„์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๊ด€๊ณ„๋ฅผ ์ธก์ •ํ•˜๋Š” ๋น„๋ชจ์ˆ˜์  ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ํŠน์ง•:
    • ์ˆœ์œ„ ๊ธฐ๋ฐ˜์˜ ๋น„์„ ํ˜• ๊ด€๊ณ„๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.
    • ๋ฐ์ดํ„ฐ๊ฐ€ ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด์ง€ ์•Š๊ฑฐ๋‚˜ ์ˆœ์œ„ํ˜• ๋ฐ์ดํ„ฐ์ผ ๋•Œ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•:
    • ์Šคํ”ผ์–ด๋งŒ ์ƒ๊ด€๊ณ„์ˆ˜๋Š” -1์—์„œ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ, 1์€ ์™„๋ฒฝํ•œ ์–‘์˜ ์ˆœ์œ„ ์ƒ๊ด€๊ด€๊ณ„, -1์€ ์™„๋ฒฝํ•œ ์Œ์˜ ์ˆœ์œ„ ์ƒ๊ด€๊ด€๊ณ„, 0์€ ์ˆœ์œ„ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์—†์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
from scipy.stats import spearmanr

# ์˜ˆ์ œ ๋ฐ์ดํ„ฐ
x = np.random.rand(100)
y = np.random.rand(100)

# ์Šคํ”ผ์–ด๋งŒ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ
corr, _ = spearmanr(x, y)
print(f'Spearman correlation coefficient: {corr}')

# Spearman correlation coefficient: 0.07144314431443144

์ฃผ์„ฑ๋ถ„ ๋ถ„์„ (PCA, Principal Component Analysis)

์ฃผ์„ฑ๋ถ„ ๋ถ„์„(PCA)์€ ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ์ถ•์†Œํ•˜์—ฌ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŠน์„ฑ์„ ์ถ”์ถœํ•˜๋Š” ํ†ต๊ณ„ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • PCA๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์ด ์ตœ๋Œ€๊ฐ€ ๋˜๋Š” ๋ฐฉํ–ฅ์„ ์ฐพ์•„, ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ฐจ์›์„ ์ถ•์†Œํ•ฉ๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ ์ฃผ์„ฑ๋ถ„ ๋ถ„์„์˜ ๋‹จ๊ณ„๋ฅผ ํ•œ๋ฒˆ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

1. ๋ฐ์ดํ„ฐ ํ‘œ์ค€ํ™”

  • ๊ฐ ๋ณ€์ˆ˜์˜ ํ‰๊ท ์„ 0, ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ 1๋กœ ๋งž์ถ”์–ด ๋ชจ๋“  ๋ณ€์ˆ˜์˜ ๋น„์ค‘์„ ๋งž์ถฅ๋‹ˆ๋‹ค.
  • ์ด๋Š” ๋ณ€์ˆ˜๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋‹จ์œ„๋ฅผ ๊ฐ€์งˆ ๋•Œ, ๋ถ„์„ ๊ฒฐ๊ณผ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
from sklearn.preprocessing import StandardScaler

# ์˜ˆ์ œ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
import numpy as np
import pandas as pd

data = np.random.rand(100, 5)
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D', 'E'])

# ๋ฐ์ดํ„ฐ ํ‘œ์ค€ํ™”
scaler = StandardScaler()
data_std = scaler.fit_transform(df)

2. ๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ ๊ณ„์‚ฐ

  • ํ‘œ์ค€ํ™”๋œ ๋ฐ์ดํ„ฐ์˜ ๋ณ€์ˆ˜๋“ค ๊ฐ„์˜ ์„ ํ˜• ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
# ๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ ๊ณ„์‚ฐ
cov_matrix = np.cov(data_std.T)
print("Covariance Matrix:\n", cov_matrix)

 

3. ๊ณ ์œ ๊ฐ’ ๋ถ„ํ•ด

  • ๊ณต๋ถ„์‚ฐ ํ–‰๋ ฌ์˜ ๊ณ ์œ ๊ฐ’๊ณผ ๊ณ ์œ ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • ๊ณ ์œ ๋ฒกํ„ฐ๋Š” ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์ด ์ตœ๋Œ€์ธ ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๊ณ ์œ ๊ฐ’์€ ๊ทธ ๋ถ„์‚ฐ์˜ ํฌ๊ธฐ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
# ๊ณ ์œ ๊ฐ’๊ณผ ๊ณ ์œ ๋ฒกํ„ฐ ๊ณ„์‚ฐ
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print("Eigenvalues:\n", eigenvalues)
print("Eigenvectors:\n", eigenvectors)

4. ์ฃผ์„ฑ๋ถ„ ์„ ํƒ

  • ๊ฐ€์žฅ ํฐ ๊ณ ์œ ๊ฐ’์— ํ•ด๋‹นํ•˜๋Š” ๊ณ ์œ ๋ฒกํ„ฐ๋ถ€ํ„ฐ ์ˆœ์„œ๋Œ€๋กœ ์ฃผ์„ฑ๋ถ„์œผ๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  • ์ผ๋ฐ˜์ ์œผ๋กœ, ์ „์ฒด ๋ถ„์‚ฐ์˜ ๋Œ€๋ถ€๋ถ„์„ ์„ค๋ช…ํ•˜๋Š” ๋ช‡ ๊ฐœ์˜ ์ฃผ์„ฑ๋ถ„๋งŒ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
# ๊ณ ์œ ๊ฐ’ ์ •๋ ฌ
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

# ์ฃผ์„ฑ๋ถ„ ์„ ํƒ (์˜ˆ: 2๊ฐœ)
n_components = 2
selected_eigenvectors = eigenvectors[:, :n_components]
print("Selected Eigenvectors:\n", selected_eigenvectors)

 

5. ์ƒˆ๋กœ์šด ํŠน์„ฑ ๊ณต๊ฐ„์œผ๋กœ ๋ฐ์ดํ„ฐ ํˆฌ์˜

  • ์„ ํƒ๋œ ์ฃผ์„ฑ๋ถ„์— ์›๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ํˆฌ์˜ํ•˜์—ฌ ์ฐจ์›์„ ์ถ•์†Œํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์˜ ์ฃผ์š” ์ •๋ณด๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ฐจ์›์„ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
# ๋ฐ์ดํ„ฐ ํˆฌ์˜
principal_components = data_std.dot(selected_eigenvectors)
print("Principal Components:\n", principal_components)

์š”์ธ ๋ถ„์„ (Factor Analysis)

์š”์ธ ๋ถ„์„์€ ๋ณ€์ˆ˜๋“ค ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•˜์—ฌ ๋ช‡ ๊ฐ€์ง€ ์ž ์žฌ์ ์ธ ์š”์ธ์œผ๋กœ ์š”์•ฝํ•˜๋Š” ํ†ต๊ณ„์  ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ์ด ๋ฐฉ๋ฒ•์€ ๊ด€์ธก๋œ ๋ณ€์ˆ˜๋“ค ๋’ค์— ์ˆจ์–ด ์žˆ๋Š” ์ž ์žฌ์  ์š”์ธ์„ ๋ฐœ๊ฒฌํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค.
  • ์š”์ธ ๋ถ„์„์€ ๋ฐ์ดํ„ฐ์˜ ์ž ์žฌ์  ๊ตฌ์กฐ๋ฅผ ๋ชจ๋ธ๋งํ•˜๊ณ , ๊ด€์ธก๋œ ๋ณ€์ˆ˜๋“ค์˜ ๋ณ€๋™์„ฑ์„ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ๊ณตํ†ต ์š”์ธ์„ ์ฐพ์•„๋‚ด์–ด ๋ณ€์ˆ˜์˜ ์ˆ˜๋ฅผ ์ค„์ด๊ณ  ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค.

 

์š”์ธ ๋ถ„์„์˜ ํŠน์ง•

์š”์ธ ๋ถ„์„์€ ์ฃผ๋กœ 3๊ฐ€์ง€์˜ ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

  • ์ž ์žฌ์  ์š”์ธ ๋ฐœ๊ฒฌ: ๋ณ€์ˆ˜๋“ค์ด ํ•˜๋‚˜ ์ด์ƒ์˜ ๋น„๊ด€์ธก๋œ ์ž ์žฌ ๋ณ€์ˆ˜(์š”์ธ)์— ์˜ํ•ด ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค๋Š” ๊ฐ€์ • ํ•˜์— ๋ถ„์„์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.
  • PCA์™€์˜ ์ฐจ์ด์ : PCA๋Š” ์ฃผ๋กœ ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ์ฐพ๋Š” ๋ฐ ์ค‘์ ์„ ๋‘๋Š” ๋ฐ˜๋ฉด, ์š”์ธ ๋ถ„์„์€ ๋ฐ์ดํ„ฐ ๋‚ด ์ž ์žฌ์  ๊ตฌ์กฐ๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค.
  • ์‘์šฉ ๋ถ„์•ผ: ์‹ฌ๋ฆฌํ•™, ์‚ฌํšŒ๊ณผํ•™, ๋งˆ์ผ€ํŒ… ๋“ฑ์—์„œ ์„ค๋ฌธ์ง€ ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฐ ์ฃผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
from sklearn.decomposition import FactorAnalysis

# ์š”์ธ ๋ถ„์„ ์ ์šฉ
fa = FactorAnalysis(n_components=2)
factors = fa.fit_transform(data_std)

# ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
import matplotlib.pyplot as plt

plt.scatter(factors[:, 0], factors[:, 1])
plt.xlabel('Factor 1')
plt.ylabel('Factor 2')
plt.title('Factor Analysis Result')
plt.show()

์š”์ธ ๋ถ„์„์˜ ๋ชฉ์ 

์š”์ธ ๋ถ„์„์˜ ๋ชฉ์ ์— ๋ฐํ•˜์—ฌ ์„ค๋ช…์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  1. ๋ณ€์ˆ˜ ์ถ•์†Œ: ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ€์ˆ˜๋“ค์„ ํ•˜๋‚˜์˜ ์š”์ธ์œผ๋กœ ๋ฌถ์–ด ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ์ค„์ž…๋‹ˆ๋‹ค.
  2. ๋ถˆํ•„์š”ํ•œ ๋ณ€์ˆ˜ ์ œ๊ฑฐ: ์š”์ธ์— ํฌํ•จ๋˜์ง€ ์•Š๊ฑฐ๋‚˜ ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ ๋ณ€์ˆ˜๋ฅผ ํƒ์ƒ‰ํ•˜์—ฌ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ๋ณ€์ˆ˜ ํŠน์„ฑ ํŒŒ์•…: ๊ด€๋ จ๋œ ๋ณ€์ˆ˜๋“ค์ด ๋ฌถ์—ฌ ์š”์ธ๋“ค์˜ ์ƒํ˜ธ ๋…๋ฆฝ์ ์ธ ํŠน์„ฑ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.