A A
[ML] ํŠน์„ฑ ๊ณตํ•™๊ณผ ๊ทœ์ œ

๋‹ค์ค‘ ํšŒ๊ท€(Characteristic Engineering and Regulation)

๋‹ค์ค‘ ํšŒ๊ท€

์—ฌ๋Ÿฌ๊ฐœ์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•œ ์„ ํ˜• ํšŒ๊ท€(Linear Regression)๋ฅผ ๋‹ค์ค‘ ํšŒ๊ท€(Multiple Regression)์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.
  • 1๊ฐœ์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ–ˆ์„๋•Œ, ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์ด ํ•™์Šต ํ•˜๋Š”๊ฒƒ์€ ์ง์„ ์ž…๋‹ˆ๋‹ค. 2๊ฐœ์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•˜๋ฉด ์„ ํ˜• ํšŒ๊ท€๋Š” ํ‰๋ฉด์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ์™ผ์ชฝ ๊ทธ๋ฆผ์ด 1๊ฐœ์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•œ ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์ด ํ•™์Šต ํ•˜๋Š” ๋ชจ๋ธ, ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์ด 2๊ฐœ์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•œ ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

  • ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ํŠน์„ฑ์ด 2๊ฐœ๋ฉด Target๊ฐ’๊ณผ ํ•จ๊ป˜ 3์ฐจ์› ๊ณต๊ฐ„์„ ํ˜•์„ฑํ•˜๊ณ  ์„ ํ˜• ํšŒ๊ท€ ๋ฐฉ์ •์‹์€ ํ‰๋ฉด์ด ๋ฉ๋‹ˆ๋‹ค.
Target = a x ํŠน์„ฑ1 + b x ํŠน์„ฑ2 + ์ ˆํŽธ
  • ๊ทธ๋Ÿฌ๋ฉด ํŠน์„ฑ์ด 3๊ฐœ์ผ ๊ฒฝ์šฐ์—๋Š”? ์šฐ๋ฆฌ๋Š” 3์ฐจ์› ๊ณต๊ฐ„์„ ๊ทธ๋ฆฌ๊ฑฐ๋‚˜ ์ƒ์ƒํ• ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ, ํŠน์„ฑ์ด ๋งŽ์€ ๊ณ ์ฐจ์›์—์„œ๋Š” ์„ ํ˜•ํšŒ๊ท€๊ฐ€ ๋งค์šฐ ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ 3๊ฐœ์˜ ํŠน์„ฑ์„ ๊ฐ๊ฐ ์ œ๊ณฑํ•ด์„œ ์ถ”๊ฐ€ํ•˜๊ณ , ๊ฐ ํŠน์„ฑ์„ ๊ณฑํ•ด์„œ ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋งŒ๋“ค๊ฒ ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๋†์–ด ๊ธธ์ด x ๋†์–ด ๊ธธ์ด ๋ฅผ ์ƒˆ๋กœ์šด ํŠน์„ฑ์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ฒ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ธฐ์กด์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•ด์„œ ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋ฝ‘์•„๋‚ด๋Š” ์ž‘์—…์„ ํŠน์„ฑ๊ณตํ•™(Feature Engineering)์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์ค€๋น„ 

  • ์ด์ „๊ณผ ๋‹ฌ๋ฆฌ ๋†์–ด์˜ ํŠน์„ฑ์ด 3๊ฐœ๊ฐ€ ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ผ์ผ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์‚ฌํ•ด์„œ ๋ถ™์ด๋Š”๊ฑด ๋ฒˆ๊ฑฐ๋กญ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, Pandas๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค.
  • Pandas๋Š” ์ž˜ ์•Œ๋ ค์ง„ ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ, ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์€ ํŒ๋‹ค์Šค์˜ ํ•ต์‹ฌ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.
  • Numpy ๋ฐฐ์—ด๊ณผ ๋‹ค์ฐจ์› ๋ฐฐ์—ด์„ ๋‹ค๋ฃฐ์ˆ˜ ์žˆ์ง€๋งŒ, ๋” ๋งŽ์€ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ Numpy ๋ฐฐ์—ด๋กœ ์‰ฝ๊ฒŒ ๋ด๊ฟ€์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
# Pandas๋กœ ๋ฐ์ดํ„ฐ ์ค€๋น„. csv ํŒŒ์ผ๋กœ ๋ฐ›์•„์„œ pandas dataframe -> numpy ๋ฐฐ์—ด๋กœ ๋ณ€ํ™˜

import pandas as pd

df = pd.read_csv('https://bit.ly/perch_csv_data')
perch_full = df.to_numpy()

print(perch_full[:5]) # perch_full ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์„œ 5๊ฐœ๋งŒ ๋ถˆ๋Ÿฌ์™”์Šต๋‹ˆ๋‹ค. perch_full๋กœ ํ•˜๋ฉด ๋‹ค ๋ณผ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
[[ 8.4 2.11 1.41]
 [13.7 3.53 2. ]
 [15. 3.82 2.43]
 [16.2 4.59 2.63]
 [17.4 4.59 2.94]]
  • Target ๋ฐ์ดํ„ฐ๋„ ์ด์ „๊ณผ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ฐ€์ ธ์˜ค๊ณ , perch_full, perch_weight๋ฅผ Training_set์™€ Test_set๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
  • ์ด ๋ฐ์ดํ„ฐ๋“ค์„ ์‚ฌ์šฉํ•ด์„œ ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
import numpy as np

perch_weight = np.array([5.9, 32.0, 40.0, 51.5, 70.0, 100.0, 78.0, 80.0, 85.0, 85.0, 110.0,
       115.0, 125.0, 130.0, 120.0, 120.0, 130.0, 135.0, 110.0, 130.0,
       150.0, 145.0, 150.0, 170.0, 225.0, 145.0, 188.0, 180.0, 197.0,
       218.0, 300.0, 260.0, 265.0, 250.0, 250.0, 300.0, 320.0, 514.0,
       556.0, 840.0, 685.0, 700.0, 700.0, 690.0, 900.0, 650.0, 820.0,
       850.0, 900.0, 1015.0, 820.0, 1100.0, 1000.0, 1100.0, 1000.0,
       1000.0])
# Scikit-learn ํ›ˆ๋ จ์„ธํŠธ๋Š” 2์ฐจ์› ๋ฐฐ์—ด์ด์—ฌ์•ผ ํ•จ์œผ๋กœ, Numpy์˜ reshape method๋ฅผ ์‚ฌ์šฉํ•ด์„œ 2์ฐจ์›์œผ๋กœ ๋ด๊ฟ”์คŒ
# perch_full & perch_weight๋ฅผ ํ›ˆ๋ จ & ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋‚˜๋ˆˆ๋‹ค.
from sklearn.model_selection import train_test_split

train_input, test_input, train_target, test_target = train_test_split(perch_full, perch_weight, random_state=42)

Scikit-learn์˜ ๋ณ€ํ™˜๊ธฐ

Scikit-learn์€ ํŠน์„ฑ์„ ๋งŒ๋“ค๊ฑฐ๋‚˜ ์ „์ฒ˜๋ฆฌ ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํด๋ž˜์Šค๋ฅผ ์ œ๊ณตํ•˜๋Š”๋ฐ, ์ด๋Ÿฐ ํด๋ž˜์Šค๋ฅผ ๋ณ€ํ™˜๊ธฐ (Transformer)๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.
๋ณ€ํ™˜๊ธฐ Class๋Š” fit(), transform() Method๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ์ถ”๊ฐ€๋กœ LinearRegression ์—์„œ๋Š” ์ถ”์ •๊ธฐ, Transformer์—์„œ๋Š” ๋ณ€ํ™˜๊ธฐ ๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.
  • ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•  ๋ณ€ํ™˜๊ธฐ๋Š” PolynomialFeature Class์ž…๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ์‚ฌ์šฉํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ํด๋ž˜์Šค๋Š” sklearn.preprocessing ํŒจํ‚ค์ง€์— ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  PolynomialFeatures Class๊ฐ€ ํ•˜๋Š”๊ฑด? ๋ณ„๊ฑฐ ์—†์–ด์š”. ํ•˜๋Š”๊ฒŒ ํŠน์„ฑ ๋ช‡๊ฐœ์ธ์ง€, 2 x 3์„ ํ•ด์„œ ์–ด๋–ค ์กฐํ•ฉ์œผ๋กœ ๋งŒ๋“œ๋Š”์ง€ ํŒŒ์•…ํ•˜๋Š” ์ •๋„์ž…๋‹ˆ๋‹ค.
from sklearn.preprocessing import PolynomialFeatures
# PolynomialFeatures(๋ณ€ํ™˜๊ธฐ) - degree๋ผ๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜: ๊ธฐ๋ณธ๊ฐ’์ด 2(์ œ๊ณฑํ•ญ์„ ๋งŒ๋“ค์–ด์ฃผ๋Š” ํ‘œ์‹œ)

# degree = 2 (3์ด๋ฉด 3์ œ๊ณฑ(์ œ๊ณฑํ•ญ์ด 3)์œผ๋กœ ํ•œ๋‹ค.)
poly = PolynomialFeatures()
poly.fit([[2,3]])

# 1(bias), 2, 3, 2**2, 2*3, 3**2
print(poly.transform([[2,3]])) # 2,3 ์ด๋ผ๋Š” ๊ฐ€์ƒ์˜ sample data
# 2,3:์›๋ž˜ ์žˆ๋˜ ํŠน์„ฑ ๊ทธ๋Œ€๋กœ. 2-4, 3-9, 2*3-6. 1์€ ์ ˆํŽธ์„ ์œ„ํ•œ ํŠน์„ฑ44
[[1. 2. 3. 4. 6. 9.]]
  • fit() Method๋Š” ์ƒˆ๋กญ๊ฒŒ ๋งŒ๋“ค ํŠน์„ฑ ์กฐํ•ฉ์„ ์ฐพ๊ณ , Transform() Method๋Š” ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. 
  • ์—ฌ๊ธฐ์„œ๋Š” 2๊ฐœ์˜ ํŠน์„ฑ(์›์†Œ)๋ฅผ ๊ฐ€์ง„ ์ƒ˜ํ”Œ [2,3]์ด ํŠน์„ฑ์„ ๊ฐ€์ง„ ์ƒ˜ํ”Œ [1, 2, 3, 4, 6, 9]๋กœ ๋ด๋€Œ์—ˆ์Šต๋‹ˆ๋‹ค.
  • PolynomialFeature ํด๋ž˜์Šค๊ฐ€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๊ฐ ํŠน์„ฑ์„ ์ œ๊ณฑํ•œ ํ•ญ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๊ด€๋ จ ์‹์„ ์•„๋ž˜์— ์ฒจ๋ถ€ ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
๋ฌด๊ฒŒ = a x ๊ธธ์ด + b x ๋†’์ด + c x ๋‘๊ป˜ + d x 1
  • ๊ด€๋ จ ์‹์„ ๋ณด๋ฉด, ํŠน์„ฑ์€ (๊ธธ์ด, ๋†’์ด, ๋‘๊ป˜, 1)์ด ๋ฉ๋‹ˆ๋‹ค. ๊ทผ๋ฐ 1์€ ๋ฌด์—‡์ผ๊นŒ์š”? 1์€, ์„ ํ˜• ๋ฐฉ์ •์‹์˜ ์ ˆํŽธ์„ ํ•ญ์ƒ ๊ฐ’์ด 1์ธ ํŠน์„ฑ๊ณผ ๊ณฑํ•ด์ง€๋Š” ๊ณ„์ˆ˜ ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทผ๋ฐ, Scikit-learn์˜ ์„ ํ˜•๋ชจ๋ธ์„ ์ž๋™์œผ๋กœ ์ ˆํŽธ์„ ์ถ”๊ฐ€ํ•จ์œผ๋กœ ๊ตณ์ด ์ด๋ ‡๊ฒŒ ํŠน์„ฑ์„ ๋งŒ๋“ค ํ•„์š”๊ฐ€ ์—†์œผ๋ฏ€๋กœ, include_bias=False๋กœ ์ง€์ •ํ•˜์—ฌ ๋‹ค์‹œ ํŠน์„ฑ์„ ๋ฐ˜ํ™˜ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” ์ ˆํŽธ์„ ์œ„ํ•œ ํ•ญ์ด ์ œ๊ฑฐ๋˜๊ณ , ํŠน์„ฑ์˜ ์ œ๊ณฑ๊ณผ ํŠน์„ฑ๋ผ๋ฆฌ ๊ณฑํ•œ ํ•ญ๋งŒ ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
# include_bias=False๋กœ ์ง€์ •ํ•˜์—ฌ ๋‹ค์‹œ ํŠน์„ฑ์„ ๋ณ€ํ™˜
poly = PolynomialFeatures(include_bias=False)
poly.fit([[2, 3]])
print(poly.transform([[2, 3]]))
[[2. 3. 4. 6. 9.]]
  • ํ•œ๋ฒˆ ์ด ๋ฐฉ์‹์œผ๋กœ train_input์— ์ ์šฉ์‹œ์ผœ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. train_input์„ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ train_polu์— ์ €์žฅํ•˜๊ณ  ๋ฐฐ์—ด์˜ ํฌ๊ธฐ๋ฅผ ํ™•์ธํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
poly = PolynomialFeatures(include_bias=False)

poly.fit(train_input) # fit(ํ›ˆ๋ จ)
train_poly = poly.transform(train_input) #train_poly(numpy ๋ฐฐ์—ด)

print(train_poly.shape)
(42, 9)
  • PolynomialFeatures ํด๋ž˜์Šค๋Š” 9๊ฐœ์˜ ํŠน์„ฑ์ด ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค์–ด์กŒ๋Š”์ง€ ํ™•์ธํ• ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค.
  • get_feature_names_out() Method ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด 9๊ฐœ์˜ ํŠน์„ฑ์ด ๊ฐ๊ฐ ์–ด๋–ค ์กฐํ•ฉ์œผ๋กœ ๋งŒ๋“ค์–ด ์กŒ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
poly.get_feature_names_out()
array(['x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2', 'x2^2'], dtype=object)
  • โ€˜x0โ€™์€ ์ฒซ๋ฒˆ์งธ ํŠน์„ฑ์„ ์˜๋ฏธํ•˜๊ณ 
  • โ€˜x0^2โ€™๋Š” ์ฒซ๋ฒˆ์งธ ํŠน์„ฑ์˜ ์ œ๊ณฑ,
  • โ€˜x0 xlโ€™์€ ์ฒซ๋ฒˆ์งธ ํŠน์„ฑ๊ณผ ๋‘๋ฒˆ์งธ ํŠน์„ฑ์˜ ๊ณฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์‹์ž…๋‹ˆ๋‹ค.
  • ์ด์ œ Test_set๋ฅผ ๋ฐ˜ํ™˜ํ•˜๊ณ  ๋ณ€ํ™˜๋œ ํŠน์„ฑ์„ ์ด์šฉํ•˜์—ฌ ๋‹ค์ค‘ ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ค๊ฒ ์Šต๋‹ˆ๋‹ค.
# training_set์— ์‚ฌ์šฉํ•œ๊ฑธ test_set์— ์‚ฌ์šฉํ•œ๋‹ค.
test_poly = poly.transform(test_input)

๋‹ค์ค‘ ํšŒ๊ท€ ๋ชจ๋ธ ํ›ˆ๋ จํ•˜๊ธฐ

๋‹ค์ค‘ ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š”๊ฒƒ์€ ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ์—ฌ๋ ค๊ฐœ์˜ ํŠน์„ฑ์„ ์ด์šฉํ•ด์„œ ์„ ํ˜•ํšŒ๊ท€๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š”๊ฒƒ ๋ฟ์ž…๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ํ›ˆ๋ จ์‹œ์ผœ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(train_poly, train_target)

print(lr.score(train_poly, train_target))
0.9903183436982125
  • 1๊ฐœ์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•œ ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ๋ณด๋‹ค ๋†’์€ ์ ์ˆ˜๊ฐ€ ๋‚˜์™”์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ํŠน์„ฑ์ด ๋Š˜์–ด๋‚˜๋ฉด ์„ ํ˜•ํšŒ๊ท€๋Š” ์ข‹์€ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค๋Š”๊ฒƒ์„ ๋ณผ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Test_set ์ ์ˆ˜๋„ ํ™•์ธํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
print(lr.score(test_poly, test_target)) # ๊ณผ์†Œ์ ํ•ฉ ๋ฌธ์ œ ํ•ด๊ฒฐ
0.9714559911594159
  • Test_set์— ๋Œ€ํ•œ ์ ์ˆ˜๋Š” 1๊ฐœ์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•œ ์„ ํ˜•ํšŒ๊ท€ ๋ณด๋‹ค ์ ์ˆ˜๊ฐ€ ๋†’์•„์ง€์ง€๋Š” ์•Š์•˜์ง€๋งŒ, ๋†์–ด์˜ ๊ธธ์ด๋งŒ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ์žˆ๋˜ ๊ณผ์†Œ์ ํ•ฉ ๋ฌธ์ œ๋Š” ํ•ด๊ฒฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๋งŒ์•ฝ ์—ฌ๊ธฐ์„œ ํŠน์„ฑ์„ ๋” ๋งŽ์ด ์ถ”๊ฐ€ํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”? 3์ œ๊ณฑ, 4์ œ๊ณฑ, 5์ œ๊ณฑ ํ•ญ์„ ๋„ฃ๋Š”๊ฑฐ์ฃ . 
PolynomialFeature ํด๋ž˜์Šค์˜ degree ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ๊ฐ’์„ ๋ณ€๊ฒฝํ•˜์—ฌ ํ•„์š”ํ•œ ์ตœ๋Œ€ ์ฐจ์ˆ˜๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 5์ œ๊ณฑ ๊นŒ์ง€ ํŠน์„ฑ์„ ์ถ”๊ฐ€ํ•ด์„œ ๋งŒ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
poly = PolynomialFeatures(degree=5, include_bias=False) # 5์ œ๊ณฑ๊นŒ์ง€ ํŠน์„ฑ์„ ๋งŒ๋“ค์–ด์„œ ์ถœ๋ ฅ

poly.fit(train_input)
train_poly = poly.transform(train_input)
test_poly = poly.transform(test_input)

print(train_poly.shape)
(42, 55) # ๋ฐ์ดํ„ฐ์…‹์€ 42๊ฐœ, ๋งŒ๋“ค์–ด์ง„ ํŠน์„ฑ์˜ ๊ฐœ์ˆ˜๊ฐ€ 55๊ฐœ
  • train_poly ๋ฐฐ์—ด์˜ ์—ด์˜ ๊ฐœ์ˆ˜๊ฐ€ ํŠน์„ฑ์˜ ๊ฐœ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์„ ๋‹ค์‹œ ํ›ˆ๋ จ์‹œ์ผœ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
lr.fit(train_poly, train_target)
print(lr.score(train_poly, train_target))
0.999999999999769
  • ์ •ํ™•๋„๊ฐ€ 99%์ด์ƒ์ž…๋‹ˆ๋‹ค. ์™„๋ฒฝํ•œ ์ ์ˆ˜์ธ๋ฐ, Test_set ์ ์ˆ˜๋„ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
# ์Œ์ˆ˜๊ฐ€ ๋œจ๋Š” ์ด์œ : training_set์— ๋„ˆ๋ฌด ๊ณผ๋Œ€ ์ ํ•ฉ ๋˜์–ด์„œ.
print(lr.score(test_poly, test_target))
-144.40490595353674
  • ํ ... ๋งค์šฐ ํฐ ์Œ์ˆ˜๊ฐ’์ด ๋‚˜์™”์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ Training_set์— ๋„ˆ๋ฌด ๊ณผ๋Œ€์ ํ•ฉ์ด ๋˜์–ด์„œ Test_set์ ์ˆ˜์—๋Š” ๋งค์šฐ ๋‚ฎ์€ ๊ฐ’์„ ๋งŒ๋“ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ทœ์ œ & ํ‘œ์ค€ํ™”

  • ์œ„์˜ ๊ธ€์—์„œ ๋ณด์ด๋“ฏ์ด, ๊ณผ๋Œ€์ ํ•ฉ์ด ๋œ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ๊ณผ๋„ํ•˜๊ฒŒ ํ•™์Šตํ•˜์ง€ ๋ชปํ•˜๋„๋ก ๋ง‰๋Š”๊ฒƒ์„ ๊ทœ์ œ ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. 
์ฆ‰, ๋ชจ๋ธ์ด ํ›ˆ๋ จ ์„ธํŠธ์— ๊ณผ๋Œ€์ ํ•ฉ๋˜์ง€ ์•Š๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ํŠน์„ฑ์— ๊ณฑํ•ด์ง€๋Š” ๊ณ„์ˆ˜(๊ธฐ์šธ๊ธฐ)์˜ ํฌ๊ธฐ๋ฅผ ์ž‘๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. 

  • ๊ทธ๋ฆผ์„ ๋ณด์‹œ๋ฉด, ์™ผ์ชฝ์€ ๊ทœ์ œ๋ฅผ ์„ค์ •ํ•˜๊ธฐ ์ „์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•œ ๋ชจ๋ธ์ด๊ณ , ์˜ค๋ฅธ์ชฝ์€ ๊ทœ์ œ๋ฅผ ์ ์šฉํ•ด์„œ ํ•™์Šต์‹œํ‚จ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด, 55๊ฐœ์˜ ํŠน์„ฑ์œผ๋กœ ํ›ˆ๋ จํ•œ ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ์˜ ๊ณ„์ˆ˜๋ฅผ ๊ทœ์ œํ•˜์—ฌ ํ›ˆ๋ จ ์„ธํŠธ์˜ ์ ์ˆ˜๋ฅผ ๋‚ฎ์ถ”๊ณ  ๋Œ€์‹  ํ…Œ์ŠคํŠธ ์ ์ˆ˜๋ฅผ ๋†’์—ฌ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ๊ทผ๋ฐ, ๊ทœ์ œ๋ฅผ ์ ์šฉํ•˜๊ธฐ ์ „์— ์ •๊ทœํ™”๋ฅผ ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ํŠน์„ฑ์˜ ์Šค์ผ€์ผ์ด ์ •๊ทœํ™”๊ฐ€ ๋˜์ง€ ์•Š์œผ๋ฉด? ๊ณฑํ•ด์ง€๋Š” ๊ณ„์ˆ˜๊ฐ’๋„ ์ฐจ์ด๊ฐ€ ๋‚˜์ด ๋•Œ๋ฌธ์—, ๊ทœ์ œ๋ฅผ ์ ์šฉํ•˜๋ฉด, ๋˜‘๊ฐ™์ด ์ œ์–ด๊ฐ€ ๋˜์ง€ ์•Š์„์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ์ด๋ฒˆ์—๋Š” Scikit-learn์—์„œ ์ œ๊ณตํ•˜๋Š” StandardScaler ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ณ€ํ™˜ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
from sklearn.preprocessing import StandardScaler

ss = StandardScaler() # ๊ฐ์ฒด ์ดˆ๊ธฐํ™” 
ss.fit(train_poly) # ๋ชจ๋ธ ํ›ˆ๋ จ

train_scaled = ss.transform(train_poly)
test_scaled = ss.transform(test_poly)
  • ์ด ์ฝ”๋“œ๋Š” StandardScaler ํด๋ž˜์Šค์˜ ๊ฐ์ฒด ss๋ฅผ ์ดˆ๊ธฐํ™” ํ•œ ํ›„, PolynomialFeature ํด๋ž˜์Šค๋กœ ๋งŒ๋“  train_poly ๊ฐ์ฒด๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
  • Training_set๋กœ ํ•™์Šตํ•œ ๋ณ€ํ™˜๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ Test_set๋„ ๋ณ€ํ™˜ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ์ด์ œ ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ์— ๊ทœ์ œ๋ฅผ ์ถ”๊ฐ€ํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•˜์—ฌ ์„ค๋ช…์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ฆฟ์ง€(ridge) & ๋ผ์˜(Lasso)2๊ฐœ๊ฐ€ ์žˆ๋Š”๋ฐ, ๋ฆฟ์ง€(ridge) ๋ถ€ํ„ฐ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Ridge(๋ฆฟ์ง€) ํšŒ๊ท€

Ridge(๋ฆฟ์ง€) ํšŒ๊ท€๋Š” ๊ณ„์ˆ˜๋ฅผ ์ œ๊ณฑํ•œ ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ทœ์ œ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • L2 ๊ทœ์ œ, ๋‹ค๋ฅธ class์— L2๊ทœ์ œ๊ฐ€ ์ ์šฉ๋ฌ์„๋•? ์„ ํ˜•ํšŒ๊ท€์—์„œ๋Š” ๋ฆฟ์ง€ ํšŒ๊ท€ ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฆฟ์ง€๋Š” sklearn.linear_model ํŒจํ‚ค์ง€ ์•ˆ์— ์žˆ์œผ๋ฉฐ, ํŽธ๋ฆฌํ•œ๊ฒƒ์€ ํ›ˆ๋ จ & ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๊ฐ™๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๋ชจ๋ธ ๊ฐ์ฒด์—์„œ fit() method๋กœ ํ›ˆ๋ จํ•˜๊ณ , score() method๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  • ์•ž์„œ ์ค€๋น„ํ•œ train_scaled ๋ฐ์ดํ„ฐ๋กœ ๋ฆฟ์ง€ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
from sklearn.linear_model import Ridge

#alpha=1(1์ด๋ฉด ๊ฐ•๋„ ์Ž”, 0์ด๋ฉด ๊ฐ•๋„ ์•ฝํ•จ), ์‚ฌ์ „์— ์šฐ๋ฆฌ๊ฐ€ ์ง€์ •ํ•ด์•ผ ๋˜๋Š” ๊ฐ’์ž„ - ์ด๋Ÿฌํ•œ ๊ฐ’์„ hyperparameter๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.
ridge = Ridge() 
ridge.fit(train_scaled, train_target)

print(ridge.score(train_scaled, train_target))
0.9896101671037343
  • ์„ ํ˜•ํšŒ๊ท€ ์—์„œ ๊ฑฐ์ด 99% ์ •ํ™•๋„๊ฐ€ ๋‚˜์˜จ๋ฐ˜๋ฉด, ์ ์ˆ˜๊ฐ€ ์กฐ๊ธˆ ๋‚ฎ์•„์กŒ์Šต๋‹ˆ๋‹ค. ํ•œ๋ฒˆ Test_set ์ ์ˆ˜๋ฅผ ํ™•์ธํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
print(ridge.score(test_scaled, test_target))
0.9790693977615388
  • ์ „์˜ ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ ์—์„œ ์ ์ˆ˜๊ฐ€ ์Œ์ˆ˜๊ฐ€ ๋‚˜์™”์ง€๋งŒ, ์ง€๊ธˆ์€ ์ •์ƒ์œผ๋กœ ๋Œ์•„์™”์Šต๋‹ˆ๋‹ค. 
  • ๋ฆฟ์ง€(Ridge)๋„ ๊ทธ๋ ‡๊ณ  ๋ผ์˜(Lasso)๋„ ๊ทธ๋ ‡์ง€๋งŒ, ๊ทœ์ œ์˜ ์–‘์„ ์ž„์˜๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์˜ ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค๋•Œ, alpha ๋งค๊ฐœ ๋ณ€์ˆ˜๋กœ ๊ทœ์ œ์˜ ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค. 
  • ๋งŒ์•ฝ alpha ๊ฐ’์ด ํฌ๋ฉด ๊ทœ์ œ ๊ฐ•๋„๊ฐ€ ์„ธ์ง€๋ฏ€๋กœ ๊ณ„์ˆ˜ ๊ฐ’์„ ๋” ์ค„์ด๊ณ  ๊ณผ์†Œ์ ํ•ฉ ๋˜๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.
  • alpha ๊ฐ’์ด ์ž‘์œผ๋ฉด ๊ณ„์ˆ˜๋ฅผ ์ค„์ด๋Š” ์—ญํ• ์ด ์ค„์–ด๋“ค๊ณ  ์„ ํ˜•ํšŒ๊ท€ ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•ด์ง€๋ฏ€๋กœ, ๊ณผ๋Œ€์ ํ•ฉ(Overfitting)๋  ๊ฐ€๋Šฅ์„ฑ์ด ํฝ๋‹ˆ๋‹ค. 
  • ๊ทธ๋Ÿฌ๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ์ ์ž˜ํ•œ alpha ๊ฐ’์„ ์ฐพ์„ ์ˆ˜ ์žˆ์„๊นŒ์š”?

์ ์ ˆํ•œ ๊ทœ์ œ ๊ฐ•๋„ ์ฐพ๊ธฐ

์ ์ ˆํ•œ alpha ๊ฐ’์„ ์ฐพ๋Š” ํ•œ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์€, alpha ๊ฐ’์— ๋Œ€ํ•œ R^2์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค์–ด๋ณด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • Matplotlib ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ import ํ•˜๊ณ , alpha ๊ฐ’์„ ๋ด๊ฟ€๋•Œ ๋งˆ๋‹ค, score() Method์˜ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•  list๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
import matplotlib.pyplot as plt
train_score = []
test_score = []
  • alpha ๊ฐ’์„ 0.001 ๋ถ€ํ„ฐ 100๊นŒ์ง€ 10๋ฐฐ์”ฉ ๋Š˜๋ ค๊ฐ€๋ฉฐ ๋ฆฟ์ง€ ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ , Training_set & Test_set ์ ์ˆ˜๋ฅผ list์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  train_score ์™€ test_score ๋ฆฌ์ŠคํŠธ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค.
  • ์ด ๊ทธ๋ž˜ํ”„๋„ x์ถ•์€ log scale๋กœ ๋ด๊ฟ”์„œ ๊ทธ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.
alpha_list = [0.001, 0.01, 0.1, 1, 10, 100] #๋ณดํ†ต์€ 7์˜ ๋ฐฐ์ˆ˜๋กœ hyperparameter ๋ฒ”์œ„ ์ง€์ • ๋ฐ ํ›ˆ๋ จ
for alpha in alpha_list:
    # alpha_list ์•ˆ์— ์žˆ๋Š” ๊ฐ’๋“ค๋กœ ํ•˜๋‚˜์”ฉ for๋ฌธ์„ ๋Œ๋ ค๊ฐ€๋ฉด์„œ, ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
    ridge = Ridge(alpha=alpha)
    # ๋ฆฟ์ง€ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
    ridge.fit(train_scaled, train_target)
    # ํ›ˆ๋ จ ์ ์ˆ˜์™€ ํ…Œ์ŠคํŠธ ์ ์ˆ˜๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
    train_score.append(ridge.score(train_scaled, train_target))
    test_score.append(ridge.score(test_scaled, test_target))
plt.plot(np.log10(alpha_list), train_score) #log10 - log scale๋กœ ๋ด๊ฟˆ
plt.plot(np.log10(alpha_list), test_score)
plt.xlabel('alpha')
plt.ylabel('R^2')
plt.show()

# alpha ๊ฐ’์ด ์Ž„์ง€๋ฉด ๊ทœ์ œ๊ฐ€ ์Ž„์ ธ์„œ, training_set score๊ฐ€ ๋‚ฎ์•„์ง, alpha๊ฐ’์ด ์•ฝํ•ด์ง€๋ฉด ๋ฐ˜๋Œ€
# ์™ผ์ชฝ์€ ๊ณผ๋Œ€์ ํ•ฉ, ์˜ค๋ฅธ์ชฝ์€ ๊ณผ์†Œ์ ํ•ฉ
  • ๊ทธ๋ž˜ํ”„์˜ alpha ๊ฐ’์„ 0.001๋ถœ 100๊นŒ์ง€ 10๋ฐฐ์”ฉ ๋Š˜๋ ธ๊ธฐ ๋•Œ๋ฌธ์—, ๊ทธ๋ž˜ํ”„๋ฅผ ๋ฐ”๋กœ ๊ทธ๋ ค๋ฒ„๋ฆฌ๋ฉด ๊ทธ๋ž˜ํ”„์˜ ์™ผ์ชฝ์ด ๋„ˆ๋ฌด ์ด˜์ด˜ํ•ด ์ง€๋ฏ€๋กœ alpha_list์— ์žˆ๋Š” 6๊ฐœ์˜ ๊ฐ’์„ ๋™์ผํ•œ ๊ฐ„๊ฒฉ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๊ธฐ ๋•Œ๋ฌธ์—, log ํ•จ์ˆ˜๋ฅผ ์ง€์ˆ˜๋กœ ํ‘œํ˜„ํ•ด์„œ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • 0.001์€ -3, 0.01์€ -2, 100์€ 2. ์ด๋ ‡๊ฒŒ ๋˜๋Š” ํ˜•์‹์ž…๋‹ˆ๋‹ค.

  • ์œ„์˜ ํŒŒ๋ž€์ƒ‰ ๊ทธ๋ž˜ํ”„๊ฐ€ Training_set ๊ทธ๋ž˜ํ”„, ์•„๋ž˜ ๋…ธ๋ž€์ƒ‰ ๊ทธ๋ž˜ํ”„๊ฐ€ Test_set ๊ทธ๋ž˜ํ”„ ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜ํ”„์˜ ์™ผ์ชฝ์„ ๋ณด๋ฉด Training_set, Test_set์˜ ์ ์ˆ˜ ์ฐจ์ด๊ฐ€ ํฝ๋‹ˆ๋‹ค. ์ด ๋ชจ์Šต์€ ๊ณผ๋Œ€์ ํ•ฉ์˜ ์ „ํ˜•์ ์ธ ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.
  • ์˜ค๋ฅธ์ชฝ์€ ๋‘˜๋‹ค ์ ์ˆ˜๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ๊ณผ์†Œ์ ํ•ฉ์˜ ๋ชจ์Šต์„ ๋ณด์ž…๋‹ˆ๋‹ค. 
  • ๋‘ ๊ทธ๋ž˜ํ”„๊ฐ€ ๊ฐ€์žฅ ๊ฐ€๊น๊ณ , ํ…Œ์ŠคํŠธ ์ ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ -1, ์ฆ‰ 10์˜ -1์Šน. 0.1 ์ž…๋‹ˆ๋‹ค. alpha ๊ฐ’์„ 0.1๋กœ ํ•ด์„œ ์ตœ์ข… ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œ์ผœ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
ridge = Ridge(alpha=0.1)
ridge.fit(train_scaled, train_target)

print(ridge.score(train_scaled, train_target))
print(ridge.score(test_scaled, test_target))
0.9903815817570366
0.9827976465386955
  • ์ด ๋ชจ๋ธ์€ Training_set ์ ์ˆ˜์™€ Test_set ์ ์ˆ˜๊ฐ€ ๋น„์Šทํ•˜๊ฒŒ ๋†’๊ณ  ๊ณผ๋Œ€์ ํ•ฉ, ๊ณผ์†Œ์ ํ•ฉ ์‚ฌ์ด์—์„œ ๊ท ํ˜•์„ ๋งž์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋ฒˆ์—” ๋ผ์˜(Lasso) ๋ชจ๋ธ์„ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ผ์˜(Lasso) ํšŒ๊ท€

๋ผ์˜(Lasso) ํšŒ๊ท€๋Š” ํ›ˆ๋ จํ•˜๋Š”๊ฒƒ์€ ๋ฆฟ์ง€(Ridge)์™€ ๋ฐฉ์‹์ด ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค. Ridge๋ฅผ Lasso๋กœ ๋ด๊พธ๋Š”๊ฒƒ์ด ๋‹ค์ž…๋‹ˆ๋‹ค.
๋‹ค๋งŒ, ๋ผ์˜ ํšŒ๊ท€์˜ ํŠน์ง•์€ ๊ฐ€์ค‘์ฐจ์˜ ์ ˆ๋Œ€๊ฐ’์— ์ œ๊ณฑ์„ ์ฃผ์–ด์„œ ๊ทœ์ œ๋ฅผ ์ฃผ๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
from sklearn.linear_model import Lasso

lasso = Lasso()
lasso.fit(train_scaled, train_target)
print(lasso.score(train_scaled, train_target))
print(lasso.score(test_scaled, test_target))
0.989789897208096
0.9800593698421884
  • Train, Test_set์˜ ์ ์ˆ˜๋„ ๋ฆฟ์ง€ ํšŒ๊ท€๋งŒํผ ์ข‹์Šต๋‹ˆ๋‹ค. ๋ผ์˜๋„ ๋™์ผํ•˜๊ฒŒ alpha ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๊ทœ์ œ์˜ ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ• ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
import matplotlib.pyplot as plt
train_score = []
test_score = []

alpha_list = [0.001, 0.01, 0.1, 1, 10, 100] #๋ณดํ†ต์€ 7์˜ ๋ฐฐ์ˆ˜๋กœ hyperparameter ๋ฒ”์œ„ ์ง€์ • ๋ฐ ํ›ˆ๋ จ
for alpha in alpha_list:
    # alpha_list ์•ˆ์— ์žˆ๋Š” ๊ฐ’๋“ค๋กœ ํ•˜๋‚˜์”ฉ for๋ฌธ์„ ๋Œ๋ ค๊ฐ€๋ฉด์„œ, ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
    lasso = Lasso(alpha=alpha, max_iter=10000)
    # ๋ผ์˜ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
    lasso.fit(train_scaled, train_target)
    # ํ›ˆ๋ จ ์ ์ˆ˜์™€ ํ…Œ์ŠคํŠธ ์ ์ˆ˜๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
    train_score.append(lasso.score(train_scaled, train_target))
    test_score.append(lasso.score(test_scaled, test_target))
plt.plot(np.log10(alpha_list), train_score) #log10 - log scale๋กœ ๋ด๊ฟˆ
plt.plot(np.log10(alpha_list), test_score)
plt.xlabel('alpha')
plt.ylabel('R^2')
plt.show()

  • ์ด ๊ทธ๋ž˜ํ”„์˜ ํŠน์ง•๋„ ์™ผ์ชฝ์€ ๊ณผ๋Œ€์ ํ•ฉ, ์˜ค๋ฅธ์ชฝ์€ ๊ณผ์†Œ์ ํ•ฉ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. 
  • ๊ทธ๋ฆฌ๊ณ , ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๊ฐˆ์ˆ˜๋ก ํ›ˆ๋ จ ์„ธํŠธ, ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ์ ์ˆ˜๊ฐ€ ์ขํ˜€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ง€์ ์ด ์•„๋งˆ ๊ณผ์†Œ์ ํ•ฉ๋˜๋Š” ๋ชจ๋ธ์ธ๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  • ๋ผ์˜ ๋ชจ๋ธ์—์„œ ์ตœ์ ์˜ alpha ๊ฐ’์€ 1, ์ฆ‰ 10์˜ 1์Šน=10์ž…๋‹ˆ๋‹ค. ์ด ๊ฐ’์œผ๋กœ ๋‹ค์‹œ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
lasso = Lasso(alpha=10)
lasso.fit(train_scaled, train_target)

print(lasso.score(train_scaled, train_target))
print(lasso.score(test_scaled, test_target))
0.9888067471131867
0.9824470598706695
  • ๋ผ์˜ ๋ชจ๋ธ๋„ ๊ณผ๋Œ€์ ํ•ฉ์„ ์ž˜ ์–ต์ œํ•˜๊ณ  ํ…Œ์ŠคํŠธ ์„ธํŠธ์˜ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ๋†’์ธ๊ฒƒ์„ ์•Œ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทผ๋ฐ, ๋ผ์˜ ๋ชจ๋ธ์˜ ๊ณ„์ˆ˜๊ฐ’์„ ์•„์— 0์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋Š”๊ฒƒ์„ ์•Œ๊ณ  ๊ณ„์‹ ๊ฐ€์š”? ๋ผ์˜ ๋ชจ๋ธ์˜ ๊ณ„์ˆ˜๋Š” coef_ ์†์„ฑ์— ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ ๋งŒ๋“ค์–ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
print(np.sum(lasso.coef_ == 0))
40
  • ์ด๊ฒƒ์„ ๋ณด๋ฉด์„œ ์•Œ ์ˆ˜ ์žˆ๋Š”๊ฒƒ์€, 55๊ฐœ์˜ ํŠน์„ฑ์„ ๋ชจ๋ธ์— ์ฃผ์ž…ํ–ˆ์ง€๋งŒ ๋ผ์˜ ๋ชจ๋ธ์ด ์‚ฌ์šฉํ•œ ํŠน์„ฑ์€ 15๊ฐœ ๋ฟ์ด๋ผ๋Š”๊ฒƒ์„ ์•Œ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ํŠน์ง• ๋•Œ๋ฌธ์— ๋ผ์˜(Lasso) ๋ชจ๋ธ์„ ์œ ์šฉํ•œ ํŠน์„ฑ์„ ๊ณจ๋ผ๋‚ด๋Š” ์šฉ๋„๋กœ๋„ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Keywords

  • ๋‹ค์ค‘ ํšŒ๊ท€(Multiple Regression)๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ํšŒ๊ท€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํŠน์„ฑ์ด ๋งŽ์œผ๋ฉด ์„ ํ˜• ๋ชจ๋ธ์€ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค.
  • ํŠน์„ฑ ๊ณตํ•™ ์€ ์ฃผ์–ด์ง„ ํŠน์„ฑ์„ ์กฐํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šดํŠน์„ฑ์„ ๋งŒ๋“œ๋Š” ์ผ๋ จ์˜ ์ž‘์—… ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
  • ๋ฆฟ์ง€(Ridge)๋Š” ๊ทœ์ œ๊ฐ€ ์žˆ๋Š” ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ ์„ ํ˜• ๋ชจ๋Œˆ์˜ ๊ณ„์ˆ˜๋ฅผ ์ž‘๊ฒŒ ๋งŒ๋“ค์–ด ๊ณผ๋Œ€์ ํ•ฉ์„ ์™„ํ™”์‹œ๊ฒ๋‹ˆ๋‹ค. ๋ฆฟ์ง€๋Š”๋น„๊ต์  ํšจ๊ณผ๊ฐ€์ข‹์•„๋„๋ฆฌ ์‚ฌ์šฉํ•˜๋Š”๊ทœ์ œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ๋ผ์˜(Lasso)๋Š” ๋˜ ๋‹ค๋ฅธ ๊ทœ์ œ๊ฐ€ ์žˆ๋Š” ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋ฆฟ์ง€์™€ ๋‹ฌ๋ฆฌ ๊ณ„์ˆ˜ ๊ฐ’์„ ์•„์˜ˆ 0์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ๋„์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ(Hyper-Parameter)๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํ•™์Šตํ•˜์ง€ ์•Š๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์‚ฌ๋žŒ ์ด ์‚ฌ์ „์— ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์œผ๋กœ ๋ฆฟ์ง€์™€ ๋ผ์˜์˜ ๊ทœ์ œ ๊ฐ•๋„ alpha ํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํŒจํ‚ค์ง€์™€ ํ•จ์ˆ˜

pandas

  • read_csv()๋Š” csv ํŒŒ์ผ์„ ๋กœ์ปฌ ์ปดํ“จํ„ฐ๋‚˜ ์ธํ„ฐ๋„ท์—์„œ ์ฝ์–ด ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋ณ€ํ™˜ํ•˜ ๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ๋งค์šฐ ๋งŽ์€ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ทธ์ค‘์— ์ง€์ฃผ ์‚ฌ์šฉํ•˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  • sep๋Š” csv ํŒŒ์ผ์˜ ๊ตฌ๋ถ„์ž๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ โ€˜์ฝค๋งˆ(,)โ€™์ž…๋‹ˆ๋‹ค.
  • header์— ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์—ด ์ด๋ฆ„์œผ๋กœ ์‚ฌ์šฉํ•  csv ํŒŒ์ผ์˜ ํ–‰ ๋ฒˆํ˜ธ๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ ์ฒซ ๋ฒˆ์งธ ํ–‰์„ ์—ด ์ด๋ฆ„์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • skiprows๋Š” ํŒŒ์ผ์—์„œ ์ฝ๊ธฐ ์ „์— ๊ฑด๋„ˆ๋  ํ–‰์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
  • nrows๋Š” ํŒŒ์ผ์—์„œ ์ฝ์„ ํ–‰์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

scikit-learn

  • PolynomialFeatures๋Š” ์ฃผ์–ด์ง„ ํŠน์„ฑ์„ ์กฐํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. degree๋Š” ์ตœ๊ณ  ์ฐจ์ˆ˜๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ 2์ž…๋‹ˆ๋‹ค.
  • interaction_only๊ฐ€ True์ด๋ฉด ๊ฑฐ๋“ญ์ œ๊ณฑ ํ•ญ์€ ์ œ์™ธ๋˜๊ณ  ํŠน์„ฑ ๊ฐ„์˜ ๊ณฑ์…ˆ ํ•ญ๋งŒ ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ False์ž…๋‹ˆ๋‹ค.
  • include_bias๊ฐ€ False์ด๋ฉด ์ ˆํŽธ์„ ์œ„ํ•œ ํŠน์„ฑ์„ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ True์ž…๋‹ˆ๋‹ค.
  • Ridge๋Š” ๊ทœ์ œ๊ฐ€ ์žˆ๋Š” ํšŒ๊ท€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ ๋ฆฟ์ง€ ํšŒ๊ท€ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
  • alpha ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ๊ทœ์ œ์˜ ๊ฐ•๋„๋ฅผ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค. alpha ๊ฐ’์ด ํด์ˆ˜๋ก ๊ทœ์ œ๊ฐ€ ์„ธ์ง‘๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ 1์ž…๋‹ˆ๋‹ค.
  • solver ๋งค๊ฐœ๋ณ€์ˆ˜์— ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ฐพ๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ โ€˜autoโ€™์ด๋ฉฐ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ์ž๋™์œผ๋กœ ์„ ํƒ๋ฉ๋‹ˆ๋‹ค.
    • scikit-learn 0.17 ๋ฒ„์ „์— ์ถ”๊ธฐ๋œ โ€˜sagโ€™๋Š” ํš๋ฅ ์  ํ‰๊ท  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ํŠน์„ฑ๊ณผ ์ƒ˜ํ”Œ ์ˆ˜๊ฐ€ ๋งŽ์„ ๋•Œ์— ์„ฑ๋Šฅ์ด ๋น ๋ฅด๊ณ  ์ข‹์Šต๋‹ˆ๋‹ค.
    • scikit-learn 0.19 ๋ฒ„์ „์—๋Š” โ€˜sagโ€™์˜ ๊ฐœ์„  ๋ฒ„์ „์ธ โ€˜sagaโ€™๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • random_state๋Š” solver๊ฐ€ โ€˜sagโ€™๋‚˜ โ€˜sagaโ€™์ผ ๋•Œ ๋…ํŒŒ์ด ๋‚œ์ˆ˜ ์‹œ๋“œ๊ฐ’์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Lasso๋Š” ๊ทœ์ œ๊ฐ€ ์žˆ๋Š” ํšŒ๊ท€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ ๋ผ์˜ ํšŒ๊ท€ ๋ชจ๋Œˆ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ์ด ํด๋ž˜์Šค๋Š” ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ฐพ๊ธฐ ์œ„ํ•ด ์ขŒํ‘œ์ถ•์„ ๋”ฐ๋ผ ์ตœ์ ํšŒ๋ฅผ ์ˆ˜ํ–‰ํ•ด๊ฐ€๋Š” ์ขŒํ‘œ ํ•˜๊ฐ•๋ฒ• coordinate descent์„ ์‹œ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • alpha์™€ random_state ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” Ridge ํด๋ž˜์Šค์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
  • max_iter๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ˆ˜ํ–‰ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ 1000์ž…๋‹ˆ๋‹ค.