A A
[ML] Random Forest (๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ)
์ด๋ฒˆ์—๋Š” Random Forest (๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ) ๊ธฐ๋ฒ•์— ๋ฐํ•˜์—ฌ ํ•œ๋ฒˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ(Random Forest)๋Š” ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ์•™์ƒ๋ธ” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ, ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๊ทธ ์˜ˆ์ธก์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋”์šฑ ๊ฐ•๋ ฅํ•˜๊ณ  ์•ˆ์ •์ ์ธ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ํŠนํžˆ ๋ถ„๋ฅ˜์™€ ํšŒ๊ท€ ๋ฌธ์ œ์— ํšจ๊ณผ์ ์ด๋ฉฐ, ๊ฐœ๋ณ„ ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ๊ณผ์ ํ•ฉ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ณ , ์ „์ฒด์ ์ธ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค.

 

Random Forest (๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ)์˜ ์ฃผ์š” ํŠน์ง•

๋‹ค์–‘์„ฑ (Diversity)

  • ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ๊ฐ๊ฐ์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ€๋ถ„์ง‘ํ•ฉ๊ณผ ํŠน์„ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ ํŠธ๋ฆฌ๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ๋‹ค๋ฅธ ํŒจํ„ด์„ ํ•™์Šตํ•˜๋„๋ก ํ•˜์—ฌ ๋ชจ๋ธ ์ „์ฒด์˜ ๋‹ค์–‘์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹์€ ํŠธ๋ฆฌ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์ค„์ด๋ฉฐ, ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ฐ ํŠธ๋ฆฌ์˜ ๊ณ ์œ ํ•œ ๊ด€์ ์ด ์ „์ฒด ๋ชจ๋ธ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.

์•™์ƒ๋ธ” ๋ฐฉ๋ฒ• (Ensemble Method)

  • ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ์•™์ƒ๋ธ” ํ•™์Šต ๋ฐฉ๋ฒ•์˜ ์ผ์ข…์œผ๋กœ, ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋ชจ๋ธ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ ํŠธ๋ฆฌ์˜ ์˜ˆ์ธก์„ ํ‰๊ท ๋‚ด๊ฑฐ๋‚˜, ๋‹ค์ˆ˜๊ฒฐ์„ ํ†ตํ•ด ์ตœ์ข… ๊ฒฐ์ •์„ ๋‚ด๋ฆฝ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์€ ๊ฐœ๋ณ„ ๋ชจ๋ธ์˜ ์˜ค๋ฅ˜๋ฅผ ์ƒ์‡„์‹œํ‚ค๊ณ , ์ „์ฒด ๋ชจ๋ธ์˜ ์ •ํ™•๋„์™€ ์‹ ๋ขฐ๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ (Prevention of Overfitting)

  • ์ „ํ†ต์ ์ธ ๊ฒฐ์ • ํŠธ๋ฆฌ๋Š” ๊นŠ์ด๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๊ณผ์ ํ•ฉ๋˜๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์—์„œ๋Š” ๊ฐ ํŠธ๋ฆฌ๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•˜๋ฏ€๋กœ, ๊ณผ์ ํ•ฉ์˜ ์œ„ํ—˜์ด ํฌ๊ฒŒ ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ, ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ํŠธ๋ฆฌ์˜ ์ƒ์„ฑ ๊ณผ์ •์—์„œ ํŠน์„ฑ์˜ ๋ฌด์ž‘์œ„ ์„ ํƒ์„ ํ†ตํ•ด ๋ถ„์‚ฐ์„ ๋†’์ด๊ณ , ๊ณผ์ ํ•ฉ์„ ์ค„์ด๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Random Foerest (๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ)์˜ ๊ธฐ๋ณธ ์›๋ฆฌ

Random Forest์˜ ๊ธฐ๋ณธ ์›๋ฆฌ๋Š” ์–ด๋– ํ•œ ๊ฒƒ๋“ค์ด ์žˆ์„๊นŒ์š”?

 

๋ฐฐ๊น…(Bootstrap Aggregating)

  • ๊ฐ ํŠธ๋ฆฌ๋Š” ์›๋ณธ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ค‘๋ณต์„ ํ—ˆ์šฉํ•˜์—ฌ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ์ƒ˜ํ”Œ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ ์ƒ˜ํ”Œ์„ '๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ'์ด๋ผ๊ณ  ํ•˜๋ฉฐ, ๊ฐ ํŠธ๋ฆฌ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.
  • ์ด ๊ณผ์ •์€ ๊ฐ ํŠธ๋ฆฌ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ํŒจํ„ด์„ ํ•™์Šตํ•˜๊ฒŒ ํ•˜์—ฌ, ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ์„ ๋†’์ด๊ณ  ๊ณผ์ ํ•ฉ์„ ์ค„์ž…๋‹ˆ๋‹ค.

๋žœ๋ค ํŠน์„ฑ ์„ ํƒ

  • ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ๊ตฌ์„ฑํ•  ๋•Œ ๋ชจ๋“  ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ , ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ๋ถ€๋ถ„์ง‘ํ•ฉ์˜ ํŠน์„ฑ๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„ํ• ์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Š” ํŠธ๋ฆฌ ๊ฐ„์˜ ์ƒ๊ด€์„ฑ์„ ์ค„์ด๋ฉฐ, ๊ฐ ํŠธ๋ฆฌ๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ๋‹ค๋ฅธ ์ธก๋ฉด์„ ํ•™์Šตํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

ํŠธ๋ฆฌ์˜ ํ•™์Šต

  • ๊ฐ ํŠธ๋ฆฌ๋Š” ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šต๋˜๋ฉฐ, ๋ถ„ํ•  ๊ธฐ์ค€์€ ์ •๋ณด ์ด๋“์ด๋‚˜ ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„์™€ ๊ฐ™์€ ๊ธฐ์ค€์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.
  • ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก ์ „์ฒด ๋ชจ๋ธ์˜ ์ •ํ™•๋„๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ํ–ฅ์ƒ๋˜์ง€๋งŒ, ๊ณ„์‚ฐ ๋น„์šฉ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋„ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐํ•ฉ์„ ํ†ตํ•œ ์˜ˆ์ธก

  • ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ๋Š” ์ƒ์„ฑ๋œ ๋ชจ๋“  ํŠธ๋ฆฌ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ(ํด๋ž˜์Šค ๋ ˆ์ด๋ธ”)๋ฅผ ๋ชจ์•„ ๋‹ค์ˆ˜๊ฒฐ ์›์น™์— ๋”ฐ๋ผ ์ตœ์ข… ์˜ˆ์ธก์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ํšŒ๊ท€ ๋ฌธ์ œ์—์„œ๋Š” ๋ชจ๋“  ํŠธ๋ฆฌ์˜ ์˜ˆ์ธก๊ฐ’์˜ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•˜์—ฌ ์ตœ์ข… ์˜ˆ์ธก๊ฐ’์„ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.

Random Foerest (๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ)์˜ ์žฅ์ , ๋‹จ์ 

์žฅ์ 

  1. ๋†’์€ ์˜ˆ์ธก ์„ฑ๋Šฅ: ์—ฌ๋Ÿฌ ๊ฒฐ์ • ํŠธ๋ฆฌ์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋‹จ์ผ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ณด๋‹ค ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ๋†’์€ ์ •ํ™•๋„์™€ ์•ˆ์ •์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  2. ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€: ๊ฐ ํŠธ๋ฆฌ๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ€๋ถ„์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•˜๋ฏ€๋กœ, ๊ฐœ๋ณ„ ํŠธ๋ฆฌ์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๊ณผ์ ํ•ฉ ๋ฌธ์ œ๋ฅผ ํฌ๊ฒŒ ๊ฐ์†Œ์‹œํ‚ต๋‹ˆ๋‹ค.
  3. ๋ณ€์ˆ˜ ์ค‘์š”๋„ ํ‰๊ฐ€: ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ๊ฐ ํŠน์„ฑ์˜ ์ค‘์š”๋„๋ฅผ ์ธก์ •ํ•  ์ˆ˜ ์žˆ์–ด, ์–ด๋–ค ๋ณ€์ˆ˜๊ฐ€ ์˜ˆ์ธก์— ๊ฐ€์žฅ ์˜ํ–ฅ๋ ฅ์ด ํฐ์ง€ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  4. ์ž๋™ ํŠน์„ฑ ์„ ํƒ: ์ตœ์ ์˜ ํŠน์„ฑ์„ ์ž๋™์œผ๋กœ ์„ ํƒํ•˜์—ฌ ๋ชจ๋ธ์„ ๊ตฌ์„ฑํ•จ์œผ๋กœ์จ ํŠน์„ฑ ์„ ํƒ ๊ณผ์ •์„ ๋‹จ์ˆœํ™”ํ•ฉ๋‹ˆ๋‹ค.
  5. ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์œ ํ˜• ์ฒ˜๋ฆฌ: ์ˆ˜์น˜ํ˜• ๋ฐ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋‹ค๋ฅธ ๋งŽ์€ ๊ธฐ๊ณ„ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋น„๊ตํ•  ๋•Œ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ์š”๊ตฌ ์‚ฌํ•ญ์ด ์ƒ๋Œ€์ ์œผ๋กœ ์ ์Šต๋‹ˆ๋‹ค.

๋‹จ์ 

  1. ๋ชจ๋ธ ํ•ด์„์˜ ์–ด๋ ค์›€: ๊ฐœ๋ณ„ ๊ฒฐ์ • ํŠธ๋ฆฌ๋Š” ํ•ด์„์ด ๋น„๊ต์  ์šฉ์ดํ•˜์ง€๋งŒ, ์ˆ˜๋ฐฑ ๋˜๋Š” ์ˆ˜์ฒœ ๊ฐœ์˜ ํŠธ๋ฆฌ๋กœ ๊ตฌ์„ฑ๋œ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ์ „์ฒด๋ฅผ ํ•ด์„ํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ๋ณต์žกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ๊ณ„์‚ฐ ๋น„์šฉ: ๋งŽ์€ ์ˆ˜์˜ ํŠธ๋ฆฌ๋ฅผ ํ•™์Šตํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํŠนํžˆ ํฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ๋Š” ํ›ˆ๋ จ ์‹œ๊ฐ„์ด ๊ธธ๊ณ  ๋งŽ์€ ๊ณ„์‚ฐ ์ž์›์„ ์š”๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ: ๋Œ€๊ทœ๋ชจ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ์€ ์ƒ๋‹นํ•œ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์†Œ๋น„ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํŠนํžˆ ์ œํ•œ๋œ ์ž์›์„ ๊ฐ€์ง„ ์‹œ์Šคํ…œ์—์„œ ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  4. ์‹ค์‹œ๊ฐ„ ์˜ˆ์ธก: ํ›ˆ๋ จ์€ ํ•œ ๋ฒˆ๋งŒ ์ด๋ฃจ์–ด์ง€์ง€๋งŒ, ์˜ˆ์ธก์„ ์œ„ํ•ด ๋ชจ๋“  ํŠธ๋ฆฌ๋ฅผ ํ†ต๊ณผํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์‹ค์‹œ๊ฐ„ ์‘๋‹ต ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋А๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Random Forest Example Code

# ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ์˜ˆ์ œ

# ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž„ํฌํŠธ
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

 

 

# ์œ ๋ฐฉ์•” ๋ฐ์ดํ„ฐ์…‹ ๋กœ๋“œ
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# ๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šต ์„ธํŠธ์™€ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ํ•™์Šต
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114
# ํ˜ผ๋™ ํ–‰๋ ฌ ์‹œ๊ฐํ™”
ConfusionMatrixDisplay.from_estimator(rf, X_test, y_test)
plt.title("Random Forest Confusion Matrix")
plt.show()