A A
[DL] Gradient Vanishing, Exploding - ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค, ํญํŒ”

1. ์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต ๊ณผ์ •

์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต ๊ณผ์ •์€ ํฌ๊ฒŒ 2๊ฐ€์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์ „ํŒŒ(Forward Pass), ์—ญ์ „ํŒŒ(Backward Pass)๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
๋จผ์ € ์ด ํ•™์Šต ๊ณผ์ •์— ๋ฐํ•˜์—ฌ ์„ค๋ช…์„ ํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

Forward Pass (์ˆœ์ „ํŒŒ)

Forward Pass (์ˆœ์ „ํŒŒ)๋Š” input(์ž…๋ ฅ) data๊ฐ€ ์‹ ๊ฒฝ๋ง์˜ ๊ฐ์ธต์„ ์ฐจ๋ก€๋Œ€๋กœ ํ†ต๊ณผํ•˜๋ฉด์„œ ์ตœ์ข… output ๊นŒ์ง€ ๋„๋‹ฌํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.

Forward Pass (์ˆœ์ „ํŒŒ), ์ถœ์ฒ˜: https://wikidocs.net/150781

  • ์ด ๊ณผ์ •์€ input layer(์ž…๋ ฅ์ธต)์—์„œ output layer(์ถœ๋ ฅ์ธต)๊นŒ์ง€ ์ˆœ์ฐจ์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง€๋ฉฐ, ์ตœ์ข…์ ์œผ๋กœ ์†์‹คํ•จ์ˆ˜ (loss function)์„ ํ†ตํ•ด ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ์ฐจ์ด๋ฅผ ์†์‹ค(loss) or ์˜ค์ฐจ(Error)๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์ด ์ฐจ์ด๋Š” ์‹ ๊ฒฝ๋ง์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
์ •๋ฆฌํ•˜๋ฉด, input๊ฐ’์€ input layer(์ž…๋ ฅ์ธต), hidden layer(์€๋‹‰์ธต)์„ ์ง€๋‚˜๋ฉด์„œ ๊ฐ ์ธต์—์„œ์˜ ๊ฐ€์ค‘์น˜์™€ ํ•จ๊ป˜ ๊ณ„์‚ฐ๋˜๋ฉฐ ๋‚˜์ค‘์—๋Š” output layer(์ถœ๋ ฅ์ธต)์œผ๋กœ ๋ชจ๋“  ์—ฐ์‚ฐ์ด ๋งˆ์นœ ์˜ˆ์ธก๊ฐ’์ด ๋‚˜์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 
  • ์ด๋ ‡๊ฒŒ input layer(์ž…๋ ฅ์ธต)์—์„œ output layer(์ถœ๋ ฅ์ธต) ๋ฐฉํ–ฅ์œผ๋กœ ์˜ˆ์ธก๊ฐ’์˜ ์—ฐ์‚ฐ์ด ์ง„ํ–‰๋˜๋Š” ๊ณผ์ •์„ Forward Pass (์ˆœ์ „ํŒŒ) ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Backward Pass (์—ญ์ „ํŒŒ)

์—ญ์ „ํŒŒ๋Š” ์†์‹ค ํ•จ์ˆ˜(Loss function)์„ ํ†ตํ•ด ๊ณ„์‚ฐ๋œ ์†์‹ค์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹ ๊ฒฝ๋ง์˜ Weigh(๊ฐ€์ค‘์น˜)์™€ bias(ํŽธํ–ฅ)์„ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
  • ์ด ๊ณผ์ •์—์„œ๋Š” ์†์‹ค ํ•จ์ˆ˜(Loss function)์˜ ๊ธฐ์šธ๊ธฐ(Gradient)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด Weight(๊ฐ€์ค‘์น˜)์— ๋Œ€ํ•œ loss(์†์‹ค)์˜ ๋ฏผ๊ฐ๋„๋ฅผ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ๊ธฐ์šธ๊ธฐ(Gradient)๋Š” ์—ฐ์‡„ ๋ฒ•์น™(Chain Rule)์„ ์‚ฌ์šฉํ•˜๋Š” Backward Pass (์—ญ์ „ํŒŒ) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด input layer(์ž…๋ ฅ์ธต) ๋ฐฉํ–ฅ์œผ๋กœ ์ „ํŒŒ๋ฉ๋‹ˆ๋‹ค.
  • ๊ฐ ์ธต(layer)์˜ Weight(๊ฐ€์ค‘์น˜)๋Š” ํ•ด๋‹น ์ธต์˜ input ๊ฐ’ (์ž…๋ ฅ๊ฐ’), Weight(๊ฐ€์ค‘์น˜)์— ๋Œ€ํ•œ ์†์‹ค ํ•จ์ˆ˜(Loss function)์˜ ๊ธฐ์šธ๊ธฐ(Gradient), ํ•™์Šต๋ฅ (Learning Rate)๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ๊ธฐ์šธ๊ธฐ(Gradient)๊ฐ€ ์–‘์ˆ˜(+)๋ฉด Weight(๊ฐ€์ค‘์น˜)๋ฅผ ์ค„์ด๊ณ , ์Œ์ˆ˜(-)๋ฉด Weight(๊ฐ€์ค‘์น˜)๋ฅผ ๋Š˜๋ฆฌ๋Š” ๋ฐฉ์‹์œผ๋กœ ์—…๋ฐ์ดํŠธ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ Weight(๊ฐ€์ค‘์น˜)๋ฅผ ์กฐ์ •ํ•˜๋ฉด์„œ ๋‹ค์Œ ์ˆœ์ „ํŒŒ(Forward Pass)๋ฅผ ์ง„ํ–‰ํ• ๋•Œ, ์†์‹ค(Loss)๊ฐ€ ๊ฐ์†Œํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
* Backward Pass(์—ญ์ „ํŒŒ) ์—์„œ ์—ฐ์‡„๋ฒ•์น™(Chain Rule)
  • ์—ญ์ „ํŒŒ์—์„œ ์—ฐ์‡„๋ฒ•์น™์€ ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ๋ฐฉ์‹์ธ๋ฐ, ์ฃผ๋กœ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•(Gradient Descent)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹ ๊ฒฝ๋ง์ด Loss function(์†์‹ค ํ•จ์ˆ˜)๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

2. ์ธ๊ณต์‹ ๊ฒฝ๋ง ์˜ˆ์‹œ (Code & ์ˆ˜์‹)

ํ•œ๋ฒˆ ์ธ๊ณต์‹ ๊ฒฝ๋ง Model์„ ํ•œ๋ฒˆ ๊ตฌํ˜„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • input Dimension์ด 3, Output Dimension์ด 2์ธ ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ํ•œ๋ฒˆ ๊ตฌํ˜„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()

model.add(Dense(2, input_dim=3, activation='softmax'))
  • ์—ฌ๊ธฐ์„œ Softmax ํ•จ์ˆ˜๋Š” ์ถœ๋ ฅ ๋ฒกํ„ฐ์˜ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
  • Output Dimension, ์—ฌ๊ธฐ์„œ๋Š” Output Vector์˜ ์ฐจ์›์„ 2๋กœ ๋‘๋ฉด, ์ด์ง„(binary) ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ชจ๋ธ์ด ๋ฉ๋‹ˆ๋‹ค.
    • ์ฐธ๊ณ ๋กœ Softmax ํ•จ์ˆ˜๋กœ๋„ ์ด์ง„ ๋ถ„๋ฅ˜๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
Keras Library ์—์„œ๋Š” "summary()"๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ทธ ๋ชจ๋ธ์— ์กด์žฌํ•˜๋Š” ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜, ๊ฐ€์ค‘์น˜, ํŽธํ–ฅ์˜ ๊ฐœ์ˆ˜๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 2)                 8         
                                                                 
=================================================================
Total params: 8 (32.00 Byte)
Trainable params: 8 (32.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
  • ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” 8๊ฐœ๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค. ์ด ์‹ ๊ฒฝ๋ง์—์„œ๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ W & b ๊ฐœ์ˆ˜๊ฐ€ ์ด 8๊ฐœ๋ผ๊ณ  ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. 

์‹ ๊ฒฝ๋ง์„ ํ–‰๋ ฌ์˜ ๊ณฑ์…ˆ ๊ด€์ ์—์„œ ๋ณธ ๊ทธ๋ฆผ, ์ถœ์ฒ˜: https://wikidocs.net/150781

  • ์ด ์‹ ๊ฒฝ๋ง ์—์„œ๋Š” input layer(์ž…๋ ฅ์ธต)์˜ ๋‰ด๋Ÿฐ์ด 3๊ฐœ, output layer(์ถœ๋ ฅ์ธต)์˜ ๋‰ด๋Ÿฐ์ด 2๊ฐœ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ™”์‚ดํ‘œ ๊ฐ๊ฐ์€ ๊ฐ€์ค‘์น˜๋ฅผ ์˜๋ฏธํ•˜๋Š” w ๋ฅผ ์˜๋ฏธํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ฐ๊ฐ 3๊ฐœ์˜ ๋‰ด๋Ÿฐ(x1, x2, x3) 2๊ฐœ์˜ ๋‰ด๋Ÿฐ ์‚ฌ์ด์—๋Š” ์ด 6๊ฐœ์˜ ํ™”์‚ดํ‘œ๊ฐ€ ์กด์žฌํ•˜๋Š”๋ฐ, ์ด๋Š” ์œ„ ์‹ ๊ฒฝ๋ง์—์„œ ๊ฐ€์ค‘์น˜ w์˜ ๊ฐœ์ˆ˜๊ฐ€ 6๊ฐœ์ž„์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Forward Pass (์ˆœ์ „ํŒŒ) ๊ณ„์‚ฐ ์˜ˆ์‹œ

  • ํ–‰๋ ฌ๊ณฑ ์—์„œ๋Š” 3์ฐจ์› ๋ฒกํ„ฐ์—์„œ 2์ฐจ์› ๋ฒกํ„ฐ๊ฐ€ ๋˜๊ธฐ ์œ„ํ•ด์„œ 3 × 2 ํ–‰๋ ฌ์„ ๊ณฑํ–ˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์ด ํ–‰๋ ฌ ๊ฐ๊ฐ์˜ ์›์†Œ๊ฐ€ ๊ฐ๊ฐ์˜ w๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์œ„ ๊ทธ๋ฆผ์—์„œ๋Š” y1 ์— ์—ฐ๊ฒฐ๋˜๋Š” ํ™”์‚ดํ‘œ w1, w2, w3๋ฅผ ์ฃผํ™ฉ์ƒ‰์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ , y2 ์— ์—ฐ๊ฒฐ๋˜๋Š” ํ™”์‚ดํ‘œ w4, w5, w6๋ฅผ ์ดˆ๋ก์ƒ‰์œผ๋กœ ํ‘œํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

  • ์ผ๋ฐ˜์ ์œผ๋กœ ๋‰ด๋Ÿฐ๊ณผ ํ™”์‚ดํ‘œ๋กœ ํ‘œํ˜„ํ•˜๋Š” ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ๊ทธ๋ฆผ์—์„œ๋Š” ํŽธํ–ฅ b๋Š” ์ƒ๋žต๋˜์—ˆ์ง€๋งŒ ํŽธํ–ฅ b ์˜ ์—ฐ์‚ฐ ๋˜ํ•œ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
  • ์•ž์—์„œ ์„ค๋ช…ํ–ˆ์ง€๋งŒ ์ด ๊ทธ๋ฆผ์—์„œ๋Š” ํŽธํ–ฅ์„ ํ‘œํ˜„ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค ,๊ทธ๋ ‡์ง€๋งŒ ํ–‰๋ ฌ ์—ฐ์‚ฐ์‹์—์„œ๋Š” b1, b2๋ฅผ ํ‘œํ˜„ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ํŽธํ–ฅ b ์˜ ๊ฐœ์ˆ˜๋Š” ํ•ญ์ƒ Output Dimension(์ถœ๋ ฅ ์ฐจ์›)์„ ๊ธฐ์ค€์œผ๋กœ ๊ฐœ์ˆ˜๋ฅผ ํ™•์ธํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.
  • ์ด ์ธ๊ณต ์‹ ๊ฒฝ๋ง์˜ ๊ฒฝ์šฐ์—๋Š” Output Dimension(์ถœ๋ ฅ ์ฐจ์›)์ด 2์ด๋ฏ€๋กœ ํŽธํ–ฅ ๋˜ํ•œ b1, b2๋กœ ๋‘ ๊ฐœ์ž…๋‹ˆ๋‹ค.

 

  • ๊ฐ€์ค‘์น˜ w์˜ ๊ฐœ์ˆ˜๊ฐ€ w1 ~ w6 ๋กœ ์ด 6๊ฐœ์ด๋ฉฐ ํŽธํ–ฅ b์˜ ๊ฐœ์ˆ˜๊ฐ€ b1, b2 ๋กœ ๋‘ ๊ฐœ์ด๋ฏ€๋กœ ์ด ํ•™์Šต๊ฐ€๋Šฅํ•œ parameter(๋งค๊ฐœ๋ณ€์ˆ˜)์˜ ์ˆ˜๋Š” 8๊ฐœ์ž…๋‹ˆ๋‹ค. model.summary()๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์ˆ˜ 8๊ฐœ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๋‰ด๋Ÿฐ y1, y2๋ฅผ ๊ตฌํ•˜๋Š” ๊ณผ์ •์„ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•œ๋‹ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

y1, y2 ๋ฅผ ๊ตฌํ•˜๋Š” ๊ณผ์ •์„ ์ˆ˜์‹ํ™”. ์ถœ์ฒ˜: https://wikidocs.net/150781

  • ์ด๋ฒˆ์—๋Š” ์ž…๋ ฅ x1, x2, x3 ์„ Input Vector(์ž…๋ ฅ ๋ฒกํ„ฐ) X๋กœ ํ•˜๊ณ  ๊ณ„์‚ฐํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

์ž…๋ ฅ x1, x2, x3๋ฅผ ๋ฒกํ„ฐ X๋กœ ๋ช…๋ช…. ์ถœ์ฒ˜: https://wikidocs.net/150781

 

  • ๊ทธ๋ฆฌ๊ณ  w1 ~ w6 ๋ฅผ ์›์†Œ๋กœ ํ•˜๋Š” 3 × 2 ํ–‰๋ ฌ์„ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ํŽธํ–ฅ b1, b2 ๋ฅผ ์›์†Œ๋กœ ํ•˜๋Š” Vector ๋ฅผ B, ๊ทธ๋ฆฌ๊ณ   y1, y2 ๋ฅผ ์›์†Œ๋กœํ•˜๋Š” Output Vector (์ถœ๋ ฅ ๋ฒกํ„ฐ)๋ฅผ Y ๋กœ ๋ช…๋ช…ํ•ฉ์‹œ๋‹ค.
  • ์œ„์˜ ์„ค๋ช…์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์€ ์•„๋ž˜์˜ ๊ณผ์ •์ฒ˜๋Ÿผ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ถœ์ฒ˜: https://wikidocs.net/150781

๋‹ค์‹œ ๋งํ•ด ์ˆ˜์‹์€ Y(๋‰ด๋Ÿฐ) = X(์ž…๋ ฅ๋ฒกํ„ฐ) * W(๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ) + B(ํŽธํ–ฅ) ์ž…๋‹ˆ๋‹ค.

Backward Pass (์—ญ์ „ํŒŒ) ์˜ˆ์‹œ

ํ•œ๋ฒˆ ์—ญ์ „ํŒŒ์˜ ์ง„ํ–‰๊ณผ์ •์„ ์„ค๋ช…ํ•ด ๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์—ญ์ „ํŒŒ๋ฅผ ์‹œ๊ฐํ™” ํ•˜๋Š” ๊ทธ๋ฆผ ์˜ˆ์‹œ. ์ถœ์ฒ˜: ์ถœ์ฒ˜: https://wikidocs.net/150781

  1. ์‹œ์ž‘์€ ์ถœ๋ ฅ์ธต(Output Layer)์—์„œ, ๋‰ด๋Ÿฐ(Neuron)๋‘ ๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ๊ฐ Output Neuron(์ถœ๋ ฅ ๋‰ด๋Ÿฐ)์€ ์†์‹ค ํ•จ์ˆ˜(Loss Function)๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ๋œ ์˜ค์ฐจ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์˜ค์ฐจ ์‹ ํ˜ธ๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค.
  3. ์ด ์˜ค์ฐจ ์‹ ํ˜ธ๋Š” ๊ฐ€์ค‘์น˜(Weight)๋ฅผ ํ†ตํ•ด ์ž…๋ ฅ์ธต(Input Layer)์œผ๋กœ ์ „ํŒŒ๋˜๋ฉฐ, ์ž…๋ ฅ์ธต(Input layer)๊ณผ ์ถœ๋ ฅ์ธต(Output layer) ์‚ฌ์ด์˜ Weight(๊ฐ€์ค‘์น˜)์— ๋Œ€ํ•œ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  4. ์ž…๋ ฅ์ธต(Input Layer) ์—๋Š” ์„ธ ๊ฐœ์˜ ๋‰ด๋Ÿฐ(Neuron)์ด ์žˆ์œผ๋ฉฐ, ๊ฐ ๋‰ด๋Ÿฐ(Neuron)์€ ์ถœ๋ ฅ์ธต(Output layer) ์œผ๋กœ๋ถ€ํ„ฐ ์ „ํŒŒ๋œGradient(๊ธฐ์šธ๊ธฐ)์— ๋”ฐ๋ผ ์ž์‹ ์˜ ๊ฐ€์ค‘์น˜(Weight)๋ฅผ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • Backward pass(์—ญ์ „ํŒŒ) ๊ณผ์ •์—์„œ๋Š” ์ถœ๋ ฅ์ธต(Output layer)์˜ ์˜ค์ฐจ๋กœ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์„œ ์—ฐ์‡„ ๋ฒ•์น™(Chain Rule)์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ layer(์ธต)์˜ Weight(๊ฐ€์ค‘์น˜)์— ๋Œ€ํ•œ Loss function(์†์‹ค ํ•จ์ˆ˜)์˜ ํŽธ๋ฏธ๋ถ„์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฅผ ํ†ตํ•ด Weight(๊ฐ€์ค‘์น˜)์˜ Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ๊ณ„์‚ฐ๋˜๊ณ , ์ด Gradient(๊ธฐ์šธ๊ธฐ)๋Š” Optimizer(ex: SGD, Adam ๋“ฑ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌWeight(๊ฐ€์ค‘์น˜)๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
* Optimizer:
Optimizer(์˜ตํ‹ฐ๋งˆ์ด์ €)๋Š” ์‹ ๊ฒฝ๋ง์„ ํ›ˆ๋ จํ•  ๋•Œ ์‚ฌ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ์ •์˜๋œ loss function(์†์‹ค ํ•จ์ˆ˜)์˜ ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๊ฑฐ๋‚˜ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์˜ Weight(๊ฐ€์ค‘์น˜)์™€ bias(ํŽธํ–ฅ)์„ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

Forward Pass(์ˆœ์ „ํŒŒ), Backward pass(์—ญ์ „ํŒŒ) Summary

  • Forward Pass(์ˆœ์ „ํŒŒ)๋Š” input(์ž…๋ ฅ)์—์„œ output(์ถœ๋ ฅ)์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ํ๋ฅด๋Š” ๊ณผ์ •์œผ๋กœ, ์ตœ์ข… ์ถœ๋ ฅ๊ณผ loss(์†์‹ค) ๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • Backward Pass(์—ญ์ „ํŒŒ)๋Š” output(์ถœ๋ ฅ)์—์„œ input(์ž…๋ ฅ)์œผ๋กœ ์˜ค์ฐจ ์‹ ํ˜ธ๊ฐ€ ํ๋ฅด๋Š” ๊ณผ์ •์œผ๋กœ, loss(์†์‹ค)์„ ๊ธฐ์ค€์œผ๋กœ weight(๊ฐ€์ค‘์น˜)๋ฅผ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
์ด ๊ณผ์ •๋“ค์„ ์‹ ๊ฒฝ๋ง์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜๋ฉฐ, ์ด ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰๋˜๋Š” ์˜๋ฏธ์˜ ์šฉ์–ด๋ฅผ epoch(์—ํญ) ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
epoch๋ฅผ ๊ฑฐ๋“ญํ• ์ˆ˜๋ก, ์‹ ๊ฒฝ๋ง์€ loss(์†์‹ค)๊ฐ€ ์ตœ์†Œํ™” ๋˜๋ฉฐ, ๋ฐ์ดํ„ฐ๊ฐ€ ์ž˜ ๋ชจ๋ธ๋ง ํ•  ์ˆ˜ ์žˆ๊ฒŒ, ์•Œ์•„์„œ Weight(๊ฐ€์ค‘์น˜) ์™€ bias(ํŽธํ–ฅ)์„ ์•Œ์•„์„œ ์ฐพ๋Š” ๊ณผ์ •์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.

3. Gradient Exploding

Gradient Exploding(๊ธฐ์šธ๊ธฐ ํญํŒ”) ๋ฌธ์ œ๋Š” ์‹ ๊ฒฝ๋ง์˜ Backward Pass(์—ญ์ „ํŒŒ) ๊ณผ์ • ์ค‘์— Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ๋„ˆ๋ฌด ์ปค์ ธ์„œ ์ˆ˜์น˜์ ์œผ๋กœ ๋ถˆ์•ˆ์ •ํ•ด์ง€๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค.
  • ๊ธด Sequence(์‹œํ€€์Šค)๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ RNN์—์„œ ์ž์ฃผ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.
  • ์ด์œ ๋Š” Time-Step์˜ Weight(๊ฐ€์ค‘์น˜)๊ฐ€ Backward Pass(์—ญ์ „ํŒŒ) ๋  ๋•Œ ์—ฐ์†์ ์œผ๋กœ ๊ณฑํ•ด์ง€๋ฉด์„œ Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

Gradient Exploding

๊ทธ๋Ÿฌ๋ฉด Gradient Exploding(๊ธฐ์šธ๊ธฐ ํญํŒ”) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์€ ๋ญ๊ฐ€ ์žˆ์„๊นŒ์š”? 
  • Gradient Exploding(๊ธฐ์šธ๊ธฐ ํญํŒ”)๊ฐ€ ์ผ์–ด๋‚˜๋Š” ์ด์œ ๋Š” ์•„๋ž˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด์„œ ์„ค๋ช…๋“œ๋ฆฌ๋ฉด, ์‹ ๊ฒฝ๋ง์˜ Backward Pass(์—ญ์ „ํŒŒ) ๊ณผ์ •์ค‘์— ๋„ˆ๋ฌด ํฐ Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ์ƒ๊ฒจ์„œ Parameter๊ฐ€ Overshooting์ด ๋ฉ๋‹ˆ๋‹ค.
    • * Overshooting: ์—ฌ๊ธฐ์„œ๋Š” ์ž„๊ณ„๊ฐ’์„ ์ดˆ๊ณผํ•˜๋Š” ํ˜„์ƒ - like Gradient Exploding(๊ธฐ์šธ๊ธฐ ํญํŒ”)๊ณผ ๊ฐ™๋‹ค๊ณ  ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

Gradient Clipping

  • ๊ธฐ์šธ๊ธฐ ํด๋ฆฌํ•‘(Gradient Clipping): ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ํ•ด๊ฒฐ์ฑ… ์ค‘ ํ•˜๋‚˜๋กœ, Backward Pass(์—ญ์ „ํŒŒ) ๊ณผ์ •์—์„œ Weight(๊ฐ€์ค‘์น˜)์˜ ํฌ๊ธฐ๊ฐ€ ํŠน์ • ์ž„๊ณ„๊ฐ’์„ ์ดˆ๊ณผํ•˜๋Š” ๊ฒฝ์šฐ Weight(๊ฐ€์ค‘์น˜)์˜ ํฌ๊ธฐ๋ฅผ ์ž„๊ณ„๊ฐ’์œผ๋กœ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” Gradient(๊ธฐ์šธ๊ธฐ)์˜ ๋ฐฉํ–ฅ์€ ์œ ์ง€ํ•˜๋ฉด์„œ ํฌ๊ธฐ๋งŒ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
    • ๋งŒ์•ฝ Gradient(๊ธฐ์šธ๊ธฐ)์˜ ํฌ๊ธฐ๊ฐ€ ์ž„๊ณ„๊ฐ’ ์ด์ƒ์œผ๋กœ ์ปค์ง€๋ฉด? ๊ฐ’์ด ์ž„๊ณ„๊ฐ’ ๋ฏธ๋งŒ์ด ๋˜๋กœ๋ก Scaling ํ•ด์ค๋‹ˆ๋‹ค.
  • ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”(Weight Initialization): Weight(๊ฐ€์ค‘์น˜)๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ์ดˆ๊ธฐํ™”ํ•˜์—ฌ Backward Pass(์—ญ์ „ํŒŒ)์‹œ Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ๊ณผ๋„ํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š๋„๋ก ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
    • ๋‹ค๋งŒ, Weight(๊ฐ€์ค‘์น˜)๋ฅผ ์ค„์ด๋ฉด Vanishing Problem(์†์‹ค ๋ฌธ์ œ)๊ฐ€ ๋ฐœ์ƒํ• ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ ์ ˆํžˆ ์กฐ์ •ํ•ด์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ž‘์€ ํ•™์Šต๋ฅ (Learning Rate): ํ•™์Šต๋ฅ ์„ ๋‚ฎ์ถ”์–ด Weight(๊ฐ€์ค‘์น˜) ์—…๋ฐ์ดํŠธ ํฌ๊ธฐ๋ฅผ ์ž‘๊ฒŒ ํ•˜์—ฌ Gradient Exploding(๊ธฐ์šธ๊ธฐ ํญํŒ”)์˜ ์˜ํ–ฅ์„ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐฐ์น˜ ์ •๊ทœํ™”(Batch Normalization): ๊ฐ layer(์ธต)์˜ input(์ž…๋ ฅ)์„ Normalization(์ •๊ทœํ™”)ํ•จ์œผ๋กœ์จ Gradient(๊ธฐ์šธ๊ธฐ)์˜ ํฌ๊ธฐ๋ฅผ ์กฐ์ ˆํ•˜๊ณ  ํ•™์Šต ๊ณผ์ •์„ ์•ˆ์ •ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋‹จ์ˆœํ™”๋œ ๋„คํŠธ์›Œํฌ(Simplified Network): ์‹ ๊ฒฝ๋ง์˜ ๋ณต์žก๋„๋ฅผ ์ค„์—ฌ์„œ Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ์ฆํญ๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

4. Gradient Vanishing

Gradient Vanishing (๊ธฐ์šธ๊ธฐ ์†์‹ค) ๋ฌธ์ œ๋Š” ์‹ ๊ฒฝ๋ง์˜ Backward Pass(์—ญ์ „ํŒŒ) ๊ณผ์ • ์ค‘์— ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

Gradient Vanishing

  • ๋„คํŠธ์›Œํฌ์˜ ๊นŠ์€ ๋ถ€๋ถ„์œผ๋กœ ์‹œ์ž‘ํ•˜์—ฌ ์•ž์ชฝ layer(์ธต)์œผ๋กœ ์ด๋™ํ•˜๋ฉด์„œ Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ์ ์  ์ž‘์•„์ง€๋Š” ํ˜„์ƒ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด Input layer(์ž…๋ ฅ์ธต)์— ๊ฐ€๊นŒ์šด Weight(๊ฐ€์ค‘์น˜)๋Š” ๊ฑฐ์ด update๊ฐ€ ์•ˆ๋ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ํšจ๊ณผ์ ์ธ ํ•™์Šต์ด ์–ด๋ ค์›Œ์งˆ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
Gradient Vanishing (๊ธฐ์šธ๊ธฐ ์†์‹ค)์˜ ์ฃผ์š” ์›์ธ์— ๋Œ€ํ•˜์—ฌ ๋งํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  1. Activation Function(ํ™œ์„ฑํ™” ํ•จ์ˆ˜): ํ™œ์„ฑํ™” ํ•จ์ˆ˜์ธ "Sigmoid" ํ•จ์ˆ˜, "tanh(ํ•˜์ดํผ๋ณผ๋ฆญ ํƒ„์  ํŠธ)" ํ•จ์ˆ˜๋Š” ์ถœ๋ ฅ๊ฐ’์˜ ๋ฒ”์œ„๊ฐ€ ์ œํ•œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
    • input(์ž…๋ ฅ) ๊ฐ’์ด ์ปค์ง€๊ฑฐ๋‚˜ ์ž‘์•„์ง€๋ฉด ํ•จ์ˆ˜์˜ Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ๋งค์šฐ ์ž‘์•„์ง‘๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ž‘์€ Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ Network๋ฅผ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€๋ฉด์„œ ๊ณฑํ•ด์ง€๋ฉด, Gradient(๊ธฐ์šธ๊ธฐ)๋Š” ์ ์  ๋” ์ž‘์•„์ง‘๋‹ˆ๋‹ค.
  2. ์ดˆ๊ธฐ Weight(๊ฐ€์ค‘์น˜) ์„ค์ •: Weight(๊ฐ€์ค‘์น˜)๊ฐ€ ์ž‘๊ฒŒ ์ดˆ๊ธฐํ™” ๋˜๋ฉด, Activation Function(ํ™œ์„ฑํ™” ํ•จ์ˆ˜)์—์„œ์˜ Gradient(๊ธฐ์šธ๊ธฐ)๋„ ์ž‘์•„์ ธ์„œ Gradient Vanishing (๊ธฐ์šธ๊ธฐ ์†์‹ค)์ด ๋ฐœ์ƒํ• ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ๊นŠ์€ ๋„คํŠธ์›Œํฌ์˜ ๊ตฌ์กฐ: ๋งŒ์•ฝ ๋„คํŠธ์›Œํฌ๊ฐ€ ๋งค์šฐ ๊นŠ์„ ๊ฒฝ์šฐ, Gradient(๊ธฐ์šธ๊ธฐ)๋Š” ๋” ๋งŽ์€ layer(์ธต)๋ฅผ ๊ฑฐ์น˜๊ฒŒ ๋˜๊ณ , ๊ฐ layer(์ธต)๋งˆ๋‹ค Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ์ ์  ๋” ์ž‘์•„์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋ฉด Gradient Vanishing (๊ธฐ์šธ๊ธฐ ์†์‹ค) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์€ ์–ด๋–ค๊ฒŒ ์žˆ์„๊นŒ์š”?
RNN์˜ ๊ตฌ์กฐ๋ฅผ ๋ด๊พธ๊ธฐ๋„ ํ•˜์ง€๋งŒ, ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋„ ์•„๋ž˜์— ๊ธฐ์ˆ ํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.
  • ReLU ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์‚ฌ์šฉ: ReLU(Rectified Linear Unit) ๋ฐ ๊ทธ ๋ณ€ํ˜•๋“ค์€ ํ•œ์ชฝ ๋ฐฉํ–ฅ์œผ๋กœ๋Š” Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ์‚ฌ๋ผ์ง€์ง€ ์•Š๋Š” ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์–ด, ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ์™„ํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ ์ ˆํ•œ ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”: He ์ดˆ๊ธฐํ™”๋‚˜ Xavier(Glorot) ์ดˆ๊ธฐํ™”์™€ ๊ฐ™์€ ์ „๋žต์„ ์‚ฌ์šฉํ•˜์—ฌ ์ ์ ˆํ•œ Weight(๊ฐ€์ค‘์น˜) ์Šค์ผ€์ผ์„ ์„ค์ •ํ•จ์œผ๋กœ์จ  Gradient Vanishing (๊ธฐ์šธ๊ธฐ ์†์‹ค) ๋ฌธ์ œ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐฐ์น˜ ์ •๊ทœํ™”(Batch Normalization): ๊ฐ layer(์ธต)์˜ ์ž…๋ ฅ์„ Normalization(์ •๊ทœํ™”)ํ•˜์—ฌ ํ•™์Šต ๊ณผ์ •์„ ์•ˆ์ •ํ™”์‹œํ‚ค๊ณ  Gradient Vanishing (๊ธฐ์šธ๊ธฐ ์†์‹ค) ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ž”์ฐจ ์—ฐ๊ฒฐ(Residual Connections): input(์ž…๋ ฅ)์„ ๋ช‡ ์ธต์„ ๊ฑด๋„ˆ๋›ฐ์–ด output(์ถœ๋ ฅ)์— ์ง์ ‘ ๋”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ์ •๋ณด๊ฐ€ ๊นŠ์€ ์ธต์œผ๋กœ loss(์†์‹ค) ์—†์ด ์ „๋‹ฌ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ResNet๊ณผ ๊ฐ™์€ ์•„ํ‚คํ…์ฒ˜์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • ๊ฒŒ์ดํŠธ๊ฐ€ ์žˆ๋Š” ์ˆœํ™˜ ๊ตฌ์กฐ: LSTM(Long Short-Term Memory)์ด๋‚˜ GRU(Gated Recurrent Unit)์™€ ๊ฐ™์€ ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋Š” ๊ฒŒ์ดํŠธ๋ฅผ ์‚ฌ์šฉํ•ด ์ •๋ณด ํ๋ฆ„์„ ์กฐ์ ˆํ•จ์œผ๋กœ์จ Gradient Vanishing(๊ธฐ์šธ๊ธฐ ์†์‹ค) ๋ฌธ์ œ์— ๋Œ€์ฒ˜ํ•ฉ๋‹ˆ๋‹ค.
Ps. ๊ทธ๋Ÿฌ๋ฉด ๋‹ค์Œ ๊ธ€์€ LSTM, GRU ๋ชจ๋ธ์— ์„ค๋ช…ํ•˜๋Š” ๊ธ€์„ ๋“ค๊ณ  ์˜ค๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๋“œ๋””์–ด ์“ฐ๋„ค์š” LSTM์„...