A A
[DL] Training Related Skills - SGD, Momentum, AdaGrad, Adam (ํ•™์Šต ๊ด€๋ จ ๊ธฐ์ˆ ๋“ค)

Parameter(๋งค๊ฐœ๋ณ€์ˆ˜) ๊ฐฑ์‹ 

์‹ ๊ฒฝ๋ง ํ•™์Šต์˜ ๋ชฉ์ ์€ Loss Function (์†์‹ค ํ•จ์ˆ˜)์˜ ๊ฐ’์„ ๊ฐ€๋Šฅํ•œ ๋‚ฎ์ถ”๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ฐพ๋Š”๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ณง ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์ตœ์ ๊ฐ’์„ ์ฐพ๋Š” ๋ฌธ์ œ์ด๋ฉฐ, ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š”๊ฒƒ์€ Optimization(์ตœ์ ํ™”) ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์ตœ์ ์˜ Parameter(๋งค๊ฐœ๋ณ€์ˆ˜) ๊ฐ’์„ ์ฐพ๋Š” ๋‹จ์†Œ๋กœ Parameter(๋งค๊ฐœ๋ณ€์ˆ˜)์˜ Gradient(๊ธฐ์šธ๊ธฐ-๋ฏธ๋ถ„)์„ ์ด์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Parameter(๋งค๊ฐœ๋ณ€์ˆ˜)์˜ Gradient๋ฅผ ๊ตฌํ•ด, ๊ธฐ์šธ์–ด์ง„ ๋ฐฉํ–ฅ์œผ๋กœ Parameter(๋งค๊ฐœ๋ณ€์ˆ˜) ๊ฐ’์„ ๊ฐฑ์‹ ํ•˜๋Š” ์ผ์„ ๋ช‡ ๋ฒˆ์ด๊ณ  ๋ฐ˜๋ณตํ•ด์„œ ์ ์  ์ตœ์ ์˜ ๊ฐ’์— ๋‹ค๊ฐ€๊ฐ”์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• - Stochastic Gradient Descent(SGD)๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• - Stochastic Gradient Descent(SGD) ์€ ๊ฐ€์žฅ ํฌ๊ฒŒ ๊ธฐ์šธ์–ด์ง„ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€๋Š”๊ฒƒ์ด Stochastic Gradient Descent(SGD)์˜ ์ „๋žต์ž…๋‹ˆ๋‹ค.

Stochastic Gradient Descent (SGD) - ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•

Stochastic Gradient Descent(SGD)๋Š” ์ˆ˜์‹์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ์“ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

  • W โ† W โˆ’ ฮท โˆ— โˆ‚L / โˆ‚W
  • ์—ฌ๊ธฐ์„œ 'W'๋Š” ๊ฐฑ์‹ ํ•  'Weight Parameter' (๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜)
  • 'โˆ‚โˆ‚W' ๋Š” 'W'์— ๋Œ€ํ•œ Loss Function(์†์‹ค ํ•จ์ˆ˜)์˜ Gradient(๊ธฐ์šธ๊ธฐ)์ž…๋‹ˆ๋‹ค.
  • 'ฮท': ํ•™์Šต๋ฅ ์„ ์˜๋ฏธํ•˜๋Š”๋ฐ ์ •ํ•ด์ง„ ์ƒ์ˆ˜๊ฐ’(๋ณดํ†ต 0.01์ด๋‚˜, 0.001๊ณผ ๊ฐ™์€ ๊ฐ’์„ ๋ฏธ๋ฆฌ ์ •ํ•จ)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ 'โ†' ๋Š” ์šฐ๋ณ€์˜ ๊ฐ’์œผ๋กœ ์ขŒ๋ณ€์˜ ๊ฐ’์„ ๊ฐฑ์‹ ํ•œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. ์ฆ‰, SGD๋Š” ๊ธฐ์šธ์–ด์ง„ ๋ฐฉํ–ฅ์œผ๋กœ ์ผ์ • ๊ฑฐ๋ฆฌ๋งŒ ๊ฐ€๊ฒ ๋‹ค๋Š” ๋‹จ์ˆœํ•œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
class SGD:
	def __init__(self):
    	self.lr = lr # lr์€ ํ•™์Šต๋ฅ  (learning rate)
        
	# update()๋Š” SGD ๊ณผ์ •์—์„œ ๋ฐ˜๋ณตํ•ด์„œ ํ˜ธ์ถœ๋จ
	def update(self, params, grads):
    	for key in params.keys():
        	params[key] -= self.lr * grads[key]
  • ์—ฌ๊ธฐ์„œ ์ดˆ๊ธฐํ™”๋•Œ ๋ฐ›๋Š” ์ธ์ˆ˜์ธ lr์€ Learning Rate(ํ•™์Šต๋ฅ )์„ ๋œปํ•ฉ๋‹ˆ๋‹ค.
  • ์ด Learning Rate(ํ•™์Šต๋ฅ )์€ ์ธ์Šคํ„ด์Šค ๋ณ€์ˆ˜๋กœ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • update(params, grads)Method๋Š” SGD๊ณผ์ •์—์„œ ๋ฐ˜๋ณตํ•ด์„œ ๋ถˆ๋ฆฌ๋Š” Dictionary ๋ณ€์ˆ˜ ์ž…๋‹ˆ๋‹ค.
    • params['W1'], grads['W1']๋“ฑ๊ณผ ๊ฐ™์ด ๊ฐ๊ฐ Weight Parameter (๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜) & Gradient (๊ธฐ์šธ๊ธฐ)๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

SGD(ํ™•๋ฅ ์  ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•)์˜ ๋‹จ์ 

Stochastic Gradient Descent(SGD)๋Š” ๋‹จ์ˆœํ•˜๊ณ  ๊ตฌํ˜„๋„ ์‰ฝ์ง€๋งŒ, ๋ฌธ์ œ์— ๋”ฐ๋ผ์„œ ๋น„ํšจ์œจ์  ์ผ๋•Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ•œ๋ฒˆ SGD์˜ ๋‹จ์ ์„ ํ•œ๋ฒˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ์ด ํ•จ์ˆ˜์˜ ํ•œ๋ฒˆ Gradient๋ฅผ ๊ตฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ• ๋•Œ์—๋Š” ํŽธ๋ฏธ๋ถ„์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • f์— ๋Œ€ํ•œ x์˜ ํŽธ๋ฏธ๋ถ„์€ x์— ๋Œ€ํ•ด์„œ ๋ฏธ๋ถ„ํ•˜๊ณ , y๋Š” ์ƒ์ˆ˜๋กœ ์ทจ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ฒซ ๋ฒˆ์งธ ํ•ญ๋งŒ ์‚ด์•„๋‚จ์Šต๋‹ˆ๋‹ค.
  • f์— ๋Œ€ํ•œ y์˜ ํŽธ๋ฏธ๋ถ„์€ y์— ๋Œ€ํ•ด์„œ ๋ฏธ๋ถ„ํ•˜๊ณ , x๋Š” ์ƒ์ˆ˜๋กœ ์ทจ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋‘ ๋ฒˆ์งธ ํ•ญ๋งŒ ์‚ด์•„๋‚จ์Šต๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ Gradient(๊ธฐ์šธ๊ธฐ)๋Š” (x์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„ ๊ฐ’, y์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„ ๊ฐ’)์œผ๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ํŽธ๋ฏธ๋ถ„์„ ๊ตฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
f(x, y)๋ฅผ x์— ๋Œ€ํ•ด ํŽธ๋ฏธ๋ถ„: โˆ‚f/โˆ‚x = (1/20) * 2x = x/10
f(x, y)๋ฅผ y์— ๋Œ€ํ•ด ํŽธ๋ฏธ๋ถ„ํ•˜๋ฉด: โˆ‚f/โˆ‚y = 2y
  • ๋”ฐ๋ผ์„œ ํ•จ์ˆ˜ f(x,y) = (x/10, 2y)๊ฐ€ Graident Vector (๊ธฐ์šธ๊ธฐ ๋ฒกํ„ฐ)๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  • Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๋ณด๋ฉด y์ถ• ๋ฐฉํ–ฅ์€ ๊ฐ€ํŒŒ๋ฅด๊ณ (ํฌ๊ณ ) x์ถ• ๋ฐฉํ–ฅ์€ ์ž‘๋‹ค๋Š”๊ฒƒ(์™„๋งŒ) ํ•˜๋‹ค๋Š”๊ฒƒ์ด ํŠน์ง•์ž…๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ์ตœ์†Œ๊ฐ’์ด ๋˜๋Š” ์ •๋ณด๋Š” (x, y) = (0, 0)์ด์ง€๋งŒ, ๋ณด์—ฌ์ฃผ๋Š” Gradient(๊ธฐ์šธ๊ธฐ) ๋Œ€๋ถ€๋ถ„์€ (0, 0)๋ฐฉํ–ฅ์„ ๊ฐ€๋ฆฌํ‚ค์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • ํ•œ๋ฒˆ ์˜ˆ๋กœ ๋“ค์–ด์„œ ํ•จ์ˆ˜์— SGD๋ฅผ ์ดˆ๊ธฐ๊ฐ’์„ (-7.0, 2.0)์œผ๋กœ ์ ์šฉํ•ด์„œ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• - Stochastic Gradient Descent(SGD)๊ณผ ๊ฐ™์ด ์‹ฌํ•˜๊ฒŒ ๊ตฝ์ด์ง„ ์›€์ง์ž„๋“ค ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋น„ํšจ์œจ์ ์ธ ์›€์ง์ž„์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ฆ‰ ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• - Stochastic Gradient Descent(SGD)์˜ ๋‹จ์ ์€ Anisotropy Function(๋น„๋“ฑ๋ฐฉ์„ฑ ํ•จ์ˆ˜ - ๋ฐฉํ–ฅ์— ๋”ฐ๋ผ ์„ฑ์งˆ, Gradient๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ํ•จ์ˆ˜)์—์„œ ํƒ์ƒ‰ ๊ฒฝ๋กœ๊ฐ€ ๋น„ํšจ์œจ์ ์ด๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ด๋Ÿด๋•Œ SGD ๊ฐ™์ด ๋ฌด์ž‘์ • ๊ธฐ์šธ์–ด์ง„ ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๋ณธ๋ž˜์˜ ์ตœ์†Œ๊ฐ’๊ณผ ๋‹ค๋ฅธ ๋ฐฉํ–ฅ์„ ๊ฐ€๋ฆฌ์ผœ์„œ๋ผ๋Š” ์ ๋„ ์ƒ๊ฐํ•ด๋ณผ ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋•Œ ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• - Stochastic Gradient Descent(SGD)์˜ ๋‹จ์ ๋“ค์„ ๊ฐœ์„ ํ•ด์ฃผ๋Š” Momentum(๋ชจ๋ฉ˜ํ…€), AdaGrad, Adam์ด๋ผ๋Š” 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Stochastic Gradient Descent(SGD) ๋Œ€์ฒด ๊ธฐ๋ฒ•

Momentum (๋ชจ๋ฉ˜ํ…€)

Momentum(๋ชจ๋ฉ˜ํ…€)์€ ๋ฌผ๋ฆฌ์—์„œ '์šด๋™๋Ÿ‰'์„ ๋œปํ•˜๋Š” ๋‹จ์–ด์ž…๋‹ˆ๋‹ค. ์ˆ˜์‹์„ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
v โ† ฮฑv โˆ’ ฮท โˆ— โˆ‚L/โˆ‚W
W โ† W + v
  • ์—ฌ๊ธฐ์„œ 'W'๋Š” ๊ฐฑ์‹ ํ•  'Weight Parameter' (๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜)
  • 'โˆ‚โˆ‚W' ๋Š” 'W'์— ๋Œ€ํ•œ Loss Function(์†์‹ค ํ•จ์ˆ˜)์˜ Gradient(๊ธฐ์šธ๊ธฐ)์ž…๋‹ˆ๋‹ค.
  • 'ฮท': ํ•™์Šต๋ฅ ์„ ์˜๋ฏธํ•˜๋Š”๋ฐ ์ •ํ•ด์ง„ ์ƒ์ˆ˜๊ฐ’(๋ณดํ†ต 0.01์ด๋‚˜, 0.001๊ณผ ๊ฐ™์€ ๊ฐ’์„ ๋ฏธ๋ฆฌ ์ •ํ•จ)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • 'v': ์†๋„๋ฅผ ์˜๋ฏธํ•˜๋Š”๋ฐ Gradient(๊ธฐ์šธ๊ธฐ) ๋ฐฉํ–ฅ์œผ๋กœ ํž˜์„ ๋ฐ›์•„ ๋ฌผ์ฒด๊ฐ€ ๊ฐ€์†๋˜๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • 'ฮฑ': ๋ฌผ๋ฆฌ์—์„œ์˜ ์ง€๋ฉด ๋งˆ์ฐฐ & ๊ณต๊ธฐ ์ €ํ•ญ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. (0.9 ๋“ฑ์˜ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค)
  • ์•„๋ž˜ ์ฝ”๋“œ๋Š” Momentum์„ ๊ตฌํ˜„ํ•œ ์ฝ”๋“œ ์ž…๋‹ˆ๋‹ค.
import numpy as np

class Momentum:
    def __init__(self, lr=0.01, momentum=0.9):
        self.lr = lr  # ํ•™์Šต๋ฅ (learning rate)
        self.momentum = momentum  # ๋ชจ๋ฉ˜ํ…€(momentum) ํŒŒ๋ผ๋ฏธํ„ฐ
        self.v = None  # ๋ชจ๋ฉ˜ํ…€์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•œ ๋”•์…”๋„ˆ๋ฆฌ

    def update(self, params, grads):
        if self.v is None:
            self.v = {}  # v๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
            for key, val in params.items():
                self.v[key] = np.zeros_like(val)  # ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•ด ๋ชจ๋ฉ˜ํ…€์„ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

        for key in params.keys():
            # ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•ด ๋ฐ˜๋ณตํ•˜์—ฌ ์—…๋ฐ์ดํŠธ ์ˆ˜ํ–‰
            # ๋ชจ๋ฉ˜ํ…€์„ ์ด์šฉํ•˜์—ฌ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ
            self.v[key] = self.momentum * self.v[key] - self.lr * grads[key]
            params[key] += self.v[key]  # ์ƒˆ๋กœ์šด ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ์ ์šฉ
  • Momentum(๋ชจ๋ฉ˜ํ…€)์„ ์‚ฌ์šฉํ•ด์„œ ์•„๋ž˜์˜ ์ˆ˜์‹์˜ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ’€์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ๊ทธ๋ฆผ์—์„œ ๋ณด๋“ฏ, Momentum(๋ชจ๋ฉ˜ํ…€)์˜ ๊ฐฑ์‹  ๊ฒฝ๋กœ๋Š” ๊ณต์ด ๋ฐ”๋ฝ์„ ๊ตฌ๋ฅด๋“ฏ ์›€์ง์ž…๋‹ˆ๋‹ค. '์ง€๊ทธ์žฌ๊ทธ ์ •๋„'๊ฐ€ ๋œํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Š” x์ถ•์˜ ํž˜์€ ์•„์ฃผ ์ž‘์ง€๋งŒ ๋ฐฉํ–ฅ์€ ๋ณ€ํ•˜์ง€ ์•Š์•„์„œ ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ์ผ์ •ํ•˜๊ฒŒ ๊ฐ€์†ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ๊ฑฐ๊พธ๋กœ y์ถ•์˜ ํž˜์€ ํฌ์ง€๋งŒ ์œ„์•„๋ž˜๋กœ ๋ฒˆ๊ฐˆ์•„ ๋ฐ›์•„์„œ ์ƒ์ธตํ•˜์—ฌ y์ถ• ๋ฐฉํ–ฅ์˜ ์†๋„๋Š” ์•ˆ์ •์ ์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • ์ „์ฒด์ ์œผ๋กœ ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• - Stochastic Gradient Descent(SGD)๋ณด๋‹ค x์ถ• ๋ฐฉํ–ฅ์œผ๋กœ ๋น ๋ฅด๊ฒŒ ๋‹ค๊ฐ€๊ฐ€ ์ง€๊ทธ์žฌ๊ทธ ์›€์ง์ž„์ด ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.

AdaGrad

  • Learning Rate(ํ•™์Šต๋ฅ  - ฮท)์ด ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ํ•™์Šต ์‹œ๊ฐ„์ด ๊ธธ์–ด์ง€๊ณ , ๋„ˆ๋ฌด ํฌ๋ฉด ๋ฐœ์‚ฐํ•˜์—ฌ ํ•™์Šต์ด ์ œ๋Œ€๋กœ ์ด๋ค„์ง€์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • ์ด Learning Rate(ํ•™์Šต๋ฅ ) ํ™•๋ฅ ์  ๊ฐ์†Œ(learning rate decay)๋Š” Learning Rate(ํ•™์Šต๋ฅ )์„ ์ ์ฐจ ์ค„์ด๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
  • Learning Rate(ํ•™์Šต๋ฅ )์„ ์„œ์„œํžˆ ๋‚ฎ์ถ”๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ Parameter(๋งค๊ฐœ๋ณ€์ˆ˜) ์ „์ฒด์˜ Learning Rate(ํ•™์Šต๋ฅ )๊ฐ’์„ ์ผ๊ด„์ ์œผ๋กœ ๋‚ฎ์ถ”๋Š” ๊ฒƒ์„ ๋ฐœ์ „ ์‹œํ‚จ๊ฒƒ์ด AdaGrad์ž…๋‹ˆ๋‹ค.
  • AdaGrad ๋ฐฉ์‹์€ '๊ฐ๊ฐ์˜' Parameter(๋งค๊ฐœ๋ณ€์ˆ˜)์— ๋งž๊ฒŒ '๋งž์ถคํ˜•' ๊ฐ’์„ ๋งŒ๋“ค๊ณ  Adaptive(์ ์‘์ )์œผ๋กœ Learning Rate(ํ•™์Šต๋ฅ )์„ ์กฐ์ •ํ•˜๋ฉด์„œ ํ•™์Šต์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜์‹์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

AdaGrad ์ˆ˜์‹

h โ† h + โˆ‚L/โˆ‚W โŠ™ โˆ‚L/โˆ‚W
Wโ† W โˆ’ ฮท โˆ— 1/โˆšh โˆ— โˆ‚L/โˆ‚W
  • ์—ฌ๊ธฐ์„œ 'W'๋Š” ๊ฐฑ์‹ ํ•  'Weight Parameter' (๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜)
  • 'โˆ‚โˆ‚W' ๋Š” 'W'์— ๋Œ€ํ•œ Loss Function(์†์‹ค ํ•จ์ˆ˜)์˜ Gradient(๊ธฐ์šธ๊ธฐ)์ž…๋‹ˆ๋‹ค.
  • 'ฮท': ํ•™์Šต๋ฅ ์„ ์˜๋ฏธํ•˜๋Š”๋ฐ ์ •ํ•ด์ง„ ์ƒ์ˆ˜๊ฐ’(๋ณดํ†ต 0.01์ด๋‚˜, 0.001๊ณผ ๊ฐ™์€ ๊ฐ’์„ ๋ฏธ๋ฆฌ ์ •ํ•จ)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • 'h'๋Š” ๊ธฐ์กด ๊ธฐ์šธ๊ธฐ ๊ฐ’์„ ์ œ๊ณฑํ•˜์—ฌ ๊ณ„์† ๋”ํ•ด์ค๋‹ˆ๋‹ค. 'โŠ™'๋Š” ํ–‰๋ ฌ๋ณ„ ์›์†Œ์˜ ๊ณฑ์…ˆ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ Parameter(๋งค๊ฐœ๋ณ€์ˆ˜)๋ฅผ ๊ฐฑ์‹ ํ• ๋•Œ 1/โˆšh๋ฅผ ๊ณฑํ•ด Learning Rate(ํ•™์Šต๋ฅ )์„ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • Parameter(๋งค๊ฐœ๋ณ€์ˆ˜)์˜ ์›์†Œ ์ค‘์—์„œ ๋งŽ์ด ์›€์ง์ธ(ํฌ๊ฒŒ ๊ฐฑ์‹ ๋œ) ์™ผ์†Œ๋Š” Learning Rate(ํ•™์Šต๋ฅ )์ด ๋‚ฎ์•„์ง€๋Š”๋ฐ, Learning Rate(ํ•™์Šต๋ฅ ) ๊ฐ์†Œ๊ฐ€ Parameter(๋งค๊ฐœ๋ณ€์ˆ˜)์˜ ์›์†Œ๋งˆ๋‹ค ๋‹ค๋ฅด๊ฒŒ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ์š”?
AdaGrad๋Š” ๊ณผ๊ฑฐ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ œ๊ณฑํ•˜์—ฌ ๊ณ„์† ๋”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šต์„ ์ง„ํ–‰ํ• ์ˆ˜๋ก ๊ฐฑ์‹  ๊ฐ•๋„๊ฐ€ ์•ฝํ•ด์ง‘๋‹ˆ๋‹ค.
์ด ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ•œ ๊ธฐ๋ฒ•์œผ๋กœ RMSProp์ด ์žˆ๋‹ค. RMSProp์—์„œ๋Š” ๋จผ ๊ณผ๊ฑฐ์˜ ๊ธฐ์šธ๊ธฐ๋Š” ์„œ์„œํžˆ ์žŠ๊ณ  ์ƒˆ๋กœ์šด ๊ธฐ์šธ๊ธฐ ์ •๋ณด๋ฅผ ํฌ๊ฒŒ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.
์ด๋ฅผ ์ง€์ˆ˜์ด๋™ํ‰๊ท (Exponential Moving Average)์ด๋ผ ํ•˜๋ฉฐ ๊ณผ๊ฑฐ ๊ธฐ์šธ๊ธฐ์˜ ๋ฐ˜์˜ ๊ทœ๋ชจ๋ฅผ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ๊ฐ์†Œ์‹œํ‚ต๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ํ•œ๋ฒˆ AdaGrad์˜ ๊ตฌํ˜„์„ ํ•œ๋ฒˆ ์‚ดํŽด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
import numpy as np

class AdaGrad:
    def __init__(self, lr=0.01):
        self.lr = lr  # ํ•™์Šต๋ฅ (learning rate)
        self.h = None  # ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์ œ๊ณฑ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๋ˆ„์ ํ•˜๊ธฐ ์œ„ํ•œ ๋”•์…”๋„ˆ๋ฆฌ

    def update(self, params, grads):
        if self.h is None:
            self.h = {}  # h๋ฅผ ์ดˆ๊ธฐํ™”
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)  # ๊ฐ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•œ h๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”
        for key in params.keys():
            # ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•ด ๋ฐ˜๋ณตํ•˜์—ฌ ์—…๋ฐ์ดํŠธ ์ˆ˜ํ–‰
            self.h[key] += grads[key] * grads[key]  # ๊ทธ๋ž˜๋””์–ธํŠธ ์ œ๊ณฑ์„ ๋ˆ„์ 
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)  
            # ๋งค๊ฐœ๋ณ€์ˆ˜ ์—…๋ฐ์ดํŠธ: ํ•™์Šต๋ฅ ์„ ์ ์‘์ ์œผ๋กœ ์กฐ์ •ํ•˜์—ฌ ์—…๋ฐ์ดํŠธ
            # np.sqrt(self.h[key]) + 1e-7๋Š” 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ž‘์€ ๊ฐ’
  • ์ฃผ์˜ํ•ด์„œ ๋ด์•ผํ•˜๋Š” ๊ฒƒ์€ ๋งˆ์ง€๋ง‰์ค„์— 1e-7์ด๋ผ๋Š” ์ž‘์€ ๊ฐ’์„ ๋”ํ•˜๋Š” ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. ์ด ์ž‘์€ ๊ฐ’์€ 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ž‘์€ ๊ฐ’์ž…๋‹ˆ๋‹ค.
  • self.h[key]์— 0์ด ๋‹ด๊ฒจ์žˆ๋‹ค ํ•ด๋„, 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ์‚ฌํƒœ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.


Adam

Adam์€ Momentum(๋ชจ๋ฉ˜ํ…€)๊ณผ AdaGrad๋ฅผ ์œตํ•ฉํ•œ ๋“ฏํ•œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. 
  • ํŠน์ง•์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ 'ํŽธํ–ฅ ๋ณด์ •'์ด ์ง„ํ–‰์ด ๋œ๋‹ค๋Š”๊ฒƒ์ด Adam์˜ ํŠน์ง•์ž…๋‹ˆ๋‹ค. 
  • ํ•œ๋ฒˆ Adam ํด๋ž˜์Šค ์ฝ”๋“œ๋ฅผ ํ•œ๋ฒˆ ๋ณด๊ณ  ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ’€์–ด๋ณธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr  # ํ•™์Šต๋ฅ (learning rate)
        self.beta1 = beta1  # ๋ชจ๋ฉ˜ํ…€(momentum)์˜ ์ง€์ˆ˜ ๊ฐ€์ค‘ ํ‰๊ท  ํŒŒ๋ผ๋ฏธํ„ฐ
        self.beta2 = beta2  # RMSProp์˜ ์ง€์ˆ˜ ๊ฐ€์ค‘ ํ‰๊ท  ํŒŒ๋ผ๋ฏธํ„ฐ
        self.iter = 0  # ๋ฐ˜๋ณต ํšŸ์ˆ˜
        self.m = None  # 1์ฐจ ๋ชจ๋ฉ˜ํ…€
        self.v = None  # 2์ฐจ ๋ชจ๋ฉ˜ํ…€

    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = {}, {}
            for key, val in params.items():
                self.m[key] = np.zeros_like(val)  # ๊ฐ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•œ 1์ฐจ ๋ชจ๋ฉ˜ํ…€์„ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
                self.v[key] = np.zeros_like(val)  # ๊ฐ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•œ 2์ฐจ ๋ชจ๋ฉ˜ํ…€์„ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

        self.iter += 1
        lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)  
        # ํŽธํ–ฅ ๋ณด์ •๋œ ํ•™์Šต๋ฅ  ๊ณ„์‚ฐ

        for key in params.keys():
            self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])  # 1์ฐจ ๋ชจ๋ฉ˜ํ…€ ์—…๋ฐ์ดํŠธ
            self.v[key] += (1 - self.beta2) * (grads[key]**2 - self.v[key])  # 2์ฐจ ๋ชจ๋ฉ˜ํ…€ ์—…๋ฐ์ดํŠธ
            params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)  # ๋งค๊ฐœ๋ณ€์ˆ˜ ์—…๋ฐ์ดํŠธ

  • Adam ๊ฐฑ์‹  ๊ณผ์ •๋„ ๊ทธ๋ฆ‡ ๋ฐ”๋‹ฅ์„ ๊ตฌ๋ฅด๋“ฏ ์›€์ง์ž…๋‹ˆ๋‹ค. ์ด ํŒจํ„ด์€ Momentum(๋ชจ๋ฉ˜ํ…€)๊ณผ ๋น„์Šทํ•˜์ง€๋งŒ,  Momentum(๋ชจ๋ฉ˜ํ…€)๋ณด๋‹ค ๊ณต์˜ ์ขŒ์šฐ ํ”๋“ค๋ฆผ์ด ์ ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ•™์Šต์˜ ๊ฐฑ์‹  ๊ฐ•๋„๋ฅผ Adaptive(์ ์‘์ )์œผ๋กœ ์กฐ์ •ํ•ด์„œ ์–ป๋Š” ํ˜œํƒ์ž…๋‹ˆ๋‹ค.

์–ด๋– ํ•œ ๊ฐฑ์‹  ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ• ๊นŒ์š”?

์ง€๊ธˆ๊นŒ์ง€ 4๊ฐœ์˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•. SGD, Momentum, AdaGrad, Adam์„ ์•Œ์•„๋ณด์•˜์Šต๋‹ˆ๋‹ค.
  • ๊ฒฐ๊ณผ๋Š” ๋ฌธ์ œ์— ๋”ฐ๋ผ ๋‹ฌ๋ฆฌ์งˆ์ˆ˜๋„ ์žˆ์œผ๋‹ˆ๊นŒ ์ฃผ์˜ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  Hyperparameter(Learning Rate๋“ฑ..)์„ ์–ด๋–ป๊ฒŒ ์„ค์ •ํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ์„œ ๊ฒฐ๊ณผ๋„ ๋ด๋€๋‹ˆ๋‹ค.
  • ์ฆ‰, ๊ฒฐ๋ก ์€ ๋ชจ๋“  ๋ฌธ์ œ์—์„œ ํ•ญ์ƒ ๋›ฐ์–ด๋‚œ ๊ธฐ๋ฒ•์€ ์—†์Šต๋‹ˆ๋‹ค. ์ƒํ™ฉ์„ ๊ณ ๋ คํ•ด์„œ ์—ฌ๋Ÿฌ๊ฐ€์ง€๋กœ ์‹œ๋„ํ•ด๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Weight(๊ฐ€์ค‘์น˜)์˜ ์ดˆ๊นƒ๊ฐ’

๋งŒ์•ฝ ์ดˆ๊นƒ๊ฐ’์„ 0์„ ํ•˜๋ฉด?

Overfitting(์˜ค๋ฒ„ํ”ผํŒ…)์„ ์–ต์ œํ•ด ๋ฒ”์šฉ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” Weight Decay(๊ฐ€์ค‘์น˜ ๊ฐ์†Œ) ๊ธฐ๋ฒ•์€ Parameter(๋งค๊ฐœ๋ณ€์ˆ˜)์˜ ๊ฐ’์ด ์ž‘์•„์ง€๋„๋ก ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. Weight(๊ฐ€์ค‘์น˜)๊ฐ’์„ ์ž‘๊ฒŒํ•˜์—ฌ Overfitting(์˜ค๋ฒ„ํ”ผํŒ…)์ด ์ผ์–ด๋‚˜๊ฒŒ ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๋งŒ์•ฝ Weight(๊ฐ€์ค‘์น˜) ์ดˆ๊ธฐ๊ฐ’์„ 0์œผ๋กœ ํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”?
  • ํ•™์Šต์ด ์ œ๋Œ€๋กœ ์ผ์–ด๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด์œ ๋Š” ๋ฐ”๋กœ Backpropagation(์˜ค์ฐจ์—ญ์ „ํŒŒ๋ฒ•)์—์„œ ๋ชจ๋“  Weight(๊ฐ€์ค‘์น˜)์˜ ๊ฐ’์ด ๋˜‘๊ฐ™์ด ๊ฐฑ์‹ ๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด์„œ 2-Layer Neural Network(2์ธต ์‹ ๊ฒฝ๋ง)์—์„œ ์ฒซ๋ฒˆ์งธ, ๋‘๋ฒˆ์งธ Layer์˜ Weight(๊ฐ€์ค‘์น˜)๊ฐ€ 0์ด๋ฉด, Forward Propagation(์ˆœ์ „ํŒŒ)๋•Œ์˜ Input Layer(์ž…๋ ฅ์ธต)์˜ Weight(๊ฐ€์ค‘์น˜)๊ฐ€ 0์ด๊ธฐ ๋•Œ๋ฌธ์— ๋‘๋ฒˆ์งธ Layer์˜ Neuron์— ๋ชจ๋‘ ๊ฐ™์€ ๊ฐ’์ด ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค. 
    • ์ฆ‰, Backpropagation(์—ญ์ „ํŒŒ)๋•Œ์˜ ๋‘๋ฒˆ์งธ Layer์˜ Weight(๊ฐ€์ค‘์น˜)๊ฐ€ ๋ชจ๋‘ ๋˜‘๊ฐ™์ด ๊ฐฑ์‹ ๋œ๋‹ค๋Š” ๋ง์ด ๋ฉ๋‹ˆ๋‹ค.
  • ์ด ์ƒํ™ฉ์„ ๋ง‰์œผ๋Ÿฌ๋ฉด ์ดˆ๊นƒ๊ฐ’์„ ๋ฌด์ž‘์œ„๋กœ ์„ค์ •์„ ํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Hidden Layer(์€๋‹‰์ธต)์˜ ํ™œ์„ฑํ™”๊ฐ’ ๋ถ„ํฌ

Hidden Layer(์€๋‹‰์ธต)์˜ ํ™œ์„ฑํ™”๊ฐ’์˜ ๋ถ„ํฌ๋ฅผ ๊ด€์ฐฐํ•˜๋ฉด ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์–ป์„์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Activation Function(ํ™œ์„ฑํ™” ํ•จ์ˆ˜)๋กœ Sigmoid Function(์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜)๋กœ ์‚ฌ์šฉํ•˜๋Š” 5-Layer Neural Network(5์ธต ์‹ ๊ฒฝ๋ง)์— ๋ฌด์ž‘์œ„๋กœ ์ƒ์„ฑํ•œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ํ˜๋ ค๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๊ฐ Layer์˜ Activation Value(ํ™œ์„ฑํ™”๊ฐ’) ๋ถ„ํฌ๋ฅผ ํ•œ๋ฒˆ Histogram(ํžˆ์Šคํ† ๊ทธ๋žจ)์œผ๋กœ ๋ณด๋ฉด์„œ Weight(๊ฐ€์ค‘์น˜)์˜ ์ดˆ๊นƒ๊ฐ’์— ๋”ฐ๋ผ Hidden Layer(์€๋‹‰์ธต)์˜ Activation Value(ํ™œ์„ฑํ™”๊ฐ’)์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š”์ง€ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
# coding: utf-8
import numpy as np
import matplotlib.pyplot as plt

# ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜ ์ •์˜
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
    
# ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (1000๊ฐœ์˜ ๋ฐ์ดํ„ฐ, ๊ฐ ๋ฐ์ดํ„ฐ๋Š” 100๊ฐœ์˜ ํŠน์„ฑ์„ ๊ฐ€์ง)
input_data = np.random.randn(1000, 100)

# ๊ฐ ์€๋‹‰์ธต์˜ ๋…ธ๋“œ(๋‰ด๋Ÿฐ) ์ˆ˜
node_num = 100

# ์€๋‹‰์ธต ๊ฐœ์ˆ˜
hidden_layer_size = 5

# ์€๋‹‰์ธต์˜ ํ™œ์„ฑํ™” ๊ฐ’์„ ์ €์žฅํ•  ๋”•์…”๋„ˆ๋ฆฌ
activations = {}

# ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ˆœ์ „ํŒŒ ์‹œ์ž‘
x = input_data

# ๊ฐ ์€๋‹‰์ธต์— ๋Œ€ํ•ด ์ˆœ์ „ํŒŒ ์ˆ˜ํ–‰
for i in range(hidden_layer_size):
    # ์ฒซ ๋ฒˆ์งธ ์€๋‹‰์ธต์ผ ๊ฒฝ์šฐ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ ,
    # ๊ทธ ์ดํ›„์˜ ์€๋‹‰์ธต๋ถ€ํ„ฐ๋Š” ์ด์ „ ์€๋‹‰์ธต์˜ ํ™œ์„ฑํ™” ๊ฐ’์„ ์‚ฌ์šฉ
    if i != 0:
        x = activations[i-1]
    
    # ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™” (ํ‰๊ท ์ด 0, ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1์ธ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ๋‚œ์ˆ˜)
    w = np.random.randn(node_num, node_num) * 1
    
    # ๊ฐ€์ค‘ํ•ฉ ๊ณ„์‚ฐ
    a = np.dot(x, w)
    
    # ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์ ์šฉ (์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜)
    z = sigmoid(a)
    
    # ํ™œ์„ฑํ™” ๊ฒฐ๊ณผ๋ฅผ ๋”•์…”๋„ˆ๋ฆฌ์— ์ €์žฅ
    activations[i] = z
  • Layer๊ฐ€ 5๊ฐœ๊ฐ€ ์žˆ์œผ๋ฉฐ, ๊ฐ ์ธต์˜ Neuron์€ 100๊ฐœ ์”ฉ์ž…๋‹ˆ๋‹ค.
  • Input Data๋กœ์„œ 1,000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๊ทœ๋ถ„ํฌ๋กœ ๋ฌด์ž‘์œ„๋กœ ์ƒ์„ฑํ•˜์—ฌ ์ด 5-Layer Neural Network์— ํ˜๋ฆฝ๋‹ˆ๋‹ค.
  • Activation ๊ฒฐ๊ณผ๋ฅผ activation ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๋ฐ˜๋ณต๋ฌธ์„ ํ†ตํ•ด์„œ ์•„๋ž˜์˜ ๊ณผ์ •์„ ๊ณ„์† ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
    • ์ฒซ ๋ฒˆ์งธ ์€๋‹‰์ธต์ผ ๊ฒฝ์šฐ, ์ž…๋ ฅ ๋ฐ์ดํ„ฐ x๋Š” input_data๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ดํ›„์˜ ์€๋‹‰์ธต์€ ์ด์ „ ์€๋‹‰์ธต์˜ ํ™œ์„ฑํ™” ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ€์ค‘์น˜ w๋Š” ํ‰๊ท ์ด 0์ด๊ณ  ํ‘œ์ค€ ํŽธ์ฐจ๊ฐ€ 1์ธ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ๋‚œ์ˆ˜๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
    • ํ–‰๋ ฌ ๊ณฑ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์™€ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณฑํ•œ ํ›„, ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜์— ์ ์šฉํ•˜์—ฌ ํ™œ์„ฑํ™” ๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋ฅผ activations ๋”•์…”๋„ˆ๋ฆฌ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
# ํžˆ์Šคํ† ๊ทธ๋žจ ๊ทธ๋ฆฌ๊ธฐ
for i, a in activations.items():
    plt.subplot(1, len(activations), i+1)
    plt.title(str(i+1) + "-layer")
    plt.hist(a.flatten(), 30, range=(0,1))
plt.show()

Weight(๊ฐ€์ค‘์น˜)๋ฅผ ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1์ธ ์ •๊ทœ๋ถ„ํฌ๋กœ ์ดˆ๊ธฐํ™” ํ• ๋•Œ์˜ ๊ฐ Layer์˜ Activation value(ํ™œ์„ฑํ™”๊ฐ’) ๋ถ„ํฌ

  • ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ๊ฐ Layer์˜ Activaiton ๊ฐ’์ด 0, 1์— ๋‹ค ์น˜์šฐ์ณ ๋ถ„ํฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • Sigmoid Function(์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜)๋Š” Output์ด 0,1์— ๊ฐ€๊นŒ์›Œ์ง€๋ฉด ๋ฏธ๋ถ„์€ 0์— ๋‹ค๊ฐ€๊ฐ‘๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ ๋ฐ์ดํ„ฐ๊ฐ€ 0, 1์— ์น˜์šฐ์ณ ๋ถ„ํฌํ•˜๊ฒŒ ๋˜๋ฉด Backpropagation(์—ญ์ „ํŒŒ)์˜ Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ’์ด ์ ์  ์ž‘์•„์ง€๋‹ค๊ฐ€ ์‚ฌ๋ผ์ง‘๋‹ˆ๋‹ค.
  • ์ด ํ˜„์ƒ์„ Gradient Vanishing(๊ธฐ์šธ๊ธฐ ์†์‹ค)์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฒˆ์—๋Š” Weight(๊ฐ€์ค‘์น˜)์˜ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ 0.01๋กœ ๋ด๊ฟ”์„œ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜ ์ดˆ๊นƒ๊ฐ’ ์„ค์ • ๋ถ€๋ถ„์„ ๋ด๊พธ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
# ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™” (ํ‰๊ท ์ด 0, ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1์ธ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ๋‚œ์ˆ˜)
w = np.random.randn(node_num, node_num) * 0.01

  • ๊ทธ๋ฆผ์„ ๋ณด๋ฉด 0.5 ๋ถ€๊ทผ์— ์ง‘์ค‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 0๊ณผ 1 ์‚ฌ์ด์— ์น˜์šฐ์น˜์ง„ ์•Š์•˜์œผ๋‚˜ Gradient Vanishing ๋ฌธ์ œ๋Š” ์ผ์–ด๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • ๊ทผ๋ฐ Activation Value(ํ™œ์„ฑํ™” ๊ฐ’)๋“ค์ด ์น˜์šฐ์ณค๋‹ค๋Š” ๊ฒƒ์€ ํ‘œํ˜„๋ ฅ ๊ด€์  ๋ถ€๋ถ„์—์„œ๋Š” ํฐ ๋ฌธ์ œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์ˆ˜์˜ Neuron๋“ค์ด ๊ฐ™์€ ๊ฐ’์ด ์ถœ๋ ฅํ•˜๊ณ  ์žˆ๋Š”๊ฑด Neuron์ด ์—ฌ๋Ÿฌ๊ฐœ ๋‘” ์˜๋ฏธ๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ Activation Value(ํ™œ์„ฑํ™” ๊ฐ’)๋“ค์ด ์น˜์šฐ์น˜๋ฉด ํ‘œํ˜„๋ ฅ์„ ์ œํ•œํ•œ๋‹ค๋Š” ๊ด€์ ์—์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

Xavier ์ดˆ๊ธฐ๊ฐ’

Xavier ์ดˆ๊นƒ๊ฐ’์€ ์ผ๋ฐ˜์ ์ธ ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋“ค์ด ํ‘œ์ค€์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • Xavier ์ดˆ๊นƒ๊ฐ’ : ์ดˆ๊นƒ๊ฐ’์˜ ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1/โˆšn์ด ๋˜๋„๋ก ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  • n : ์•ž ์ธต์˜ ๋…ธ๋“œ ์ˆ˜์ž…๋‹ˆ๋‹ค.

Xavier ์ดˆ๊ธฐ๊ฐ’: ์ดˆ๊นƒ๊ฐ’์˜ ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1/√n์ด ๋˜๋„๋ก ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

  • Xavier ์ดˆ๊นƒ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋ฉด ์•ž ์ธต์˜ ๋…ธ๋“œ๊ฐ€ ๋งŽ์„์ˆ˜๋ก ๋Œ€์ƒ ๋…ธ๋“œ์˜ ์ดˆ๊นƒ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๋Š” ๊ฐ€์ค‘์น˜๊ฐ€ ์ข๊ฒŒ ํผ์ง‘๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ Xavier ์ดˆ๊นƒ๊ฐ’์„ ์‚ฌ์šฉํ•ด์„œ ์‹คํ—˜ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
node_num = 100 # ์•ž์ธต์˜ ๋…ธ๋“œ ์ˆ˜
w = np.random.randn(node_num, node_num) / np.sqrt(node_num)

๊ฐ€์ค‘์น˜์˜ ์ดˆ๊นƒ๊ฐ’์œผ๋กœ 'Xavier ์ดˆ๊ธฐ๊ฐ’'์„ ์ด์šฉํ•  ๋•Œ์˜ ๊ฐ ์ธต์˜ ํ™œ์„ฑํ™”๊ฐ’ ๋ถ„ํฌ

  • Xavier ์ดˆ๊นƒ๊ฐ’์„ ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ๋Š” ์œ„์˜ ๊ฒฐ๊ณผ์ฒ˜๋Ÿผ ๋ฉ๋‹ˆ๋‹ค. ์œ„์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ์ธต์ด ๊นŠ์–ด์ง€๋ฉด์„œ ํ˜•ํƒœ๊ฐ€ ์–ผ๊ทธ๋Ÿฌ์ง‘๋‹ˆ๋‹ค.
  • ๋‹ค๋งŒ ์•ž์„  ๋ฐฉ์‹๋“ค ๋ณด๋‹ค๋Š” ํ™•์‹คํžˆ ๋„“๊ฒŒ ๋ถ„ํฌ๋จ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋„ ์ ๋‹นํžˆ ํผ์ ธ์žˆ๊ณ , Sigmoid ํ•จ์ˆ˜์˜ ํ‘œํ˜„๋ ฅ๋„ ์ œํ•œ๋ฐ›์ง€ ์•Š์œผ๋ฉด์„œ ํ•™์Šต์ด ํšจ์œจ์ ์œผ๋กœ ์ด๋ค„์งˆ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋ฉ๋‹ˆ๋‹ค.
์ธต์ด ๊นŠ์–ด์ง€๋ฉด ์ผ๊ทธ๋Ÿฌ์ง€๋Š” ํ˜„์ƒ์€ sigmoid ํ•จ์ˆ˜ ๋Œ€์‹  tanh(์Œ๊ณก์„  ํ•จ์ˆ˜)๋ฅผ ์ด์šฉํ•˜๋ฉด ๊ฐœ์„ ๋ฉ๋‹ˆ๋‹ค.
tanh ํ•จ์ˆ˜๋„ S์ž ๊ณก์„ ์ด์ง€๋งŒ (0, 0.5)์—์„œ ๋Œ€์นญ์ธ ์‹œ๊ทธ๋ชจ์ด๋“œ์™€๋Š” ๋‹ค๋ฅด๊ฒŒ ์›์  ๋Œ€์นญ์ž…๋‹ˆ๋‹ค.
ํ™œ์„ฑํ™” ํ•จ์ˆ˜์šฉ์œผ๋กœ๋Š” ์›์ ์—์„œ ๋Œ€์นญ์ธ ํ•จ์ˆ˜๊ฐ€ ๋ฐ”๋žŒ์งํ•˜๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

ReLU๋ฅผ ์‚ฌ์šฉํ• ๋•Œ์˜ Weight(๊ฐ€์ค‘์น˜) ์ดˆ๊นƒ๊ฐ’ - He ์ดˆ๊นƒ๊ฐ’

  • ReLU๋กœ Activation Function(ํ™œ์„ฑํ™” ํ•จ์ˆ˜)๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์—๋Š” ReLU์— ํŠนํ™”๋œ ์ดˆ๊นƒ๊ฐ’์„ ์ด์šฉํ•˜๋ผ๊ณ  ๋ณดํ†ต์€ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. 
  • ์ด ํŠนํ™”๋œ ์ดˆ๊นƒ๊ฐ’์„ He ์ดˆ๊นƒ๊ฐ’ ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • He ์ดˆ๊นƒ๊ฐ’์€ ์•ž ์ธต์˜ ๋…ธ๋“œ๊ฐ€ n๊ฐœ์ผ ๋•Œ, ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ โˆš(2/n)์ธ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด์œ ๋Š” ๋ฌด์—‡์ผ๊นŒ์š”? (Xavier ์ดˆ๊นƒ๊ฐ’์€ โˆš(1/n))
  • ReLU๋Š” ์Œ์˜ ์˜์—ญ์ด 0์ด๋ผ์„œ ๋” ๋„“๊ฒŒ ๋ถ„ํฌ์‹œํ‚ค๊ธฐ ์œ„ํ•ด 2๋ฐฐ์˜ ๊ณ„์ˆ˜๊ฐ€ ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ํ•œ๋ฒˆ ํ™œ์„ฑํ™”๊ฐ’ ๋ถ„ํฌ๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 0.01์ธ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๊ฐ€์ค‘์น˜ ์ดˆ๊นƒ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ? ๊ฐ ์ธต์˜ ํ™œ์„ฑํ™”๊ฐ’๋“ค์€ ์•„์ฃผ ์ž‘์€ ๊ฐ’์ž…๋‹ˆ๋‹ค.
    • ์ž‘์€ ๊ฐ’๋“ค์ด ํ๋ฅด๋ฉด Backpropagation(์—ญ์ „ํŒŒ)๋•Œ Weight(๊ฐ€์ค‘์น˜)์˜ Gradient(๊ธฐ์šธ๊ธฐ)์—ญ์‹œ ์ž‘์•„์ง„๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.
  • Xavier ์ดˆ๊นƒ๊ฐ’์„ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ ์—๋Š” ์ธต์ด ๊นŠ์–ด์ง€๋ฉด์„œ ์น˜์šฐ์นจ์ด ์กฐ๊ธˆ์”ฉ ์ปค์ง‘๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋Š” Gradient Loss(๊ธฐ์šธ๊ธฐ ์†Œ์‹ค)๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • He ์ดˆ๊นƒ๊ฐ’์„ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ ๋ชจ๋“  ์ธต์—์„œ ๊ท ์ผํ•˜๊ฒŒ ๋ถ„ํฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Backpropagation(์—ญ์ „ํŒŒ) ๋•Œ๋„ ์ ์ ˆํ•œ ๊ฐ’์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.
Summary: ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋กœ ReLU๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋Š” He ์ดˆ๊นƒ๊ฐ’์„, S์ž ๋ชจ์–‘ ๊ณก์„ (sigmoid, tanh)์ผ ๋•Œ๋Š” Xavier ์ดˆ๊นƒ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Ex. Mnist Dataset์œผ๋กœ ๋ณธ Weight(๊ฐ€์ค‘์น˜) ์ดˆ๊นƒ๊ฐ’ ๋น„๊ต

ํ•œ๋ฒˆ Mnist Dataset์œผ๋กœ ๊ฐ€์ง€๊ณ  ํ•œ๋ฒˆ Weight(๊ฐ€์ค‘์น˜)์˜ ์ดˆ๊นƒ๊ฐ’์„ ์ฃผ๋Š” ๋ฐฉ์‹์ด ์‹ ๊ฒฝ๋ง ํ•™์Šต์— ์–ผ๋งˆ๋‚˜ ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

# coding: utf-8
import os
import sys

sys.path.append(os.pardir)  # ๋ถ€๋ชจ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.util import smooth_curve
from common.multi_layer_net import MultiLayerNet
from common.optimizer import SGD


# 0. MNIST ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ==========
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

train_size = x_train.shape[0]
batch_size = 128
max_iterations = 2000


# 1. ์‹คํ—˜์šฉ ์„ค์ •==========
weight_init_types = {'std=0.01': 0.01, 'Xavier': 'sigmoid', 'He': 'relu'}
optimizer = SGD(lr=0.01)

networks = {}
train_loss = {}
for key, weight_type in weight_init_types.items():
    networks[key] = MultiLayerNet(input_size=784, hidden_size_list=[100, 100, 100, 100],
                                  output_size=10, weight_init_std=weight_type)
    train_loss[key] = []


# 2. ํ›ˆ๋ จ ์‹œ์ž‘==========
for i in range(max_iterations):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    for key in weight_init_types.keys():
        grads = networks[key].gradient(x_batch, t_batch)
        optimizer.update(networks[key].params, grads)
    
        loss = networks[key].loss(x_batch, t_batch)
        train_loss[key].append(loss)
    
    if i % 100 == 0:
        print("===========" + "iteration:" + str(i) + "===========")
        for key in weight_init_types.keys():
            loss = networks[key].loss(x_batch, t_batch)
            print(key + ":" + str(loss))


# 3. ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ==========
markers = {'std=0.01': 'o', 'Xavier': 's', 'He': 'D'}
x = np.arange(max_iterations)
for key in weight_init_types.keys():
    plt.plot(x, smooth_curve(train_loss[key]), marker=markers[key], markevery=100, label=key)
plt.xlabel("iterations")
plt.ylabel("loss")
plt.ylim(0, 2.5)
plt.legend()
plt.show()

  • ๋‰ด๋Ÿฐ ์ˆ˜๊ฐ€ 100๊ฐœ์ธ 5์ธต ์‹ ๊ฒฝ๋ง์—์„œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋กœ ReLU๋ฅผ ์‚ฌ์šฉ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค
  • std=0.01์ผ ๋•Œ๋Š” ํ•™์Šต์ด ์ „ํ˜€ ์ด๋ค„์ง€์ง€ ์•Š์•˜๊ณ , Xavier์™€ He ์ดˆ๊นƒ๊ฐ’์˜ ๊ฒฝ์šฐ๋Š” ํ•™์Šต์ด ์ˆœ์กฐ๋กญ๊ฒŒ ์ด๋ค„์ง„ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ํ•™์Šต ์ง„๋„๋Š” He ์ดˆ๊นƒ๊ฐ’ ์ชฝ์ด ๋” ๋น ๋ฅด๋‹ค๊ณ  ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ Weight(๊ฐ€์ค‘์น˜) ์ดˆ๊นƒ๊ฐ’์ด ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.