A A
[DL] Gradient (๊ธฐ์šธ๊ธฐ), Training Algorithm(ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜)

Gradient (๊ธฐ์šธ๊ธฐ)

๋งŒ์•ฝ์— x0, x1์˜ ํŽธ๋ฏธ๋ถ„์„ ๋™์‹œ์— ๊ณ„์‚ฐํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ํ• ๊นŒ์š”?
  • ๊ทธ๋Ÿฌ๋ฉด ๋ชจ๋“  ํŽธ๋ฏธ๋ถ„์„ ๋ฒกํ„ฐ๋กœ ์ •๋ฆฌ๋ฅผ ํ•ด์•ผ ํ•˜๋Š”๋ฐ, ๊ทธ ์ •๋ฆฌํ•œ๊ฒƒ์„ Grdient(๊ธฐ์šธ๊ธฐ)๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด์„œ ์•„๋ž˜์˜ ์ฝ”๋“œ์™€ ๊ฐ™์ด ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
def numerical_gradient(f, x):
	h = 1e-4
    grad = np.zeros_like(x) # x์™€ ํ˜•์ƒ์ด ๊ฐ™์€ ๋ฐฐ์—ด์„ ์ƒ์„ฑ
    
    for idx in range(x.size):
    	tmp_val = x[idx]
        
      # f(x+h) ๊ณ„์‚ฐ
      x[idx] = tmp_val + h
      fxh1 = f(x)

      # f(x-h) ๊ณ„์‚ฐ
      x[idx] = tmp_val - h
      fxh2 = f(x)

      grad[idx] = (fxh1 - fxh2) / (2*h)
      x[idx] = tmp_val
      
	return grad
  • numerical_gradient(f, x) ํ•จ์ˆ˜์˜ ๊ตฌํ˜„์€ ์ข€ ๋ณต์žกํ•˜๊ฒŒ ๋ณด์ด์ง€๋งŒ, ๋™์ž‘ ๋ฐฉ์‹์€ ๋ณ€์ˆ˜๊ฐ€ ํ•˜๋‚˜์ผ๋•Œ์˜ ์ˆ˜์น˜ ๋ฏธ๋ถ„๊ณผ ๊ฑฐ์ด ๊ฐ™์Šต๋‹ˆ๋‹ค.
  • ์ฐธ๊ณ ๋กœ, np.zeros_like(x)๋Š” x์™€ ํ˜•์ƒ์ด ๊ฐ™๊ณ  ๊ทธ ์›์†Œ๊ฐ€ ๋ชจ๋‘ 0์ธ ๋ฐฐ์—ด์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • numerical_gradient(f, x) ํ•จ์ˆ˜์˜ ์ธ์ˆ˜์ธ f๋Š” ํ•จ์ˆ˜์ด๊ณ  x๋Š” Numpy ๋ฐฐ์—ด์ด๋ฏ€๋กœ, Numpy ๋ฐฐ์—ด x์˜ ๊ฐ ์›์†Œ์— ๋Œ€ํ•ด์„œ ์ˆ˜์น˜ ๋ฏธ๋ถ„์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทผ๋ฐ, ๊ทธ๋Ÿฌ๋ฉด Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ์˜๋ฏธ ํ•˜๋Š”๊ฒŒ ๋ฌด์—‡์ผ๊นŒ์š”? ๊ทธ๋ฆผ์œผ๋กœ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์œ„์˜ ์ˆ˜์‹์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ฆผ.

  • ์ด ๊ทธ๋ฆผ์€ Gradient(๊ธฐ์šธ๊ธฐ)์˜ ๊ฒฐ๊ณผ์— ๋งˆ์ด๋„ˆ์Šค๋ฅผ ๋ถ™์ธ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค.
  • Gradient(๊ธฐ์šธ๊ธฐ)๊ทธ๋ฆผ์€ '๊ฐ€์žฅ ๋‚ฎ์€ ์žฅ์†Œ(์ตœ์†Œ๊ฐ’)'์„ ๊ฐ€๋ฆฌํ‚ค๋Š”๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋งˆ์น˜ ๋‚˜์นจ๋ฐ˜์ฒ˜๋Ÿผ ํ™”์‚ดํ‘œ๋“ค์€ ํ•œ์ ์„ ํ–ฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  '๊ฐ€์žฅ ๋‚ฎ์€๊ณณ'์—์„œ ๋ฉ€์–ด์งˆ์ˆ˜๋ก ํ™”์‚ดํ‘œ์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ง์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๊ธฐ์šธ๊ธฐ๋Š” ๊ฐ ์ง€์ ์—์„œ ๋‚ฎ์•„์ง€๋Š” ๋ฐฉํ–ฅ์„ ๊ฐ€๋ฆฌํ‚ต๋‹ˆ๋‹ค.
  • ์ •ํ™•ํžˆ ๋งํ•˜๋ฉด, ๊ธฐ์šธ๊ธฐ๊ฐ€ ๊ฐ€๋ฆฌํ‚ค๋Š” ์ชฝ์€ ๊ฐ ์žฅ์†Œ์—์„œ ํ•จ์ˆ˜์˜ ์ถœ๋ ฅ ๊ฐ’์„ ๊ฐ€์žฅ ํฌ๊ฒŒ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค. ์ด๊ฑด ์ค‘์š”ํ•œ ์ ์ž…๋‹ˆ๋‹ค!

 Gradient Descent (๊ฒฝ์‚ฌ๋ฒ• - ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•)

  • ์†์‹ค ํ•จ์ˆ˜๊ฐ€ ์ตœ์†Ÿ๊ฐ’์ด ๋  ๋•Œ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ’์„ ๊ตฌํ•ด์•ผ ํ•˜๋Š”๋ฐ, ๊ธฐ์šธ๊ธฐ๋ฅผ ํ™œ์šฉํ•ด์„œ ํ•จ์ˆ˜์˜ ์ตœ์†Ÿ๊ฐ’์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ๊ธฐ์šธ์–ด์ง„ ๋ฐฉํ–ฅ์œผ๋กœ ๋‚˜์•„๊ฐ€๊ฐ€๋Š” ๊ฒƒ์„ ๋ฐ˜๋ณตํ•ด์„œ ํ•จ์ˆ˜์˜ ๊ฐ’์„ ์ ์ฐจ ์ค„์ด๋Š” ๊ฒƒ์„ Gradient Method(๊ฒฝ์‚ฌ๋ฒ•)์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทผ๋ฐ, Gradient(๊ธฐ์šธ๊ธฐ)๊ฐ€ ๊ฐ€๋ฆฌํ‚ค๋Š” ๊ณณ์— ์ •๋ง ํ•จ์ˆ˜์˜ ์ตœ์†Œ๊ฐ’์ด ์žˆ๋Š”์ง€? ๊ทธ์ชฝ์ด ๋‚˜์•„๊ฐˆ ๋ฐฉํ–ฅ์ธ์ง€๋Š” ๋ณด์žฅํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
  • ์‹ค์ œ๋กœ ๋ณต์žกํ•œ ํ•จ์ˆ˜์—์„œ๋Š” ๊ธฐ์šธ๊ธฐ๊ฐ€ ๊ฐ€๋ฆฌํ‚ค๋Š” ๋ฐฉํ–ฅ์— ์ตœ์†Œ๊ฐ’์ด ์—†๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋Œ€๋ถ€๋ถ„์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋•Œ Gradient Method(๊ฒฝ์‚ฌ๋ฒ•)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ Gradient Method(๊ฒฝ์‚ฌ๋ฒ•)์„ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ฒฝ์‚ฌ๋ฒ•(Gradient Method)์˜ ์ˆ˜์‹

  • ์œ„์˜ ์‹์—์„œ n๊ธฐํ˜ธ(์—ํƒ€)๋Š” ๊ฐฑ์‹ ํ•˜๋Š” ์–‘์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋ฅผ ์‹ ๊ฒฝ๋ง์—์„œ๋Š” Learning Rate(ํ•™์Šต๋ฅ )์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • 1๋ฒˆ์˜ ํ•™์Šต์œผ๋กœ ์–ผ๋งˆ๋งŒํผ ํ•™์Šตํ•ด์•ผ ํ• ์ง€, ์ฆ‰, ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ’์„ ์–ผ๋งˆ๋‚˜ ๊ฐฑ์‹ ํ•˜๋Š๋ƒ๋ฅผ ์ •ํ•˜๋Š” ๊ฒƒ์ด Learning Rate(ํ•™์Šต๋ฅ )์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์œ„์˜ ์ˆ˜์‹์€ 1ํšŒ์— ํ•ด๋‹นํ•˜๋Š” ๊ฐฑ์‹ ์ด๊ณ , ์ด ๋‹จ๊ณ„๋ฅผ ๋ฐ˜๋ณตํ•ด์„œ ์„œ์„œ์ด ํ•จ์ˆ˜์˜ ๊ฐ’์„ ์ค„์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๋˜ํ•œ Learning Rate(ํ•™์Šต๋ฅ )์˜ ๊ฐ’์€ ๋ฏธ๋ฆฌ ํŠน์ • ๊ฐ’์œผ๋กœ ์ •ํ•ด๋†”์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ’์ด ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์œผ๋ฉด '์ข‹์€ ์žฅ์†Œ' ๋ฅผ ์ฐพ์•„๊ฐˆ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
  • Gradient Descent(๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•)์€ ์•„๋ž˜์˜ ์ฝ”๋“œ๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
def gradient_descent(f, init_x, lr=0.01, step_num=100):
	x = init_x
    
    for i in range(step_num):
    	grad = numerical_gradient(f, x)
        x -= lr * grad
	return x
  • ์ฝ”๋“œ๋ฅผ ํ•œ๋ฒˆ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค
  • ์ธ์ˆ˜ f๋Š” ์ตœ์ ํ™”ํ•˜๋ ค๋Š” ํ•จ์ˆ˜, init_x๋Š” ์ดˆ๊นƒ๊ฐ’, lr์€ learning rate๋ฅผ ์˜๋ฏธํ•˜๋Š” ํ•™์Šต๋ฅ , step_num์€ ๊ฒฝ์‚ฌ๋ฒ•์— ๋”ฐ๋ฅธ ๋ฐ˜๋ณต ํšŸ์ˆ˜์ž…๋‹ˆ๋‹ค.
  • numerical_gradient(f, x)๋กœ ํ•จ์ˆ˜์˜ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  Gradient(๊ธฐ์šธ๊ธฐ)์— ํ•™์Šต๋ฅ ์„ ๊ณฑํ•œ ๊ฐ’์œผ๋กœ ๊ฐฑ์‹ ํ•˜๋Š” ์ฒ˜๋ฆฌ๋ฅผ step_num๋งŒํผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.

Neural Network(์‹ ๊ฒฝ๋ง)์—์„œ์˜ Gradient(๊ธฐ์šธ๊ธฐ)

  • ์‹ ๊ฒฝ๋ง ํ•™์Šต์—์„œ๋„ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ตฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” Gradient(๊ธฐ์šธ๊ธฐ)๋Š” Weight Parameter(๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜)์— ๋Œ€ํ•œ Loss Function(์†์‹คํ•จ์ˆ˜)์˜ Gradient(๊ธฐ์šธ๊ธฐ)์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ๊ฐ„๋‹จํ•œ ์‹ ๊ฒธ๋ง์„ ์˜ˆ๋ฅผ ๋“ค์–ด์„œ ์‹ค์ œ๋กœ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ๊ตฌํ˜„ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. (by Python)
# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # ๋ถ€๋ชจ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •
import numpy as np
from common.functions import softmax, cross_entropy_error
from common.gradient import numerical_gradient


class simpleNet:
    def __init__(self):
        self.W = np.random.randn(2,3) # ์ •๊ทœ๋ถ„ํฌ๋กœ ์ดˆ๊ธฐํ™”

    def predict(self, x):
        return np.dot(x, self.W)

    def loss(self, x, t):
        z = self.predict(x)
        y = softmax(z)
        loss = cross_entropy_error(y, t)

        return loss

x = np.array([0.6, 0.9])
t = np.array([0, 0, 1])

net = simpleNet()

f = lambda w: net.loss(x, t)
dW = numerical_gradient(f, net.W)

print(dW)
  • ์—ฌ๊ธฐ์„œ ๋ด์•ผํ•˜๋Š”๊ฑด simpleNet ํด๋ž˜์Šค ์ž…๋‹ˆ๋‹ค.
  • simpleNet ํด๋ž˜์Šค๋Š” ํ˜•์ƒ์ด 2 * 3์ธ Weight(๊ฐ€์ค‘์น˜) ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํ•˜๋‚˜์˜ Instance ๋ณ€์ˆ˜๋กœ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
  • Method๋Š” 2๊ฐœ์ธ๋ฐ, ํ•˜๋‚˜๋Š” ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” predict(x)์ด๊ณ , ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” Loss Function์˜ ๊ฐ’์„ ๊ตฌํ•˜๋Š” loss(x, t)์ž…๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ์ธ์ˆ˜๋Š” x๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ, t๋Š” ์ •๋‹ต ๋ ˆ์ด๋ธ” ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด simpleNet์„ ์‚ฌ์šฉํ•ด ๋ช‡๊ฐ€์ง€ ์‹œํ—˜์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
>>> net = simpleNet()
>>> print(net.W) # ๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜
[[0.47355232, 0.9977393, 0.84668094]
 [0.85557411, 0.03563661, 0.69422093]]
>>> x = np.array([0.6, 0.9])
>>> p = net.predict(x)
>>> print(p)
[1.05414809 0.63071653 1.1328074]
>>> np.argmax(p) # ์ตœ๋Œ“๊ฐ’์˜ ์ธ๋ฑ์Šค
2
>>> t = np.array([0, 0, 1]) # ์ •๋‹ต ๋ ˆ์ด๋ธ”
>>> net.loss(x, t)
0.9280685366
  • ์ด์–ด์„œ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ตฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ง€๊ธˆ์ฒ˜๋Ÿผ numerical_gradient(f, x)๋ฅผ ์จ์„œ ๊ตฌํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ์ •์˜ํ•œ f(W) ํ•จ์ˆ˜์˜ ์ธ์ˆ˜ W๋Š” ๋”๋ฏธ๋กœ ๋งŒ๋“ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • numerical_gradient(f, x) ๋‚ด๋ถ€์—์„œ f(x)๋ฅผ ์‹คํ–‰ํ•˜๋Š”๋ฐ, ๊ทธ์™€์˜ ์ผ๊ด€์„ฑ์„ ์œ„ํ•ด f(W)๋ฅผ ์ •์˜ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
>>> def f(W):
        return net.loss(x, t)

>>> dW = numerical_gradient(f, net.W)
>>> print(dW) # 2x3์˜ 2์ฐจ์› ๋ฐฐ์—ด
[[0.21924763 0.14356247 -0.36281009]
 [0.32887144 0.2153437 -0.54421514]]
  • numerical_gradient(f, x) ํ•จ์ˆ˜์˜ ์ธ์ˆ˜์ธ f๋Š” ํ•จ์ˆ˜์ด๊ณ , x๋Š” ํ•จ์ˆ˜ f์˜ ์ธ์ˆ˜์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ ์—ฌ๊ธฐ์—์„œ๋Š” W๋ฅผ ์ธ์ˆ˜๋กœ ๋ฐ›์•„ Loss Function(์†์‹ค ํ•จ์ˆ˜)๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์ƒˆ๋กœ์šด ํ•จ์ˆ˜ f๋ฅผ ์ •์˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์—ฌ๊ธฐ์„œ ์ƒˆ๋กœ ์ •์˜ํ•œ ํ•จ์ˆ˜๋ฅผ numerical_gradient(f, x)์— ๋„˜๊น๋‹ˆ๋‹ค.
  • dW๋Š” numerical_gradient(f. net.W)์˜ ๊ฒฐ๊ณผ๋กœ, ๊ทธ ํ˜•์ƒ์€ 2 * 3์˜ 2์ฐจ์› ๋ฐฐ์—ด์ž…๋‹ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ Neural Network(์‹ ๊ฒธ๋ง)์˜ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ตฌํ•œ ๋‹ค์Œ์—๋Š” ๊ฒฝ์‚ฌ๋ฒ•์— ๋”ฐ๋ผ Weight Parameter(๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜)๋ฅผ ๊ฐฑ์‹ ํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

Training Algorithm (ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜)

Nerual Network(์‹ ๊ฒฝ๋ง) ํ•™์Šต์˜ ์ ˆ์ฐจ๋ฅผ ํ•œ๋ฒˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

์ „์ œ

  • ์‹ ๊ฒฝ๋ง์—๋Š” ์ ์‘ ๊ฐ€๋Šฅํ•œ Weight(๊ฐ€์ค‘์น˜)์™€ Bias(ํŽธํ–ฅ)์ด ์žˆ๊ณ , ์ด Weight(๊ฐ€์ค‘์น˜)์™€ Bias(ํŽธํ–ฅ)์„ Training Data(ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ)์— ์ ์‘ํ•˜๋„๋ก ์กฐ์ •ํ•˜๋Š” ๊ณผ์ •์„ Training(ํ•™์Šต)์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  Neural Network Training(์‹ ๊ฒฝ๋ง ํ•™์Šต)์€ 4๋‹จ๊ณ„๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1๋‹จ๊ณ„ - Mini-Batch

  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์ค‘ ์ผ๋ถ€๋ฅผ ๋ฌด์ž‘์œ„๋กœ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์„ ๋ณ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ Mini-Batch(๋ฏธ๋‹ˆ๋ฐฐ์น˜) ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๊ทธ Mini-Batch(๋ฏธ๋‹ˆ๋ฐฐ์น˜)์˜ Loss Function Value(์†์‹ค ํ•จ์ˆ˜ ๊ฐ’)์„ ์ค„์ด๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค.

2๋‹จ๊ณ„ - Gradient(๊ธฐ์šธ๊ธฐ) ์‚ฐ์ถœ

  • Mini-Batch์˜ Loss Function ๊ฐ’์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ ๊ฐ Weight Paraemter(๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜)์˜ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
  • Gradient(๊ธฐ์šธ๊ธฐ)๋Š” Loss Function Value(์†์‹ค ํ•จ์ˆ˜ ๊ฐ’)์„ ๊ฐ€์žฅ ์ž‘๊ฒŒ ํ•˜๋Š” ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

3๋‹จ๊ณ„ - Parameter(๋งค๊ฐœ๋ณ€์ˆ˜) ๊ฐฑ์‹ 

  • Weight Paraemter(๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜)๋ฅผ Gradient(๊ธฐ์šธ๊ธฐ) ๋ฐฉํ–ฅ์œผ๋กœ ์•„์ฃผ ์กฐ๊ธˆ ๊ฐฑ์‹ ํ•ฉ๋‹ˆ๋‹ค.

4๋‹จ๊ณ„ - ๋ฐ˜๋ณต

  • 1~3๋‹จ๊ณ„๋ฅผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.

  • ์ด๊ฒƒ์ด Neural Network Training(์‹ ๊ฒฝ๋ง ํ•™์Šต)์ด ์ด๋ฃจ์–ด์ง€๋Š” ์ˆœ์„œ์ž…๋‹ˆ๋‹ค.
  • ์ด๋Š” Gradient Descent(๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•)์œผ๋กœ Paraemter(๋งค๊ฐœ๋ณ€์ˆ˜)๋ฅผ ๊ฐฑ์‹ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ์ด๋•Œ, Data๋ฅผ Mini-Batch๋กœ ๋ฌด์ž‘์œ„๋กœ ์„ ์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ• (Stochastic Gradient Descent, SGD)๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

2-Layer Nerual Network (2์ธต ์‹ ๊ฒฝ๋ง) ๊ตฌํ˜„ํ•˜๊ธฐ

  • ์ฒ˜์Œ์—๋Š” 2์ธต ์‹ ๊ฒฝ๋ง์„ ํ•˜๋‚˜์˜ ํด๋ž˜์Šค๋กœ ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.
# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # ๋ถ€๋ชจ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •
from common.functions import *
from common.gradient import numerical_gradient


class TwoLayerNet:

    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        # ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”
        self.params = {}
        self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)
        self.params['b2'] = np.zeros(output_size)

    def predict(self, x):
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
    
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)
        
        return y
        
    # x : ์ž…๋ ฅ ๋ฐ์ดํ„ฐ, t : ์ •๋‹ต ๋ ˆ์ด๋ธ”
    def loss(self, x, t):
        y = self.predict(x)
        
        return cross_entropy_error(y, t)
    
    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        t = np.argmax(t, axis=1)
        
        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy
        
    # x : ์ž…๋ ฅ ๋ฐ์ดํ„ฐ, t : ์ •๋‹ต ๋ ˆ์ด๋ธ”
    def numerical_gradient(self, x, t):
        loss_W = lambda W: self.loss(x, t)
        
        grads = {}
        grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
        grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
        grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
        grads['b2'] = numerical_gradient(loss_W, self.params['b2'])
        
        return grads
        
    def gradient(self, x, t):
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
        grads = {}
        
        batch_num = x.shape[0]
        
        # forward
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)
        
        # backward
        dy = (y - t) / batch_num
        grads['W2'] = np.dot(z1.T, dy)
        grads['b2'] = np.sum(dy, axis=0)
        
        da1 = np.dot(dy, W2.T)
        dz1 = sigmoid_grad(a1) * da1
        grads['W1'] = np.dot(x.T, dz1)
        grads['b1'] = np.sum(dz1, axis=0)

        return grads

 

TwoLayerNet ํด๋ž˜์Šค๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ๋ณ€์ˆ˜

TwoLayerNet ํด๋ž˜์Šค์˜ Method


Mini-Batch Training ๊ตฌํ˜„ํ•˜๊ธฐ

  • Mini-Batch ํ•™์Šต์€ Training Data์ค‘ ์ผ๋ถ€๋ฅผ ๋ฌด์ž‘์œ„๋กœ ๊บผ๋‚ด๊ณ (Mini-Batch), ๊ทธ Mini-Batch์— ๋Œ€ํ•ด์„œ Gradient(๊ฒฝ์‚ฌ๋ฒ•)์œผ๋กœ Parameter(๋งค๊ฐœ๋ณ€์ˆ˜)๋ฅผ ๊ฐฑ์‹ ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด TwoLayerNet ํด๋ž˜์Šค์™€ Mnist Dataset์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
import numpy as np
from dataset.mnist import load_mnist
from two_layer_net import TwoLayerNet

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)

train_loss_list = []

# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
iters_num = 10000 # ๋ฐ˜๋ณต ํšŸ์ˆ˜
train_size = x_train.shape[0]
batch_size = 100 # ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํฌ๊ธฐ
learning_rate = 0.1
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

for i in range(iters_num):
	# ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํš๋“
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    # ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ
    grad = network.gradient(x_batch, t_batch)
    
    # ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐฑ์‹ 
    for key in ('W1', 'b1', 'W2', 'b2'):
    	network.params[key] -= learning_rate * grad[key]
        
	# ํ•™์Šต ๊ฒฝ๊ณผ ๊ธฐ๋ก
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)
  • Loss Function Value(์†์‹ค ํ•จ์ˆ˜ ๊ฐ’์˜ ์ถ”์ด) ๊ทธ๋ž˜ํ”„ ์ž…๋‹ˆ๋‹ค.

  • ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ํ•™์Šต ํšŸ์ˆ˜๊ฐ€ ๋Š˜์–ด๊ฐ€๋ฉด์„œ Loss Function(์†์‹ค ํ•จ์ˆ˜)์˜ ๊ฐ’์ด ์ค„์–ด๋“œ๋Š”๊ฑธ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Š” ํ•™์Šต์ด ์ž˜ ๋˜๋Š”๊ฒƒ์ด๋ฉฐ, Nerual Network(์‹ ๊ฒฝ๋ง)์˜ Weight Parameter(๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜)๊ฐ€ ์„œ์„œ์ด ๋ฐ์ดํ„ฐ์— ์ ์‘ํ•˜๊ณ  ์žˆ์Œ์„ ์˜๋ฏธํ•˜๋ฉฐ, ์‹ ๊ฒฝ๋ง์ด ํ•™์Šตํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Test Data๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ

  • Loss Function(์†์‹คํ•จ์ˆ˜)์˜ ๊ฐ’์ด๋ž€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์— ๋Œ€ํ•œ ์†์‹ค ํ•จ์ˆ˜์˜ ๊ฐ’์ž…๋‹ˆ๋‹ค.
  • Nerual Network Training(์‹ ๊ฒฝ๋ง ํ•™์Šต)์—์„œ๋Š” Overfitting(์˜ค๋ฒ„ํ”ผํŒ…)์„ ์ผ์œผํ‚ค์ง€ ์•Š๋Š”์ง€ ํ™•์ธ์„ ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
  • Training Data(ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ)์— ํฌํ•จ๋œ Image๋งŒ ์ œ๋Œ€๋กœ ๊ตฌ๋ถ„ํ•˜๊ณ , ๊ทธ๋ ‡์ง€ ์•Š์€ ์ด๋ฏธ์ง€๋Š” ์ œ๋Œ€๋กœ ์‹๋ณ„ ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
import numpy as np
from dataset.mnist import load_mnist
from two_layer_net import TwoLayerNet

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True)

train_loss_list = []

# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
iters_num = 10000 # ๋ฐ˜๋ณต ํšŸ์ˆ˜
train_size = x_train.shape[0]
batch_size = 100 # ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํฌ๊ธฐ
learning_rate = 0.1

train_loss_list = []
train_acc_list = []
test_acc_list = []

# 1์—ํญ๋‹น ๋ฐ˜๋ณต ์ˆ˜
iter_per_epoch = max(train_size / batch_size, 1)

for i in range(iters_num):
	# ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํš๋“
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    # ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ
    grad = network.gradient(x_batch, t_batch)
    
    # ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐฑ์‹ 
    for key in ('W1', 'b1', 'W2', 'b2'):
    	network.params[key] -= learning_rate * grad[key]
        
	# ํ•™์Šต ๊ฒฝ๊ณผ ๊ธฐ๋ก
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)
    
    # 1์—ํญ ๋‹น ์ •ํ™•๋„ ๊ณ„์‚ฐ
    if i % iter_per_epoch == 0:
    	train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)
        print("train acc, test acc | "+ str(train_acc) + ", +str(test_acc))
  • 1 epoch๋งˆ๋‹ค ๋ชจ๋“  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ์‹œํ—˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐ ๋ฐ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.

  • ์—ฌ๊ธฐ์„œ ์šฐ๋ฆฌ๊ฐ€ ์•Œ ์ˆ˜ ์žˆ๋Š”๊ฑด, ํ•™์Šต(epoch)์„ ์ˆ˜ํ–‰ ํ• ๋•Œ ๋งˆ์ž ์ •ํ™•๋„๊ฐ€ ์ข‹์•„์ง‘๋‹ˆ๋‹ค.
  • ์ฆ‰, Overfitting์ด ์ผ์–ด๋‚˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

Summary(์ •๋ฆฌ)

- ๊ธฐ๊ณ„ํ•™์Šต์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ์‹œํ—˜ ๋ฐ์ดํ„ฐ๋กœ ๋‚˜๋ˆ  ์‚ฌ์šฉํ•œ๋‹ค.
- ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์˜ ๋ฒ”์šฉ ๋Šฅ๋ ฅ์„ ์‹œํ—˜ ๋ฐ์ดํ„ฐ๋กœ ํ‰๊ฐ€ํ•œ๋‹ค.
- ์‹ ๊ฒฝ๋ง ํ•™์Šต์€ ์†์‹ค ํ•จ์ˆ˜์˜ ์ง€ํ‘œ๋กœ, ์†์‹ค ํ•จ์ˆ˜์˜ ๊ฐ’์ด ์ž‘์•„์ง€๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ฐฑ์‹ ํ•œ๋‹ค.
- ๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ฐฑ์‹ ํ•  ๋•Œ๋Š” ๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ด์šฉํ•˜๊ณ , ๊ธฐ์šธ์–ด์ง„ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€์ค‘์น˜์˜ ๊ฐ’์„ ๊ฐฑ์‹ ํ•˜๋Š” ์ž‘์—…์„ ๋ฐ˜๋ณตํ•œ๋‹ค.
- ์•„์ฃผ ์ž‘์€ ๊ฐ’์„ ์ฃผ์—ˆ์„ ๋•Œ์˜ ์ฐจ๋ถ„์œผ๋กœ ๋ฏธ๋ถ„ํ•˜๋Š” ๊ฒƒ์„ ์ˆ˜์น˜ ๋ฏธ๋ถ„์ด๋ผ๊ณ  ํ•œ๋‹ค.
- ์ˆ˜์น˜ ๋ฏธ๋ถ„์„ ์ด์šฉํ•ด ๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.
- ์ˆ˜์น˜ ๋ฏธ๋ถ„์„ ์ด์šฉํ•œ ๊ณ„์‚ฐ์—๋Š” ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ์ง€๋งŒ, ๊ทธ ๊ตฌํ˜„์€ ๊ฐ„๋‹จํ•˜๋‹ค. ํ•œํŽธ, ๋‹ค์Œ ์žฅ์—์„œ ๊ตฌํ˜„ํ•˜๋Š” ์˜ค์ฐจ์—ญ์ „ํŒŒ๋ฒ•์€ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ ์†์œผ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.