A A
[DL] ์˜ฌ๋ฐ”๋ฅธ ํ•™์Šต์„ ์œ„ํ•ด - Overfitting, Dropout, Hyperparameter

์˜ฌ๋ฐ”๋ฅธ ํ•™์Šต์„ ์œ„ํ•ด 

Machine Learning์—์„œ Overfitting์ด ๋˜๋Š” ์ผ์ด ๋งŽ์Šต๋‹ˆ๋‹ค. Overiftting(์˜ค๋ฒ„ํ”ผํŒ…)์€ ์‹ ๊ฒฝ๋ง์ด Training data(ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ)์—๋งŒ ์ง€๋‚˜์น˜๊ฒŒ ์ ์šฉ๋˜์–ด์„œ ๊ทธ ์™ธ์˜ ๋ฐ์ดํ„ฐ์—๋Š” ์ œ๋Œ€๋กœ ๋Œ€์‘ํ•˜์ง€ ๋ชปํ•˜๋Š” ์ƒํƒœ์ž…๋‹ˆ๋‹ค.

Overfitting (์˜ค๋ฒ„ํ”ผํŒ…)

  • ์˜ค๋ฒ„ํ”ผํŒ…์€ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ๋งŽ๊ณ  ํ‘œํ˜„๋ ฅ์ด ๋†’์€ ๋ชจ๋ธ์ธ ๊ฒฝ์šฐ, ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๊ฐ€ ์ ์€ ๊ฒฝ์šฐ์— ์ฃผ๋กœ ์ผ์–ด๋‚ฉ๋‹ˆ๋‹ค.
  • ์ด ๋‘ ์š”๊ฑด์„ ์ถฉ์กฑํ•˜์—ฌ Overiftting(์˜ค๋ฒ„ํ”ผํŒ…)์„ ์ผ์œผ์ผœ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • MNIST Dataset์˜ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ์ค‘ 300๊ฐœ๋งŒ ์‚ฌ์šฉํ•˜๊ณ , 7-Layer Network๋ฅผ ์‚ฌ์šฉํ•ด์„œ Network์˜ ๋ณต์žก์„ฑ์„ ๋†’ํ˜€๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ๊ฐ Layer์˜ Neuron์€ 100๊ฐœ, Activation Function(ํ™œ์„ฑํ™” ํ•จ์ˆ˜)๋Š” ReLU ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
# ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๋Š” ์ฝ”๋“œ (Data Loader)
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# ์˜ค๋ฒ„ํ”ผํŒ…์„ ์žฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ ์ˆ˜๋ฅผ ์ค„์ž„
x_train = x_train[:300]
t_train = t_train[:300]
  • ์•„๋ž˜๋Š” Training์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ฝ”๋“œ ์ž…๋‹ˆ๋‹ค. 
network = MultiLayerNet(input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100], output_size=10,
                        weight_decay_lambda=weight_decay_lambda)
optimizer = SGD(lr=0.01) # ํ•™์Šต๋ฅ ์ด 0.01์ธ SGD๋กœ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐฑ์‹ 

max_epochs = 201
train_size = x_train.shape[0]
batch_size = 100

train_loss_list = []
train_acc_list = []
test_acc_list = []

iter_per_epoch = max(train_size / batch_size, 1)
epoch_cnt = 0

for i in range(1000000000):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    grads = network.gradient(x_batch, t_batch)
    optimizer.update(network.params, grads)

    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)

        print("epoch:" + str(epoch_cnt) + ", train acc:" + str(train_acc) + ", test acc:" + str(test_acc))

        epoch_cnt += 1
        if epoch_cnt >= max_epochs:
            break
  • train_acc_list์™€ test_acc_list์—๋Š” epoch๋‹จ์œ„์˜ ์ •ํ™•๋„๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ฆฌ๋ฉด ์•„๋ž˜์˜ ๊ทธ๋ž˜ํ”„ ์ฒ˜๋Ÿผ ๋‚˜์˜ต๋‹ˆ๋‹ค.
# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
markers = {'train': 'o', 'test': 's'}
x = np.arange(max_epochs)
plt.plot(x, train_acc_list, marker='o', label='train', markevery=10)
plt.plot(x, test_acc_list, marker='s', label='test', markevery=10)
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

Train data & Test data epoch๊ฐ„ ์ •ํ™•๋„ ์ถ”์ด

  • Training data๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ธก์ •ํ•œ ์ •ํ™•๋„๋Š” 100 epoch๋ฅผ ์ง€๋‚˜๋Š” ์‹œ์ ๋ถ€ํ„ฐ๋Š” ๊ฑฐ์ด 100%์ž…๋‹ˆ๋‹ค.
  • ๋‹ค๋งŒ Test ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ํฐ ์ฐจ์ด๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ˜„์ƒ์€ Training data์—๋งŒ ์ ์‘ํ•ด๋ฒ„๋ฆฐ, ์ฆ‰ fitting๋˜๋ฒ„๋ฆฐ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.
  • Training ๋•Œ ์‚ฌ์šฉ๋˜์ง€ ์•Š์€ Test data์— ์ œ๋Œ€๋กœ ๋Œ€์‘ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒƒ์„ ์ด ๊ทธ๋ž˜ํ”„์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Weight Decay (๊ฐ€์ค‘์น˜ ๊ฐ์†Œ)

Overiftting์„ ์–ต์ œ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์‚ฌ์šฉ๋˜๋˜ ๋ฐฉ๋ฒ•์œผ๋กœ Weight Decay(๊ฐ€์ค‘์น˜ ๊ฐ์†Œ)๋ผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • Training ๊ณผ์ •์—์„œ ํฐ Weight(๊ฐ€์ค‘์น˜)์— ๋Œ€ํ•ด์„œ๋Š” ๊ทธ์— ์ƒ์‘ํ•˜๋Š” ํŒจ๋„ํ‹ฐ๋ฅผ ๋ถ€๊ณผํ•˜์—ฌ Overfitting์„ ์–ต์ œํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ์›๋ž˜ Overfitting์€ Weight Parameter(๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜)์˜ ๊ฐ’์ด ์ปค์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ์ผ๋‹จ, ์‹ ๊ฒฝ๋ง ํ•™์Šต์˜ ๋ชฉ์ ์€ Loss Function(์†์‹ค ํ•จ์ˆ˜)์˜ ๊ฐ’์„ ์ค„์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. 
  • ์˜ˆ๋ฅผ ๋“ค์–ด๋ณด๋ฉด, ๊ฐ€์ค‘์น˜ ์ œ๊ณฑ ๋…ธ๋ฆ„(norm. L2 ๋…ธ๋ฆ„)์„ ์†์‹คํ•จ์ˆ˜์— ๋”ํ•ด์ค๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ๊ฐ€์ค‘์น˜๊ฐ€ ์ปค์ง€๋Š” ๊ฒƒ์„ ์–ต์ œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
  • ์—ฌ๊ธฐ์„œ λ(๋žŒ๋‹ค)๋Š” ์ •๊ทœํ™”์˜ ์„ธ๊ธฐ๋ฅผ ์กฐ์ ˆํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ด๋ฉฐ, ํฌ๊ฒŒ ์„ค์ •ํ• ์ˆ˜๋ก ํฐ ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•œ ํŽ˜๋„ํ‹ฐ๊ฐ€ ์ปค์ง‘๋‹ˆ๋‹ค.
  • ๋˜ 1/2λW**2 ์˜ ์•ž์ชฝ์˜ 1/2์€ 1/2λW**2์˜ ๋ฏธ๋ถ„ ๊ฒฐ๊ณผ์ธ λW๋ฅผ ์กฐ์ ˆํ•˜๋Š” ์—ญํ• ์˜ ์ƒ์ˆ˜์ž…๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ λ = 0.1๋กœ Weight Decay(๊ฐ€์ค‘์น˜ ๊ฐ์†Œ)๋ฅผ ์ ์šฉํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Weight Decay (๊ฐ€์ค‘์น˜ ๊ฐ์†Œ) Network code (by python)

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # ๋ถ€๋ชจ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •
import numpy as np
from collections import OrderedDict
from common.layers import *
from common.gradient import numerical_gradient


class MultiLayerNet:
    """์™„์ „์—ฐ๊ฒฐ ๋‹ค์ธต ์‹ ๊ฒฝ๋ง

    Parameters
    ----------
    input_size : ์ž…๋ ฅ ํฌ๊ธฐ๏ผˆMNIST์˜ ๊ฒฝ์šฐ์—” 784๏ผ‰
    hidden_size_list : ๊ฐ ์€๋‹‰์ธต์˜ ๋‰ด๋Ÿฐ ์ˆ˜๋ฅผ ๋‹ด์€ ๋ฆฌ์ŠคํŠธ๏ผˆe.g. [100, 100, 100]๏ผ‰
    output_size : ์ถœ๋ ฅ ํฌ๊ธฐ๏ผˆMNIST์˜ ๊ฒฝ์šฐ์—” 10๏ผ‰
    activation : ํ™œ์„ฑํ™” ํ•จ์ˆ˜ - 'relu' ํ˜น์€ 'sigmoid'
    weight_init_std : ๊ฐ€์ค‘์น˜์˜ ํ‘œ์ค€ํŽธ์ฐจ ์ง€์ •๏ผˆe.g. 0.01๏ผ‰
        'relu'๋‚˜ 'he'๋กœ ์ง€์ •ํ•˜๋ฉด 'He ์ดˆ๊นƒ๊ฐ’'์œผ๋กœ ์„ค์ •
        'sigmoid'๋‚˜ 'xavier'๋กœ ์ง€์ •ํ•˜๋ฉด 'Xavier ์ดˆ๊นƒ๊ฐ’'์œผ๋กœ ์„ค์ •
    weight_decay_lambda : ๊ฐ€์ค‘์น˜ ๊ฐ์†Œ(L2 ๋ฒ•์น™)์˜ ์„ธ๊ธฐ
    """
    def __init__(self, input_size, hidden_size_list, output_size,
                 activation='relu', weight_init_std='relu', weight_decay_lambda=0):
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size_list = hidden_size_list
        self.hidden_layer_num = len(hidden_size_list)
        self.weight_decay_lambda = weight_decay_lambda
        self.params = {}

        # ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”
        self.__init_weight(weight_init_std)

        # ๊ณ„์ธต ์ƒ์„ฑ
        activation_layer = {'sigmoid': Sigmoid, 'relu': Relu}
        self.layers = OrderedDict()
        for idx in range(1, self.hidden_layer_num+1):
            self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
                                                      self.params['b' + str(idx)])
            self.layers['Activation_function' + str(idx)] = activation_layer[activation]()

        idx = self.hidden_layer_num + 1
        self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
            self.params['b' + str(idx)])

        self.last_layer = SoftmaxWithLoss()

    def __init_weight(self, weight_init_std):
        """๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”
        
        Parameters
        ----------
        weight_init_std : ๊ฐ€์ค‘์น˜์˜ ํ‘œ์ค€ํŽธ์ฐจ ์ง€์ •๏ผˆe.g. 0.01๏ผ‰
            'relu'๋‚˜ 'he'๋กœ ์ง€์ •ํ•˜๋ฉด 'He ์ดˆ๊นƒ๊ฐ’'์œผ๋กœ ์„ค์ •
            'sigmoid'๋‚˜ 'xavier'๋กœ ์ง€์ •ํ•˜๋ฉด 'Xavier ์ดˆ๊นƒ๊ฐ’'์œผ๋กœ ์„ค์ •
        """
        all_size_list = [self.input_size] + self.hidden_size_list + [self.output_size]
        for idx in range(1, len(all_size_list)):
            scale = weight_init_std
            if str(weight_init_std).lower() in ('relu', 'he'):
                scale = np.sqrt(2.0 / all_size_list[idx - 1])  # ReLU๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์˜ ๊ถŒ์žฅ ์ดˆ๊นƒ๊ฐ’
            elif str(weight_init_std).lower() in ('sigmoid', 'xavier'):
                scale = np.sqrt(1.0 / all_size_list[idx - 1])  # sigmoid๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์˜ ๊ถŒ์žฅ ์ดˆ๊นƒ๊ฐ’
            self.params['W' + str(idx)] = scale * np.random.randn(all_size_list[idx-1], all_size_list[idx])
            self.params['b' + str(idx)] = np.zeros(all_size_list[idx])

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    def loss(self, x, t):
        """์†์‹ค ํ•จ์ˆ˜๋ฅผ ๊ตฌํ•œ๋‹ค.
        
        Parameters
        ----------
        x : ์ž…๋ ฅ ๋ฐ์ดํ„ฐ
        t : ์ •๋‹ต ๋ ˆ์ด๋ธ” 
        
        Returns
        -------
        ์†์‹ค ํ•จ์ˆ˜์˜ ๊ฐ’
        """
        y = self.predict(x)

        weight_decay = 0
        for idx in range(1, self.hidden_layer_num + 2):
            W = self.params['W' + str(idx)]
            weight_decay += 0.5 * self.weight_decay_lambda * np.sum(W ** 2)

        return self.last_layer.forward(y, t) + weight_decay

    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        if t.ndim != 1 : t = np.argmax(t, axis=1)

        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy

    def numerical_gradient(self, x, t):
        """๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•œ๋‹ค(์ˆ˜์น˜ ๋ฏธ๋ถ„).
        
        Parameters
        ----------
        x : ์ž…๋ ฅ ๋ฐ์ดํ„ฐ
        t : ์ •๋‹ต ๋ ˆ์ด๋ธ”
        
        Returns
        -------
        ๊ฐ ์ธต์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋‹ด์€ ๋”•์…”๋„ˆ๋ฆฌ(dictionary) ๋ณ€์ˆ˜
            grads['W1']ใ€grads['W2']ใ€... ๊ฐ ์ธต์˜ ๊ฐ€์ค‘์น˜
            grads['b1']ใ€grads['b2']ใ€... ๊ฐ ์ธต์˜ ํŽธํ–ฅ
        """
        loss_W = lambda W: self.loss(x, t)

        grads = {}
        for idx in range(1, self.hidden_layer_num+2):
            grads['W' + str(idx)] = numerical_gradient(loss_W, self.params['W' + str(idx)])
            grads['b' + str(idx)] = numerical_gradient(loss_W, self.params['b' + str(idx)])

        return grads

    def gradient(self, x, t):
        """๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•œ๋‹ค(์˜ค์ฐจ์—ญ์ „ํŒŒ๋ฒ•).

        Parameters
        ----------
        x : ์ž…๋ ฅ ๋ฐ์ดํ„ฐ
        t : ์ •๋‹ต ๋ ˆ์ด๋ธ”
        
        Returns
        -------
        ๊ฐ ์ธต์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋‹ด์€ ๋”•์…”๋„ˆ๋ฆฌ(dictionary) ๋ณ€์ˆ˜
            grads['W1']ใ€grads['W2']ใ€... ๊ฐ ์ธต์˜ ๊ฐ€์ค‘์น˜
            grads['b1']ใ€grads['b2']ใ€... ๊ฐ ์ธต์˜ ํŽธํ–ฅ
        """
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.last_layer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        # ๊ฒฐ๊ณผ ์ €์žฅ
        grads = {}
        for idx in range(1, self.hidden_layer_num+2):
            grads['W' + str(idx)] = self.layers['Affine' + str(idx)].dW + self.weight_decay_lambda * self.layers['Affine' + str(idx)].W
            grads['b' + str(idx)] = self.layers['Affine' + str(idx)].db

        return grads

Test Code

# coding: utf-8
import os
import sys

sys.path.append(os.pardir)  # ๋ถ€๋ชจ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.multi_layer_net import MultiLayerNet
from common.optimizer import SGD

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# ์˜ค๋ฒ„ํ”ผํŒ…์„ ์žฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ ์ˆ˜๋ฅผ ์ค„์ž„
x_train = x_train[:300]
t_train = t_train[:300]

# weight decay๏ผˆ๊ฐ€์ค‘์น˜ ๊ฐ์‡ ๏ผ‰ ์„ค์ • =======================
#weight_decay_lambda = 0 # weight decay๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ
weight_decay_lambda = 0.1
# ====================================================

network = MultiLayerNet(input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100], output_size=10,
                        weight_decay_lambda=weight_decay_lambda)
optimizer = SGD(lr=0.01) # ํ•™์Šต๋ฅ ์ด 0.01์ธ SGD๋กœ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐฑ์‹ 

max_epochs = 201
train_size = x_train.shape[0]
batch_size = 100

train_loss_list = []
train_acc_list = []
test_acc_list = []

iter_per_epoch = max(train_size / batch_size, 1)
epoch_cnt = 0

for i in range(1000000000):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    grads = network.gradient(x_batch, t_batch)
    optimizer.update(network.params, grads)

    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)

        print("epoch:" + str(epoch_cnt) + ", train acc:" + str(train_acc) + ", test acc:" + str(test_acc))

        epoch_cnt += 1
        if epoch_cnt >= max_epochs:
            break


# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ==========
markers = {'train': 'o', 'test': 's'}
x = np.arange(max_epochs)
plt.plot(x, train_acc_list, marker='o', label='train', markevery=10)
plt.plot(x, test_acc_list, marker='s', label='test', markevery=10)
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

Weight Decay๋ฅผ ์ด์šฉํ•œ Training Data์™€ Test Data์— ๋Œ€ํ•œ ์ •ํ™•๋„ ์ถ”์ด

  • Training Data, Test Data์— ๋Œ€ํ•œ ์ •ํ™•๋„ ์ฐจ์ด๋Š” ์—ฌ์ „ํžˆ ์žˆ์ง€๋งŒ, Weight Decay๋ฅผ ์ด์šฉํ•˜์ง€ ์•Š์€ ๊ฒฐ๊ณผ๊ณผ ๋น„๊ตํ•˜๋ฉด ์ฐจ์ด๊ฐ€ ์ค„์€๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, Overfitting์ด ์–ด๋Š์ •๋„ ์–ต์ œ๋˜์—ˆ๋‹ค๋Š” ์˜๋ฏธ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Dropout (๋“œ๋กญ์•„์›ƒ)

Weight Decay(๊ฐ€์ค‘์น˜ ๊ฐ์†Œ)๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๊ณ , ์–ด๋Š์ •๋„ ์ง€๋‚˜์นœ ํ•™์Šต(Overfitting)์„ ์–ต์ œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜, ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์ด ๋ณต์žกํ•ด์ง€๋ฉด Weight Decay(๊ฐ€์ค‘์น˜ ๊ฐ์†Œ)๋งŒ์œผ๋กœ๋Š” ๋Œ€์‘ํ•˜๊ธฐ ์–ด๋ ค์›Œ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿด๋•Œ๋Š” ํ”ํžˆ Dropout(๋“œ๋กญ์•„์›ƒ)์ด๋ผ๋Š” ๊ธฐ๋ฒ•์„ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Dropout์€ Neuron์„ ์ž„์˜๋กœ ์‚ญ์ œํ•˜๋ฉด์„œ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • Training๋•Œ, Hidden Layer(์€๋‹‰์ธต)์˜ ๋‰ด๋Ÿฐ์„ ๋ฌด์ž‘์œ„๋กœ ๊ณจ๋ผ์„œ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ๋ฅผ ํ˜๋ฆด ๋•Œ๋งˆ๋‹ค ์‚ญ์ œํ•  ๋‰ด๋Ÿฐ์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•˜๊ณ , ์‹œํ—˜๋•Œ๋Š” ๋ชจ๋“  ๋‰ด๋Ÿฐ์— ์‹ ํ˜ธ๋ฅผ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹จ Test๋•Œ์— ๊ฐ ๋‰ด๋Ÿฐ์˜ ์ถœ๋ ฅ์— ํ›ˆ๋ จ ๋•Œ ์‚ญ์ œ ์•ˆํ•œ ๋น„์œจ์„ ๊ณฑํ•˜์—ฌ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

์™ผ์ชฝ์ด ์ผ๋ฐ˜์ ์ธ ์‹ ๊ฒฝ๋ง, ์˜ค๋ฅธ์ชฝ์ด Dropout์„ ์ ์šฉํ•œ ์‹ ๊ฒฝ๋ง์ž…๋‹ˆ๋‹ค. Dropout์€ ๋‰ด๋Ÿฐ์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•ด ์‚ญ์ œํ•˜์—ฌ ์‹ ํ˜ธ ์ „๋‹ฌ์„ ์ฐจ๋‹จํ•ฉ๋‹ˆ๋‹ค.

  • ํ•œ๋ฒˆ Dropout์„ ๊ตฌํ˜„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” Dropout์„ ๊ตฌํ˜„ํ•œ ์ฝ”๋“œ ์ž…๋‹ˆ๋‹ค.
class Dropout:
	def __init__(self, dropout_ratio=0.5):
    	self.dropout_ratio = dropout_ratio
        self.mask = None
        
	def forward(self, x, train_flg=True):
    	if train_flg:
        	self.mask = np.random.rand(*x.shape) > self.dropout_ratio
            return x * self.mask
		else:
        	return x * (1.0 - self.dropout_ratio)
            
	def backward(self, dout):
    	return dout*self.mask
  • ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•˜๊ฒŒ ๋ด์•ผํ•˜๋Š”์ ์€ Training์‹œ, Forward Propagation(์ˆœ์ „ํŒŒ)๋•Œ ๋งˆ๋‹ค self.mask์— ์‚ญ์ œํ•  ๋‰ด๋Ÿฐ์„ False๋กœ ํ‘œ์‹œํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.
  • self.mask๋Š” x์™€ ํ˜•์ƒ๊ณผ ๊ฐ™์€ ๋ฐฐ์—ด์„ ๋ฌด์ž‘์œ„๋กœ ์ƒ์„ฑํ•˜๊ณ , ๊ทธ ๊ฐ’์ด dropout_ratio๋ณด๋‹ค ํฐ ์›์†Œ๋งŒ True๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  • Backpropagation(์—ญ์ „ํŒŒ)๋•Œ์˜ ๋™์ž‘์€ ReLU์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  • ์ด๋ง์€, Forward Propagation(์ˆœ์ „ํŒŒ)๋•Œ ์‹ ํ˜ธ๋ฅผ ํ†ต๊ณผ์‹œํ‚ค๋Š” ๋‰ด๋Ÿฐ์€ Backpropagation(์—ญ์ „ํŒŒ)๋•Œ๋„ ์‹ ํ˜ธ๋ฅผ ๊ทธ๋Œ€๋กœ ํ†ต๊ณผ์‹œํ‚ค๊ณ 
  •  Forward Propagation(์ˆœ์ „ํŒŒ)๋•Œ ์‹ ํ˜ธ๋ฅผ ํ†ต๊ณผ์‹œํ‚ค์ง€ ์•Š๋Š” ๋‰ด๋Ÿฐ์€Backpropagation(์—ญ์ „ํŒŒ)๋•Œ๋„ ์‹ ํ˜ธ๋ฅผ ์ฐจ๋‹จํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ํ•œ๋ฒˆ Dropout์˜ ํšจ๊ณผ๋ฅผ MNIST ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•œ๋ฒˆ ํ™•์ธํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
    • 7-Layer Network. ๊ฐ Layer์˜ Neuron์€ 100๊ฐœ, Activation Function์€ ReLU๋ฅผ ์จ์„œ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
import os
import sys
sys.path.append(os.pardir)  # ๋ถ€๋ชจ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.multi_layer_net_extend import MultiLayerNetExtend
from common.trainer import Trainer

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# ์˜ค๋ฒ„ํ”ผํŒ…์„ ์žฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ ์ˆ˜๋ฅผ ์ค„์ž„
x_train = x_train[:300]
t_train = t_train[:300]

# ๋“œ๋กญ์•„์›ƒ ์‚ฌ์šฉ ์œ ๋ฌด์™€ ๋น„์šธ ์„ค์ •
use_dropout = True  # ๋“œ๋กญ์•„์›ƒ์„ ์“ฐ์ง€ ์•Š์„ ๋•Œ๋Š” False
dropout_ratio = 0.2

network = MultiLayerNetExtend(input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100],
                              output_size=10, use_dropout=use_dropout, dropout_ration=dropout_ratio)
trainer = Trainer(network, x_train, t_train, x_test, t_test,
                  epochs=301, mini_batch_size=100,
                  optimizer='sgd', optimizer_param={'lr': 0.01}, verbose=True)
trainer.train()

train_acc_list, test_acc_list = trainer.train_acc_list, trainer.test_acc_list

# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
markers = {'train': 'o', 'test': 's'}
x = np.arange(len(train_acc_list))
plt.plot(x, train_acc_list, marker='o', label='train', markevery=10)
plt.plot(x, test_acc_list, marker='s', label='test', markevery=10)
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

์™ผ์ชฝ์€ Dropout X, ์˜ค๋ฅธ์ชฝ์€ Dropout (0.15) ์ ์šฉํ•œ ๊ฒฐ๊ณผ

  • ์œ„์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด Dropout์„ ์ ์šฉํ•˜๋‹ˆ๊นŒ Train Data, Test Data์— ๋Œ€ํ•œ ์ •ํ™•๋„ ์ฐจ์ด๊ฐ€ ์ค„์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  Training Data์— ๋Œ€ํ•œ ์ •ํ™•๋„๊ฐ€ 100%์—๋„ ๋„๋‹ฌํ•˜์ง€ ์•Š๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด์ฒ˜๋Ÿผ Dropout์„ ์ด์šฉํ•˜๋ฉด ํ‘œํ˜„๋ ฅ์„ ๋†’์ด๋ฉด์„œ๋„, Overfitting์„ ์–ต์ œํ•˜๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ ์ ˆํ•œ HyperParameter Value ์ฐพ๊ธฐ

HyperParameter๋Š” ๊ฐ ์ธต์˜ Neuron์ˆ˜, Batch size, ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐฑ์‹ ์‹œ์˜ Learning Rate์™€ Weight Decay ๋“ฑ์ž…๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)์˜ ๊ฐ’์„ ์ตœ๋Œ€ํ•œ ํšจ์œจ์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Validation Data (๊ฒ€์ฆ ๋ฐ์ดํ„ฐ)

  • Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)๋ฅผ ๋‹ค์–‘ํ•œ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•˜๊ณ  ๊ฒ€์ฆํ•  ํ…๋ฐ, ์—ฌ๊ธฐ์„œ ์ฃผ์˜ํ•  ์ ์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ๋•Œ๋Š” ์‹œํ—˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์•ˆ ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • Test Data(์‹œํ—˜ ๋ฐ์ดํ„ฐ)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•˜๋ฉด Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ) ๊ฐ’์ด Test Data(์‹œํ—˜ ๋ฐ์ดํ„ฐ)์— Overfitting(์˜ค๋ฒ„ํ”ผํŒ…)๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •์šฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ(validation data)๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)์˜ ์ ์ •์„ฑ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ์ดํ„ฐ์ธ ์…ˆ์ž…๋‹ˆ๋‹ค.
  • ๋ณดํ†ต์€ Training Data(ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ)์ค‘ 20%๋ฅผ Validation Data(๊ฒ€์ฆ ๋ฐ์ดํ„ฐ)๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
๊ฐ ๋ฐ์ดํ„ฐ์˜ ์—ญํ• : Training Data(ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ - ๋งค๊ฐœ๋ณ€์ˆ˜ ํ•™์Šต), Validation Data (๊ฒ€์ฆ ๋ฐ์ดํ„ฐ - Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ) ์„ฑ๋Šฅ ํ‰๊ฐ€), Test Data(์‹œํ—˜ ๋ฐ์ดํ„ฐ - ์‹ ๊ฒธ๋ง์˜ ๋ฒ”์šฉ ์„ฑ๋Šฅ ํ‰๊ฐ€)
(x_train, t_train), (x_test, t_test) = load_mnist()

# ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋’ค์„ž์Œ
x_train, t_train = shuffle_dataset(x_train, t_train)

# 20%๋ฅผ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
validation_rate = 0.2
validation_num = int(x_train.shape[0] * validation_rate)

x_val = x_train[:validation_num]
t_val = t_train[:validation_num]
x_train = x_train[validation_num:]
t_train = t_train[validation_num:]

Hyperparameter ์ตœ์ ํ™”

Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)๋ฅผ ์ตœ์ ํ™”ํ•  ๋•Œ, ์ตœ์  ๊ฐ’์ด ์กด์žฌํ•˜๋Š” ๋ฒ”์œ„๋ฅผ ์กฐ๊ธˆ์”ฉ ์ค„์—ฌ๊ฐ„๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
๋ฒ”์œ„๋ฅผ ์กฐ๊ธˆ์”ฉ ์ค„์ด๋ ค๋ฉด ๋Œ€๋žต์ ์ธ ๋ฒ”์œ„๋ฅผ ์„ค์ •ํ•˜๊ณ  ๊ทธ ๋ฒ”์œ„์—์„œ ๋ฌด์ž‘์œ„๋กœ Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)๊ฐ’์„ ์ƒ˜ํ”Œ๋งํ•œ ํ›„, ๊ทธ ๊ฐ’์œผ๋กœ ์ •ํ™•๋„๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜์—ฌ Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)์˜ '์ตœ์  ๊ฐ’'์˜ ๋ฒ”์œ„๋ฅผ ์ขํ˜€๊ฐ€๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๋ณดํ†ต Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)์˜ ๋ฒ”์œ„๋Š” '๋Œ€๋žต์ ์œผ๋กœ' ์ง€์ •ํ•˜๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค(0.001~1000)์‚ฌ์ด.
  • ๋˜ํ•œ Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)๋ฅผ ์ตœ์ ํ™”ํ•  ๋•Œ๋Š” ์˜ค๋žœ ์‹œ๊ฐ„(๋ฉฐ์น ~๋ช‡์ฃผ ์ด์ƒ)์ด ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ ํ•™์Šต์„ ์œ„ํ•œ epoch์„ ์ž‘๊ฒŒ ํ•˜์—ฌ, 1ํšŒ ํ‰๊ฐ€์— ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•˜๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค. ์ด ๋ง์„ ์š”์•ฝํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. 

 

  • 0๋‹จ๊ณ„
    ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  • 1๋‹จ๊ณ„
    ์„ค์ •๋œ ๋ฒ”์œ„์—์„œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ฐ’์„ ๋ฌด์ž‘์œ„๋กœ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  • 2๋‹จ๊ณ„
    1๋‹จ๊ณ„์—์„œ ์ƒ˜ํ”Œ๋งํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•˜๊ณ , ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ ์ •ํ™•๋„๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. (์—ํญ์€ ์ž‘๊ฒŒ ์„ค์ •)
  • 3๋‹จ๊ณ„
    1
    ๋‹จ๊ณ„์™€ 2๋‹จ๊ณ„๋ฅผ ํŠน์ • ํšŸ์ˆ˜(100ํšŒ ๋“ฑ) ๋ฐ˜๋ณตํ•˜๋ฉฐ, ๊ทธ ์ •ํ™•๋„์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ณ  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ฒ”์œ„๋ฅผ ์ขํž™๋‹ˆ๋‹ค.

Hyperparameter ์ตœ์ ํ™” ๊ตฌํ˜„ํ•˜๊ธฐ

ํ•œ๋ฒˆ MNIST ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ตœ์ ํ™”ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
Learning Rate(ํ•™์Šต๋ฅ )๊ณผ Weight Decay(๊ฐ€์ค‘์น˜ ๊ฐ์†Œ)์˜ ์„ธ๊ธฐ๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๊ณ„์ˆ˜๋ฅผ ํƒ์ƒ‰ํ•˜๋Š” ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.
  • Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)์˜ ๋ฌด์ž‘์œ„ ์ถ”์ถœ ์ฝ”๋“œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
    • (0.001~1000)์‚ฌ์ด log scale ๋ฒ”์œ„ ๋‚ด์˜ ๋ฌด์ž‘์œ„ ์ถ”์ถœ
weight_decay = 10 ** np.random.uniform(-8, -4)
lr = 10 ** np.random.uniform(-6, -2)
  • ๋ฌด์ž‘์œ„๋กœ ์ถ”์ถœํ•œ ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ ํ›„, ๋‹ค์–‘ํ•œ Hyperparameter(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ)๊ฐ’์œผ๋กœ ํ•™์Šต์„ ๋ฐ˜๋ณตํ•˜๋ฉฐ ์‹ ๊ฒฝ๋ง์— ์ข‹์„ ๊ฒƒ ๊ฐ™์€ ๊ฐ’์ด ์–ด๋””์— ์กด์žฌํ•˜๋Š”์ง€ ๊ด€์ฐฐํ•ฉ๋‹ˆ๋‹ค.
# coding: utf-8
import sys, os
sys.path.append(os.pardir)  # ๋ถ€๋ชจ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.multi_layer_net import MultiLayerNet
from common.util import shuffle_dataset
from common.trainer import Trainer

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# ๊ฒฐ๊ณผ๋ฅผ ๋น ๋ฅด๊ฒŒ ์–ป๊ธฐ ์œ„ํ•ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์ค„์ž„
x_train = x_train[:500]
t_train = t_train[:500]

# 20%๋ฅผ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
validation_rate = 0.20
validation_num = int(x_train.shape[0] * validation_rate)
x_train, t_train = shuffle_dataset(x_train, t_train)
x_val = x_train[:validation_num]
t_val = t_train[:validation_num]
x_train = x_train[validation_num:]
t_train = t_train[validation_num:]


def __train(lr, weight_decay, epocs=50):
    network = MultiLayerNet(input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100],
                            output_size=10, weight_decay_lambda=weight_decay)
    trainer = Trainer(network, x_train, t_train, x_val, t_val,
                      epochs=epocs, mini_batch_size=100,
                      optimizer='sgd', optimizer_param={'lr': lr}, verbose=False)
    trainer.train()

    return trainer.test_acc_list, trainer.train_acc_list


# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ฌด์ž‘์œ„ ํƒ์ƒ‰
optimization_trial = 100
results_val = {}
results_train = {}
for _ in range(optimization_trial):
    # ํƒ์ƒ‰ํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ฒ”์œ„ ์ง€์ •
    weight_decay = 10 ** np.random.uniform(-8, -4)
    lr = 10 ** np.random.uniform(-6, -2)

    val_acc_list, train_acc_list = __train(lr, weight_decay)
    print("val acc:" + str(val_acc_list[-1]) + " | lr:" + str(lr) + ", weight decay:" + str(weight_decay))
    key = "lr:" + str(lr) + ", weight decay:" + str(weight_decay)
    results_val[key] = val_acc_list
    results_train[key] = train_acc_list

# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
print("=========== Hyper-Parameter Optimization Result ===========")
graph_draw_num = 20
col_num = 5
row_num = int(np.ceil(graph_draw_num / col_num))
i = 0

for key, val_acc_list in sorted(results_val.items(), key=lambda x:x[1][-1], reverse=True):
    print("Best-" + str(i+1) + "(val acc:" + str(val_acc_list[-1]) + ") | " + key)

    plt.subplot(row_num, col_num, i+1)
    plt.title("Best-" + str(i+1))
    plt.ylim(0.0, 1.0)
    if i % 5: plt.yticks([])
    plt.xticks([])
    x = np.arange(len(val_acc_list))
    plt.plot(x, val_acc_list)
    plt.plot(x, results_train[key], "--")
    i += 1

    if i >= graph_draw_num:
        break

plt.show()

์‹ค์„ ์€ Validation Data์— ๋Œ€ํ•œ ์ •ํ™•๋„, ์ ์„ ์€ Training Data์— ๋Œ€ํ•œ ์ •ํ™•๋„

Best-1 (val acc:0.83) | Ir:0.0092, weight dec ay:3.86e - 07
Best-2 (val acc:0.78) | Ir:0.00956, weight dec ay:6.04e - 07
Best-3 (val acc:0.77) | lr:0.00571, weight decay:1.27e - 06
Best-4 (val acc:0.74) | Ir:0.00626, weight decay:1.43e - 05
Best-5 (val acc:0.73) | lr:0.OO52, weight dec ay:8.97e - 06
  • ์ด๋ ‡๊ฒŒ ์ ์ ˆํ•œ ๊ฐ’์ด ์œ„์น˜ํ•œ ๋ฒ”์œ„๋ฅผ ์ขํ˜€๊ฐ€๋‹ค๊ฐ€ ํŠน์ • ๋‹จ๊ณ„์—์„œ ์ตœ์ข… ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ํ•˜๋‚˜ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

Summary

  • ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐฑ์‹  ๋ฐฉ๋ฒ•์—๋Š” ํ™•๋ฅ ์  ๊ฒฝํ•˜ ํ•˜๊ฐ•๋ฒ•(SGD) ์™ธ์—๋„ ๋ชจ๋ฉ˜ํ…€, AdaGrad, Adam ๋“ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ฐ€์ค‘์น˜ ์ดˆ๊นƒ๊ฐ’์„ ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์˜ฌ๋ฐ”๋ฅธ ํ•™์Šต์„ ํ•˜๋Š” ๋ฐ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ€์ค‘์น˜์˜ ์ดˆ๊นƒ๊ฐ’์œผ๋กœ๋Š” Xavier ์ดˆ๊นƒ๊ฐ’๊ณผ He ์ดˆ๊นƒ๊ฐ’์ด ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.
  • ๋ฐฐ์น˜ ์ •๊ทœํ™”๋ฅผ ์ด์šฉํ•˜๋ฉด ํ•™์Šต์„ ๋น ๋ฅด๊ฒŒ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ดˆ๊นƒ๊ฐ’์— ์˜ํ–ฅ์„ ๋œ ๋ฐ›๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ์˜ค๋ฒ„ํ”ผํŒ…์„ ์–ต์ œํ•˜๋Š” ์ •๊ทœํ™” ๊ธฐ์ˆ ๋กœ๋Š” ๊ฐ€์ค‘์น˜ ๊ฐ์†Œ์™€ ๋“œ๋กญ์•„์›ƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ํƒ์ƒ‰์€ ์ตœ์  ๊ฐ’์ด ์กด์žฌํ•  ๋ฒ•ํ•œ ๋ฒ”์œ„๋ฅผ ์ ์ฐจ ์ขํžˆ๋ฉด์„œ ํ•˜๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.