๐Ÿ–ฅ๏ธ Deep Learning

[DL] Batch Normalization - ๋ฐฐ์น˜ ์ •๊ทœํ™”

Bigbread1129 2024. 5. 1. 23:54

Batch Normalization - ๋ฐฐ์น˜ ์ •๊ทœํ™”

Batch Normalization (๋ฐฐ์น˜ ์ •๊ทœํ™”)์˜ ๊ฐœ๋…์€ 2015๋…„์— ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ์ผ๋‹จ, Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”)๊ฐ€ ์ฃผ๋ชฉ๋ฐ›๋Š” ์ด์œ ๋Š” ๋‹ค์Œ์˜ ์ด์œ ๋“ค๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
    • Training(ํ•™์Šต)์„ ๋นจ๋ฆฌ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, Training(ํ•™์Šต) ์†๋„๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์ดˆ๊นƒ๊ฐ’์— ํฌ๊ฒŒ ์˜์กดํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๊ทธ๋ฆฌ๊ณ  Overiftting์„ ์–ต์ œํ•˜๋Š” ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, Dropout๋“ฑ์˜ ํ•„์š”์„ฑ์ด ๊ฐ์†Œํ•ฉ๋‹ˆ๋‹ค.
  • Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”)์˜ ๊ธฐ๋ณธ ์•„์ด๋””์–ด๋Š” ์•ž์—์„œ ๋งํ–ˆ๋“ฏ์ด ๊ฐ Layer(์ธต)์—์„œ์˜ Activation Value(ํ™œ์„ฑํ™” ๊ฐ’)์ด ์ ๋‹นํžˆ ๋ถ„ํฌ๊ฐ€ ๋˜๋„๋ก ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ์˜ˆ์‹œ๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”)๋ฅผ ์‚ฌ์šฉํ•œ Neural Network(์‹ ๊ฒฝ๋ง)์˜ ์˜ˆ์‹œ

  • Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”)๋Š” ๊ทธ ์ด๋ฆ„๊ณผ ๊ฐ™์ด ํ•™์Šต์‹œ Mini-Batch๋ฅผ ๋‹จ์œ„๋กœ ์ •๊ทœํ™”๋ฅผ ํ•ฉ๋‹ˆ๋‹ค.
๋ฏธ๋‹ˆ ๋ฐฐ์น˜(mini-batch)๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ์ž‘์€ ํฌ๊ธฐ์˜ ์ผ๋ถ€๋กœ ๋‚˜๋ˆ„์–ด ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ํ•œ ๋ฒˆ์— ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ์ž‘์€ ๋ฐฐ์น˜๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ๋ฐฐ์น˜์— ๋Œ€ํ•ด ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” Mean(ํ‰๊ท )์ด 0, Variance(๋ถ„์‚ฐ)์ด 1์ด ๋˜๋„๋ก ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜์‹์œผ๋กœ๋Š” ์•„๋ž˜์˜ ์‹๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

    • ์œ„ ์ˆ˜์‹์€ Mini-Batch B =  ๊ฐœ์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ์ง‘ํ•ฉ์— ๋Œ€ํ—ค ํ‰๊ท  ฮผB์™€ ๋ถ„์‚ฐ ฯƒBโ€‹ยฒ์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
    • ๊ทธ๋ฆฌ๊ณ  ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ Mean(ํ‰๊ท )์ด 0, Variance(๋ถ„์‚ฐ)์ด 1์ด ๋˜๊ฒŒ Normalization(์ •๊ทœํ™”) ํ•˜๊ณ , ฮต์€ 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ์‚ฌํƒœ๋ฅผ ์˜ˆ๋ฐฉํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
    • ๋˜ Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”) Layer ๋งˆ๋‹ค ์ด ์ •๊ทœํ™”๋œ ๋ฐ์ดํ„ฐ์— ๊ณ ์œ ํ•œ ํ™•๋Œ€ scale์™€ ์ด๋™ shift ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • ์ˆ˜์‹์œผ๋กœ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ฮณ : ํ™•๋Œ€,  ฮฒ : ์ด๋™์„ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค.
  • ๋‘ ๊ฐ’์€ ์ฒ˜์Œ์—๋Š” ฮณ=1, ฮฒ=0 (1๋ฐฐ ํ™•๋Œ€, ์ด๋™ ์—†์Œ=์›๋ณธ ๊ทธ๋Œ€๋กœ)์—์„œ ์‹œ์ž‘ํ•ด์„œ ํ•™์Šตํ•˜๋ฉฐ ์ ํ•ฉํ•œ ๊ฐ’์œผ๋กœ ์กฐ์ •ํ•ด๊ฐ‘๋‹ˆ๋‹ค.
  • ์ด๊ฒƒ์ด Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”)์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์„ค๋ช…ํ•œ ๋‚ด์šฉ์„ ์•„๋ž˜์˜ ๊ทธ๋ž˜ํ”„๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”)์˜ ๊ณ„์‚ฐ ๊ทธ๋ž˜ํ”„


Batch Normalization (๋ฐฐ์น˜ ์ •๊ทœํ™”)์˜ ํšจ๊ณผ

ํ•œ๋ฒˆ Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”) ๊ณ„์ธต์„ ์‚ฌ์šฉํ•œ ์‹คํ—˜์„ ํ•œ๋ฒˆ Mnist Dataset์„ ์‚ฌ์šฉํ•˜์—ฌ Batch Normalization Layer๋ฅผ ์‚ฌ์šฉํ• ๋•Œ, ์‚ฌ์šฉํ•˜์ง€ ์•Š์„๋•Œ์˜ ํ•™์Šต ์ง„๋„๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‹ฌ๋ผ์ง€๋Š”์ง€๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
# coding: utf-8
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
sys.path.append(os.pardir)  # ๋ถ€๋ชจ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ํŒŒ์ผ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •
from dataset.mnist import load_mnist  # MNIST ๋ฐ์ดํ„ฐ์…‹์„ ๋ถˆ๋Ÿฌ์˜ค๋Š” ํ•จ์ˆ˜
from common.multi_layer_net_extend import MultiLayerNetExtend  # ๋‹ค์ธต ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ ํด๋ž˜์Šค
from common.optimizer import SGD, Adam  # ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํด๋ž˜์Šค

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

# ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ค„์ž„
x_train = x_train[:1000]
t_train = t_train[:1000]

max_epochs = 20
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.01

# ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™” ํ‘œ์ค€ํŽธ์ฐจ ์„ค์ • ๋ฐ ํ•™์Šต ํ•จ์ˆ˜ ์ •์˜
def __train(weight_init_std):
    # ๋ฐฐ์น˜ ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•œ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ๊ณผ ์ ์šฉํ•˜์ง€ ์•Š์€ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ ์ƒ์„ฑ
    bn_network = MultiLayerNetExtend(input_size=784,
                                     hidden_size_list=[100, 100, 100, 100, 100],
                                     output_size=10,
                                     weight_init_std=weight_init_std,
                                     use_batchnorm=True)  # ๋ฐฐ์น˜ ์ •๊ทœํ™” ์‚ฌ์šฉ
    network = MultiLayerNetExtend(input_size=784,
                                  hidden_size_list=[100, 100, 100, 100, 100],
                                  output_size=10,
                                  weight_init_std=weight_init_std)  # ๋ฐฐ์น˜ ์ •๊ทœํ™” ๋ฏธ์‚ฌ์šฉ
    optimizer = SGD(lr=learning_rate)  # ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•

    train_acc_list = []  # ํ•™์Šต ์ •ํ™•๋„ ๊ธฐ๋ก ๋ฆฌ์ŠคํŠธ
    bn_train_acc_list = []  # ๋ฐฐ์น˜ ์ •๊ทœํ™” ์ ์šฉ ํ•™์Šต ์ •ํ™•๋„ ๊ธฐ๋ก ๋ฆฌ์ŠคํŠธ

    iter_per_epoch = max(train_size / batch_size, 1)
    epoch_cnt = 0

    for i in range(1000000000):
        batch_mask = np.random.choice(train_size, batch_size)  # ๋ฌด์ž‘์œ„ ๋ฐฐ์น˜ ์ƒ˜ํ”Œ๋ง
        x_batch = x_train[batch_mask]
        t_batch = t_train[batch_mask]

        for _network in (bn_network, network):
            grads = _network.gradient(x_batch, t_batch)  # ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ
            optimizer.update(_network.params, grads)  # ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ

        if i % iter_per_epoch == 0:
            train_acc = network.accuracy(x_train, t_train)  # ์ •ํ™•๋„ ๊ณ„์‚ฐ
            bn_train_acc = bn_network.accuracy(x_train, t_train)  # ๋ฐฐ์น˜ ์ •๊ทœํ™” ์ ์šฉ ์ •ํ™•๋„ ๊ณ„์‚ฐ
            train_acc_list.append(train_acc)
            bn_train_acc_list.append(bn_train_acc)

            print("epoch:" + str(epoch_cnt) + " | " + str(train_acc) + " - "
                  + str(bn_train_acc))

            epoch_cnt += 1
            if epoch_cnt >= max_epochs:
                break

    return train_acc_list, bn_train_acc_list


# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ==========
weight_scale_list = np.logspace(0, -4, num=16)
x = np.arange(max_epochs)

for i, w in enumerate(weight_scale_list):
    print("============== " + str(i+1) + "/16" + " =============

Batch Normalization ํšจ๊ณผ: Batch Normalization๊ฐ€ ํ•™์Šต ์†๋„๋ฅผ ๋†’์ธ๋‹ค.

  • ๋ณด๋ฉด Batch Normalization (๋ฐฐ์น˜ ์ •๊ทœํ™”)๊ฐ€ ํ•™์Šต์„ ๋นจ๋ฆฌ ์ง„์ „์‹œํ‚ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด Weight ์ดˆ๊นƒ๊ฐ’์˜ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ๋ด๊ฟ”๊ฐ€๋ฉด์„œ ํ•™์Šต ๊ฒฝ๊ณผ๋ฅผ ๊ด€์ฐฐํ•œ ๊ทธ๋ž˜ํ”„ ์ž…๋‹ˆ๋‹ค.

  • ๊ฑฐ์ด ๋ชจ๋“  ๊ฒฝ์šฐ์—์„œ Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”)๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์˜ Training(ํ•™์Šต) ์†๋„๊ฐ€ ๋น ๋ฅธ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.
  • ์‹ค์ œ๋กœ Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”)๋ฅผ ์ด์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์—๋Š” ์ดˆ๊ฐ“๊ฐ’์ด ์ž˜ ๋ถ„ํฌ๋˜์–ด ์žˆ์ง€ ์•Š์œผ๋ฉด Training(ํ•™์Šต)์ด ์ „ํ˜€ ์ง„ํ–‰๋˜์ง€ ์•Š์€ ๋ชจ์Šต๋„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
Summary: Batch Normalization(๋ฐฐ์น˜ ์ •๊ทœํ™”)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด Training(ํ•™์Šต)์ด ๋นจ๋ฆฌ์ž๋ฉด, Weight(๊ฐ€์ค‘์น˜) ์ดˆ๊นƒ๊ฐ’์— ํฌ๊ฒŒ ์˜์กด ํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค๋Š” ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.
๋Œ“๊ธ€์ˆ˜0