A A
[NLP] RNNLM - RNN์„ ์‚ฌ์šฉํ•œ Language Model

RNNLM (RNN์„ ์‚ฌ์šฉํ•œ Language (์–ธ์–ด) ๋ชจ๋ธ)

์ด๋ฒˆ์—๋Š” RNN์„ ์‚ฌ์šฉํ•˜์—ฌ Language Model(์–ธ์–ด ๋ชจ๋ธ)์„ ๊ตฌํ˜„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ๊ทธ ์ „์— ๋จผ์ € ์‚ฌ์šฉ๋˜๋Š” Neural Network(์‹ ๊ฒฝ๋ง)์„ ํ•œ๋ฒˆ ๋ณด๊ณ  ์‹œ์ž‘ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ์™ผ์ชฝ์€ RNNLM์˜ ๊ณ„์ธต ๊ตฌ์„ฑ์ด๊ณ , ์˜ค๋ฅธ์ชฝ์—๋Š” ์ด๋ฅผ ์‹œ๊ฐ„์ถ•์œผ๋กœ ํŽผ์นœ Neural Network(์‹ ๊ฒฝ๋ง)์ž…๋‹ˆ๋‹ค.

RNNLM์˜ ์‹ ๊ฒฝ๋ง (์™ผ์ชฝ์ด ํŽผ์น˜๊ธฐ ์ „, ์˜ค๋ฅธ์ชฝ์€ ํŽผ์นœ ํ›„)

  • ๊ทธ๋ฆผ์˜ Embedding Layer(๊ณ„์ธต)์€ ๋‹จ์–ด ID์˜ ๋ถ„์‚ฐ ํ‘œํ˜„ (๋‹จ์–ด Vector)๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๊ทธ ๋ถ„์‚ฐ ํ‘œํ˜„์ด RNN Layer(RNN ๊ณ„์ธต)๋กœ ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.
  • RNN ๊ณ„์ธต์€ Hidden State(์€๋‹‰ ์ƒํƒœ)๋ฅผ ๋‹ค์Œ Layer(์ธต)์œผ๋กœ ์ถœ๋ ฅํ•จ๊ณผ ๋™์‹œ์—, ๋‹ค์Œ ์‹œ๊ฐ์˜ RNN ๊ณ„์ธต(์˜ค๋ฅธ์ชฝ)์œผ๋กœ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  RNN ๊ณ„์ธต์ด ์œ„๋กœ ์ถœ๋ ฅํ•œ Hidden State(์€๋‹‰ ์ƒํƒœ)๋Š” Affine ๊ณ„์ธต์„ ๊ฑฐ์ณ Softmax ๊ณ„์ธต์œผ๋กœ ์ „ํ•ด์ง‘๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ํ•œ๋ฒˆ Sample Corpus(๋ง๋ญ‰์น˜)๋ฅผ ํ•œ๋ฒˆ ์ค˜์„œ ์‚ฌ์šฉํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
"You say goodbye and I say hello"

์ƒ˜ํ”Œ Corpus(๋ง๋ญ‰์น˜)๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” RNNLM์˜ ์˜ˆ์‹œ

  • ์œ„์˜ ๊ทธ๋ฆผ์€ RNNLM(์ˆœํ™˜ ์‹ ๊ฒฝ๋ง ์–ธ์–ด๋ชจ๋ธ)์˜ ์ž‘๋™ ๋ฐฉ์‹์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค
  • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋Š” ๋‹จ์–ด ID ๋ฐฐ์—ด๋กœ, ์ฒ˜์Œ์—๋Š” ๋‹จ์–ด "you"๊ฐ€ ์ž…๋ ฅ๋˜๊ณ , Softmax ๊ณ„์ธต์ด "say"๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค
  • ๋‘ ๋ฒˆ์งธ ๋‹จ์–ด "say"๋ฅผ ์ž…๋ ฅํ•˜๋ฉด, Softmax ๊ณ„์ธต์€ "goodbye"์™€ "hello" ์ค‘ ๋†’์€ ํ™•๋ฅ ๋กœ "goodbye"๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • RNN ๊ณ„์ธต์€ ์ด์ „ ๋‹จ์–ด "you say"๋ฅผ ๊ธฐ์–ตํ•˜์—ฌ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
  • RNN์€ ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ๋ฅผ ์€๋‹‰ ์ƒํƒœ ๋ฒกํ„ฐ๋กœ ์ €์žฅํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๊ณผ๊ฑฐ์˜ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ˜„์žฌ์™€ ๋ฏธ๋ž˜์˜ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

 

Time ๊ณ„์ธต ๊ตฌํ˜„

์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๊ณ„์ธต์„ Time RNN ์ด๋ผ๋Š” ์ด๋ฆ„์˜ ๊ณ„์ธต์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฒˆ์—๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๊ณ„์ธต์„ Time Embedding, Time Affine ํ˜•ํƒœ์˜ ์ด๋ฆ„์œผ๋กœ ๊ตฌํ˜„ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๊ณ„์ธต์„ Time XX ๊ณ„์ธต์œผ๋กœ ๊ตฌํ˜„

T๊ฐœ๋ถ„์˜ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ณ„์ธต์„ Time XX ๊ณ„์ธต ์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค.
์ด๋Ÿฌํ•œ ๊ณ„์ธต๋“ค์ด ๊ตฌํ˜„๋˜์–ด ์žˆ๋‹ค๋ฉด ๊ทธ ๊ณ„์ธต๋“ค์„ ๋ ˆ๊ณ  ๋ธ”๋Ÿญ์ฒ˜๋Ÿผ ์กฐ๋ฆฝํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ์‹ ๊ฒฝ๋ง์„ ์™„์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Time ๊ณ„์ธต์€ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Time Affine ๊ณ„์ธต์€ ์•„๋ž˜์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ Affine ๊ณ„์ธต์„ T๊ฐœ ์ค€๋น„ํ•ด์„œ, ๊ฐ ์‹œ๊ฐ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

Time Affine ๊ณ„์ธต์€ T๊ฐœ์˜ Affine ๊ณ„์ธต์˜ ์ง‘ํ•ฉ์œผ๋กœ ๊ตฌํ˜„๋ฉ๋‹ˆ๋‹ค.

  • Time Embedding ๊ณ„์ธต ์—ญ์‹œ Forward Propagation(์ˆœ์ „ํŒŒ) ์‹œ์— T๊ฐœ์˜ Embedding ๊ณ„์ธต์„ ์ค€๋น„ํ•˜๊ณ  ๊ฐ Embedding ๊ณ„์ธต์ด ๊ฐ ์‹œ๊ฐ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • Time Embedding ๊ณ„์ธต์€ ๋‹จ์ˆœํžˆ Affine ๊ณ„์ธต T๊ฐœ๋ฅผ ์ด์šฉํ•˜๋Š” ๋ฐฉ์‹ ๋Œ€์‹  ํ–‰๋ ฌ ๊ณ„์‚ฐ์œผ๋กœ ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋Š”, ํšจ์œจ ์ข‹์€ ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

Time Affine Class Source Code (by Python)

import numpy as np

class TimeAffine:
    def __init__(self, W, b):
        """
        ํด๋ž˜์Šค ์ดˆ๊ธฐํ™” ๋ฉ”์„œ๋“œ.
        
        ํŒŒ๋ผ๋ฏธํ„ฐ:
        W (numpy.ndarray): ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ
        b (numpy.ndarray): ๋ฐ”์ด์–ด์Šค ๋ฒกํ„ฐ
        """
        self.params = [W, b]  # ๊ฐ€์ค‘์น˜์™€ ๋ฐ”์ด์–ด์Šค๋ฅผ ๋ฆฌ์ŠคํŠธ๋กœ ์ €์žฅ
        self.grads = [np.zeros_like(W), np.zeros_like(b)]  # ๊ฐ€์ค‘์น˜์™€ ๋ฐ”์ด์–ด์Šค์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ ์ดˆ๊ธฐํ™”
        self.x = None  # ์ˆœ์ „ํŒŒ ์‹œ ์ž…๋ ฅ๊ฐ’์„ ์ €์žฅํ•  ๋ณ€์ˆ˜

    def forward(self, x):
        """
        ์ˆœ์ „ํŒŒ ๋ฉ”์„œ๋“œ.
        
        ํŒŒ๋ผ๋ฏธํ„ฐ:
        x (numpy.ndarray): ์ž…๋ ฅ ๋ฐ์ดํ„ฐ. (๋ฐฐ์น˜ ํฌ๊ธฐ, ์‹œํ€€์Šค ๊ธธ์ด, ํŠน์„ฑ ์ฐจ์›)
        
        ๋ฐ˜ํ™˜๊ฐ’:
        out (numpy.ndarray): ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ. (๋ฐฐ์น˜ ํฌ๊ธฐ, ์‹œํ€€์Šค ๊ธธ์ด, ์ถœ๋ ฅ ์ฐจ์›)
        """
        N, T, D = x.shape  # ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํ˜•์ƒ (๋ฐฐ์น˜ ํฌ๊ธฐ, ์‹œํ€€์Šค ๊ธธ์ด, ํŠน์„ฑ ์ฐจ์›)
        W, b = self.params  # ๊ฐ€์ค‘์น˜์™€ ๋ฐ”์ด์–ด์Šค
        
        rx = x.reshape(N*T, -1)  # ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ 2์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜ (N*T, D)
        out = np.dot(rx, W) + b  # ์„ ํ˜• ๋ณ€ํ™˜ ์ˆ˜ํ–‰
        self.x = x  # ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜์—ฌ ์—ญ์ „ํŒŒ ์‹œ ์‚ฌ์šฉ
        return out.reshape(N, T, -1)  # ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์›๋ž˜ ํ˜•์ƒ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฐ˜ํ™˜

    def backward(self, dout):
        """
        ์—ญ์ „ํŒŒ ๋ฉ”์„œ๋“œ.
        
        ํŒŒ๋ผ๋ฏธํ„ฐ:
        dout (numpy.ndarray): ์ถœ๋ ฅ์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ. (๋ฐฐ์น˜ ํฌ๊ธฐ, ์‹œํ€€์Šค ๊ธธ์ด, ์ถœ๋ ฅ ์ฐจ์›)
        
        ๋ฐ˜ํ™˜๊ฐ’:
        dx (numpy.ndarray): ์ž…๋ ฅ์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ. (๋ฐฐ์น˜ ํฌ๊ธฐ, ์‹œํ€€์Šค ๊ธธ์ด, ํŠน์„ฑ ์ฐจ์›)
        """
        x = self.x  # ์ €์žฅ๋œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ
        N, T, D = x.shape  # ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํ˜•์ƒ
        W, b = self.params  # ๊ฐ€์ค‘์น˜์™€ ๋ฐ”์ด์–ด์Šค
        
        dout = dout.reshape(N*T, -1)  # ์ถœ๋ ฅ์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ๋ฅผ 2์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜ (N*T, ์ถœ๋ ฅ ์ฐจ์›)
        rx = x.reshape(N*T, -1)  # ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ 2์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜ (N*T, D)
        
        db = np.sum(dout, axis=0)  # ๋ฐ”์ด์–ด์Šค์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ ๊ณ„์‚ฐ
        dW = np.dot(rx.T, dout)  # ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ ๊ณ„์‚ฐ
        dx = np.dot(dout, W.T)  # ์ž…๋ ฅ์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ ๊ณ„์‚ฐ
        dx = dx.reshape(*x.shape)  # ์ž…๋ ฅ์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ๋ฅผ ์›๋ž˜ ํ˜•์ƒ์œผ๋กœ ๋ณ€ํ™˜
        
        self.grads[0][...] = dW  # ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ๋ฅผ ์ €์žฅ
        self.grads[1][...] = db  # ๋ฐ”์ด์–ด์Šค์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ๋ฅผ ์ €์žฅ
        
        return dx  # ์ž…๋ ฅ์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ๋ฅผ ๋ฐ˜ํ™˜
  • Time Affine ํด๋ž˜์Šค์˜ ์ฝ”๋“œ๋ฅผ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • __init__ ๋ฉ”์„œ๋“œ:
    • Weight(๊ฐ€์ค‘์น˜) W์™€ Bias(ํŽธํ–ฅ) b๋ฅผ ์ธ์ž๋กœ ๋ฐ›์•„ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
    • Weight(๊ฐ€์ค‘์น˜)์™€ Bias(ํŽธํ–ฅ)์— ๋Œ€ํ•œ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ์ €์žฅํ•  ๋ณ€์ˆ˜๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
    • Forward Propagation(์ˆœ์ „ํŒŒ)์‹œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•  ๋ณ€์ˆ˜๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • forward ๋ฉ”์„œ๋“œ:
    • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ x๋ฅผ ๋ฐ›์•„ Forward Propagation(์ˆœ์ „ํŒŒ)๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํ˜•์ƒ (N, T, D)์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
    • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ 2์ฐจ์› ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ (N*T, D)๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
    • Weight(๊ฐ€์ค‘์น˜) W์™€ Bias(ํŽธํ–ฅ) b๋ฅผ ์ด์šฉํ•ด ์„ ํ˜• ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • ๋ณ€ํ™˜๋œ ์ถœ๋ ฅ์„ ์›๋ž˜์˜ 3์ฐจ์› ํ˜•์ƒ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • backward ๋ฉ”์„œ๋“œ:
    • ์ถœ๋ ฅ์— ๋Œ€ํ•œ Gradient(๊ธฐ์šธ๊ธฐ) dout์„ ๋ฐ›์•„ Backpropagation(์—ญ์ „ํŒŒ)๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • ์ €์žฅ๋œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํ˜•์ƒ์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
    • ์ถœ๋ ฅ์— ๋Œ€ํ•œ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ 2์ฐจ์› ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ 2์ฐจ์› ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    • Bias(ํŽธํ–ฅ)์— ๋Œ€ํ•œ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • Weight(๊ฐ€์ค‘์น˜)์— ๋Œ€ํ•œ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • ์ž…๋ ฅ์— ๋Œ€ํ•œ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์›๋ž˜์˜ 3์ฐจ์› ํ˜•์ƒ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

 

Time Softmax with Loss ๊ณ„์ธต

Softmax ๊ณ„์ธต์„ ๊ตฌํ˜„ํ•  ๋•Œ ์†์‹ค ์˜ค์ฐจ๋ฅผ ๊ตฌํ•˜๋Š” *Cross Entropy Error ๊ณ„์ธต๋„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

Time Softmax with Loss ๊ณ„์ธต์˜ ์ „์ฒด ๊ทธ๋ฆผ

*Cross Entropy Loss Function: ๋ชจ๋ธ์˜ ์˜ˆ์ธก ํ™•๋ฅ  ๋ถ„ํฌ์™€ ์‹ค์ œ ๋ ˆ์ด๋ธ” ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
์ฃผ์–ด์ง„ ์˜ˆ์ธก ํ™•๋ฅ  ๋ถ„ํฌ ๐‘์™€ ์‹ค์ œ ๋ ˆ์ด๋ธ” ๐‘ž ์‚ฌ์ด์˜ Cross-Entropy๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.
CrossEntropy(๐‘,๐‘ž)=−∑๐‘– ๐‘ž๐‘– logโก(๐‘๐‘–)
์—ฌ๊ธฐ์„œ ๐‘ž๐‘–๋Š” ์‹ค์ œ ๋ ˆ์ด๋ธ”์˜ ๐‘–๋ฒˆ์งธ ์š”์†Œ(์ผ๋ฐ˜์ ์œผ๋กœ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ๋œ ๋ฒกํ„ฐ), ๐‘๐‘–๋Š” ์˜ˆ์ธก ํ™•๋ฅ  ๋ถ„ํฌ์˜ ๐‘–i๋ฒˆ์งธ ์š”์†Œ์ž…๋‹ˆ๋‹ค.
  • ์œ„์˜ ๊ทธ๋ฆผ์—์„œ X0, X1๋“ฑ์˜ ๋ฐ์ดํ„ฐ๋Š” ์•„๋ž˜์ธต์—์„œ ์ „ํ•ด์ง€๋Š” 'Score(์ ์ˆ˜)'๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • 'Score(์ ์ˆ˜)'๋Š” ํ™•๋ฅ ๋กœ ์ •๊ทœํ™”๋˜๊ธฐ ์ „์˜ ๊ฐ’์ž…๋‹ˆ๋‹ค.
  • ๋˜ํ•œ t0, t1๋“ฑ์˜ ๋ฐ์ดํ„ฐ๋Š” ์ •๋‹ต ๋ ˆ์ด๋ธ”์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆผ์—์„œ ๋ณด๋“ฏ์ด, T๊ฐœ์˜ Softmax with Loss ๊ณ„์ธต์ด ๊ฐ๊ฐ์˜ Loss๋ฅผ ์‚ฐ์ถœํ›„, ํ•ฉ์‚ฐํ•ด ํ‰๊ท ๋‚ธ ๊ฐ’์ด ์ตœ์ข… ์†์‹ค์ด ๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋•Œ ์ˆ˜ํ–‰ํ•˜๋Š” ์ˆ˜์‹์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ์—ฌ๊ธฐ์„œ Softmax with Loss ๊ณ„์ธต์€ Mini-Batch์— ํ•ด๋‹นํ•˜๋Š” Loss์˜ ํ‰๊ท ์„ ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ N๊ฐœ ์งœ๋ฆฌ Mini-Batch๋Š” N๊ฐœ์˜ ์†์‹ค์„ ๋”ํ•ด ๋‹ค์‹œ N๊ฐœ๋กœ ๋‚˜์›Œ์„œ ๋ฐ์ดํ„ฐ๋‹น ํ•œ๊ฐœ๋‹น ํ‰๊ท  ์†์‹ค์„ ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Time Softmax with Loss ๊ณ„์ธต๋„ ์‹œ๊ณ„์—ด์— ๋ฐํ•œ ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ๋ฐ์ดํ„ฐ 1๊ฐœ๋‹น ํ‰๊ท  Loss๋ฅผ ๊ตฌํ•ด ์ตœ์ข… ์ถœ๋ ฅ์œผ๋กœ ๋‚ด๋ณด๋ƒ…๋‹ˆ๋‹ค.

RNNLM ํ•™์Šต ๋ฐ ํ‰๊ฐ€

ํ•œ๋ฒˆ RNNLM์„ ํ•œ๋ฒˆ ๊ตฌํ˜„์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ณ„์ธต ๊ตฌ์„ฑ์€ ์•„๋ž˜์˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ๋ณด๋ฉด, ์ด๋ ‡๊ฒŒ RNNLM ํด๋ž˜์Šค๋Š” 4๊ฐœ์˜ Time ๊ณ„์ธต์„ ์Œ“์€ Neural Network(์‹ ๊ฒฝ๋ง)์ž…๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ์ฝ”๋“œ๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
import sys
sys.path.append('..')
import numpy as np
from common.time_layers import *


class SimpleRnnlm:
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        V, D, H = vocab_size, wordvec_size, hidden_size
        rn = np.random.randn

        # ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”
        embed_W = (rn(V, D) / 100).astype('f')
        rnn_Wx = (rn(D, H) / np.sqrt(D)).astype('f')
        rnn_Wh = (rn(H, H) / np.sqrt(H)).astype('f')
        rnn_b = np.zeros(H).astype('f')
        affine_W = (rn(H, V) / np.sqrt(H)).astype('f')
        affine_b = np.zeros(V).astype('f')

        # ๊ณ„์ธต ์ƒ์„ฑ
        self.layers = [
            TimeEmbedding(embed_W),
            TimeRNN(rnn_Wx, rnn_Wh, rnn_b, stateful=True),
            TimeAffine(affine_W, affine_b)
        ]
        self.loss_layer = TimeSoftmaxWithLoss()
        self.rnn_layer = self.layers[1]

        # ๋ชจ๋“  ๊ฐ€์ค‘์น˜์™€ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋ฆฌ์ŠคํŠธ์— ๋ชจ์€๋‹ค.
        self.params, self.grads = [], []
        for layer in self.layers:
            self.params += layer.params
            self.grads += layer.grads

    def forward(self, xs, ts):
        for layer in self.layers:
            xs = layer.forward(xs)
        loss = self.loss_layer.forward(xs, ts)
        return loss

    def backward(self, dout=1):
        dout = self.loss_layer.backward(dout)
        for layer in reversed(self.layers):
            dout = layer.backward(dout)
        return dout

    def reset_state(self):
        self.rnn_layer.reset_state()
  • ์ด ์ฝ”๋“œ๋Š” ๊ฐ ๊ณ„์ธต์—์„œ ์‚ฌ์šฉํ•˜๋Š” Parameter (Weight-๊ฐ€์ค‘์น˜ & Bias-ํŽธํ–ฅ)์„ ์ดˆ๊ธฐํ™” ํ•˜๊ณ  ํ•„์š”ํ•œ ๊ณ„์ธต์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ *Truncated BPTT(Backpropagation Through Time)๋กœ ํ•™์Šตํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด์„œ Time RNN ๊ณ„์ธต์˜ stateful์„ True๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ ๊ฒฐ๊ณผ Time RNN์€ ์ด์ „ ์‹œ๊ฐ์˜ Hidden State(์€๋‹‰ ์ƒํƒœ)๋ฅผ ๊ณ„์Šนํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ํŠน์ง•์€ RNN & Affine ๊ณ„์ธต์—์„œ Xavier ์ดˆ๊ธฐ๊ฐ’์„ ์ด์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
*Xavier ์ดˆ๊ธฐ๊ฐ’: ์ด์ „ ๊ณ„์ธต์˜ node๊ฐ€ n๊ฐœ๋ผ๋ฉด ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1/√n์ธ ๋ถ„ํฌ๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

Xavier ์ดˆ๊นƒ๊ฐ’: ์ด์ „ ๊ณ„์ธต์˜ ๋…ธ๋“œ๊ฐ€ n๊ฐœ๋ผ๋ฉด ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1/ √n์ธ ๋ถ„ํฌ๋ฅผ ์ดˆ๊ธฐ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ

RNN์—์„œ Weight์˜ ์ดˆ๊นƒ๊ฐ’์€ ์–ด๋–ป๊ฒŒ ์„ค์ •ํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ ํ•™์Šต์ด ์ง„ํ–‰๋˜๋Š” ๋ฐฉ๋ฒ• & ์ตœ์ข… ์ •ํ™•๋„๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.
  • ๊ณ„์†ํ•ด์„œ forward(), backward(), reset_state() Method์˜ ๊ตฌํ˜„์„ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
def forward(self, xs, ts):
    """
    ์ˆœ์ „ํŒŒ ๋ฉ”์„œ๋“œ.
    
    ํŒŒ๋ผ๋ฏธํ„ฐ:
    xs (numpy.ndarray): ์ž…๋ ฅ ๋ฐ์ดํ„ฐ
    ts (numpy.ndarray): ์‹ค์ œ ๋ ˆ์ด๋ธ”
    
    ๋ฐ˜ํ™˜๊ฐ’:
    loss (float): ๊ณ„์‚ฐ๋œ ์†์‹ค ๊ฐ’
    """
    for layer in self.layers:
        xs = layer.forward(xs)  # ๊ฐ ๋ ˆ์ด์–ด์— ๋Œ€ํ•ด ์ˆœ์ „ํŒŒ ์ˆ˜ํ–‰
    loss = self.loss_layer.forward(xs, ts)  # ์†์‹ค ๋ ˆ์ด์–ด์—์„œ ์†์‹ค ๊ณ„์‚ฐ
    return loss  # ๊ณ„์‚ฐ๋œ ์†์‹ค ๊ฐ’ ๋ฐ˜ํ™˜

def backward(self, dout=1):
    """
    ์—ญ์ „ํŒŒ ๋ฉ”์„œ๋“œ.
    
    ํŒŒ๋ผ๋ฏธํ„ฐ:
    dout (float): ์ƒ์œ„ ๊ณ„์ธต์—์„œ ์ „ํŒŒ๋œ ๊ทธ๋ผ๋””์–ธํŠธ (๊ธฐ๋ณธ๊ฐ’: 1)
    
    ๋ฐ˜ํ™˜๊ฐ’:
    dout (numpy.ndarray): ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ
    """
    dout = self.loss_layer.backward(dout)  # ์†์‹ค ๋ ˆ์ด์–ด์—์„œ ์—ญ์ „ํŒŒ ์ˆ˜ํ–‰
    for layer in reversed(self.layers):
        dout = layer.backward(dout)  # ๊ฐ ๋ ˆ์ด์–ด์— ๋Œ€ํ•ด ์—ญ์ „ํŒŒ ์ˆ˜ํ–‰
    return dout  # ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ทธ๋ผ๋””์–ธํŠธ ๋ฐ˜ํ™˜

def reset_state(self):
    """
    ์ƒํƒœ ์ดˆ๊ธฐํ™” ๋ฉ”์„œ๋“œ.
    
    ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง(RNN) ๋ ˆ์ด์–ด์˜ ์ƒํƒœ๋ฅผ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
    """
    self.rnn_layer.reset_state()  # RNN ๋ ˆ์ด์–ด์˜ ์ƒํƒœ ์ดˆ๊ธฐํ™”

Language Model(์–ธ์–ด๋ชจ๋ธ)์˜ ํ‰๊ฐ€ - Perplexity

Language Model(์–ธ์–ด ๋ชจ๋ธ)์€ ์ฃผ์–ด์ง„ ๊ณผ๊ฑฐ๋‹จ์–ด(์ •๋ณด)๋กœ ๋ถ€ํ„ฐ ๋‹ค์Œ์— ์ถœํ˜„ํ•œ ๋‹จ์–ด์˜ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ, Language Model(์–ธ์–ด ๋ชจ๋ธ)์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ์ฒ™๋„๋กœ Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ-ํ˜ผ๋ž€๋„)๋ฅผ ์ž์ฃผ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ)๋Š” ๊ฐ„๋‹จํžˆ ๋งํ•˜๋ฉด 'ํ™•๋ฅ ์˜ ์—ญ์ˆ˜'์ž…๋‹ˆ๋‹ค. ์ด ๋‚ด์šฉ์˜ ํ•ด์„์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•˜๋‚˜์ผ๋•Œ ์ •ํ™•ํžˆ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด์„œ "you say goodbye and i say hello"๋ผ๋Š” Corpus(๋ง๋ญ‰์น˜)๋กœ ์˜ˆ๋ฅผ ๋“ค๋ฉด "you"๋ผ๋Š” ๋‹จ์–ด ๋‹ค์Œ์— ์ถœ๋ ฅํ•  ๋‹จ์–ด๊ฐ€ "say'๋ผ๊ณ  ํ•˜๋ฉด, ํ™•๋ฅ ์€ 0.8์ž…๋‹ˆ๋‹ค.
  • ์ด๋•Œ Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ-ํ˜ผ๋ž€๋„)๋Š” ํ™•๋ฅ ์˜ ์—ญ์ˆ˜, ์ฆ‰ 1/0.8 = 1.25 ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ชจ๋ธ 2์—์„œ "์ •๋‹ต์ธ "say"์˜ ํ™•๋ฅ ์ด 0.2๋ผ๊ณ  ํ•˜๋ฉด, 1/0.2 = 5 ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ-ํ˜ผ๋ž€๋„)๋Š” ์ž‘์„์ˆ˜๋ก ์ข‹๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๊ทธ๋ ‡๋‹ค๋ฉด 1.25๋‚˜ 5.0์ด๋ผ๋Š” ๊ฐ’์€ ์ง๊ด€์ ์œผ๋กœ๋Š” ์–ด๋–ป๊ฒŒ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?
  • ์ด ๊ฐ’์€ '๋ถ„๊ธฐ ์ˆ˜(number of branches)'๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ถ„๊ธฐ ์ˆ˜๋ž€ ๋‹ค์Œ์— ์ทจํ•  ์ˆ˜ ์žˆ๋Š” ์„ ํƒ์‚ฌํ•ญ์˜ ์ˆ˜(๊ตฌ์ฒด์ ์œผ๋กœ ๋งํ•˜๋ฉด, ๋‹ค์Œ์— ์ถœํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์–ด์˜ ํ›„๋ณด ์ˆ˜)๋ฅผ ๋งํ•ฉ๋‹ˆ๋‹ค.
  • ์•ž์˜ ์˜ˆ์—์„œ, ์ข‹์€ ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ '๋ถ„๊ธฐ ์ˆ˜'๊ฐ€ 1.25๋ผ๋Š” ๊ฒƒ์€ ๋‹ค์Œ์— ์ถœํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์–ด์˜ ํ›„๋ณด๋ฅผ 1๊ฐœ ์ •๋„๋กœ ์ขํ˜”๋‹ค๋Š” ๋œป์ด ๋˜๊ณ , ๋ฐ˜๋ฉด ๋‚˜์œ ๋ชจ๋ธ์—์„œ๋Š” ํ›„๋ณด๊ฐ€ ์•„์ง 5๊ฐœ๋‚˜ ๋œ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.
์˜ˆ์ฒ˜๋Ÿผ Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ)๋กœ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ข‹์€ ๋ชจ๋ธ์€ ์ •๋‹ต ๋‹จ์–ด๋ฅผ ๋†’์€ ํ™•๋ฅ ๋กœ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋”ฐ๋ผ์„œ Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ) ๊ฐ’์ด ์ž‘์•„์ง‘๋‹ˆ๋‹ค.(์ตœ์†Œ๊ฐ’์€ 1.0). ํ•œํŽธ, ๋‚˜์œ ๋ชจ๋ธ์€ ์ •๋‹ต ๋‹จ์–ด๋ฅผ ๋‚ฎ์€ ํ™•๋ฅ ๋กœ ์˜ˆ์ธกํ•˜๋ฏ€๋กœ Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ) ๊ฐ’์ด ํฝ๋‹ˆ๋‹ค.
  • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•˜๋‚˜์ผ ๋•Œ์˜ Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ)๋ฅผ ์ด์•ผ๊ธฐํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ ‡๋‹ค๋ฉด ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ผ ๋•Œ๋Š” ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”? ์ด๋Ÿด ๋•Œ๋Š” ์•„๋ž˜์˜ ๊ณต์‹์— ๋”ฐ๋ผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
L = −1/N โ€‹∑nโ€‹∑k โ€‹tnk โ€‹log y nkโ€‹, perplexity = ๐‘’๐ฟ
  • ์€ ๋ฐ์ดํ„ฐ์˜ ์ด๊ฐœ์ˆ˜์ž…๋‹ˆ๋‹ค. ๐‘ก๐‘›์€ One-Hot Vector ๋กœ ๋‚˜ํƒ€๋‚ธ ์ •๋‹ต ๋ ˆ์ด๋ธ”์ด๋ฉฐ, ๐‘ก๐‘›๐‘˜๋Š” n๊ฐœ์งธ ๋ฐ์ดํ„ฐ์˜ k๋ฒˆ์งธ ๊ฐ’์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๐‘ฆ๐‘›๐‘˜๋Š” ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.(Neural Network-์‹ ๊ฒฝ๋ง ์—์„œ๋Š” Softmax์˜ ์ถœ๋ ฅ). L์€ Neural Network(์‹ ๊ฒฝ๋ง)์˜ Loss์„ ๋œปํ•˜๋ฉฐ, Cross-Entropy-Error(๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ ์˜ค์ฐจ)์™€ ์™„์ „ํžˆ ๊ฐ™์€ ์‹์ž…๋‹ˆ๋‹ค. ์ด L์„ ์‚ฌ์šฉํ•ด ๐‘’−๐ฟ ๋ฅผ ๊ณ„์‚ฐํ•œ ๊ฐ’์ด ๊ณง Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ์ž…๋‹ˆ๋‹ค.

RNNLM์˜ Code (by Python)

PTB ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•ด์„œ RNNLM ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ๋‹จ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•ด์„œ ํ•™์Šต์„ ํ•˜๊ฒŒ ๋˜๋ฉด, ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ์•ˆ๋‚˜์˜ฌ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, 1000๊ฐœ์˜ ๋‹จ์–ด๋งŒ ์ด์šฉํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
# coding: utf-8
import sys
sys.path.append('..')
import matplotlib.pyplot as plt
import numpy as np
from common.optimizer import SGD
from dataset import ptb
from simple_rnnlm import SimpleRnnlm


# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
batch_size = 10
wordvec_size = 100
hidden_size = 100 # RNN์˜ ์€๋‹‰ ์ƒํƒœ ๋ฒกํ„ฐ์˜ ์›์†Œ ์ˆ˜
time_size = 5     # Truncated BPTT๊ฐ€ ํ•œ ๋ฒˆ์— ํŽผ์น˜๋Š” ์‹œ๊ฐ„ ํฌ๊ธฐ
lr = 0.1
max_epoch = 100

# ํ•™์Šต ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ(์ „์ฒด ์ค‘ 1000๊ฐœ๋งŒ)
corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_size = 1000
corpus = corpus[:corpus_size]
vocab_size = int(max(corpus) + 1)

xs = corpus[:-1]  # ์ž…๋ ฅ
ts = corpus[1:]   # ์ถœ๋ ฅ(์ •๋‹ต ๋ ˆ์ด๋ธ”)
data_size = len(xs)
print('๋ง๋ญ‰์น˜ ํฌ๊ธฐ: %d, ์–ดํœ˜ ์ˆ˜: %d' % (corpus_size, vocab_size))

# ํ•™์Šต ์‹œ ์‚ฌ์šฉํ•˜๋Š” ๋ณ€์ˆ˜
max_iters = data_size // (batch_size * time_size)
time_idx = 0
total_loss = 0
loss_count = 0
ppl_list = []

# ๋ชจ๋ธ ์ƒ์„ฑ
model = SimpleRnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)

# 1. ๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ๊ฐ ์ƒ˜ํ”Œ์˜ ์ฝ๊ธฐ ์‹œ์ž‘ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐ
jump = (corpus_size - 1) // batch_size
offsets = [i * jump for i in range(batch_size)]

for epoch in range(max_epoch):
    for iter in range(max_iters):
        # 2. ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ์ทจ๋“
        batch_x = np.empty((batch_size, time_size), dtype='i')
        batch_t = np.empty((batch_size, time_size), dtype='i')
        for t in range(time_size):
            for i, offset in enumerate(offsets):
                batch_x[i, t] = xs[(offset + time_idx) % data_size]
                batch_t[i, t] = ts[(offset + time_idx) % data_size]
            time_idx += 1

        # ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•˜์—ฌ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐฑ์‹ 
        loss = model.forward(batch_x, batch_t)
        model.backward()
        optimizer.update(model.params, model.grads)
        total_loss += loss
        loss_count += 1

    # 3. ์—ํญ๋งˆ๋‹ค ํผํ”Œ๋ ‰์„œํ‹ฐ ํ‰๊ฐ€
    ppl = np.exp(total_loss / loss_count)
    print('| ์—ํญ %d | ํผํ”Œ๋ ‰์„œํ‹ฐ %.2f'
          % (epoch+1, ppl))
    ppl_list.append(float(ppl))
    total_loss, loss_count = 0, 0
  • ์ด ์ฝ”๋“œ๋Š” ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ณธ Neural Network(์‹ ๊ฒฝ๋ง) ํ•™์Šต๊ณผ ๊ฑฐ์ด ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค๋งŒ ํฐ ๊ด€์ ์—์„œ '๋ฐ์ดํ„ฐ ์ œ๊ณต ๋ฐฉ๋ฒ•', 'Perplexity ๊ณ„์‚ฐ' ๋ถ€๋ถ„์„ ๋ณด๋ฉด์„œ ์ฝ”๋“œ๋ฅผ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๋ฐ์ดํ„ฐ ์ œ๊ณต ๋ฐฉ๋ฒ•

๋ฐ์ดํ„ฐ ์ œ๊ณต ๋ฐฉ๋ฒ•์—์„œ ์—ฌ๊ธฐ์„  Truncated BPTT ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
Truncated BPTT ๋ฐฉ์‹์— ๋ฐํ•œ ๊ฐœ๋…์€ ์•„๋ž˜์— ๋งํฌ ๋‹ฌ์•„๋†“์„๊ป˜์š”!
 

[NLP] BPTT (Backpropagation Through Time)

BPTT (Backpropagation Through Time)BPTT(Backpropagation Through Time)๋Š” ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง(RNN, Recurrent Neural Network)์˜ ํ•™์Šต์„ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” Backpropagation(์—ญ์ „ํŒŒ) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ™•์žฅ ๋ฒ„์ „์ž…๋‹ˆ๋‹ค.์—ฌ๊ธฐ์„œ์˜ Backpropagation(์˜ค์ฐจ

daehyun-bigbread.tistory.com

  • Truncated BPTT ๋ฐฉ์‹์—์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฃผ๊ณ , ๊ฐ๊ฐ์˜ Mini-Batch์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๋Š” ์‹œ์ž‘ ์œ„์น˜๋ฅผ ์กฐ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ „์ฒด ํ•™์Šต ์ฝ”๋“œ 1๋ฒˆ ๋ถ€๋ถ„์—์„œ ๊ฐ Mini-Batch์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๋Š” ์‹œ์ž‘ ์œ„์น˜๋ฅผ offset์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
# 1. ๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ๊ฐ ์ƒ˜ํ”Œ์˜ ์ฝ๊ธฐ ์‹œ์ž‘ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐ
jump = (corpus_size - 1) // batch_size
offsets = [i * jump for i in range(batch_size)]

 

  • ์ „์ฒด ํ•™์Šต ์ฝ”๋“œ 2๋ฒˆ ๋ถ€๋ถ„์—์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฝ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆ‡์˜ ์—ญํ• ์„ ํ•˜๋Š” batch_x, batch_t๋ฅผ ์ค€๋น„ํ•˜๊ณ , time_idx๋ฅผ 1์”ฉ ๋Š˜๋ฆฌ๋ฉด์„œ Corpus(๋ง๋ญ‰์น˜)์—์„œ time_idx์œ„์น˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ 1๋ฒˆ ์†Œ์Šค ์ฝ”๋“œ์—์„œ ๊ณ„์‚ฐํ•œ offset์„ ์ด์šฉํ•˜์—ฌ ๊ฐ mini_batch์—์„œ offset์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ , ๋งŒ์•ฝ Corpus(๋ง๋ญ‰์น˜)๋ฅผ ์ฝ๋Š” ์œ„์น˜๊ฐ€ Corpus(๋ง๋ญ‰์น˜) ํฌ๊ธฐ๋ฅผ ๋„˜์–ด์„ค ๊ฒฝ์šฐ์— Corpus(๋ง๋ญ‰์น˜)์˜ ์ฒ˜์Œ์œผ๋กœ ๋Œ์•„์™€์•ผ ํ•˜๋Š”๋ฐ, ์ด๋ฅผ ์œ„ํ•ด์„œ Corpus(๋ง๋ญ‰์น˜)์˜ ํฌ๊ธฐ๋กœ ๋‚˜๋ˆˆ ๋‚˜๋จธ์ง€๋ฅผ Index๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
        # 2. ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ์ทจ๋“
        batch_x = np.empty((batch_size, time_size), dtype='i')
        batch_t = np.empty((batch_size, time_size), dtype='i')
        for t in range(time_size):
            for i, offset in enumerate(offsets):
                batch_x[i, t] = xs[(offset + time_idx) % data_size]
                batch_t[i, t] = ts[(offset + time_idx) % data_size]
            time_idx += 1

 

  • 3๋ฒˆ ์ฝ”๋“œ ๋ถ€๋ถ„์—์„œ๋Š” *Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ๋Š” Epoch๋งˆ๋‹ค Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ)๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ Epoch๋งˆ๋‹ค Loss์˜ ํ‰๊ท ์„ ๊ตฌํ•˜๊ณ , ๊ตฌํ•œ ๊ฐ’์„ ์ด์šฉํ•ด์„œ Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ)๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
Language Model(์–ธ์–ด ๋ชจ๋ธ)์€ ์ฃผ์–ด์ง„ ๊ณผ๊ฑฐ๋‹จ์–ด(์ •๋ณด)๋กœ ๋ถ€ํ„ฐ ๋‹ค์Œ์— ์ถœํ˜„ํ•œ ๋‹จ์–ด์˜ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
์ด๋•Œ, Language Model(์–ธ์–ด ๋ชจ๋ธ)์˜ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ์ฒ™๋„๋กœ Perplexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ-ํ˜ผ๋ž€๋„)๋ฅผ ์ž์ฃผ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค.
    # 3. ์—ํญ๋งˆ๋‹ค ํผํ”Œ๋ ‰์„œํ‹ฐ ํ‰๊ฐ€
    ppl = np.exp(total_loss / loss_count)
    print('| ์—ํญ %d | ํผํ”Œ๋ ‰์„œํ‹ฐ %.2f'
          % (epoch+1, ppl))
    ppl_list.append(float(ppl))
    total_loss, loss_count = 0, 0

Perplexity (ํผํ”Œ๋ ‰์‹œํ‹ฐ) ์ถ”์ด


RNNLM์˜ Trainer Class

์ด๋ฒˆ์—๋Š” RNNLM์„ ์ˆ˜ํ–‰ํ•ด์ฃผ๋Š” Trainer ํด๋ž˜์Šค๋ฅผ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ์ด ๋ถ€๋ถ„์€ RNNLM์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•™์Šต๋ถ€๋ถ„์€ ํด๋ž˜์Šค ์•ˆ์œผ๋กœ ์ˆจ๊ฒจ์ฃผ๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ์ฝ”๋“œ๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
import sys
sys.path.append('..')
from common.optimizer import SGD
from common.trainer import RnnlmTrainer
from dataset import ptb
from simple_rnnlm import SimpleRnnlm


# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
batch_size = 10
wordvec_size = 100
hidden_size = 100  # RNN์˜ ์€๋‹‰ ์ƒํƒœ ๋ฒกํ„ฐ์˜ ์›์†Œ ์ˆ˜
time_size = 5  # RNN์„ ํŽผ์น˜๋Š” ํฌ๊ธฐ
lr = 0.1
max_epoch = 100

# ํ•™์Šต ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ
corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_size = 1000  # ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์„ ์ž‘๊ฒŒ ์„ค์ •
corpus = corpus[:corpus_size]
vocab_size = int(max(corpus) + 1)
xs = corpus[:-1]  # ์ž…๋ ฅ
ts = corpus[1:]  # ์ถœ๋ ฅ๏ผˆ์ •๋‹ต ๋ ˆ์ด๋ธ”๏ผ‰

# ๋ชจ๋ธ ์ƒ์„ฑ
model = SimpleRnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)

trainer.fit(xs, ts, max_epoch, batch_size, time_size)
trainer.plot()

 

  • ์ด์™€ ๊ฐ™์ด, ๋จผ์ € RnnlmTrainer ํด๋ž˜์Šค์— model๊ณผ optimizer๋ฅผ ์ฃผ์–ด ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ fit() ๋ฉ”์„œ๋“œ๋ฅผ ํ˜ธ์ถœํ•ด ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋•Œ ๊ทธ ๋‚ด๋ถ€์—์„œ๋Š” ์•ž ์ ˆ์—์„œ ์ˆ˜ํ–‰ํ•œ ์ผ๋ จ์˜ ์ž‘์—…์ด ์ง„ํ–‰๋˜๋Š”๋ฐ, ๊ทธ ๋‚ด์šฉ์„ ์ƒ์„ธํžˆ ์ ์–ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
    • Mini-Batch๋ฅผ '์ˆœ์ฐจ์ '์œผ๋กœ ๋งŒ๋“ค์–ด
    • ๋ชจ๋ธ์˜ Forward Propagation(์ˆœ์ „ํŒŒ)์™€ Backpropagation(์—ญ์ „ํŒŒ)๋ฅผ ํ˜ธ์ถœํ•˜๊ณ 
    • Optimizer(์˜ตํ‹ฐ๋งˆ์ด์ €)๋กœ Weight(๊ฐ€์ค‘์น˜)๋ฅผ ๊ฐฑ์‹ ํ•˜๊ณ 
    • Perplexity(ํผํ”Œ๋ ‰์„œํ‹ฐ)๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
NOTE. RnnlmTrainer ํด๋ž˜์Šค๋Š” ์•ž์—์„œ ์„ค๋ช…ํ•œ Trainer ํด๋ž˜์Šค์™€ ๋˜‘๊ฐ™์€ API๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
์‹ ๊ฒฝ๋ง์˜ ์ผ๋ฐ˜์ ์ธ ํ•™์Šต์€ Trainer ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , RNNLM ํ•™์Šต์—๋Š” RnnlmTrainer ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

Summary

Language Model(์–ธ์–ด ๋ชจ๋ธ)์€ ๋‹จ์–ด Sequence๋ฅผ ํ™•๋ฅ ๋กœ ํ•ด์„ํ•œ๋‹ค.
RNN ๊ณ„์ธต์„ ์ด์šฉํ•œ ์กฐ๊ฑด๋ถ€ Language Model(์–ธ์–ด ๋ชจ๋ธ)์€ (์ด๋ก ์ ์œผ๋กœ๋Š”) ๊ทธ๋•Œ๊นŒ์ง€ ๋“ฑ์žฅํ•œ ๋ชจ๋“  ๋‹จ์–ด์˜ ์ •๋ณด๋ฅผ ๊ธฐ์–ตํ•  ์ˆ˜ ์žˆ๋‹ค.