A A
[NLP] Transformer Model - ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ ์•Œ์•„๋ณด๊ธฐ
์ด๋ฒˆ ๊ธ€์—์„œ๋Š” Transformer ๋ชจ๋ธ์˜ ์ „๋ฐ˜์ ์ธ Architecture ๋ฐ ๊ตฌ์„ฑ์— ๋ฐํ•˜์—ฌ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Transformer: Attention is All You Need

Transformer ๋ชจ๋ธ์˜ ๊ตฌ์กฐ (Left: ํ•œ๊ธ€ ๋ฒ„์ „, Right: ์˜์–ด ๋ฒ„์ „)

  • Transformer ๋ชจ๋ธ์€ 2017๋…„์— "Attention is All You Need"๋ผ๋Š” ๋…ผ๋ฌธ์„ ํ†ตํ•ด์„œ ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์ฃผ์š”ํ•œ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” "Self-Attention" ์ด๋ผ๋Š” ๋งค์ปค๋‹ˆ์ฆ˜์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ, ๋ฌธ์žฅ ๋‚ด์˜ ๋ชจ๋“  ๋‹จ์–ด๋“ค ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ํ•œ ๋ฒˆ์— ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์— ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด์ „์˜ ์„ค๋ช…ํ–ˆ๋˜ RNN(Recurrent Neural Network), LSTM(Long Short-Term Memory)๊ณผ ๊ฐ™์€ ์ˆœ์ฐจ์ ์ธ Model์ด ๊ฐ€์ง„ ์ˆœ์ฐจ์  ์ฒ˜๋ฆฌ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ–ˆ๋‹ค๋Š” ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ํ˜„์žฌ Transformer ๋ชจ๋ธ์€ Sequence - Sequence๊ฐ„์˜ ์ž‘์—… ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, Language Modeling, Pre-Training ์„ค์ •์—์„œ๋„ ์‚ฌ์‹ค์ƒ์˜ ํ‘œ์ค€ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  Transformer Model์€ ์ƒˆ๋กœ์šด ๋ชจ๋ธ๋ง ํŒจ๋Ÿฌ๋‹ค์ž„์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. 
Attention ๋ฐ Self-Attention ๊ฐœ๋…์— ๊ด€ํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜์˜ ๊ธ€์— ์ž์„ธํžˆ ์ž‘์„ฑํ•ด ๋‘์—ˆ์œผ๋‹ˆ ํ•œ๋ฒˆ ์ฝ์–ด๋ณด์„ธ์š”!
 

[NLP] Attention - ์–ดํ…์…˜

1. Attention Attention์€ CS ๋ฐ ML์—์„œ ์ค‘์š”ํ•œ ๊ฐœ๋…์ค‘ ํ•˜๋‚˜๋กœ ์—ฌ๊ฒจ์ง‘๋‹ˆ๋‹ค. Attention์˜ ๋งค์ปค๋‹ˆ์ฆ˜์€ ์ฃผ๋กœ Sequence Data๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ฑฐ๋‚˜ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. -> Sequence ์ž…๋ ฅ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ํ•™์Šต

daehyun-bigbread.tistory.com

Transformer Model๊ณผ RNN Model์˜ ์ฐจ์ด์ 

Transformer ๋ชจ๋ธ๊ณผ RNN ๋ชจ๋ธ๊ฐ„์˜ ์ฐจ์ด์ ์„ ํ•œ๋ฒˆ ์–˜๊ธฐํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
"๋‚˜๋Š” ์€ํ–‰์— ๋„์ฐฉ ํ–ˆ์–ด... ๊ฑฐ๋ฆฌ๋ฅผ ๊ฑด๋„ˆ์„œ? ๊ฐ•์„ ๊ฑด๋„ˆ์„œ?"
  • ์ด ๋ฌธ์žฅ์—์„œ ์€ํ–‰์ด ์˜๋ฏธํ•˜๋Š”๊ฒŒ ๋ญ˜๊นŒ์š”?
  • ๋ฌธ์žฅ์„ Encoding ํ•  ๋•Œ RNN์€ ์ „์ฒด ๋ฌธ์žฅ์„ ์ฝ์„๋•Œ ๊นŒ์ง€ ์€ํ–‰์ด ๋ฌด์—‡์„ ์˜๋ฏธํ•˜๋Š”์ง€ ์ดํ•ดํ•˜์ง€ ๋ชปํ•˜๋ฉฐ ๊ธด Sequence์˜ ๊ฒฝ์šฐ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆด์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด์™€ ๋Œ€์กฐ์ ์œผ๋กœ Transformer์˜ Encoder Token์€ ์„œ๋กœ ๋™์‹œ์— ์ƒํ˜ธ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์ง๊ด€์ ์œผ๋กœ Transformer์˜ Encoder๋Š” ์ผ๋ จ์˜ ์ถ”๋ก ํ•˜๋Š” ๋‹จ๊ณ„(Layer-์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๋ณ€ํ™˜ & ์ •๋ณด ์ฒ˜๋ฆฌ๋ฅผ ๋‹ด๋‹นํ•˜๋Š” ๋‹จ์œ„)๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ฐ ๋‹จ๊ณ„(๊ฐ Layer)์—์„œ์˜ Token์€ ์„œ๋กœ๋ฅผ ๋ฐ”๋ผ๋ณด๊ณ (Attention - Self Attention), ์ •๋ณด๋ฅผ ๊ตํ™˜ํ•˜๋ฉด์„œ ์ „์ฒด ๋ฌธ์žฅ์˜ ๋งฅ๋žต์—์„œ ์„œ๋กœ๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ Decoder Layer์—์„œ์˜ Prefix Token์€ Self-Attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด์„œ ์„œ๋กœ ์ƒํ˜ธ ์ž‘์šฉ ํ•˜๋ฉด์„œ Encoder ์ƒํƒœ๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.
  • Attention ๋ฐ Self-Attention์— ๊ด€ํ•œ ๊ฐœ๋…์€ ์œ„์˜ ๋งํฌ๋ฅผ ๋‹ฌ์•„๋†“์•˜์œผ๋‹ˆ ํ•œ๋ฒˆ ์ฝ์–ด๋ณด์„ธ์š”!

Transformer: Encoder & Decoder

Transformer ๋ชจ๋ธ์€ ํฌ๊ฒŒ 2๊ฐœ์˜ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. Encoder์™€ Decoder๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
์•„๋ž˜์— Encoder & Decoder์— ๊ด€ํ•œ ์ ์€๊ธ€์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”! 
  • ์•„๋ž˜์˜ ๊ธ€์€ seq2seq์— ๋ฐํ•œ Encoder & Decoder์— ๊ด€ํ•˜์—ฌ ์“ด ๊ธ€์ด์ง€๋งŒ, ๊ธฐ๋ณธ์ ์ธ ๊ฐœ๋…๊ณผ ๊ฐœ๋…์€ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค๋งŒ ๊ตฌ์ฒด์ ์ธ ๊ตฌํ˜„๊ณผ ์ž‘๋™ ๋ฐฉ์‹์—์„œ์˜ ์ฐจ์ด๋Š” ์žˆ์Šต๋‹ˆ๋‹ค. Transformer Model์—์„œ Encoder & Decoder ๋ถ€๋ถ„์— ์ค‘์ ์„ ๋‘๊ณ  ์„ค๋ช…ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
 

[NLP] Seq2Seq, Encoder & Decoder

1..sequence-to-sequence ๐Ÿ’ก ํŠธ๋žœ์Šคํฌ๋จธ(Transformer) ๋ชจ๋ธ์€ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ(machine translation) ๋“ฑ ์‹œํ€€์Šค-ํˆฌ-์‹œํ€€์Šค(sequence-to-sequence) ๊ณผ์ œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. sequence: ๋‹จ์–ด ๊ฐ™์€ ๋ฌด์–ธ๊ฐ€์˜ ๋‚˜์—ด์„ ์˜๋ฏธํ•ฉ

daehyun-bigbread.tistory.com

Transformer ๋ชจ๋ธ์˜ Encoder

Transformer ๋ชจ๋ธ์˜ Encoder ๋ถ€๋ถ„

  • Encoder: Input Sequence(์ž…๋ ฅ ์‹œํ€€์Šค)๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ตฌ์„ฑ์š”์†Œ๋กœ, ์—ฌ๋Ÿฌ๊ฐœ์˜ Encoder Layer๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์Œ“์•„์„œ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ Layer๋Š” Multi-Head Attention ๋งค์ปค๋‹ˆ์ฆ˜, Position-wise Feed-Forward Network๋ฅผ ํฌํ•จํ•˜๋ฉฐ, input ๋ฐ์ดํ„ฐ์˜ ์ „์ฒด์ ์ธ Context๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํฌ์ฐฉํ•˜์—ฌ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • Transformer Model์˜ Encoder๋Š” ์ˆœ์ฐจ์  ์ฒ˜๋ฆฌ๊ฐ€ ์•„๋‹Œ, ์ „์ฒด Input Sequence๋ฅผ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋Š” ์ ์—์„œ RNN ๊ธฐ๋ฐ˜์˜ Encoder์™€ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Transformer ๋ชจ๋ธ์˜ Encoder Example Code

์ด ์ฝ”๋“œ๋Š” ์˜ˆ์ œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์— ๋งž์ถฐ์„œ ์ˆ˜์ •์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • # seq_len: ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„๋„ ๊ทธ Sequence length๊นŒ์ง€๋งŒ ๋ฐ›๊ฒ ๋‹ค.
class Encoder(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(Encoder, self).__init__()

        self.d_model = kargs['d_model']
        self.num_layers = kargs['num_layers']

        self.embedding = tf.keras.layers.Embedding(kargs['input_vocab_size'], self.d_model)
        self.pos_encoding = positional_encoding(kargs['maximum_position_encoding'],
                                                self.d_model)


        self.enc_layers = [EncoderLayer(**kargs)
                           for _ in range(self.num_layers)]

        self.dropout = tf.keras.layers.Dropout(kargs['rate'])

    def call(self, x, mask):
        attn = None
        seq_len = tf.shape(x)[1]

        # adding embedding and position encoding.
        x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :] # seq_len

        x = self.dropout(x)

        for i in range(self.num_layers):
            x, attn = self.enc_layers[i](x, mask)

        return x, attn  # (batch_size, input_seq_len, d_model)

Transformer ๋ชจ๋ธ์˜ Decoder

Transformer ๋ชจ๋ธ์˜ Decoder ๋ถ€๋ถ„

  • Decoder: Target Sequence๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ตฌ์„ฑ ์š”์†Œ๋กœ, ์—ฌ๋Ÿฌ๊ฐœ์˜ Decoder Layer๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์Œ“์•„์„œ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • Decoder Layer๋Š” Encoder์™€ ์œ ์‚ฌํ•˜๊ฒŒ Multi-Head Attention๊ณผ Feed-Forward Networks๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ์ถ”๊ฐ€์ ์œผ๋กœ Encoder์˜ ์ถœ๋ ฅ & ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” Encoder-Decoder Attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฅผ ํ†ตํ•ด์„œ Decoder๋Š” Encoder๊ฐ€ ์ฒ˜๋ฆฌํ•œ ์ •๋ณด์™€ ์ž์‹ ์ด ์ง€๊ธˆ๊นŒ์ง€ ์ƒ์„ฑํ•œ ์ถœ๋ ฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์Œ ํ† ํฐ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

Transformer ๋ชจ๋ธ์˜ Decoder Example Code

์ด ์ฝ”๋“œ๋Š” ์˜ˆ์ œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์— ๋งž์ถฐ์„œ ์ˆ˜์ •์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
class Decoder(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(Decoder, self).__init__()

        self.d_model = kargs['d_model']
        self.num_layers = kargs['num_layers']

        self.embedding = tf.keras.layers.Embedding(kargs['target_vocab_size'], self.d_model)
        self.pos_encoding = positional_encoding(kargs['maximum_position_encoding'], self.d_model)

        self.dec_layers = [DecoderLayer(**kargs)
                           for _ in range(self.num_layers)]
        self.dropout = tf.keras.layers.Dropout(kargs['rate'])

    def call(self, x, enc_output, look_ahead_mask, padding_mask):
        seq_len = tf.shape(x)[1]
        attention_weights = {}

        x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](x, enc_output, look_ahead_mask, padding_mask)

            attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
            attention_weights['decoder_layer{}_block2'.format(i+1)] = block2

        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights

Transformer: Layer

Layer์— ๋ฐํ•˜์—ฌ ์„ค๋ช…์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • Transformer ๋ชจ๋ธ์—์„œ์˜ "Layer"๋Š” Model์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ์ค‘ ํ•˜๋‚˜๋กœ, ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๋ณ€ํ™˜ & ์ •๋ณด ์ฒ˜๋ฆฌ๋ฅผ ๋‹ด๋‹นํ•˜๋Š” ๋‹จ์œ„์ž…๋‹ˆ๋‹ค.
    • ๋ชจ๋ธ์˜ ์ž…๋ ฅ์„ ๋งŒ๋“œ๋Š” ๊ณ„์ธต์„ Input Layer, ์ถœ๋ ฅ์„ Output Layer๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • Transformer Model์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ Layer๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์Œ“์•„์„œ ๊ตฌ์„ฑ๋˜๋ฉฐ, ๊ฐ Layer๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์•„์„œ ์ฒ˜๋ฆฌํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์Œ Layer๋กœ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด์„œ Model์€ Data์˜ ๋ณต์žกํ•œ ํŠน์„ฑ & ํŒจํ„ด์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํฌ๊ฒŒ Transformer ๋ชจ๋ธ์˜ Layer๋Š” 2๊ฐ€์ง€ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋Š”๋ฐ, ํ•˜๋‚˜๋Š” Multi-Head Attention, ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” Position-wise Feed-Forward Network ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๊ฐ Layer๋Š” ์ •๊ทœํ™”(Normalization)๊ณผ Residual Connection๊ณผ ๊ฐ™์€ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์˜ ์•ˆ์ •์„ฑ, ํšจ์œจ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค. 
    • Normalization์€ Layer์˜ Input(์ž…๋ ฅ), Output(์ถœ๋ ฅ)์„ ์ •๊ทœํ™” ํ•˜์—ฌ ํ•™์Šต๊ณผ์ •์„ ์•ˆ์ •ํ™” ์‹œํ‚ค๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
    • Residual Connection์€ Input(์ž…๋ ฅ)์„ Layer์˜ Output(์ถœ๋ ฅ)์— ์ง์ ‘ ๋”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ํ•™์Šต ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค. 

Transformer: Input Layer

Input Layer (์ž…๋ ฅ ๋ ˆ์ด์–ด)

  • Encoder Input์€ Source Sequence & Input Embedding ๊ฐ’์— ์œ„์น˜์ •๋ณด๊นŒ์ง€ ๋”ํ•ด์„œ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • ์ด๋•Œ Encoder Input์€ Source ์–ธ์–ด ๋ฌธ์žฅ์˜ Token Index Sequence๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ์˜ˆ๋ฅผ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Self-Attention์˜ Input Layer์˜ ๋™์ž‘๋ฐฉ์‹

Source ์–ธ์–ด์˜ Token Sequence

  • ์˜ˆ๋ฅผ ๋“ค์–ด ์†Œ์Šค ์–ธ์–ด์˜ Token Sequence๊ฐ€ "์–ด์ œ, ์นดํŽ˜, ๊ฐ”์—ˆ์–ด"๋ผ๋ฉด Encoder Input Layer(์ธ์ฝ”๋” ์ž…๋ ฅ์ธต)์˜ ์ง์ ‘์ ์ธ ์ž…๋ ฅ๊ฐ’์€ ์ด๋“ค Token๋“ค์— ๋Œ€์‘ํ•˜๋Š” ์ธ๋ฑ์Šค ์‹œํ€€์Šค๊ฐ€ ๋˜๋ฉฐ Encoder Input(์ธ์ฝ”๋” ์ž…๋ ฅ)์€ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  Input Embedding์— ๋”ํ•˜๋Š” ์œ„์น˜ ์ •๋ณด๋Š” ํ•ด๋‹น Token์ด ๋ฌธ์žฅ ๋‚ด์—์„œ ๋ช‡๋ฒˆ์งธ์˜ ์œ„์น˜์ธ์ง€ ์ •๋ณด๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
"์–ด์ œ" ๊ฐ€ ์ฒซ๋ฒˆ์งธ, "์นดํŽ˜" ๊ฐ€ ๋‘๋ฒˆ์งธ, "๊ฐ”์—ˆ์–ด" ๊ฐ€ ์„ธ๋ฒˆ์งธ๋ฉด → Transformer Model์€ ์ด๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์†Œ์Šค ์–ธ์–ด์˜ ํ† ํฐ ์‹œํ€€์Šค๋ฅผ ์ด์— ๋Œ€์‘ํ•˜๋Š” Vector Sequence(๋ฒกํ„ฐ ์‹œํ€€์Šค)๋กœ ๋ณ€ํ™˜ํ•ด Incoder input์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด Encoder Input Layer์—์„œ ๋งŒ๋“ค์–ด์ง„ Vector Sequence๊ฐ€ ์ตœ์ดˆ์˜ Encoder Block์˜ Input์ด ๋ฉ๋‹ˆ๋‹ค.
  • ๊ทธ ๋‹ค์Œ์—๋Š” Output Vector Sequence๊ฐ€ ๋‘๋ฒˆ์งธ Encoder Block์˜ Input์ด ๋ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ๋‹ค์Œ Encoder ๋ธ”๋Ÿญ์˜ Input(์ž…๋ ฅ)์€ ์ด์ „ Block์˜ Output(์ถœ๋ ฅ)์ž…๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์„ N๋ฒˆ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.

 


Transformer: Encoder Layer Example Code

์ด ์ฝ”๋“œ๋Š” ์˜ˆ์ œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์— ๋งž์ถฐ์„œ ์ˆ˜์ •์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(**kargs)
        self.ffn = point_wise_feed_forward_network(**kargs)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(kargs['rate'])
        self.dropout2 = tf.keras.layers.Dropout(kargs['rate'])

    def call(self, x, mask):
        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2, attn_output

Transformer: Output Layer

Output Layer๋Š” Transformer ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์ธต์ž…๋‹ˆ๋‹ค.

Output Layer (์ถœ๋ ฅ์ธต)

  • Output Layer์˜ ์ถœ๋ ฅ์€ Target ์–ธ์–ด์˜ ์–ดํœ˜์ˆ˜๋งŒํผ ์ฐจ์›์„ ๊ฐ€์ง€๋Š” ํ™•๋ฅ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค.
  • Multi-Head Attention์€ Attention ๊ด€๋ จ ๊ธ€์—์„œ ๋‹ค๋ฃจ์—ˆ์œผ๋ฏ€๋กœ, ์ฒ˜์Œ ๋ณด๋Š” Position-wise Feed-Forward Network์— ๋ฐํ•˜์—ฌ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Transformer: Decoder Layer Example Code

์ด ์ฝ”๋“œ๋Š” ์˜ˆ์ œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์— ๋งž์ถฐ์„œ ์ˆ˜์ •์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(**kargs)
        self.mha2 = MultiHeadAttention(**kargs)

        self.ffn = point_wise_feed_forward_network(**kargs)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(kargs['rate'])
        self.dropout2 = tf.keras.layers.Dropout(kargs['rate'])
        self.dropout3 = tf.keras.layers.Dropout(kargs['rate'])


    def call(self, x, enc_output, look_ahead_mask, padding_mask):
        # enc_output.shape == (batch_size, input_seq_len, d_model)
        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn1 = self.dropout1(attn1)
        out1 = self.layernorm1(attn1 + x)

        attn2, attn_weights_block2 = self.mha2(
            enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
        attn2 = self.dropout2(attn2)
        out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

        ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout3(ffn_output)
        out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

        return out3, attn_weights_block1, attn_weights_block2

Feed-Forward Networks

Position-wise Feed-Forward Network์— ๋ฐํ•˜์—ฌ ์„ค๋ช…์„ ๋“œ๋ฆด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ, ์ด๊ฑด Transformer ๋ชจ๋ธ์˜ ํŠน์ • Context(๋ฌธ๋งฅ)์—์„œ๋งŒ ์‚ฌ์šฉ๋˜๋Š” FFN(Feed-Forward Network)์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ผ๋‹จ Feed-Forward Network์— ๋ฐํ•œ ์„ค๋ช…์„ ๋จผ์ € ํ•œ ๋‹ค์Œ์— Position-wise Feed-Forward Network์— ๋ฐํ•œ ์„ค๋ช…์„ ์ด์–ด์„œ ํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Feed-Forward Networks (FFN)

  • Feed-Forward Networks๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ์ธ๊ณต ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์ค‘ ํ•˜๋‚˜๋กœ, Input Layer(์ž…๋ ฅ์ธต)์—์„œ Output Layer(์ถœ๋ ฅ์ธต)์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ˆœ๋ฐฉํ–ฅ์œผ๋กœ ํ๋ฅด๋Š” ๊ตฌ์กฐ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ Data๋Š” ๊ฐ Layer(์ธต)์„ ์ง€๋‚  ๋•Œ๋งˆ๋‹ค ๊ฐ€์ค‘์น˜์— ์˜ํ•ด ๋ณ€ํ™˜๋˜๊ณ , Activation Function(ํ™œ์„ฑํ™” ํ•จ์ˆ˜)๋ฅผ ํ†ตํ•ด ๋‹ค์Œ Layer(์ธต)์œผ๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๋„คํŠธ์›Œํฌ๋Š” ์ˆœํ™˜ ์—ฐ๊ฒฐ์ด๋‚˜ ๋ณต์žกํ•œ Feedback ๋ฃจํ”„๊ฐ€ ์—†์–ด์„œ ๊ณ„์‚ฐ์ด ๋น„๊ต์  ๊ฐ„๋‹จํ•˜๊ณ , ๋‹ค์–‘ํ•œ ๋ฌธ์ œ์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ •๋ฆฌํ•˜์ž๋ฉด, ๋ฐ์ดํ„ฐ๊ฐ€ ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•ด ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ํ๋ฅธ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋Š” Input Layer(์ž…๋ ฅ์ธต)์—์„œ ์‹œ์ž‘ํ•˜์—ฌ Hidden Layer(์€๋‹‰์ธต)์„ ๊ฑฐ์ณ Output Layer(์ถœ๋ ฅ์ธต)์œผ๋กœ ์ „๋‹ฌ๋˜๋ฉฐ, ๊ฐ ์ธต์—์„œ๋Š” Activation Function(ํ™œ์„ฑํ™” ํ•จ์ˆ˜)๋ฅผ ํ†ตํ•ด ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ์ˆœํ™˜(loop)์ด๋‚˜ ๋˜๋Œ์•„๊ฐ€๋Š”(feedback) ์—ฐ๊ฒฐ์€ ์—†์œผ๋ฉฐ, ๊ฐ ์ธต์€ ์ด์ „ ์ธต์˜ ์ถœ๋ ฅ์„ ๋‹ค์Œ ์ธต์˜ Input(์ž…๋ ฅ)์œผ๋กœ๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Feed-Forward Networks (FFN)์˜ ๊ธฐ๋ณธ์ ์ธ ํ˜•ํƒœ

  • FNN(Feed-Forward Networks)์€ ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ ๊ธฐ๋ณธ์ ์ธ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.
  • ๋‹ค์ˆ˜์˜ Input(์ž…๋ ฅ) Node, Weight(๊ฐ€์ค‘์น˜), Activation Function(ํ™œ์„ฑํ™” ํ•จ์ˆ˜)๋ฅผ ํ†ตํ•ด ์ถœ๋ ฅ ๋…ธ๋“œ๋กœ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋•Œ Weight(๊ฐ€์ค‘์น˜)๋Š” ํ•™์Šต ๊ณผ์ •์—์„œ ์—…๋ฐ์ดํŠธ ๋˜๋ฉฐ, ์ดˆ๊ธฐ Weight(๊ฐ€์ค‘์น˜)๋Š” ๋ณดํ†ต ๋ฌด์ž‘์œ„๋กœ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.
  • FNN(Feed-Forward Networks)๋Š” MLP, Multi-Layer Perceptron(๋‹ค์ค‘ ํผ์…‰ํŠธ๋ก )์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋ฉฐ, Hidden Layer(์€๋‹‰์ธต)์ด ํ•˜๋‚˜ ์ด์ƒ์ธ ์ธ๊ณต์‹ ๊ฒธ๋ง์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Feed-Forward Network ๊ตฌ์กฐ

Feed-Forward Networks (FFN)์˜ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ

Feed-Forward Network์˜ ์„ค๊ณ„ ๊ตฌ์กฐ

  • ์ด ๊ธ€์—์„œ๋Š” Transformer ๋ชจ๋ธ์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด์„œ ์ž์„ธํ•œ ์„ค๋ช…์€ ๋‚˜์ค‘์— Deep_Learning ํŒŒํŠธ์—์„œ ์„ค๋ช…ํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
Summary: "Feed-Forward Networks"๋Š” ๋„“์€ ์˜๋ฏธ์—์„œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์ˆœ๋ฐฉํ–ฅ์œผ๋กœ ์ „๋‹ฌ๋˜๋Š” ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ Context ๋ฐ ๊ตฌ์กฐ์—์„œ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Position-wise Feed-Forward Networks(PFFN)

Position-wise Feed-Forward Networks ์— ๋Œ€ํ•œ ์ •์˜๋ฅผ ํ•œ๋ฒˆ ๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Transformer๋‚ด์—์„œ Position-wise Feed-Forward Networks์˜ ์ •์˜

Feed-Forward Network์˜ ๊ตฌ์กฐ. ์ถœ์ฒ˜: https://pozalabs.github.io/transformer/

  • Transformer ๋ชจ๋ธ ๋‚ด์—์„œ "Position-wise Feed-Forward Networks"๋Š” ํŠน๋ณ„ํ•œ ์œ ํ˜•์˜ FFN(Feed-Forward Networks) - *Fully Connected Feed-Forward Network ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ๋„คํŠธ์›Œํฌ๋Š” Transformer ๋ชจ๋ธ ๋‚ด์—์„œ ๊ฐ Encoder, Decoder ๋‚ด์— ์กด์žฌํ•˜๋ฉฐ, Sequence์˜ ๊ฐ ์œ„์น˜(Position)์— ์žˆ๋Š” Word Vector(๋‹จ์–ด ๋ฒกํ„ฐ)์— ๋…๋ฆฝ์ ์œผ๋กœ ๋™์ผํ•œ ๋„คํŠธ์›Œํฌ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์ฆ‰, Sequence ๋‚ด ๋ชจ๋“  ๋‹จ์–ด(or Token)๋“ค์€ ๊ฐ์ž์˜ ์œ„์น˜์—์„œ FFN(Feed-Forward Networks)๋ฅผ ๊ฑฐ์น˜๊ฒŒ ๋˜๋Š”๋ฐ, ๋ณดํ†ต 2๊ฐœ์˜ Linear Transformation(์„ ํ˜• ๋ณ€ํ™˜) & ํ•˜๋‚˜์˜ NonLinear Function (๋น„์„ ํ˜• ํ•จ์ˆ˜ - Ex. ReLUํ•จ์ˆ˜)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. 
  • Position-wise๋ผ๋Š” ์šฉ์–ด๋Š” ์ด ๋„คํŠธ์›Œํฌ์˜ Sequence๊ฐ€ ๊ฐ ์œ„์น˜๋งˆ๋‹ค ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ๋œ๋‹ค๋Š” ์ ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.
* Fully Connected Feed-Forward Network: ๋„คํŠธ์›Œํฌ์˜ ํ•œ์ธต์— ์žˆ๋Š” ๋ชจ๋“  Neuron์ด ์ด์ „์ธต(Layer)์˜ ๋ชจ๋“  Neuron๊ณผ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, Fully Connected ๊ตฌ์กฐ๋Š” ๋„คํŠธ์›Œํฌ์˜ ๊ฐ Layer๊ฐ€ ์ „์ฒด์ ์œผ๋กœ ์—ฐ๊ฒฐ๋œ ๊ตฌ์กฐ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Position-wise Feed-Forward Networks

Feed-Forward Network ์ˆ˜์‹

 

Flow

  • x์— Linear Transformation(์„ ํ˜• ๋ณ€ํ™˜)์„ ์ ์šฉํ•œ ๋’ค, NonLinear Function(๋น„์„ ํ˜•ํ•จ์ˆ˜)์ธ ReLUํ•จ์ˆ˜(max(0, z)๋ฅผ ๊ฑฐ์ณ ๋‹ค์‹œํ•œ๋ฒˆ Linear Transformation(์„ ํ˜• ๋ณ€ํ™˜)์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋•Œ ๊ฐ๊ฐ์˜ Position๋งˆ๋‹ค ๊ฐ™์€ Parameter(๋งค๊ฐœ๋ณ€์ˆ˜)์ธ W, b๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๋งŒ์•ฝ Layer๊ฐ€ ๋‹ฌ๋ผ์ง€๋ฉด ๋‹ค๋ฅธ Parameter๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Point-Wise Feed Forward Network Example Code

์ด ์ฝ”๋“œ๋Š” ์˜ˆ์ œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์— ๋งž์ถฐ์„œ ์ˆ˜์ •์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
def point_wise_feed_forward_network(**kargs):
    return tf.keras.Sequential([
      tf.keras.layers.Dense(kargs['dff'], activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(kargs['d_model'])  # (batch_size, seq_len, d_model)
    ])

Self-Attention

Self-Attention์— ๊ด€ํ•œ ๋‚ด์šฉ์€ Attention์— ๊ด€ํ•œ ๊ธ€์— ์„ค๋ช…์„ ํ•ด๋†“์•˜์œผ๋‹ˆ, ์ด ๊ธ€์—์„œ๋Š” ๊ฐœ๋…๋งŒ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์˜ˆ์‹œ ์ฝ”๋“œ๋Š” ์•„๋ž˜์˜ ๊ธ€์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”!

 

 

[NLP] Attention - ์–ดํ…์…˜

1. Attention Attention์€ CS ๋ฐ ML์—์„œ ์ค‘์š”ํ•œ ๊ฐœ๋…์ค‘ ํ•˜๋‚˜๋กœ ์—ฌ๊ฒจ์ง‘๋‹ˆ๋‹ค. Attention์˜ ๋งค์ปค๋‹ˆ์ฆ˜์€ ์ฃผ๋กœ Sequence Data๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ฑฐ๋‚˜ ์ƒ์„ฑํ•˜๋Š” ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. -> Sequence ์ž…๋ ฅ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ํ•™์Šต

daehyun-bigbread.tistory.com

Self-Attention์˜ ์ „๋ฐ˜์ ์ธ ๊ตฌ์กฐ. ์ถœ์ฒ˜: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html#transformer_intro

Self-Attention ๊ธฐ๋ฒ•์€ Attention ๊ธฐ๋ฒ•์„ ๋ง ๊ทธ๋Œ€๋กœ ์ž๊ธฐ ์ž์‹ ์—๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋Š” Attention ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • Input Sequence ๋‚ด์˜ ๊ฐ ์š”์†Œ๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋งค์ปค๋‹ˆ์ฆ˜์ด๋ฉฐ, Sequence์˜ ๋‹ค์–‘ํ•œ ์œ„์น˜๊ฐ„์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Sequence ์š”์†Œ ๊ฐ€์šด๋ฐ Task ์ˆ˜ํ–‰์— ์ค‘์š”ํ•œ Element(์š”์†Œ)์— ์ง‘์ค‘ํ•˜๊ณ  ๊ทธ๋ ‡์ง€ ์•Š์€ Element(์š”์†Œ)๋Š” ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋Ÿฌ๋ฉด Task๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋Š” ์„ฑ๋Šฅ์ด ์ƒ์Šนํ•  ๋ฟ๋”๋Ÿฌ Decoding ํ• ๋•Œ Source Sequence ๊ฐ€์šด๋ฐ ์ค‘์š”ํ•œ Element(์š”์†Œ)๋“ค๋งŒ ์ถ”๋ฆฝ๋‹ˆ๋‹ค.
  • ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ์ง‘์ค‘ํ•  ๋‹จ์–ด๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐฉ์‹์„ ์˜๋ฏธ -> ์ค‘์š”ํ•œ ๋‹จ์–ด์—๋งŒ ์ง‘์ค‘์„ ํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ๊ทธ๋ƒฅ ์ฝ์Šต๋‹ˆ๋‹ค.
    • ์ด ๋ฐฉ๋ฒ•์ด ๋ฌธ๋งฅ์„ ํŒŒ์•…ํ•˜๋Š” ํ•ต์‹ฌ์ด๋ฉฐ, ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์„ Deep Learning ๋ชจ๋ธ์— ์ ์šฉํ•œ๊ฒƒ์ด 'Attention' ๋งค์ปค๋‹ˆ์ฆ˜์ด๋ฉฐ, ์ด ๋งค์ปค๋‹ˆ์ฆ˜์„ ์ž๊ธฐ ์ž์‹ ์—๊ฒŒ ์ ์šฉํ•œ๊ฒƒ์ด 'Self-Attention' ์ž…๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ๋งค์ปค๋‹ˆ์ฆ˜์„ ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด์„œ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Self-Attention์˜ ๊ณ„์‚ฐ ์˜ˆ์‹œ

Self-Attention์€ Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜) 3๊ฐ€์ง€ ์š”์†Œ๊ฐ€ ์„œ๋กœ ์˜ํ–ฅ์„ ์ฃผ๊ณ  ๋ฐ›๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.
  • ๋ฌธ์žฅ๋‚ด ๊ฐ ๋‹จ์–ด๊ฐ€ Vector(๋ฒกํ„ฐ) ํ˜•ํƒœ๋กœ input(์ž…๋ ฅ)์„ ๋ฐ›์Šต๋‹ˆ๋‹ค.

* Vector: ์ˆซ์ž์˜ ๋‚˜์—ด ์ •๋„

  • ๊ฐ ๋‹จ์–ด์˜ Vector๋Š” 3๊ฐ€์ง€ ๊ณผ์ •์„ ๊ฑฐ์ณ์„œ ๋ฐ˜ํ™˜์ด ๋ฉ๋‹ˆ๋‹ค.
  • Query(์ฟผ๋ฆฌ) - ๋‚ด๊ฐ€ ์ฐพ๊ณ  ์ •๋ณด๋ฅผ ์š”์ฒญํ•˜๋Š”๊ฒƒ ์ž…๋‹ˆ๋‹ค. 
  • Key(ํ‚ค) - ๋‚ด๊ฐ€ ์ฐพ๋Š” ์ •๋ณด๊ฐ€ ์žˆ๋Š” ์ฐพ์•„๋ณด๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
  • Value(๋ฐธ๋ฅ˜) - ์ฐพ์•„์„œ ์ œ๊ณต๋œ ์ •๋ณด๊ฐ€ ๊ฐ€์น˜ ์žˆ๋Š”์ง€ ํŒ๋‹จํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.

  • ์œ„์˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์ž…๋ ฅ๋˜๋Š” ๋ฌธ์žฅ "์–ด์ œ ์นดํŽ˜ ๊ฐ”์—ˆ์–ด ๊ฑฐ๊ธฐ ์‚ฌ๋žŒ ๋งŽ๋”๋ผ" ์ด 6๊ฐœ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค๋ฉด?
  • ์—ฌ๊ธฐ์„œ์˜ Self-Attention ๊ณ„์‚ฐ ๋Œ€์ƒ์€ Query(์ฟผ๋ฆฌ) Vector 6๊ฐœ, Key(ํ‚ค) Vector 6๊ฐœ, Value(๋ฐธ๋ฅ˜) Vector 6๊ฐœ๋“ฑ ๋ชจ๋‘ 18๊ฐœ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

  • ์œ„์˜ ํ‘œ๋Š” ๋” ์„ธ๋ถ€์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ธ๊ฒƒ์ž…๋‹ˆ๋‹ค. Self-Attention์€ Query ๋‹จ์–ด ๊ฐ๊ฐ์— ๋Œ€ํ•ด ๋ชจ๋“  Key ๋‹จ์–ด์™€ ์–ผ๋งˆ๋‚˜ ์œ ๊ธฐ์ ์ธ ๊ด€๊ณ„๋ฅผ ๋งบ๊ณ  ์žˆ๋Š”์ง€์˜ ํ™•๋ฅ ๊ฐ’๋“ค์˜ ํ•ฉ์ด 1์ธ ํ™•๋ฅ ๊ฐ’์œผ๋กœ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ์ด๊ฒƒ์„ ๋ณด๋ฉด Self-Attention ๋ชจ๋“ˆ์€ Value(๋ฐธ๋ฅ˜) Vector๋“ค์„ Weighted Sum(๊ฐ€์ค‘ํ•ฉ)ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ณ„์‚ฐ์„ ๋งˆ๋ฌด๋ฆฌ ํ•ฉ๋‹ˆ๋‹ค.
  • ํ™•๋ฅ ๊ฐ’์ด ๊ฐ€์žฅ ๋†’์€ ํ‚ค ๋‹จ์–ด๊ฐ€ ์ฟผ๋ฆฌ ๋‹จ์–ด์™€ ๊ฐ€์žฅ ๊ด€๋ จ์ด ๋†’์€ ๋‹จ์–ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์—ฌ๊ธฐ์„œ๋Š” '์นดํŽ˜'์— ๋Œ€ํ•ด์„œ๋งŒ ๊ณ„์‚ฐ ์˜ˆ์‹œ๋ฅผ ๋“ค์—ˆ์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ ๋‚˜๋จธ์ง€ ๋‹จ์–ด๋“ค์˜ค Self-Attention์„ ๊ฐ๊ฐ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Self-Attention์˜ ๋™์ž‘ ๋ฐฉ์‹

Self-Attention์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฐœ๋…์€ Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜)์˜ ์‹œ์ž‘๊ฐ’์ด ๋™์ผํ•˜๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ ‡๋‹ค๊ณ  Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜)๊ฐ€ ๋™์ผ ํ•˜๋‹ค๋Š” ๋ง์ด ์ด๋‹™๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ๊ฐ€์ค‘์น˜ Weight W๊ฐ’์— ์˜ํ•ด์„œ ์ตœ์ข…์ ์ธ Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜)๊ฐ’์€ ์„œ๋กœ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

Attention ๊ตฌํ•˜๋Š” ๊ณต์‹

Attention์„ ๊ตฌํ•˜๋Š” ๊ณต์‹์„ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ์ผ๋‹จ Query(์ฟผ๋ฆฌ)๋ž‘ Key(ํ‚ค)๋ฅผ ๋‚ด์ ํ•ด์ค๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋‚ด์ ์„ ํ•ด์ฃผ๋Š” ์ด์œ ๋Š” ๋‘˜ ์‚ฌ์ด์˜ ์—ฐ๊ด€์„ฑ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด์„œ์ž…๋‹ˆ๋‹ค.
  • ์ด ๋‚ด์ ๋œ ๊ฐ’์„ "Attention Score"๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. Dot-Product Attention ๋ถ€๋ถ„์—์„œ ์ž์„ธํžˆ ์„ค๋ช… ํ–ˆ์ง€๋งŒ ์ด๋ฒˆ์—๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ๋งŒ์•ฝ, Query(์ฟผ๋ฆฌ)๋ž‘ Key(ํ‚ค)์˜ Dimension(์ฐจ์›)์ด ์ปค์ง€๋ฉด, ๋‚ด์  ๊ฐ’์˜ Attention Score๊ฐ€ ์ปค์ง€๊ฒŒ ๋˜์–ด์„œ ๋ชจ๋ธ์ด ํ•™์Šตํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์ด ์ƒ๊น๋‹ˆ๋‹ค.
  • ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ฐจ์› d_k์˜ ๋ฃจํŠธ๋งŒํผ ๋‚˜๋ˆ„์–ด์ฃผ๋Š” Scaling ์ž‘์—…์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ณผ์ •์„ "Scaled Dot-Product Attention" ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  Scaled Dot-Product Attention"์„ ์ง„ํ–‰ํ•œ ๊ฐ’์„ ์ •๊ทœํ™”๋ฅผ ์ง„ํ–‰ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด์„œ Softmax ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์ณ์„œ ๋ณด์ •์„ ์œ„ํ•ด ๊ณ„์‚ฐ๋œ scoreํ–‰๋ ฌ, valueํ–‰๋ ฌ์„ ๋‚ด์ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์ตœ์ข…์ ์œผ๋กœ Attention ํ–‰๋ ฌ์„ ์–ป์„ ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ์‹œ ๋ฌธ์žฅ์„ ๋“ค์–ด์„œ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

"I am a student"๋ผ๋Š” ๋ฌธ์žฅ์œผ๋กœ ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • Self-Attention์€ Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜) 3๊ฐœ ์š”์†Œ ์‚ฌ์ด์˜ ๋ฌธ๋งฅ์  ๊ด€๊ณ„์„ฑ์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
Q = X * Wq, K = X * Wk, W = X * Wv
  • ์œ„์˜ ์ˆ˜์‹์ฒ˜๋Ÿผ Input Vector Sequence(์ž…๋ ฅ ๋ฒกํ„ฐ ์‹œํ€€์Šค) X์— Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜)๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๋Š” ํ–‰๋ ฌ(W)๋ฅผ ๊ฐ๊ฐ ๊ณฑํ•ด์ค๋‹ˆ๋‹ค.
  • Input Vector Sequence(์ž…๋ ฅ ๋ฒกํ„ฐ ์‹œํ€€์Šค)๊ฐ€ 4๊ฐœ์ด๋ฉด ์™ผ์ชฝ์— ์žˆ๋Š” ํ–‰๋ ฌ์„ ์ ์šฉํ•˜๋ฉด Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜) ๊ฐ๊ฐ 4๊ฐœ์”ฉ, ์ด 12๊ฐœ๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค.
* Word Embedding: ๋‹จ์–ด๋ฅผ Vector๋กœ ๋ณ€ํ™˜ํ•ด์„œ Dense(๋ฐ€์ง‘)ํ•œ Vector๊ณต๊ฐ„์— Mapping ํ•˜์—ฌ ์‹ค์ˆ˜ Vector๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.
  1. ๊ฐ ๋‹จ์–ด์— ๋ฐํ•˜์—ฌ Word Embedding(๋‹จ์–ด ์ž„๋ฒ ๋”ฉ)์„ ํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์–ด 'i'์˜ Embedding์ด [1,1,1,1]์ด๋ผ๊ณ  ํ–ˆ์„ ๋•Œ, ์ฒ˜์Œ 'i'์˜ ์ฒ˜์Œ Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜)๋ฅผ ๊ฐ๊ฐ 'Q_i, original', 'K_i, original', 'V_i, original', ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
    • Embedding์˜ ๊ฐ’์ด ๋‹ค [1,1,1,1]๋กœ ๊ฐ™์€ ์ด์œ ๋Š” Self-Attention ๋งค์ปค๋‹ˆ์ฆ˜์—์„œ๋Š” ๊ฐ™์•„์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋‘ [1,1,1,1]์ด๋ผ๊ณ  ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
  2. ํ•™์Šต๋œ Weight(๊ฐ€์ค‘์น˜)๊ฐ’์ด 'WQ', 'WK', 'WV'๋ผ๊ณ  ํ• ๋•Œ Original ๊ฐ’๋“ค๊ณผ ์ ๊ณฑ์„ ํ•ด์ฃผ๋ฉด ์ตœ์ข…์ ์œผ๋กœ 'Q', 'K', 'V'๊ฐ’์ด ๋„์ถœ๋ฉ๋‹ˆ๋‹ค.
    • 'Q', 'K', 'V'๊ฐ’์„ ์ด์šฉํ•ด์„œ ์œ„์—์„œ ์„ค์ •ํ•œ ๋ณด์ •๋œ 'Attention Score'๋ฅผ ๊ณฑํ•ด์ฃผ๋ฉด ์•„๋ž˜์˜ ์™ผ์ชฝ์‹๊ณผ ๊ฐ™์ด 1.5๋ผ๋Š” ๊ฐ’์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.
    • ํ–‰๋ ฌ 'Q', 'K'๋Š” ์„œ๋กœ ์ ๊ณฑ ๊ณ„์‚ฐํ•ด์ฃผ๊ณ , ์—ฌ๊ธฐ์„œ ํ–‰๋ ฌ 'Q', 'K', 'V'์˜ Dimension(์ฐจ์›)์€ 4์ด๋ฏ€๋กœ ๋ฃจํŠธ 4๋กœ ๋‚˜๋ˆ„์–ด์ค๋‹ˆ๋‹ค.

  • 'i' ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ชจ๋“  ๋‹จ์–ด๊ฐ„์˜ 'Self-Attention'์„ ํ•ด์ฃผ๋ฉด ์œ„์˜ ์˜ค๋ฅธ์ชฝ ํ–‰๋ ฌ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค.
  • ๊ฐ€์šด๋ฐ ๋…ธ๋ฝ์ƒ‰ ๋ถ€๋ถ„์€ ์ž๊ธฐ ์ž์‹ ์— ๋Œ€ํ•œ 'Attention'์ด๋ฏ€๋กœ ๋‹น์—ฐํžˆ ๊ฐ’์ด ์ œ์ผ ํฌ๊ณ , ์–‘์ชฝ ์ดˆ๋ก์ƒ‰ ๋ถ€๋ถ„์„ ๋ณด๋ฉด ์ ์ˆ˜๊ฐ€ ๋†’์Šต๋‹ˆ๋‹ค.

Attention ๊ตฌํ•˜๋Š” ๊ณผ์ •์„ ๋„์‹ํ™”

  • ์œ„์˜ ๊ทธ๋ฆผ์€ ๋‹จ์–ด ํ•˜๋‚˜ํ•˜๋‚˜์˜ 'Attention'์„ ๊ตฌํ•˜๋Š” ๊ณผ์ •์„ ๋„์‹ํ™” ํ•œ ๊ทธ๋ฆผ์ž…๋‹ˆ๋‹ค.
  • ์‹ค์ œ๋กœ๋Š” ์—ฌ๋Ÿฌ ๋‹จ์–ด๋ฅผ ์œ„์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์„œ ๊ณ„์‚ฐ์„ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋ฉด ์—ฐ์‚ฐ์†๋„๊ฐ€ ๋นจ๋ฆฌ์ง€๋Š” ์ด์ ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ฐ„๋‹จํžˆ 'Self-Attention' ๊ณผ์ •์„ ์š”์•ฝํ•ด์„œ ์„ค๋ช…ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
Self-Attention ๊ณผ์ • Summary
1. ์›ํ•˜๋Š” ๋ฌธ์žฅ์„ ์ž„๋ฒ ๋”ฉํ•˜๊ณ  ํ•™์Šต์„ ํ†ตํ•ด ๊ฐ Query, Key, Value์— ๋งž๋Š” weight๋“ค์„ ๊ตฌํ•ด์คŒ.
2. ๊ฐ ๋‹จ์–ด์˜ ์ž„๋ฒ ๋”ฉ์˜ Query, Key, Value(Query = Key = Value)์™€ weight๋ฅผ ์ ๊ณฑ(๋‚ด์ )ํ•ด ์ตœ์ข… Q, K, V๋ฅผ ๊ตฌํ•จ.
3. Attention score ๊ณต์‹์„ ํ†ตํ•ด ๊ฐ ๋‹จ์–ด๋ณ„ Self Attention value๋ฅผ ๋„์ถœ
4. Self Attention value์˜ ๋‚ด๋ถ€๋ฅผ ๋น„๊ตํ•˜๋ฉด์„œ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์€ ๋‹จ์–ด๋“ค์„ ๋„์ถœ

Multi-Head Attention

Multi-Head Attention๋„ ๊ฐœ๋…๋งŒ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์˜ˆ์‹œ ์ฝ”๋“œ๋Š” ์œ„์— Attention ๊ด€๋ จํ•œ ๊ธ€์— ์˜ˆ์‹œ ์ฝ”๋“œ๋ฅผ ๋งŒ๋“ค์–ด ๋†“์•˜์œผ๋‹ˆ ํ•„์š”ํ•˜์‹ ๋ถ„์€ ํ™•์ธํ•ด์ฃผ์„ธ์š”!

Multi-Head Attention์˜ ์ „๋ฐ˜์ ์ธ ๊ตฌ์กฐ. ์ถœ์ฒ˜: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html#transformer_intro

Multi-Head Attention์€ ์—ฌ๋Ÿฌ๊ฐœ์˜ 'Attention Head'๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฐ 'Head Attention'์—์„œ ๋‚˜์˜จ ๊ฐ’์„ ์—ฐ๊ฒฐํ•ด ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ์˜ 'Attention'์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šต์„ ์‹œํ‚ค๋Š”๊ฒƒ ๋ณด๋‹ค๋Š” 'Attention'์„ ๋ณ‘๋ ฌ๋กœ ์—ฌ๋Ÿฌ๊ฐœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

Multi-Head Attention์˜ ์ˆ˜์‹

  1. ์ˆœ์„œ๋Š” ์›๋ž˜์˜ Query(์ฟผ๋ฆฌ), Key(ํ‚ค) Value(๋ฐธ๋ฅ˜) ํ–‰๋ ฌ ๊ฐ’์„ Head์ˆ˜ ๋งŒํผ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ถ„ํ• ํ•œ ํ–‰๋ ฌ ๊ฐ’์„ ํ†ตํ•ด, ๊ฐ 'Attention' Value๊ฐ’๋“ค์„ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.
  3. ๋„์ถœ๋œ 'Attention value'๊ฐ’๋“ค์„ concatenate(์Œ“์•„ ํ•ฉ์น˜๊ธฐ)ํ•˜์—ฌ์„œ ์ตœ์ข… 'Attention Value'๋ฅผ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.

  • [4x4] ํฌ๊ธฐ์˜ ๋ฌธ์žฅ Embedding Vector์™€ [4x8]์˜ Query(์ฟผ๋ฆฌ), Key(ํ‚ค) Value(๋ฐธ๋ฅ˜)๊ฐ€ ์žˆ์„ ๋•Œ, ์ผ๋ฐ˜์ ์ธ ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐํ•˜๋Š” Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ [4x4]*[4x8]=[4x8]์˜ 'Attention Value'๊ฐ€ ํ•œ ๋ฒˆ์— ๋„์ถœ๋ฉ๋‹ˆ๋‹ค.

  • 'Multi-Head Attention' ๋งค์ปค๋‹ˆ์ฆ˜์œผ๋กœ ๋ณด๋ฉด ์—ฌ๊ธฐ์„œ Head๋Š” 4๊ฐœ ์ž…๋‹ˆ๋‹ค. 'I, am, a, student'
  • Head๊ฐ€ 4๊ฐœ ์ด๋ฏ€๋กœ ๊ฐ ์—ฐ์‚ฐ๊ณผ์ •์ด 1/4๋งŒํผ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ์œ„์˜ ๊ทธ๋ฆผ์œผ๋กœ ๋ณด๋ฉด ํฌ๊ธฐ๊ฐ€ [4x8]์ด์—ˆ๋˜, Query(์ฟผ๋ฆฌ), Key(ํ‚ค) Value(๋ฐธ๋ฅ˜)๋ฅผ 4๋“ฑ๋ถ„ ํ•˜์—ฌ [4x2]๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์—ฌ๊ธฐ์„œ์˜ 'Attention Value'๋Š” [4x2]๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  • ์ด 'Attention Value'๋“ค์„ ๋งˆ์ง€๋ง‰์œผ๋กœ Concatenate(ํ•ฉ์ณ์ค€๋‹ค)ํ•ฉ์ณ์ฃผ๋ฉด, ํฌ๊ธฐ๊ฐ€ [4x8]๊ฐ€ ๋˜์–ด ์ผ๋ฐ˜์ ์ธ Attention ๋งค์ปค๋‹ˆ์ฆ˜์˜ ๊ฒฐ๊ณผ๊ฐ’๊ณผ ๋™์ผํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ์‹œ๋ฅผ ํ•œ๋ฒˆ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
Summary: Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜)๊ฐ’์„ ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐํ•˜์ง€ ์•Š๊ณ  head ์ˆ˜๋งŒํผ ๋‚˜๋ˆ  ๊ณ„์‚ฐ ํ›„ ๋‚˜์ค‘์— Attention Value๋“ค์„ ํ•ฉ์น˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜. ํ•œ๋งˆ๋””๋กœ ๋ถ„ํ•  ๊ณ„์‚ฐ ํ›„ ํ•ฉ์‚ฐํ•˜๋Š” ๋ฐฉ์‹.

Multi-Head Attention Example

  • ์ž…๋ ฅ ๋‹จ์–ด ์ˆ˜๋Š” 2๊ฐœ, ๋ฐธ๋ฅ˜์˜ ์ฐจ์›์ˆ˜๋Š” 3, ํ—ค๋“œ๋Š” 8๊ฐœ์ธ ๋ฉ€ํ‹ฐ-ํ—ค๋“œ ์–ดํ…์…˜์„ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ฆผ ์ž…๋‹ˆ๋‹ค.
  • ๊ฐœ๋ณ„ ํ—ค๋“œ์˜ ์…€ํ”„ ์–ดํ…์…˜ ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ๋Š” ‘์ž…๋ ฅ ๋‹จ์–ด ์ˆ˜ ×, ๋ฐธ๋ฅ˜ ์ฐจ์›์ˆ˜’, ์ฆ‰ 2×3 ํฌ๊ธฐ๋ฅผ ๊ฐ–๋Š” ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค.
  • 8๊ฐœ ํ—ค๋“œ์˜ ์…€ํ”„ ์–ดํ…์…˜ ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์Œ ๊ทธ๋ฆผ์˜ โ‘ ์ฒ˜๋Ÿผ ์ด์–ด ๋ถ™์ด๋ฉด 2×24 ์˜ ํ–‰๋ ฌ์ด ๋ฉ๋‹ˆ๋‹ค.
  • Multi-Head Attention์˜ ์ตœ์ข… ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ๋Š” '์ž…๋ ฅ ๋‹จ์–ด ์ˆ˜' x '๋ชฉํ‘œ ์ฐจ์› ์ˆ˜' ์ด๋ฉฐ, Encoder, Decoder Block ๋ชจ๋‘์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.
Multi-Head Attention์€ ๊ฐœ๋ณ„ Head์˜ Self-Attention ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ๋ฅผ ์ด์–ด ๋ถ™์ธ ํ–‰๋ ฌ (โ‘ )์— W0๋ฅผ ํ–‰๋ ฌ๊ณฑํ•ด์„œ ๋งˆ๋ฌด๋ฆฌ ๋œ๋‹ค.
→ ์…€ํ”„ ์–ดํ…์…˜ ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ ํ–‰๋ ฌ์˜ ์—ด(column)์˜ ์ˆ˜ × ๋ชฉํ‘œ ์ฐจ์›์ˆ˜

Multi-Head Attention ๋ถ€๊ฐ€ ์„ค๋ช…

Multi-Head Attention์˜ ์„ธ๋ถ€์  ๊ตฌ์กฐ. ์ถœ์ฒ˜: https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html#transformer_intro

 

  • ์œ„์˜ ์‚ฌ์ง„์—์„œ๋Š” ํ•˜๋‚˜์˜ Head-Attention์„ ์œ„ํ•ด ๊ณ„์‚ฐํ•œ Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜) ๊ฐ’์„ ์—ฌ๋Ÿฌ ๋ถ€๋ถ„์œผ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ ํ•˜๋‚˜ ๋˜๋Š” ์—ฌ๋Ÿฌ๊ฐœ์˜ Attention Head๊ฐ€ ์žˆ๋Š” ๋ชจ๋ธ์€ ๋™์ผํ•œ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
  • ์ด๋ง์„ ์ •๋ฆฌํ•˜๋ฉด, Multi-Head Attention์€ Model์˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ์ง€ ์•Š๊ณ  ๋™์ผํ•œ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€์ง€๋งŒ, ๊ณ„์‚ฐํ•œ Query(์ฟผ๋ฆฌ), Key(ํ‚ค), Value(๋ฐธ๋ฅ˜) ๊ฐ’์„ ์—ฌ๋Ÿฌ ๋ถ€๋ถ„์œผ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.

Multi-Head Attention Example Code

์ด ์ฝ”๋“œ๋Š” ์˜ˆ์ œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์— ๋งž์ถฐ์„œ ์ˆ˜์ •์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • Multi-Head Attention Input = Output
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, **kargs):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = kargs['num_heads']
        self.d_model = kargs['d_model']

        assert self.d_model % self.num_heads == 0

        self.depth = self.d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(kargs['d_model']) # Multi-Head Attetion Input = Output
        self.wk = tf.keras.layers.Dense(kargs['d_model'])
        self.wv = tf.keras.layers.Dense(kargs['d_model'])

        self.dense = tf.keras.layers.Dense(kargs['d_model'])

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention,
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights

Positional Encoding

Transformer ๋ชจ๋ธ์—์„œ์˜ Positinal encoding์€ Sequence ๋‚ด ๋‹จ์–ด๋“ค์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Positional Encoding ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. - Transformer Model์˜ ์ผ๋ถ€๋ถ„

  • ๋‹ค์‹œ ์„ค๋ช…ํ•˜๋ฉด, Model์ด ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ์ดํ•ดํ•˜๊ณ , ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ๋„์™€์ค๋‹ˆ๋‹ค.
  • Transformer Model์—์„œ ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” Positional encoding ๋ฐฉ๋ฒ•์€ Sin(์‚ฌ์ธ)ํ•จ์ˆ˜, Cos(์ฝ”์‚ฌ์ธ)ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Positional Encoding ์ˆ˜์‹

  • ํ•œ๋ฒˆ ์ˆ˜์‹์„ ๋ณด๋ฉด PE๋Š” Positional Encoding Matrix์˜ ์š”์†Œ์ž…๋‹ˆ๋‹ค.
    • ์ด๋ง์€ Input Sequence(์ž…๋ ฅ ์‹œํ€€์Šค)์˜ ๊ฐ ์œ„์น˜(๋˜๋Š” ๋‹จ์–ด)์— ๋Œ€ํ•œ ๊ณ ์œ ํ•œ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š” ํ–‰๋ ฌ(matirx)์˜ ๊ฐ ์›์†Œ(element)๋ฅผ ์ง€์นญํ•ฉ๋‹ˆ๋‹ค.
    • Transformer ๋ชจ๋ธ์€ Sequence์˜ ์ˆœ์„œ ์ •๋ณด๋ฅผ ๋‚ด์žฌ์ ์œผ๋กœ ์ฒ˜๋ฆฌ๋ฅผ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ชจ๋ธ์— ์ฃผ์ž…ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
  • pos๋Š” Sequence์—์„œ ๋‹จ์–ด ๋˜๋Š” Token์˜ ์œ„์น˜(index)๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • i๋Š” Positional Encoding Vector ๋‚ด์— ์ฐจ์›์˜ Index. ์ฆ‰, Encoding ์ฐจ์›(d_model)๋‚ด์˜ Index๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • d_model์€ ๋ชจ๋ธ์˜ Embedding Dimension(์ž„๋ฒ ๋”ฉ ์ฐจ์›), ๋ชจ๋“  Input Token(์ž…๋ ฅ ํ† ํฐ)์ด ๋ณ€ํ™˜๋˜๋Š” Vector์˜ ํฌ๊ธฐ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Positional Encoding Matrix ์˜ˆ์‹œ

ํ•œ๋ฒˆ ์˜ˆ์‹œ๋ฅผ "I am a Robot"์œผ๋กœ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

  • 'Pos'๋Š” Input Sequence(์ž…๋ ฅ ์‹œํ€€์Šค)์—์„œ์˜ ๊ฐ์ฒด์˜ ์œ„์น˜์ž…๋‹ˆ๋‹ค.
    • ์ž…๋ ฅ์‹œํ€€์Šค์˜ ๊ธธ์ด๋Š” 0 <= pos < L / 2 ์ž…๋‹ˆ๋‹ค, L์€ Input Sequence(์ž…๋ ฅ ์‹œํ€€์Šค-Token)์˜ ๊ธธ์ด์ž…๋‹ˆ๋‹ค.
  • 'd_model' ์€ Output Embedding ๊ณต๊ฐ„์˜ Dimension(์ฐจ์›)์ž…๋‹ˆ๋‹ค.
  • 'PE(pos, 2i)' ๋Š” ์œ„์น˜๋ฅผ Mapping ํ•˜๋Š” ํ•จ์ˆ˜์ด๋ฉฐ, Indexingํ•  Input Sequence์˜ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • '10000' ์€ Transformer ๋…ผ๋ฌธ์—์„œ Positional Encoding ์ˆ˜์‹์—์„œ ์„ค์ •๋œ ์‚ฌ์šฉ์ž์˜ ์ •์˜ Scaler(์Šค์นผ๋ผ)์ž…๋‹ˆ๋‹ค. ๋‹ค๋ฅธ๊ณณ์—์„œ ์„ค๋ช…ํ•œ ์ˆ˜์‹์—๋Š” 10000 ๋Œ€์‹  'n' ์ด๋ผ๊ณ  ์ง€์นญํ•˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.
  • 'i' ๋Š” ์—ด index์— ๋งค์นญํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. 0< <= i < d_model / 2 ์ž…๋‹ˆ๋‹ค.
์œ„์˜ Matrixํ‘œ๋ฅผ ๋ณด์‹œ๋ฉด ์ง์ˆ˜ ์œ„์น˜๋Š” Sin ํ•จ์ˆ˜์— ํ•ด๋‹นํ•˜๊ณ , ํ™€์ˆ˜ ์œ„์น˜๋Š” Cos ํ•จ์ˆ˜์— ํ•ด๋‹นํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•œ๋ฒˆ '10000' ์„ '100' ์œผ๋กœ ๋‚ฎ์ถ”๊ณ , 'd_model' -  Output Embedding ๊ณต๊ฐ„์˜ Dimension(์ฐจ์›)์„ 4๋กœ ํ•˜๊ณ  ๊ณ„์‚ฐํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ Encoding Matrix๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Positional Encoding Matrix ์˜ˆ์‹œ

Positional Encoding์€ ์ฃผ๊ธฐ์ ์ธ ํŒจํ„ด์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

  • Positional Encoding ์ˆ˜์‹์—์„œ์˜ ์‚ฌ์šฉ๋˜๋Š” Sin, Cos ํ•จ์ˆ˜์•ˆ์— ๋“ค์–ด๊ฐ€๋Š” ๊ณต์‹์€ ์ฃผ๊ธฐ์ ์ธ ํŒจํ„ด์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ์ฃผ๊ธฐ์ ์ธ ๊ฐ’์„ ์ƒ์„ฑํ•˜๋ฉด์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์œ„์น˜์— ๋ฐํ•˜์—ฌ ๊ณ ์œ ํ•œ ๊ฐ’์„ ์ƒ์„ฑํ•˜๋Š” ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • Positional Encoding์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์œ„์น˜์— ๋ฐํ•˜์—ฌ ์ƒ์„ฑํ•œ ๊ณ ์œ ํ•œ ๊ฐ’์€ ๋‹จ์–ด์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜๋ฅผ Embedding ๊ณต๊ฐ„์— ๋ฐ˜์˜ํ•˜์—ฌ Transformer ๋ชจ๋ธ์ด ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ์ธ์‹ํ•˜๋„๋ก ๋„์™€์ค๋‹ˆ๋‹ค.

Scaling

Scaling ์ˆ˜์‹

  • Scaling์€ Positional Encoding์—์„œ์˜ ์œ„์น˜ 'pos' ์— ๋Œ€ํ•œ Scaling ์—ญํ• ์„ ํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ ์œ„์น˜์˜ ๊ฐ’๋“ค์„ ์ƒ์„ฑํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
'Pos'๋Š” Input Sequence(์ž…๋ ฅ ์‹œํ€€์Šค)์—์„œ์˜ ๊ฐ์ฒด์˜ ์œ„์น˜์ž…๋‹ˆ๋‹ค.
์ž…๋ ฅ์‹œํ€€์Šค์˜ ๊ธธ์ด๋Š” 0 <= pos < L / 2 ์ž…๋‹ˆ๋‹ค, L์€ Input Sequence(์ž…๋ ฅ ์‹œํ€€์Šค-Token)์˜ ๊ธธ์ด์ž…๋‹ˆ๋‹ค.

ํ™€์ˆ˜/์ง์ˆ˜ Dimension ๊ตฌ๋ถ„

  • ์œ„์—์„œ Positional Encoding Matrix ํ‘œ๋ฅผ ๋ณด์‹œ๋ฉด ์™ผ์ชฝ์—์„œ 2๋ฒˆ์งธ ํ‘œ๋ฅผ ๋ณด๋ฉด "0, 1, 2, 3"์ด๋ผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๊ฑด ์ฐจ์›์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ˆซ์ž์ธ๋ฐ, ์—ฌ๊ธฐ์„œ ํ™€์ˆ˜ ์ˆซ์ž๋Š” Cos(์ฝ”์‚ฌ์ธ)ํ•จ์ˆ˜๊ฐ€ ํ™€์ˆ˜ Dimension(์ฐจ์›), ์ง์ˆ˜ ์ˆซ์ž๋Š” SIn(์‚ฌ์ธ)ํ•จ์ˆ˜๋Š” ์ง์ˆ˜ Dimension(์ฐจ์›)์„ ์ฒ˜๋ฆฌํ•˜์—ฌ Dimension(์ฐจ์›)๊ฐ„ ๋‹ค์–‘์„ฑ์„ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ Dimension(์ฐจ์›)์ด ์„œ๋กœ ๋‹ค๋ฅธ ์ •๋ณด๋ฅผ ๋‹ด๋‹นํ•˜๊ฒŒ ๋˜๋ฉฐ, Model์ด ๋‹จ์–ด์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Positional Encoding Example Code

์ด ์ฝ”๋“œ๋Š” ์˜ˆ์ œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์— ๋งž์ถฐ์„œ ์ˆ˜์ •์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • Sin, Cos ํ•จ์ˆ˜ ๋งŒ๋“ค๊ณ  ๊ฐ’๋“ค์„ ๊ฒน์น˜์ง€ ์•Š๊ฒŒ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • "Embedding"์€ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ  ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ฐ์ดํ„ฐ๋ฅผ ์ €์ฐจ์› ๊ณต๊ฐ„์— ๋งคํ•‘ํ•˜๋Š” ๊ธฐ์ˆ - "Mapping"์€ ์ผ๋ฐ˜์ ์œผ๋กœ ํ•œ ์„ธํŠธ์˜ ๊ฐ’์—์„œ ๋‹ค๋ฅธ ์„ธํŠธ์˜ ๊ฐ’์œผ๋กœ ๋Œ€์‘์‹œํ‚ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • embedding dimension ์ด ๋‹ฌ๋ผ์ ธ๋„ positional encoding์œผ๋กœ ๊ณ ์ •๋œ ๊ฐ’์œผ๋กœ ๋“ค์–ด์˜ต๋‹ˆ๋‹ค.
  • ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์—์„œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์ฐจ์›์˜ ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜ํ™ฅ๋‹ˆ๋‹ค.
def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * i//2) / np.float32(d_model))
    return pos * angle_rates
def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

    # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

์ด๋ ‡๊ฒŒ Embedding Layer๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

pos_encoding = positional_encoding(50, 512) #512 - embedding dimension์˜ dimension
print (pos_encoding.shape)

plt.pcolormesh(pos_encoding[0], cmap='RdBu')
plt.xlabel('Depth')
plt.xlim((0, 512))
plt.ylabel('Position')
plt.colorbar()
plt.show()
  • Result: (1, 50, 512)


Transformer: Model Architecture

Transformer ๋ชจ๋ธ์˜ ์ „๋ฐ˜์ ์ธ ๊ตฌ์กฐ๋ฅผ ํ•œ๋ฒˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Transformer Model Architecture. ์ถœ์ฒ˜:&nbsp;https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html#transformer_intro

  • ํฌ๊ฒŒ๋Š” Encoder, Decoder ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ์ด ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์„ธ๋ถ€์ ์œผ๋กœ๋Š” Feed-Forward Block, Residual Connection & Layer Normalization, Positional Encoding, Multi-Head Attention์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์œ„์—์„œ ์ž์„ธํ•œ ์„ค๋ช…์„ ํ–ˆ์œผ๋‹ˆ๊นŒ ์—ฌ๊ธฐ์„œ๋Š” ๋Œ€๋žต์ ์ธ ์„ค๋ช…๋งŒ ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Feed-forward Block

  • ๊ฐ Layer์—๋Š” Feed-Forward Network Block์ด ์žˆ์Šต๋‹ˆ๋‹ค. 2๊ฐœ์˜ Linear Layer(์„ ํ˜• ๋ ˆ์ด์–ด)์‚ฌ์ด์—๋Š” ReLU๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Feed-Forward Block ๋‚ด๋ถ€

  • Attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด์„œ ๋‹ค๋ฅธ Token์„ ์‚ดํŽด๋ณธ ํ›„ ๋ชจ๋ธ์€ Feed-Forward Block์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

Residual Connection (์ž”์—ฌ ์—ฐ๊ฒฐ)

Residual Connection Block ๋‚ด๋ถ€

  • Residual Connection์€ ๋งค์šฐ ๊ฐ„๋‹จํ•˜๋ฉด์„œ ๋งค์šฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • Block์˜ Input(์ž…๋ ฅ)์„ Output(์ถœ๋ ฅ)์— ์ถ”๊ฐ€ ํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.
  • Network๋ฅผ ํ†ตํ•ด์„œ Gradient Flow(ํ๋ฆ„)์„ ์™„ํ™”ํ•˜๊ณ  ๋งŽ์€ Layer(๋ ˆ์ด์–ด)๋ฅผ ์Œ“์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Transformer์—์„œ๋Š” ๊ฐ Attention ๋ฐ Feed-Forward Block ํ›„์— Residual Connection(์ž”์—ฌ ์—ฐ๊ฒฐ)์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

Layer Normalization (๋ ˆ์ด์–ด ์ •๊ทœํ™”)

Layer Normalization

  • "Add & Norm" Layer ์˜ "Norm" ๋ถ€๋ถ„์€ Layer Normalization(๋ ˆ์ด์–ด ์ •๊ทœํ™”)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ๊ฐ ์˜ˆ์ œ์˜ Vector ํ‘œํ˜„์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™” ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Š” ๋‹ค์Œ Layer๋กœ "Flow"๋ฅผ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
  • Layer Normalization(๋ ˆ์ด์–ด ์ •๊ทœํ™”)๋Š” ์•ˆ์ •์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ํ’ˆ์งˆ๊นŒ์ง€ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  • Transformer์—์„œ๋Š” ๊ฐ Token์˜ Vector ํ‘œํ˜„์„ Normalization(์ •๊ทœํ™”)ํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ Layer Norm์—๋Š” ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜์ธ 'scale', 'bias' ๊ทธ๋ฆฌ๊ณ  Normalization(์ •๊ทœํ™”)ํ›„์— Layer์˜ ์ถœ๋ ฅ(or Next Layer์˜ Input)์˜ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•˜๋Š”๋ฐ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค.

Positional Encoding

Positional Encoding ๋ถ€๋ถ„. ์ถœ์ฒ˜:&nbsp;https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html#transformer_intro

  • Transformer Model์€ Recurrence(๋ฐ˜๋ณต) & Convolution์ด ํฌํ•จ๋˜์–ด ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค
  • ๊ทธ๋Ÿฌ๋ฉด Input(์ž…๋ ฅ)์œผ๋กœ ๋“ค์–ด์˜ค๋Š” Token์˜ ์ˆœ์„œ๋ฅผ ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ Model์˜ Token์˜ Position(์œ„์น˜)๋ฅผ ์•Œ๋ ค์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋‘๊ฐ€์ง€์˜ Embedding ์„ธํŠธ๊ฐ€ ์žˆ๋Š”๋ฐ, Token, Position์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  Token์˜ ์ž…๋ ฅ ํ‘œํ˜„ ํ˜•ํƒœ๋Š” Token + Position(์œ„์น˜)์˜ Embedding์˜ ํ•ฉ์ž…๋‹ˆ๋‹ค.

Transformer Model Example Code

์ด ์ฝ”๋“œ๋Š” ์˜ˆ์ œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์— ๋งž์ถฐ์„œ ์ˆ˜์ •์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Parameter

char2idx = prepro_configs['char2idx']
end_index = prepro_configs['end_symbol']
model_name = 'transformer'
vocab_size = prepro_configs['vocab_size']
BATCH_SIZE = 2
MAX_SEQUENCE = 25
EPOCHS = 30
VALID_SPLIT = 0.1

kargs = {'model_name': model_name,
         'num_layers': 2,
         'd_model': 512,
         'num_heads': 8,
         'dff': 2048,
         'input_vocab_size': vocab_size,
         'target_vocab_size': vocab_size,
         'maximum_position_encoding': MAX_SEQUENCE,
         'end_token_idx': char2idx[end_index],
         'rate': 0.1
        }

Transformer Model Code

class Transformer(tf.keras.Model):
    def __init__(self, **kargs):
        super(Transformer, self).__init__(name=kargs['model_name'])
        self.end_token_idx = kargs['end_token_idx']

        self.encoder = Encoder(**kargs)
        self.decoder = Decoder(**kargs)

        self.final_layer = tf.keras.layers.Dense(kargs['target_vocab_size'])

    def call(self, x):
        inp, tar = x

        enc_padding_mask, look_ahead_mask, dec_padding_mask = create_masks(inp, tar)
        enc_output, attn = self.encoder(inp, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

        # dec_output.shape == (batch_size, tar_seq_len, d_model)
        dec_output, attn = self.decoder(
            tar, enc_output, look_ahead_mask, dec_padding_mask)

        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

        return final_output, attn

    def inference(self, x):
        inp = x
        tar = tf.expand_dims([STD_INDEX], 0)

        enc_padding_mask, look_ahead_mask, dec_padding_mask = create_masks(inp, tar)
        enc_output = self.encoder(inp, enc_padding_mask)

        predict_tokens = list()
        for t in range(0, MAX_SEQUENCE):
            dec_output, _ = self.decoder(tar, enc_output, look_ahead_mask, dec_padding_mask)
            final_output = self.final_layer(dec_output)
            outputs = tf.argmax(final_output, -1).numpy()
            pred_token = outputs[0][-1]
            if pred_token == self.end_token_idx:
                break
            predict_tokens.append(pred_token)
            tar = tf.expand_dims([STD_INDEX] + predict_tokens, 0)
            _, look_ahead_mask, dec_padding_mask = create_masks(inp, tar)

        return predict_tokens