A A
[LLM] Improving Language Understanding by Generative Pre-Training (GPT-1 ๋…ผ๋ฌธ Review)
์ด๋ฒˆ์—” GPT-1 Model์— ๋ฐํ•˜์—ฌ ๋…ผ๋ฌธ์„ ์ฝ๊ณ  ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์„ ๋ฆฌ๋ทฐํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ์ด ๋…ผ๋ฌธ์—์„œ๋Š” Transformer ๋ชจ๋ธ์— ๋ฐํ•œ ์„ค๋ช…์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ, ์–ด๋Š์ •๋„ Transformer Model์— ๋ฐํ•œ ์ง€์‹์ด ์ž‡์–ด์•ผ ์ดํ•ดํ•˜์‹ค์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•œ๋ฒˆ ์ฝ๊ณ  ์™€์ฃผ์„ธ์š”!
 

[NLP] Transformer Model - ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ ์•Œ์•„๋ณด๊ธฐ

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” Transformer ๋ชจ๋ธ์˜ ์ „๋ฐ˜์ ์ธ Architecture ๋ฐ ๊ตฌ์„ฑ์— ๋ฐํ•˜์—ฌ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. Transformer: Attention is All You Need Transformer ๋ชจ๋ธ์€ 2017๋…„์— "Attention is All You Need"๋ผ๋Š” ๋…ผ๋ฌธ์„ ํ†ตํ•ด์„œ ์†Œ๊ฐœ๋˜์—ˆ์Šต

daehyun-bigbread.tistory.com


Abstract

์ž์—ฐ์–ด ์ดํ•ด๋Š” ํ…์ŠคํŠธ ํ•จ์˜, ์งˆ๋ฌธ ์‘๋‹ต, ์˜๋ฏธ ์œ ์‚ฌ์„ฑ ํ‰๊ฐ€, ๋ฌธ์„œ ๋ถ„๋ฅ˜์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ์ž‘์—…์„ ์œ„ํ•œ ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ๋Š” ๋ถ€์กฑํ•œ ๋ฐ˜๋ฉด, ๋Œ€๊ทœ๋ชจ์˜ ๋น„์ง€๋„ ํ…์ŠคํŠธ ์ฝ”ํผ์Šค๋Š” ํ’๋ถ€ํ•˜๊ฒŒ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ๊ธฐ์กด์˜ ํŒ๋ณ„์ (discriminative) ๋ชจ๋ธ๋“ค์€ ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์€ ์ƒํ™ฉ์—์„œ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

 

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด generative pre-training๊ณผ discriminative fine-tuning์„ ๊ฒฐํ•ฉํ•œ ์ ‘๊ทผ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค:

  1. Generative Pre-Training:
    • ๋น„์ง€๋„ ํ•™์Šต์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ํ…์ŠคํŠธ ์ฝ”ํผ์Šค์—์„œ ์–ธ์–ด ๋ชจ๋ธ์„ ์‚ฌ์ „ ํ•™์Šตํ•˜์—ฌ, ์ž์—ฐ์Šค๋Ÿฌ์šด ์–ธ์–ด ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๊ณผ์ •์—์„œ ๋ชจ๋ธ์€ ์ผ๋ฐ˜์ ์ธ ์–ธ์–ด ํŒจํ„ด๊ณผ ๊ตฌ์กฐ๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  2. Discriminative Fine-Tuning:
    • ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๊ฐ ํŠน์ • ์ž‘์—…์— ๋งž๊ฒŒ ๋ฏธ์„ธ ์กฐ์ •(fine-tuning) ํ•˜์—ฌ, ํƒœ์Šคํฌ ์ธ์‹ ์ž…๋ ฅ ๋ณ€ํ™˜์„ ํ†ตํ•ด ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์˜ ์ตœ์†Œํ•œ์˜ ๋ณ€๊ฒฝ์œผ๋กœ๋„ ํšจ์œจ์ ์ธ ์ „์ด๋ฅผ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

์ด ์—ฐ๊ตฌ๋Š” ๋‹ค์–‘ํ•œ ์ž์—ฐ์–ด ์ดํ•ด ๋ฒค์น˜๋งˆํฌ์—์„œ ํ•ด๋‹น ์ ‘๊ทผ๋ฒ•์˜ ํšจ๊ณผ๋ฅผ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ํŒ๋ณ„์  ํ•™์Šต ๋ชจ๋ธ๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘์—ˆ๋‹ค๋Š” ๊ฒƒ์ด ํŠน์ง•์ž…๋‹ˆ๋‹ค.

  • Commonsense Reasoning (Stories Cloze Test): 8.9% ์˜ ์ ˆ๋Œ€์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • Question Answering (RACE): 5.7%์˜ ์ ˆ๋Œ€์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • Textual Entailment (MultiNLI): 1.5%์˜ ์ ˆ๋Œ€์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ

Introduction

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP) ๋ถ„์•ผ์—์„œ ์›์‹œ ํ…์ŠคํŠธ๋กœ๋ถ€ํ„ฐ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋Šฅ๋ ฅ์€ ์ง€๋„ ํ•™์Šต(Supervised Learning)์— ๋Œ€ํ•œ ์˜์กด์„ ์ค„์ด๋Š” ๋ฐ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ ๋Œ€๋ถ€๋ถ„์˜ ๋”ฅ ๋Ÿฌ๋‹(Deep Learning) ๋ฐฉ๋ฒ•์€ ์ˆ˜์ž‘์—…์œผ๋กœ ๋ผ๋ฒจ๋ง๋œ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€๋งŒ, ์ด๋Š” ์ฃผ์„(Annotation)์ด ๋ถ€์กฑํ•œ ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ์—์„œ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์—์„œ ๋น„์ง€๋„(Unsupervised) ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์€ ๋ผ๋ฒจ๋ง์— ๋Œ€ํ•œ ์‹œ๊ฐ„๊ณผ ๋น„์šฉ์„ ์ค„์ด๋Š” ์œ ์šฉํ•œ ๋Œ€์•ˆ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋น„์ง€๋„ ๋ฐฉ์‹์œผ๋กœ ์ข‹์€ ํ‘œํ˜„(Representations)์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์ง€๋„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ์—๋„ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ์‚ฌ์ „ ํ•™์Šต๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ(Pre-trained Word Embeddings)์€ ๋‹ค์–‘ํ•œ NLP ์ž‘์—…์—์„œ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋‹จ์–ด ์ˆ˜์ค€ ์ด์ƒ์˜ ์ •๋ณด๋ฅผ ๋น„์ง€๋„ ํ…์ŠคํŠธ์—์„œ ํ•™์Šตํ•˜๋Š” ๋ฐ๋Š” ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค:

  1. ์ „์ด ํ•™์Šต(Transfer Learning)์— ์œ ์šฉํ•œ ํ…์ŠคํŠธ ํ‘œํ˜„(Textual Representation)์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์ ํ™” ๋ชฉํ‘œ(Objective)๋ฅผ ์ •์˜ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
    • ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ Language Modeling, Machine Translation, Discourse Coherence์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๋ชฉํ‘œ๋ฅผ ์‹œ๋„ํ•ด ์™”์ง€๋งŒ, ๊ฐ ๋ฐฉ๋ฒ•์€ ์ž‘์—…์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์„ฑ๊ณผ๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  2. ํ•™์Šต๋œ ํ‘œํ˜„์„ ๋ชฉํ‘œ ์ž‘์—…(Target Task)์— ํšจ๊ณผ์ ์œผ๋กœ ์ „์ด(Transfer)ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ํ•ฉ์˜๊ฐ€ ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.
    • ๊ธฐ์กด์˜ ๊ธฐ์ˆ ๋“ค์€ ์ž‘์—… ํŠน์ • ์•„ํ‚คํ…์ฒ˜ ๋ณ€๊ฒฝ(Task-Specific Architecture Modifications)๊ณผ ์ •๊ตํ•œ ํ•™์Šต ์Šคํ‚ด(Training Schemes) ๋ฐ ๋ณด์กฐ ํ•™์Šต ๋ชฉํ‘œ(Auxiliary Objectives)์˜ ๊ฒฐํ•ฉ์„ ํ•„์š”๋กœ ํ•˜๋ฉฐ, ์ด๋Š” ์–ธ์–ด ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํšจ๊ณผ์ ์ธ ๋ฐ˜์ง€๋„ ํ•™์Šต(Semi-Supervised Learning) ์ ‘๊ทผ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์„ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋น„์ง€๋„ ์‚ฌ์ „ ํ•™์Šต(Unsupervised Pre-Training)๊ณผ ์ง€๋„ ๋ฏธ์„ธ ์กฐ์ •(Supervised Fine-Tuning)์„ ๊ฒฐํ•ฉํ•œ ๋ฐ˜์ง€๋„ ํ•™์Šต ์ ‘๊ทผ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์˜ ๋ชฉํ‘œ๋Š” ์ตœ์†Œํ•œ์˜ ์ ์‘(Minimal Adaptation)์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ์ž‘์—…์— ์ „์ด๋  ์ˆ˜ ์žˆ๋Š” ๋ณดํŽธ์ ์ธ ํ‘œํ˜„(Universal Representations)์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์˜ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. Generative Pre-Training:
    • ๋Œ€๊ทœ๋ชจ ๋น„์ง€๋„ ํ…์ŠคํŠธ ์ฝ”ํผ์Šค(Unsupervised Text Corpora)์—์„œ Language Modeling Objective๋ฅผ ์‚ฌ์šฉํ•ด ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์˜ ์ดˆ๊ธฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๊ณผ์ •์—์„œ ๋ชจ๋ธ์€ ์ผ๋ฐ˜์ ์ธ ์–ธ์–ด ํŒจํ„ด๊ณผ ๊ตฌ์กฐ๋ฅผ ํ•™์Šตํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ํ‘œํ˜„์„ ์–ป์Šต๋‹ˆ๋‹ค.
  2. Discriminative Fine-Tuning:
    • ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ง€๋„ ๋ชฉํ‘œ(Supervised Objectives)์— ๋งž๊ฒŒ ๋ฏธ์„ธ ์กฐ์ •(Fine-Tuning) ํ•ฉ๋‹ˆ๋‹ค.
    • ํƒœ์Šคํฌ ํŠน์ • ์ž…๋ ฅ ๋ณ€ํ™˜(Task-Specific Input Adaptations)์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์˜ ์ตœ์†Œํ•œ์˜ ๋ณ€๊ฒฝ์œผ๋กœ๋„ ํšจ๊ณผ์ ์ธ ์ „์ด๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ๋ฐ ์ „์ด ํ•™์Šต ๊ธฐ๋ฒ•

Transformer Architecture

์ด ์—ฐ๊ตฌ์—์„œ๋Š” Transformer ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. Transformer๋Š” Machine Translation, Document Generation, Syntactic Parsing ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, Recurrent Networks์™€ ๊ฐ™์€ ๋Œ€์•ˆ๋“ค๋ณด๋‹ค ์žฅ๊ธฐ ์˜์กด์„ฑ(Long-term Dependencies)์„ ๋” ์ž˜ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ(Structured Memory)๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ์€ ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ์ „์ด ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜๋Š” ๋ฐ ๋งค์šฐ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

 

๋˜ํ•œ, ์ „์ด ํ•™์Šต(Transfer Learning) ์ค‘์—๋Š” ํƒœ์Šคํฌ ํŠน์ • ์ž…๋ ฅ ๋ณ€ํ™˜(Task-Specific Input Adaptations)์„ ํ†ตํ•ด ๊ตฌ์กฐํ™”๋œ ํ…์ŠคํŠธ ์ž…๋ ฅ(Structured Text Input)์„ ๋‹จ์ผ ์—ฐ์† ์‹œํ€€์Šค(Single Continuous Sequence)๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ˆœํšŒ ์Šคํƒ€์ผ ์ ‘๊ทผ๋ฒ•(Traversal-Style Approaches)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ตœ์†Œํ•œ์œผ๋กœ ๋ณ€๊ฒฝ(Minimal Changes)ํ•˜๋ฉด์„œ๋„ ํšจ๊ณผ์ ์ธ Fine-Tuning์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

 

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ž์—ฐ์–ด ์ถ”๋ก (Natural Language Inference), ์งˆ๋ฌธ ์‘๋‹ต(Question Answering), ์˜๋ฏธ ์œ ์‚ฌ์„ฑ(Semantic Similarity), ํ…์ŠคํŠธ ๋ถ„๋ฅ˜(Text Classification)์˜ ๋„ค ๊ฐ€์ง€ ์–ธ์–ด ์ดํ•ด ์ž‘์—…์—์„œ ์ด ์ ‘๊ทผ๋ฒ•์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. General Task-Agnostic Model์€ ๊ฐ ์ž‘์—…์— ๋งž์ถฐ ์„ค๊ณ„๋œ ํŒ๋ณ„์  ํ•™์Šต ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ–ˆ์œผ๋ฉฐ, ์—ฐ๊ตฌ๋œ 12๊ฐœ์˜ ์ž‘์—… ์ค‘ 9๊ฐœ์—์„œ ์ตœ์ฒจ๋‹จ ๊ธฐ์ˆ (State-of-the-Art)์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ์„ฑ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • Commonsense Reasoning (Stories Cloze Test)์—์„œ 8.9%์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • Question Answering (RACE)์—์„œ 5.7%์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • Textual Entailment (MultiNLI)์—์„œ 1.5%์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • GLUE Benchmark์—์„œ 5.5%์˜ ์ ˆ๋Œ€์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ

๋˜ํ•œ, ๋„ค ๊ฐ€์ง€ ๋‹ค๋ฅธ ์„ค์ •์—์„œ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ์ œ๋กœ์ƒท ํ–‰๋™(Zero-Shot Behavior)์„ ๋ถ„์„ํ•˜์—ฌ, ๋ชจ๋ธ์ด ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…(Downstream Tasks)์„ ์œ„ํ•œ ์œ ์šฉํ•œ ์–ธ์–ด์  ์ง€์‹(Linguistic Knowledge)์„ ์Šต๋“ํ•˜๊ณ  ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.


Related Work

Semi-Supervised Learning

Semi-supervised learning์€ ๋ผ๋ฒจ๋ง(Labeling)๊ณผ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜(Text Classification)์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›์•„์˜จ ์—ฐ๊ตฌ ๋ถ„์•ผ์ž…๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์ด ๋ฐœํ‘œ๋˜๊ธฐ ์ „๊นŒ์ง€์˜ ์ตœ์‹  ์—ฐ๊ตฌ๋“ค์€ unlabeled data๋ฅผ ELMo์™€ ๊ฐ™์€ ๋ชจ๋ธ์—์„œ ๋‹จ์–ด ์ˆ˜์ค€์˜ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐ ํ™œ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ณธ ์—ฐ๊ตฌ์˜ ๋ชฉํ‘œ๋Š” ์ด๋Ÿฌํ•œ ๋‹จ์–ด ์ˆ˜์ค€์— ๋จธ๋ฌด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋น„์ง€๋„ ๋ฐ์ดํ„ฐ(Unlabeled Data)๋ฅผ ๊ตฌ(Phrase)๋‚˜ ๋ฌธ์žฅ ์ˆ˜์ค€(Sentence Level)์—์„œ ๋” ๋†’์€ ์ˆ˜์ค€์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ์˜๋ฏธ๋ฅผ ํฌ์ฐฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Unsupervised Pre-Training

Unsupervised pre-training์˜ ๋ชฉ์ ์€ ์ดˆ๊ธฐ ์ข‹์€ ํ‘œํ˜„(Representation)์„ ์ฐพ๋Š” ๋ฐ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•™์Šต๋œ ํ‘œํ˜„์€ supervised learning ๋‹จ๊ณ„์—์„œ ๋” ์ž˜ ๋™์ž‘ํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜(Image Classification), ์Œ์„ฑ ์ธ์‹(Speech Recognition), ๊ธฐ๊ณ„ ๋ฒˆ์—ญ(Machine Translation) ๋“ฑ ์—ฌ๋Ÿฌ ๋ถ„์•ผ์—์„œ pre-training์ด ๋งค์šฐ ์œ ์šฉํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๊ธฐ์กด์˜ LSTM๊ณผ ๊ฐ™์€ ๋ชจ๋ธ๋“ค์€ ๊ธด ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•˜์—ฌ, ์–ธ์–ด์˜ ์ •๋ณด๋ฅผ ์ถฉ๋ถ„ํžˆ ์ˆ˜์šฉํ•  ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

 

๋ฐ˜๋ฉด, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Transformer ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธด ๋ฌธ์žฅ ๊ตฌ์กฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์บก์ฒ˜ํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹คํ—˜์„ ํ†ตํ•ด ์ฆ๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ natural language inference, paraphrase detection, story completion ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์–ธ์–ด ์ดํ•ด ์ž‘์—…์—์„œ๋„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

Auxiliary Training Objectives

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” unsupervised pre-training์˜ ๋ชฉ์  ํ•จ์ˆ˜(Objective Function)๋ฅผ supervised fine-tuning ๋‹จ๊ณ„์—์„œ auxiliary objective๋กœ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ๋‹จ์ˆœํžˆ supervised learning์˜ ๋ชฉ์  ํ•จ์ˆ˜์— unsupervised learning ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๋”ํ•ด์ฃผ์—ˆ์œผ๋ฉฐ, ์ด๋ฅผ auxiliary objective๋ผ๊ณ  ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋ก ์€ ์ด๋ฏธ ์—ฌ๋Ÿฌ ์—ฐ๊ตฌ์—์„œ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๋‚ด๋ฉฐ ๊ทธ ํšจ๊ณผ๋ฅผ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋„ Generative Pre-Training์„ ํ†ตํ•ด ํ•™์Šต๋œ ํ‘œํ˜„์ด fine-tuning ๊ณผ์ •์—์„œ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ, ์ „์ด ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐ ์„ฑ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค.


Framework

๋ณธ ์—ฐ๊ตฌ์˜ ํ•™์Šต ์ ˆ์ฐจ๋Š” ๋‘ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

  1. ๋น„์ง€๋„ ์‚ฌ์ „ ํ•™์Šต(Unsupervised Pre-Training): ๋Œ€๊ทœ๋ชจ ํ…์ŠคํŠธ ์ฝ”ํผ์Šค๋ฅผ ์ด์šฉํ•ด ๊ณ ์šฉ๋Ÿ‰ ์–ธ์–ด ๋ชจ๋ธ ํ•ฉ๋‹ˆ๋‹ค.
  2. ์ง€๋„ ๋ฏธ์„ธ ์กฐ์ •(Supervised Fine-Tuning): ์ดํ›„ ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํŒ๋ณ„์ (discriminative) ์ž‘์—…์— ๋งž์ถ”์–ด ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

Unsupervised pre-training (๋น„์ง€๋„ ์‚ฌ์ „ ํ•™์Šต)

๋น„์ง€๋„ ํ† ํฐ ์ฝ”ํผ์Šค U={u1,...,un}๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์šฐ๋ฆฌ๋Š” ํ‘œ์ค€ ์–ธ์–ด ๋ชจ๋ธ๋ง ๋ชฉํ‘œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ ๊ฐ€๋Šฅ๋„๋ฅผ ์ตœ๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ k๋Š” ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ์˜ ํฌ๊ธฐ์ด๋ฉฐ, ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  P๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ Θ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์‹ ๊ฒฝ๋ง์œผ๋กœ ๋ชจ๋ธ๋ง๋ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋“ค์€ ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•(stochastic gradient descent)์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.

 

์šฐ๋ฆฌ์˜ ์‹คํ—˜์—์„œ๋Š” ์–ธ์–ด ๋ชจ๋ธ๋กœ์„œ ๋‹ค์ธต Transformer ๋””์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” Transformer์˜ ๋ณ€ํ˜•์œผ๋กœ, ์ž…๋ ฅ ์ปจํ…์ŠคํŠธ ํ† ํฐ์— ๋Œ€ํ•ด ๋‹ค์ค‘ ํ—ค๋“œ self-attention ์ž‘์—…์„ ์ ์šฉํ•˜๊ณ , ์œ„์น˜๋ณ„ ํ”ผ๋“œํฌ์›Œ๋“œ ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ๋ชฉํ‘œ ํ† ํฐ์— ๋Œ€ํ•œ ์ถœ๋ ฅ ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ U=(u−k,...,u−1)๋Š” ํ† ํฐ์˜ ์ปจํ…์ŠคํŠธ ๋ฒกํ„ฐ, n์€ ๋ ˆ์ด์–ด ์ˆ˜, We๋Š” ํ† ํฐ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ, Wp๋Š” ์œ„์น˜ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค.

https://wikidocs.net/162096

Supervised fine-tuning (์ง€๋„ ๋ฏธ์„ธ ์กฐ์ •)

Unsupervised pre-training (๋น„์ง€๋„ ์‚ฌ์ „ ํ•™์Šต) ์˜ ๋ชฉํ‘œ๋กœ ๋ชจ๋ธ์„ ํ•™์Šตํ•œ ํ›„, ์šฐ๋ฆฌ๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ง€๋„ ๋ชฉํ‘œ ์ž‘์—…์— ๋งž๊ฒŒ ์ ์‘์‹œํ‚ต๋‹ˆ๋‹ค. ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ์…‹ C๊ฐ€ ์ฃผ์–ด์กŒ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉฐ, ์—ฌ๊ธฐ์„œ ๊ฐ ์ธ์Šคํ„ด์Šค๋Š” ์ž…๋ ฅ ํ† ํฐ ์‹œํ€€์Šค x1, ... ,xm ๊ณผ ๋ผ๋ฒจ y๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

์ž…๋ ฅ์€ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ํ†ต๊ณผํ•˜์—ฌ ์ตœ์ข… Transformer ๋ธ”๋ก์˜ ํ™œ์„ฑํ™”๊ฐ’ h^m_l ์„ ์–ป์œผ๋ฉฐ, ์ด๋Š” y๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๋งค๊ฐœ๋ณ€์ˆ˜ Wy๋ฅผ ๊ฐ€์ง„ ์ถ”๊ฐ€ ์„ ํ˜• ์ถœ๋ ฅ ๋ ˆ์ด์–ด์— ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.

์ด๋กœ์จ ๋‹ค์Œ ๋ชฉํ‘œ๋ฅผ ์ตœ๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” ๋˜ํ•œ ๋ฏธ์„ธ ์กฐ์ • ์‹œ ์–ธ์–ด ๋ชจ๋ธ๋ง์„ ๋ณด์กฐ ๋ชฉํ‘œ๋กœ ํฌํ•จํ•˜๋Š” ๊ฒƒ์ด (a) ์ง€๋„ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™”๋ฅผ ๊ฐœ์„ ํ•˜๊ณ  (b) ์ˆ˜๋ ด ์†๋„๋ฅผ ๊ฐ€์†ํ™”ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ด๋Ÿฌํ•œ ๋ณด์กฐ ๋ชฉํ‘œ๋ฅผ ํฌํ•จํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ๋‹ค๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ•œ ์ด์ „ ์—ฐ๊ตฌ ๊ฒฐ๊ณผ์™€ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ชฉํ‘œ๋ฅผ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค (๊ฐ€์ค‘์น˜ λ์™€ ํ•จ๊ป˜).

๋ฐ˜์ ์œผ๋กœ, ๋ฏธ์„ธ ์กฐ์ • ๋™์•ˆ ํ•„์š”ํ•œ ์ถ”๊ฐ€ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” Wy์™€ ๊ตฌ๋ถ„ ํ† ํฐ์˜ ์ž„๋ฒ ๋”ฉ ์ž…๋‹ˆ๋‹ค.

(์™ผ์ชฝ ๊ทธ๋ฆผ) ์ด ์—ฐ๊ตฌ์—์„œ ์‚ฌ์šฉ๋œ Transformer ์•„ํ‚คํ…์ฒ˜ (์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ) ์„œ๋กœ ๋‹ค๋ฅธ ์ž‘์—…์— ๋Œ€ํ•œ ๋ฏธ์„ธ ์กฐ์ •์„ ์œ„ํ•œ ์ž…๋ ฅ ๋ณ€ํ™˜. ์šฐ๋ฆฌ๋Š” ๋ชจ๋“  ๊ตฌ์กฐํ™”๋œ ์ž…๋ ฅ์„ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ํ† ํฐ ์‹œํ€€์Šค๋กœ ๋ณ€ํ™˜ํ•œ ๋’ค, ์„ ํ˜•+์†Œํ”„ํŠธ๋งฅ์Šค ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์น˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

 

Transformer Decoder part ๋ณด์ถฉ์„ค๋ช… ๊ทธ๋ฆผ.

Task-Specific Input Transformations (ํƒœ์Šคํฌ ํŠน์ • ์ž…๋ ฅ ๋ณ€ํ™˜)

  • ํ…์ŠคํŠธ ๋ถ„๋ฅ˜(Text Classification)์™€ ๊ฐ™์€ ์ž‘์—…์—์„œ๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ง์ ‘ ๋ฏธ์„ธ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์งˆ๋ฌธ ์‘๋‹ต(Question Answering)์ด๋‚˜ ํ…์ŠคํŠธ ํ•จ์˜(Textual Entailment)์ฒ˜๋Ÿผ ๊ตฌ์กฐํ™”๋œ ์ž…๋ ฅ(Structured Input)์„ ํ•„์š”๋กœ ํ•˜๋Š” ์ž‘์—…์€ ๋ช‡ ๊ฐ€์ง€ ์ˆ˜์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ์ „์ด๋œ ํ‘œํ˜„ ์œ„์— ํƒœ์Šคํฌ ํŠน์ • ์•„ํ‚คํ…์ฒ˜(Task-Specific Architecture)๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์„ ์ œ์•ˆํ–ˆ์œผ๋‚˜, ์ด๋Š” ์ถ”๊ฐ€์ ์ธ ์ปค์Šคํ„ฐ๋งˆ์ด์ง•(Customization)์„ ์š”๊ตฌํ•˜๊ณ , ๊ทธ์— ๋”ฐ๋ฅธ ํ•™์Šต ๋ถ€๋‹ด์ด ๋Š˜์–ด๋‚˜๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” Traversal-Style ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•ด ๊ตฌ์กฐํ™”๋œ ์ž…๋ ฅ์„ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ์†๋œ ์‹œํ€€์Šค(Continuous Sequence)๋กœ ๋ณ€ํ™˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์•„ํ‚คํ…์ฒ˜ ๋ณ€๊ฒฝ์„ ์ตœ์†Œํ™”ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ์ผ๊ด€๋œ ์ „์ด ํ•™์Šต์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค.


Experiments

Setup - Unsupervised Learning

๋ณธ ์—ฐ๊ตฌ๋Š” ์–ธ์–ด ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•ด BooksCorpus ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ์ด ๋ฐ์ดํ„ฐ์…‹์€ 7,000๊ถŒ ์ด์ƒ์˜ ๋ฏธ์ถœํŒ ๋„์„œ๋กœ ๊ตฌ์„ฑ๋˜์–ด ๋ชจํ—˜, ํŒํƒ€์ง€, ๋กœ๋งจ์Šค ๋“ฑ ๋‹ค์–‘ํ•œ ์žฅ๋ฅด์˜ ๊ธด ์—ฐ์† ํ…์ŠคํŠธ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์ด ์žฅ๊ธฐ ๋ฌธ๋งฅ์„ ํ•™์Šตํ•˜๊ณ , ๋ฌธ๋งฅ์  ์—ฐ์†์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋ฐ˜๋ฉด, ELMo๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” 1B Word Benchmark๋Š” ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ์…”ํ”Œ๋˜์–ด ์žฅ๊ธฐ ๊ตฌ์กฐ๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์šด ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. BooksCorpus๋Š” ์ด๋ฅผ ๊ทน๋ณตํ•˜์—ฌ ์žฅ๊ธฐ ์ •๋ณด๋ฅผ ์กฐ๊ฑดํ™”ํ•˜๋Š” ํ•™์Šต์— ์œ ๋ฆฌํ•œ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, ๋ณธ ์—ฐ๊ตฌ์˜ ๋ชจ๋ธ์€ BooksCorpus์—์„œ 18.4์˜ ๋‚ฎ์€ ํ† ํฐ ์ˆ˜์ค€ Perpexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ)๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ์ •ํ™•ํ•œ ๋‹ค์Œ ํ† ํฐ ์˜ˆ์ธก์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํšจ๊ณผ์ ์ธ ์–ธ์–ด ํ‘œํ˜„ ํ•™์Šต์„ ์ง€์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ํŠน์„ฑ์„ ์ž˜ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Perpexity(ํผํ”Œ๋ ‰์‹œํ‹ฐ)?

LM์ด ์–ผ๋งˆ๋‚˜ ๋†’์€ ํ™•๋ฅ ๋กœ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ–ˆ๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค. Perplexity ๋ฌธ์žฅ ์ƒ์„ฑํ™•๋ฅ ์˜ ์—ญ์ˆ˜๋ฅผ ์ทจํ•จ์œผ๋กœ ๋‚ฎ์„์ˆ˜๋ก ์ข‹์€๊ฐ’ ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ '๋ฌธ์žฅ ์ƒ์„ฑ ํ™•๋ฅ ' ์ด๋ผ๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ๋งˆ๋‹ค ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (ex. domain๋งˆ๋‹ค ์“ฐ์ด๋Š” ๋‹จ์–ด๋‚˜ ๊ตฌ๊ฐ€ ๋‹ค๋ฅผ ๊ฒƒ์œผ๋ฏ€๋กœ). ๋”ฐ๋ผ์„œ ์ผ๋ฐ˜์ ์ธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ์—๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ์ง€๋งŒ ๋งŽ์ด ์“ฐ์ด๋Š” ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค.

๋ชจ๋ธ ์‚ฌ์–‘

๋ณธ ์—ฐ๊ตฌ์—์„œ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋Š” Transformer์˜ ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, ๋””์ฝ”๋” ์ „์šฉ Transformer๋กœ 12๊ฐœ์˜ ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ์„ค์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • Self-Attention: ๋งˆ์Šคํฌ๋“œ self-attention ํ—ค๋“œ๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, 768 ์ฐจ์›์˜ ์ƒํƒœ(hidden state)์™€ 12๊ฐœ์˜ attention ํ—ค๋“œ๋ฅผ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Feedforward ๋„คํŠธ์›Œํฌ: 3072 ์ฐจ์›์˜ ๋‚ด๋ถ€ ์ƒํƒœ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋ ˆ์ด์–ด์—์„œ ์œ„์น˜๋ณ„ ํ”ผ๋“œํฌ์›Œ๋“œ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ตฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ตœ์ ํ™” ๋ฐ ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„๋ง: Adam ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ์ตœ๋Œ€ ํ•™์Šต๋ฅ ์„ 2.5e-4๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต๋ฅ ์€ ์ดˆ๊ธฐ 2000๋ฒˆ์˜ ์—…๋ฐ์ดํŠธ ๋™์•ˆ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ณ , ์ดํ›„ ์ฝ”์‚ฌ์ธ ์Šค์ผ€์ค„๋กœ ๊ฐ์†Œ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  • ํ•™์Šต ์„ค์ •:
    • ๋ฏธ๋‹ˆ๋ฐฐ์น˜: 64๊ฐœ์˜ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ 512๊ฐœ์˜ ํ† ํฐ ์‹œํ€€์Šค๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ 100 ์—ํฌํฌ ๋™์•ˆ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ๋“œ๋กญ์•„์›ƒ: ์ž”์—ฌ(residual), ์ž„๋ฒ ๋”ฉ(embedding), attention ๋ ˆ์ด์–ด์— 0.1์˜ ๋“œ๋กญ์•„์›ƒ ๋น„์œจ์„ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์ •๊ทœํ™”: ์ˆ˜์ •๋œ L2 ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ๋น„ํŽธํ–ฅ(non-bias) ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•ด w=0.01๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํ™œ์„ฑํ™” ํ•จ์ˆ˜: Gaussian Error Linear Unit (GELU)์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์œ„์น˜ ์ž„๋ฒ ๋”ฉ: ์›๋ž˜์˜ ์‚ฌ์ธํŒŒ ๋ฐฉ์‹ ๋Œ€์‹  ํ•™์Šต๋œ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ(learned positional embeddings)์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํ† ํฌ๋‚˜์ด์ง•: BooksCorpus์˜ ํ…์ŠคํŠธ๋Š” ftfy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ์ •๋ฆฌํ•˜๊ณ  spaCy ํ† ํฌ๋‚˜์ด์ €๋กœ ์ฒ˜๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค. Byte Pair Encoding (BPE)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 40,000๊ฐœ์˜ ๋ณ‘ํ•ฉ ์–ดํœ˜๋ฅผ ์ƒ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
12 layers
768dim | 12 heads | 3072 FFN intermediate
Adam Optimizer with linear scheduling
train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens (=max_len)
bytepair encoding (BPE) vocabulary with 40,000 merges
Gaussian Error Linear Unit (GELU)
PE : learned position embeddings

Fine-Tuning(๋ฏธ์„ธ ์กฐ์ •) ์„ธ๋ถ€ ์‚ฌํ•ญ

๋ฏธ์„ธ ์กฐ์ •(Fine-Tuning) ๋‹จ๊ณ„์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„ ๋น„์ง€๋„ ์‚ฌ์ „ ํ•™์Šต(Unsupervised Pre-Training)์—์„œ ์„ค์ •ํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ์„ธ๋ถ€ ์‚ฌํ•ญ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ๋“œ๋กญ์•„์›ƒ: ๋ถ„๋ฅ˜๊ธฐ(Classifier)์— 0.1์˜ ๋“œ๋กญ์•„์›ƒ์„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํ•™์Šต๋ฅ  ๋ฐ ๋ฐฐ์น˜ ํฌ๊ธฐ:
    • ํ•™์Šต๋ฅ : 6.25e-5
    • ๋ฐฐ์น˜ ํฌ๊ธฐ: 32
  • ํ•™์Šต ํšจ์œจ: ๋Œ€๋ถ€๋ถ„์˜ ์ž‘์—…์—์„œ 3 ์—ํฌํฌ์˜ ํ•™์Šต์œผ๋กœ ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„๋ง: 0.2%์˜ ํ•™์Šต ๋™์•ˆ ์›Œ๋ฐ์—…(warm-up)๊ณผ ์„ ํ˜• ํ•™์Šต๋ฅ  ๊ฐ์†Œ ์Šค์ผ€์ค„์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ณด์กฐ ๋ชฉํ‘œ ๊ฐ€์ค‘์น˜ (λ\lambdaλ): 0.5๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

Supervised Fine-Tuning

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ž์—ฐ์–ด ์ถ”๋ก (Natural Language Inference), ์งˆ๋ฌธ ์‘๋‹ต(Question Answering), ์˜๋ฏธ ์œ ์‚ฌ์„ฑ(Semantic Similarity), ํ…์ŠคํŠธ ๋ถ„๋ฅ˜(Text Classification)์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์ง€๋„ ํ•™์Šต(Supervised Learning) ์ž‘์—…์—์„œ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, GLUE ๋‹ค์ค‘ ์ž‘์—… ๋ฒค์น˜๋งˆํฌ(GLUE Multi-Task Benchmark)์— ํฌํ•จ๋œ ์ž‘์—…์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

์ž์—ฐ์–ด ์ถ”๋ก  (Natural Language Inference, NLI)

์ž์—ฐ์–ด ์ถ”๋ก (NLI)๋Š” ํ…์ŠคํŠธ ํ•จ์˜ ์ธ์‹(Textual Entailment Recognition)์œผ๋กœ๋„ ์•Œ๋ ค์ ธ ์žˆ์œผ๋ฉฐ, ๋‘ ๊ฐœ์˜ ๋ฌธ์žฅ์„ ๋น„๊ตํ•˜์—ฌ ํ•จ์˜(Entailment), ๋ชจ์ˆœ(Contradiction), ์ค‘๋ฆฝ(Neutral) ์ค‘ ํ•˜๋‚˜๋กœ ๊ด€๊ณ„(Relationship)๋ฅผ ํŒ๋‹จํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ด ์ž‘์—…์€ ์–ดํœ˜์  ํ•จ์˜(Lexical Entailment), ๊ณต๋™ ์ฐธ์กฐ(Coreference), ์–ดํœ˜์  ๋ฐ ๊ตฌ๋ฌธ์  ๋ชจํ˜ธ์„ฑ(Lexical and Syntactic Ambiguity) ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์–ธ์–ด ํ˜„์ƒ์œผ๋กœ ์ธํ•ด ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๊ณผ์ œ๋กœ ๊ฐ„์ฃผ๋ฉ๋‹ˆ๋‹ค.

 

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ์ถœ์ฒ˜์—์„œ ์ˆ˜์ง‘๋œ ๋‹ค์„ฏ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด NLI ์ž‘์—…์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค:

  • SNLI (Stanford Natural Language Inference): ์ด๋ฏธ์ง€ ์บก์…˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ…์ŠคํŠธ
  • MNLI (Multi-Genre Natural Language Inference): ์Œ์„ฑ ์ „์‚ฌ, ๋Œ€์ค‘ ์†Œ์„ค, ์ •๋ถ€ ๋ณด๊ณ ์„œ ๋“ฑ ๋‹ค์–‘ํ•œ ์ถœ์ฒ˜์˜ ํ…์ŠคํŠธ
  • QNLI (Question Natural Language Inference): ์œ„ํ‚คํ”ผ๋””์•„ ๋ฌธ์„œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ…์ŠคํŠธ
  • SciTail: ๊ณผํ•™ ์‹œํ—˜ ๋ฌธ์ œ
  • RTE (Recognizing Textual Entailment): ๋‰ด์Šค ๊ธฐ์‚ฌ ๋ฐ ์ •๋ถ€ ๋ณด๊ณ ์„œ ๊ธฐ๋ฐ˜์˜ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹

์‹คํ—˜ ๊ฒฐ๊ณผ

Table 2 : ์ž์—ฐ์–ด ์ถ”๋ก  ์ž‘์—…์— ๋Œ€ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์šฐ๋ฆฌ ๋ชจ๋ธ๊ณผ ํ˜„์žฌ ์ตœ์ฒจ๋‹จ ๋ฐฉ๋ฒ•๋“ค์„ ๋น„๊ตํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ "5x"๋Š” 5๊ฐœ์˜ ๋ชจ๋ธ๋กœ ๊ตฌ์„ฑ๋œ ์•™์ƒ๋ธ”์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์€ ์ •ํ™•๋„๋ฅผ ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Table 2์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋ณธ ์—ฐ๊ตฌ์˜ ๋ชจ๋ธ์ด ๊ธฐ์กด์˜ ์ตœ์ฒจ๋‹จ ๋ฐฉ๋ฒ•(State-of-the-Art Approaches)๊ณผ ๋น„๊ตํ•ด NLI ์ž‘์—…์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ์ฃผ์š” ์„ฑ๊ณผ ์š”์•ฝ์ž…๋‹ˆ๋‹ค:

  • MNLI: 1.5% ์ ˆ๋Œ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • SciTail: 5% ์ ˆ๋Œ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • QNLI: 5.8% ์ ˆ๋Œ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • SNLI: 0.6% ์ ˆ๋Œ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ

์œ„ ๊ฒฐ๊ณผ๋“ค์€ ๋ณธ ์—ฐ๊ตฌ์˜ ๋ชจ๋ธ์ด ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์„ ๋„˜๋‚˜๋“œ๋Š” ์ถ”๋ก ์„ ๋” ์ž˜ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ์–ธ์–ด์  ๋ชจํ˜ธ์„ฑ(Linguistic Ambiguity)์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ํƒ์›”ํ•œ ๋Šฅ๋ ฅ์„ ๋ณด์œ ํ•˜๊ณ  ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

RTE์—์„œ๋Š” 2490๊ฐœ์˜ ์˜ˆ์ œ๋กœ ๊ตฌ์„ฑ๋œ ๋น„๊ต์  ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ 56%์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์ค‘ ์ž‘์—… biLSTM ๋ชจ๋ธ(Multi-Task biLSTM Model)์ด ๋ณด๊ณ ํ•œ 61.7%๋ณด๋‹ค ๋‚ฎ์€ ์„ฑ๋Šฅ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋ณธ ์—ฐ๊ตฌ์˜ ๋ชจ๋ธ์ด ๋” ํฐ NLI ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ํƒ์›”ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต(Multi-Task Learning)์„ ํ†ตํ•ด ์ถ”๊ฐ€์ ์ธ ์ด์ ์„ ์–ป์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ํ˜„์žฌ๋กœ์„œ๋Š” ์ด ๊ฐ€๋Šฅ์„ฑ์„ ๊นŠ์ด ํƒ๊ตฌํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

 

์งˆ๋ฌธ ์‘๋‹ต ๋ฐ ์ƒ์‹ ์ถ”๋ก  (Question Answering and Commonsense Reasoning)

Table 3 : ์งˆ๋ฌธ ์‘๋‹ต ๋ฐ ์ƒ์‹ ์ถ”๋ก ์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์šฐ๋ฆฌ ๋ชจ๋ธ๊ณผ ํ˜„์žฌ ์ตœ์ฒจ๋‹จ ๋ฐฉ๋ฒ•๋“ค์„ ๋น„๊ตํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ "9x"๋Š” 9๊ฐœ์˜ ๋ชจ๋ธ๋กœ ๊ตฌ์„ฑ๋œ ์•™์ƒ๋ธ”์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์งˆ๋ฌธ ์‘๋‹ต(Question Answering)์€ ๋‹จ์ผ ๋ฌธ์žฅ ๋ฐ ๋‹ค์ค‘ ๋ฌธ์žฅ ์ถ”๋ก (Single and Multi-Sentence Reasoning)์„ ์š”๊ตฌํ•˜๋Š” ์ž‘์—…์œผ๋กœ, ๊ธด ๋ฒ”์œ„์˜ ์ปจํ…์ŠคํŠธ(Context)๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” RACE ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. RACE๋Š” ์ค‘ํ•™๊ต์™€ ๊ณ ๋“ฑํ•™๊ต ์‹œํ—˜์˜ ์˜์–ด ์ง€๋ฌธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์งˆ๋ฌธ ์‘๋‹ต ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ, CNN์ด๋‚˜ SQuAD์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹๋ณด๋‹ค ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ์ถ”๋ก ์„ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธด ๋ฌธ๋งฅ(Long Context)์„ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•™์Šต๋œ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ, Story Cloze Test์—์„œ๋„ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ–ˆ๋Š”๋ฐ, ์ด ์ž‘์—…์€ ๋‘ ๊ฐ€์ง€ ์„ ํƒ์ง€ ์ค‘ ์˜ฌ๋ฐ”๋ฅธ ์ด์•ผ๊ธฐ ๊ฒฐ๋ง์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‘ ์ž‘์—…์—์„œ ๋ณธ ์—ฐ๊ตฌ์˜ ๋ชจ๋ธ์€ ์ด์ „ ์ตœ๊ณ  ์„ฑ๊ณผ๋ฅผ ์ƒ๋‹นํžˆ ์ดˆ๊ณผํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค:

  • Story Cloze Test: 8.9%์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • RACE: ์ „์ฒด์ ์œผ๋กœ 5.7%์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ

์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” ๋ณธ ์—ฐ๊ตฌ์˜ Generative Pre-Training ๋ชจ๋ธ์ด ๊ธด ๋ฒ”์œ„์˜ ์ปจํ…์ŠคํŠธ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

 

์˜๋ฏธ ์œ ์‚ฌ์„ฑ (Semantic Similarity)

์˜๋ฏธ ์œ ์‚ฌ์„ฑ(Semantic Similarity) ์ž‘์—…์€ ๋‘ ๋ฌธ์žฅ์ด ์˜๋ฏธ์ ์œผ๋กœ ๋™๋“ฑํ•œ์ง€ ์•„๋‹Œ์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ์ž‘์—…์€ ๊ฐœ๋…์˜ ์žฌํ‘œํ˜„(Paraphrasing)์„ ์ธ์‹ํ•˜๊ณ , ๋ถ€์ • ํ‘œํ˜„(Negation)์„ ์ดํ•ดํ•˜๋ฉฐ, ๊ตฌ๋ฌธ์  ๋ชจํ˜ธ์„ฑ(Syntactic Ambiguity)์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Šฅ๋ ฅ์ด ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์„ธ ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค:

  • Microsoft Paraphrase Corpus (MRPC)
  • Quora Question Pairs (QQP)
  • Semantic Textual Similarity Benchmark (STS-B)

๋ณธ ์—ฐ๊ตฌ์˜ ๋ชจ๋ธ์€ ์„ธ ๊ฐœ์˜ ์˜๋ฏธ ์œ ์‚ฌ์„ฑ ์ž‘์—… ์ค‘ ๋‘ ๊ฐœ์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, ํŠนํžˆ STS-B์—์„œ๋Š” 1ํฌ์ธํŠธ์˜ ์ ˆ๋Œ€์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค. QQP์—์„œ๋Š” ๊ธฐ์กด Single-task BiLSTM + ELMo + Attn ๋ชจ๋ธ์— ๋น„ํ•ด 4.2%์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์ด ์ž‘์—…์—์„œ์˜ ๋‘๋“œ๋Ÿฌ์ง„ ์„ฑ๊ณผ๋ฅผ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ถ„๋ฅ˜ (Classification)

๋งˆ์ง€๋ง‰์œผ๋กœ, ์—ฐ๊ตฌ๋Š” ๋‘ ๊ฐ€์ง€ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Table 4 : ์˜๋ฏธ ์œ ์‚ฌ์„ฑ๊ณผ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์šฐ๋ฆฌ ๋ชจ๋ธ๊ณผ ํ˜„์žฌ ์ตœ์ฒจ๋‹จ ๋ฐฉ๋ฒ•๋“ค์„ ๋น„๊ตํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ํ‘œ์— ๋‚˜์™€ ์žˆ๋Š” ๋ชจ๋“  ์ž‘์—… ํ‰๊ฐ€๋“ค์€ GLUE ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • Corpus of Linguistic Acceptability (CoLA): ๋ฌธ์žฅ์ด ๋ฌธ๋ฒ•์ ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅธ์ง€์— ๋Œ€ํ•œ ์ „๋ฌธ๊ฐ€์˜ ํŒ๋‹จ์„ ํฌํ•จํ•˜๋ฉฐ, ๋ชจ๋ธ์˜ ๋‚ด์žฌ๋œ ์–ธ์–ด์  ํŽธํ–ฅ(Linguistic Bias)์„ ํ…Œ์ŠคํŠธํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • Stanford Sentiment Treebank (SST-2): ํ‘œ์ค€ ์ด์ง„ ๋ถ„๋ฅ˜(Binary Classification) ์ž‘์—…์œผ๋กœ, ๊ธ์ • ๋ฐ ๋ถ€์ • ๊ฐ์ • ๋ถ„์„์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ์—ฐ๊ตฌ์˜ ๋ชจ๋ธ์€ CoLA์—์„œ 45.4์˜ ์ ์ˆ˜๋ฅผ ๊ธฐ๋กํ•˜๋ฉฐ, ์ด์ „ ์ตœ๊ณ  ์„ฑ๊ณผ์ธ 35.0์„ ํฌ๊ฒŒ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ๋‚ด์žฌ๋œ ์–ธ์–ด์  ํŽธํ–ฅ์„ ์ž˜ ํ•™์Šตํ–ˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. SST-2์—์„œ๋Š” 91.3%์˜ ์ •ํ™•๋„๋ฅผ ๊ธฐ๋กํ•˜์—ฌ, ์ตœ์ฒจ๋‹จ ๊ฒฐ๊ณผ๋“ค๊ณผ ๊ฒฝ์Ÿํ•  ๋งŒํ•œ ์„ฑ๊ณผ๋ฅผ ๋ƒˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ, ๋ณธ ์—ฐ๊ตฌ์˜ ๋ชจ๋ธ์€ GLUE ๋ฒค์น˜๋งˆํฌ์—์„œ 72.8์˜ ์ ์ˆ˜๋ฅผ ๊ธฐ๋กํ•˜๋ฉฐ, ์ด์ „ ์ตœ๊ณ  ์ ์ˆ˜์ธ 68.9๋ฅผ ๋›ฐ์–ด๋„˜์–ด ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

 

์ „๋ฐ˜์ ์ธ ์„ฑ๊ณผ

๋ณธ ์—ฐ๊ตฌ์˜ Generative Pre-Training ์ ‘๊ทผ๋ฒ•์€ ํ‰๊ฐ€ํ•œ 12๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹ ์ค‘ 9๊ฐœ์—์„œ ์ƒˆ๋กœ์šด ์ตœ์ฒจ๋‹จ(State-of-the-Art) ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์•™์ƒ๋ธ” ๋ชจ๋ธ(Ensemble Models)๊ณผ ๋น„๊ตํ•ด๋„ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ, ์ ‘๊ทผ๋ฒ•์˜ ํšจ์œจ์„ฑ๊ณผ ๊ฐ•๋ ฅํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ, ๋ณธ ์ ‘๊ทผ๋ฒ•์€ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ์ผ๊ด€๋˜๊ฒŒ ์šฐ์ˆ˜ํ•œ ์„ฑ๊ณผ๋ฅผ ๋ฐœํœ˜ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ์˜ˆ๋ฅผ ๋“ค์–ด Embedding(์ž„๋ฒ ๋”ฉ)์€ ๋ฐ์ดํ„ฐ์…‹ (STS-B, ์•ฝ 5,700๊ฐœ์˜ ํ•™์Šต ์˜ˆ์ œ)๋ถ€ํ„ฐ ๊ฐ€์žฅ ํฐ ๋ฐ์ดํ„ฐ์…‹ (SNLI, ์•ฝ 55๋งŒ ๊ฐœ์˜ ํ•™์Šต ์˜ˆ์ œ)๊นŒ์ง€ ๋ชจ๋‘์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š”  ์ „์ด ํ•™์Šต(Transfer Learning) ์ ‘๊ทผ๋ฒ•์ด ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹ ๊ทœ๋ชจ์™€ ์œ ํ˜•์— ์ ์‘ํ•  ์ˆ˜ ์žˆ๋Š” ์œ ์—ฐ์„ฑ๊ณผ ํ™•์žฅ์„ฑ์„ ๊ฐ€์กŒ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


Analysis

์ „์ด๋œ ๋ ˆ์ด์–ด ์ˆ˜์˜ ์˜ํ–ฅ (Impact of Number of Layers Transferred)

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋น„์ง€๋„ ์‚ฌ์ „ ํ•™์Šต(Unsupervised Pre-Training)์—์„œ ํ•™์Šต๋œ ๋ ˆ์ด์–ด ์ˆ˜๋ฅผ ์ „์ด(Transferred)ํ•  ๋•Œ, ์ง€๋„ ๋Œ€์ƒ ์ž‘์—…(Supervised Target Task)์—์„œ์˜ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค. Figure 2(์™ผ์ชฝ)๋Š” ์ „์ด๋œ ๋ ˆ์ด์–ด ์ˆ˜์— ๋”ฐ๋ฅธ MultiNLI์™€ RACE์—์„œ์˜ ์„ฑ๋Šฅ ๋ณ€ํ™”๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Figure 2 : (์™ผ์ชฝ) ์‚ฌ์ „ ํ•™์Šต๋œ ์–ธ์–ด ๋ชจ๋ธ์—์„œ ๋” ๋งŽ์€ ๋ ˆ์ด์–ด๋ฅผ ์ „์ดํ•  ๋•Œ RACE์™€ MultiNLI์—์„œ์˜ ํšจ๊ณผ. (์˜ค๋ฅธ์ชฝ) ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ์–ธ์–ด ๋ชจ๋ธ ์‚ฌ์ „ ํ•™์Šต ์—…๋ฐ์ดํŠธ์— ๋”ฐ๋ฅธ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ์˜ ๋ณ€ํ™” ์–‘์ƒ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„. ๊ฐ ์ž‘์—…์˜ ์„ฑ๋Šฅ์€ ๋ฌด์ž‘์œ„ ์ถ”์ธก ๋ฒ ์ด์Šค๋ผ์ธ๊ณผ ๋‹จ์ผ ๋ชจ๋ธ๋กœ ๋‹ฌ์„ฑํ•œ ํ˜„์žฌ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ ์‚ฌ์ด์—์„œ ์ •๊ทœํ™”๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ์ž„๋ฒ ๋”ฉ(Embeddings)์„ ์ „์ดํ•˜๋ฉด ๊ธฐ๋ณธ์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.
  • ๊ฐ Transformer ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€๋กœ ์ „์ดํ•  ๋•Œ๋งˆ๋‹ค MultiNLI์—์„œ ์ „์ฒด ์ „์ด ์‹œ ์ตœ๋Œ€ 9%์˜ ์„ฑ๋Šฅ ์ด์ ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ๊ฐ ๋ ˆ์ด์–ด๊ฐ€ ๋Œ€์ƒ ์ž‘์—…์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ์œ ์šฉํ•œ ๊ธฐ๋Šฅ์„ ํฌํ•จํ•˜๊ณ  ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” Generative Pre-Training ์ ‘๊ทผ๋ฒ•์ด ๋ชจ๋ธ์˜ ๊ณ„์ธต์  ํ•™์Šต ๊ตฌ์กฐ(Hierarchical Learning Structure)๋ฅผ ํ™œ์šฉํ•ด, ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์˜ ์–ธ์–ด์  ํŒจํ„ด๊ณผ ํŠน์ง•์„ ํšจ๊ณผ์ ์œผ๋กœ ์ „์ดํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ „์ด๋˜๋Š” ๋ ˆ์ด์–ด๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๋” ๋งŽ์€ ์œ ์šฉํ•œ ๊ธฐ๋Šฅ์ด ๋ณด์กด๋˜๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๋Œ€์ƒ ์ž‘์—…์—์„œ ์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Table 5 : ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ์˜ ๋ชจ๋ธ ์ ˆ๋‹จ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํ‰๊ท  ์ ์ˆ˜(Avg. score)๋Š” ๋ชจ๋“  ๊ฒฐ๊ณผ์˜ ๊ฐ€์ค‘์น˜๊ฐ€ ์—†๋Š” ํ‰๊ท ์ž…๋‹ˆ๋‹ค.

์ œ๋กœ์ƒท ํ–‰๋™ (Zero-shot Behaviors)

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” Transformer์˜ ์–ธ์–ด ๋ชจ๋ธ ์‚ฌ์ „ ํ•™์Šต์ด ์™œ ํšจ๊ณผ์ ์ธ์ง€์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๋†’์ด๊ณ ์ž ์ œ๋กœ์ƒท(Zero-Shot) ํ–‰๋™์„ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ œ๋กœ์ƒท ํ•™์Šต์ด๋ž€, ์ง€๋„ ๋ฏธ์„ธ ์กฐ์ •(Supervised Fine-Tuning) ์—†์ด ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

 

ํ•˜๋‚˜์˜ ๊ฐ€์„ค์€ ๊ธฐ๋ณธ ์ƒ์„ฑ ๋ชจ๋ธ(Generative Model)์ด ์–ธ์–ด ๋ชจ๋ธ๋ง ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ํ•™์Šต๋˜๋Š” ๊ณผ์ •์—์„œ ์šฐ๋ฆฌ๊ฐ€ ํ‰๊ฐ€ํ•˜๋Š” ๋งŽ์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•™์Šต๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, Transformer์˜ ๊ตฌ์กฐํ™”๋œ ์ฃผ์˜ ๋ฉ”๋ชจ๋ฆฌ(Attentional Memory)๊ฐ€ LSTM๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ์ „์ด(Transfer)์— ๋„์›€์ด ๋œ๋‹ค๋Š” ์ ๋„ ์ฃผ๋ชฉํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๋˜ํ•œ ์ƒ์„ฑ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ง€๋„ ๋ฏธ์„ธ ์กฐ์ • ์—†์ด ์—ฌ๋Ÿฌ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ํœด๋ฆฌ์Šคํ‹ฑ ์†”๋ฃจ์…˜(Heuristic Solutions)์„ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค. Figure 2(์˜ค๋ฅธ์ชฝ)์—์„œ๋Š” ์ƒ์„ฑ ์‚ฌ์ „ ํ•™์Šต ๋™์•ˆ ์ด๋Ÿฌํ•œ ํœด๋ฆฌ์Šคํ‹ฑ์˜ ์„ฑ๋Šฅ์„ ์‹œ๊ฐ์ ์œผ๋กœ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ํœด๋ฆฌ์Šคํ‹ฑ ์„ฑ๋Šฅ์ด ํ•™์Šต ๊ณผ์ • ๋™์•ˆ ์•ˆ์ •์ ์ด๊ณ  ๊พธ์ค€ํžˆ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์ด๋Š” ์ƒ์„ฑ ์‚ฌ์ „ ํ•™์Šต์ด ๋‹ค์–‘ํ•œ ์ž‘์—…๊ณผ ๊ด€๋ จ๋œ ๊ธฐ๋Šฅ ํ•™์Šต(Feature Learning)์„ ์ง€์›ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

 

ํŠนํžˆ, LSTM์€ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ์—์„œ ๋†’์€ ๋ณ€๋™์„ฑ์„ ๋ณด์˜€๋Š”๋ฐ, ์ด๋Š” Transformer ์•„ํ‚คํ…์ฒ˜์˜ ๊ท€๋‚ฉ์  ํŽธํ–ฅ(Inductive Bias)์ด ์ „์ด ํ•™์Šต์— ๋” ์œ ๋ฆฌํ•˜๋‹ค๋Š” ์ ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

์ž‘์—…๋ณ„ ํœด๋ฆฌ์Šคํ‹ฑ ์˜ˆ์‹œ

  • CoLA (์–ธ์–ด์  ํ—ˆ์šฉ์„ฑ): ์ƒ์„ฑ ๋ชจ๋ธ์ด ํ• ๋‹นํ•œ ํ‰๊ท  ํ† ํฐ ๋กœ๊ทธ-ํ™•๋ฅ ๋กœ ์˜ˆ์ œ๋ฅผ ์ ์ˆ˜ํ™”ํ•˜๊ณ  ์ž„๊ณ„๊ฐ’(Threshold)์„ ๊ธฐ์ค€์œผ๋กœ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • SST-2 (๊ฐ์ • ๋ถ„์„): ๊ฐ ์˜ˆ์ œ์— ํ† ํฐ "very"๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ , "positive"์™€ "negative"๋กœ ์ถœ๋ ฅ ๋ถ„ํฌ๋ฅผ ์ œํ•œํ•˜์—ฌ ๋” ๋†’์€ ํ™•๋ฅ ์„ ํ• ๋‹นํ•œ ํ† ํฐ์„ ์˜ˆ์ธก๊ฐ’์œผ๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  • RACE (์งˆ๋ฌธ ์‘๋‹ต): ๋ฌธ์„œ์™€ ์งˆ๋ฌธ์„ ์กฐ๊ฑด์œผ๋กœ ํ•˜์—ฌ ๊ฐ€์žฅ ๋†’์€ ํ‰๊ท  ํ† ํฐ ๋กœ๊ทธ-ํ™•๋ฅ ์„ ํ• ๋‹นํ•œ ๋‹ต๋ณ€์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.
  • DPRD (Winograd Schemas): ์ •๊ด€์‚ฌ ๋Œ€๋ช…์‚ฌ๋ฅผ ๋‘ ๊ฐ€์ง€ ๊ฐ€๋Šฅํ•œ ์ฐธ์กฐ์–ด๋กœ ๋Œ€์ฒดํ•˜๊ณ , ์ƒ์„ฑ ๋ชจ๋ธ์ด ๋” ๋†’์€ ํ‰๊ท  ํ† ํฐ ๋กœ๊ทธ-ํ™•๋ฅ ์„ ํ• ๋‹นํ•œ ํ•ด์„์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

์ ˆ๋‹จ ์—ฐ๊ตฌ (Ablation Studies)

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์„ธ ๊ฐ€์ง€ ์ ˆ๋‹จ ์—ฐ๊ตฌ(Ablation Studies)๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์š”์†Œ๋ฅผ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค (Table 5 ์ฐธ์กฐ).

  1. ๋ณด์กฐ ์–ธ์–ด ๋ชจ๋ธ๋ง ๋ชฉํ‘œ์˜ ์˜ํ–ฅ:
    • ๋ณด์กฐ ๋ชฉํ‘œ(Auxiliary Objective) ์—†์ด ๋ฏธ์„ธ ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•  ๋•Œ์˜ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ, NLI ์ž‘์—…๊ณผ QQP์—์„œ ๋ณด์กฐ ๋ชฉํ‘œ๊ฐ€ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.
    • ํŠนํžˆ ํฐ ๋ฐ์ดํ„ฐ์…‹์ด ๋ณด์กฐ ๋ชฉํ‘œ์˜ ์ด์ ์„ ์–ป์—ˆ์ง€๋งŒ, ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํšจ๊ณผ๊ฐ€ ๋‘๋“œ๋Ÿฌ์ง€์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋Œ€๊ทœ๋ชจ ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ๋ณด์กฐ ๋ชฉํ‘œ๊ฐ€ ๋” ์œ ์šฉํ•˜๋‹ค๋Š” ์ ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
  2.  LSTM๊ณผ์˜ ๋น„๊ต:
    • ๋™์ผํ•œ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ๋‹จ์ผ ๋ ˆ์ด์–ด 2048 ์œ ๋‹› LSTM๊ณผ Transformer์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, LSTM์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ํ‰๊ท  ์ ์ˆ˜๊ฐ€ 5.6์  ํ•˜๋ฝํ–ˆ์Šต๋‹ˆ๋‹ค.
    • LSTM์€ ๋‹จ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ์…‹(MRPC)์—์„œ๋งŒ Transformer๋ฅผ ๋Šฅ๊ฐ€ํ–ˆ์œผ๋ฉฐ, ์ด๋Š” Transformer์˜ ๊ตฌ์กฐ์  ์ด์ ์„ ์žฌ์ฐจ ํ™•์ธ์‹œ์ผœ์ค๋‹ˆ๋‹ค.
  3.  ์‚ฌ์ „ ํ•™์Šต ์—†๋Š” ๋ชจ๋ธ๊ณผ์˜ ๋น„๊ต:
    • ์‚ฌ์ „ ํ•™์Šต ์—†์ด ์ง€๋„ ๋Œ€์ƒ ์ž‘์—…์— ์ง์ ‘ ํ›ˆ๋ จํ•œ Transformer ์•„ํ‚คํ…์ฒ˜์™€ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, ์‚ฌ์ „ ํ•™์Šต์˜ ๋ถ€์žฌ๋Š” ๋ชจ๋“  ์ž‘์—…์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์‚ฌ์ „ ํ•™์Šต๋œ ์ „์ฒด ๋ชจ๋ธ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด 14.8% ๊ฐ์†Œํ•œ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์‚ฌ์ „ ํ•™์Šต(Pre-Training)์˜ ์ค‘์š”์„ฑ์„ ๋ถ„๋ช…ํžˆ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Conclusion

์ƒ์„ฑ ์‚ฌ์ „ ํ•™์Šต๊ณผ ํŒ๋ณ„์  ๋ฏธ์„ธ ์กฐ์ •์„ ํ†ตํ•ด ๋‹จ์ผ ํƒœ์Šคํฌ ๋น„์˜์กด ๋ชจ๋ธ๋กœ ๊ฐ•๋ ฅํ•œ ์ž์—ฐ์–ด ์ดํ•ด๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์†Œ๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฐ์†๋œ ํ…์ŠคํŠธ๋กœ ๊ตฌ์„ฑ๋œ ๋‹ค์–‘ํ•œ ์ฝ”ํผ์Šค์—์„œ ์‚ฌ์ „ ํ•™์Šตํ•จ์œผ๋กœ์จ, ์šฐ๋ฆฌ ๋ชจ๋ธ์€ ์ƒ๋‹นํ•œ ์„ธ๊ณ„ ์ง€์‹๊ณผ ์žฅ๊ธฐ ์˜์กด์„ฑ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์Šต๋“ํ•˜๊ฒŒ ๋˜์—ˆ์œผ๋ฉฐ, ์ด๋Š” ์งˆ๋ฌธ ์‘๋‹ต, ์˜๋ฏธ ์œ ์‚ฌ์„ฑ ํ‰๊ฐ€, ํ•จ์˜ ๊ฒฐ์ •, ํ…์ŠคํŠธ ๋ถ„๋ฅ˜์™€ ๊ฐ™์€ ํŒ๋ณ„์  ์ž‘์—…์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ์„ฑ๊ณต์ ์œผ๋กœ ์ „์ด๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์šฐ๋ฆฌ๊ฐ€ ์—ฐ๊ตฌํ•œ 12๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹ ์ค‘ 9๊ฐœ์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

ํŒ๋ณ„์  ์ž‘์—…์—์„œ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋น„์ง€๋„ (์‚ฌ์ „) ํ•™์Šต์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์˜ค๋žซ๋™์•ˆ ๊ธฐ๊ณ„ ํ•™์Šต ์—ฐ๊ตฌ์˜ ์ค‘์š”ํ•œ ๋ชฉํ‘œ์˜€์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ์—ฐ๊ตฌ๋Š” ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์ด ์‹ค์ œ๋กœ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•˜๋ฉฐ, ์–ด๋–ค ๋ชจ๋ธ(Transformer)๊ณผ ๋ฐ์ดํ„ฐ์…‹(์žฅ๊ธฐ ์˜์กด์„ฑ์„ ๊ฐ€์ง„ ํ…์ŠคํŠธ)์ด ์ด ์ ‘๊ทผ ๋ฐฉ์‹๊ณผ ๊ฐ€์žฅ ์ž˜ ๋งž๋Š”์ง€์— ๋Œ€ํ•œ ๋‹จ์„œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด๊ฒƒ์ด ์ž์—ฐ์–ด ์ดํ•ด์™€ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์—์„œ ๋น„์ง€๋„ ํ•™์Šต์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ์—ฐ๊ตฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ, ๋น„์ง€๋„ ํ•™์Šต์ด ์–ด๋–ป๊ฒŒ, ์–ธ์ œ ํšจ๊ณผ์ ์œผ๋กœ ์ž‘๋™ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ๋ฅผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.