A A
[NLP] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
์ด๋ฒˆ์—”, BART Model์— ๋Œ€ํ•˜์—ฌ ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

What is BART?

BART ์‹ฌ์Šจ.. ํ—คํ—ค

BART(Bidirectional and Auto-Regressive Transformers) ๋ชจ๋ธ์€ Facebook AI(ํ˜„ Meta AI)์—์„œ 2019๋…„์— ์†Œ๊ฐœํ•œ sequence-to-sequence ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. BART๋Š” BERT์™€ GPT์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

 

BERT ๋ชจ๋ธ์˜ Bidrectional(์–‘๋ฐฉํ–ฅ)์œผ๋กœ ์–ธ์–ด Sequence์˜ Token๋“ค์„ Attention ๋งค์ปค๋‹ˆ์ฆ˜์— ๋ฐ˜์˜ํ•˜์—ฌ ๋ฌธ์ž๋ฅผ Encoding ํ•˜๋Š” ๋‚ด์šฉ, 

GPT์˜ Generative Decoder๋ฅผ ํ™œ์šฉํ•œ, ์ด๋•Œ๊นŒ์ง€์˜ ์ž…๋ ฅ์„ ๋ฐ”ํƒ•์œผ๋กœ ์ƒˆ๋กœ์šด ์ถœ๋ ฅ์„ ๋งŒ๋“œ๋Š” Generative model ์ž…๋‹ˆ๋‹ค.

์ •๋ฆฌํ•˜๋ฉด, ๊ธฐ๋ณธ์˜ Sequence-to-Sequence Transformer Model์„ ์ƒˆ๋กœ์€ Pre-Training Objective๋ฅผ ํ†ตํ•ด Train ์‹œ์ผœ ํ•˜๋‚˜๋กœ ํ•ฉ์นœ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.


Abstract

BART๋Š” ์‚ฌ์ „ ํ•™์Šต ๋ชฉํ‘œ๋กœ Denoising Task๋ฅผ ์ฑ„ํƒํ•˜๊ณ , Sequence-to-Sequence(Seq2Seq) ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ธฐ๋ณธ์œผ๋กœ ํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. BART๋Š” ํ…์ŠคํŠธ์— ์ž„์˜์˜ ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์†์ƒ์‹œํ‚จ ํ›„, ์ด๋ฅผ ์›๋ณธ ํ…์ŠคํŠธ๋กœ ๋ณต๊ตฌํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.

 

Seq2Seq์˜ Encoder๋Š” BERT์™€ ์œ ์‚ฌํ•œ ์–‘๋ฐฉํ–ฅ Encoder์˜ ํŠน์„ฑ์„ ๊ฐ€์ง€๋ฉฐ, Decoder๋Š” GPT์™€ ๊ฐ™์€ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ž‘๋™ํ•˜๋Š” Auto-Regressive ํŠน์„ฑ์„ ์ง€๋‹ˆ๊ณ  ์žˆ์–ด, BART๋Š” BERT์™€ GPT์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ ๋ชจ๋ธ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ๋…ธ์ด์ง• ๊ธฐ๋ฒ•์„ ํ‰๊ฐ€ํ•˜์—ฌ ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ์ž„์˜๋กœ ์„ž๋Š” ๊ฒƒ๊ณผ Text Infilling ์Šคํ‚ด(์ŠคํŒฌ ๋‹จ์œ„์˜ ํ…์ŠคํŠธ๊ฐ€ ํ•˜๋‚˜์˜ ๋งˆ์Šคํฌ ํ† ํฐ์œผ๋กœ ์น˜ํ™˜๋จ)์„ ์‚ฌ์šฉํ•  ๋•Œ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์Œ์„ ๋ฐœ๊ฒฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. BART๋Š” ํŠนํžˆ ํ…์ŠคํŠธ ์ƒ์„ฑ์— ๋Œ€ํ•ด fine-tuned ๋˜์—ˆ์„ ๋•Œ ํšจ์œจ์ ์ด๋ฉฐ, ์ดํ•ด๋ ฅ ํ…Œ์ŠคํŠธ์—์„œ๋„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

 

GLUE์™€ SQuAD์—์„œ RoBERTa ์ด์ƒ์˜ ์„ฑ๋Šฅ์„, ROUGE ์ ์ˆ˜์—์„œ๋Š” 6์  ์ด์ƒ ํ–ฅ์ƒ์‹œํ‚ค๋ฉฐ, Abstractive Dialogue, Question Answering, Summarization ํƒœ์Šคํฌ์—์„œ State-of-the-Art(SOTA)๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ablation ์‹คํ—˜์„ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ž…์ฆํ•˜์˜€์Šต๋‹ˆ๋‹ค.


Introduction

์ž๊ธฐ ์ง€๋„ ํ•™์Šต(Self-supervised learning)์€ ๋‹ค์–‘ํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP) ํƒœ์Šคํฌ์—์„œ ๊ด„๋ชฉํ•  ๋งŒํ•œ ์„ฑ๊ณผ๋ฅผ ๋ณด์—ฌ์™”์Šต๋‹ˆ๋‹ค.

๋Œ€ํ‘œ์ ์ธ ๋ชจ๋ธ๋กœ๋Š” Word2Vec, ELMo, BERT, SpanBERT, XLNet, RoBERTa ๋“ฑ์ด ์žˆ์œผ๋ฉฐ, ์ด๋“ค ์ค‘ ๊ฐ€์žฅ ์„ฑ๊ณต์ ์ธ ์ ‘๊ทผ๋ฒ•์€ Denoising Autoencoder๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ Masked Language Model(MLM) ๋ณ€ํ˜•๋“ค์ž…๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ MLM ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์€ ํŠน์ • ํƒœ์Šคํฌ์—๋งŒ ์ตœ์ ํ™”๋˜์–ด ํ™œ์šฉ์„ฑ์ด ์ œํ•œ๋œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

MLM ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์˜ ์ฃผ์š” ํ•œ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ํŠน์ • ํƒœ์Šคํฌ์— ๋Œ€ํ•œ ์ตœ์ ํ™”: MLM์€ ์ฃผ๋กœ ํ…์ŠคํŠธ ์ดํ•ด ํƒœ์Šคํฌ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์–ด, ํ…์ŠคํŠธ ์ƒ์„ฑ๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ๋ฒ”์ฃผ์˜ ํƒœ์Šคํฌ์—์„œ๋Š” ์„ฑ๋Šฅ์ด ์ œํ•œ์ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ๋‹จ์ผ ๋ฐฉํ–ฅ์„ฑ์˜ ํ•œ๊ณ„: BERT์™€ ๊ฐ™์€ ๋ชจ๋ธ์€ Encoder-only ๊ตฌ์กฐ๋กœ ์ธํ•ด ์ž์—ฐ์–ด ์ƒ์„ฑ(NLG) ํƒœ์Šคํฌ์—์„œ ์ง์ ‘์ ์œผ๋กœ ํ™œ์šฉํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, Seq2Seq ๋ชจ๋ธ๋กœ ๊ตฌํ˜„๋œ denoising autoencoder์ธ Bidirectional and Auto-Regressive Transformers, BART๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค

 

BART๋Š” Bidirectional Encoder์™€ Auto-Regressive Decoder๋ฅผ ๊ฒฐํ•ฉํ•œ Seq2Seq(Sequence-to-Sequence) ๋ชจ๋ธ๋กœ, BERT์™€ GPT์˜ ์žฅ์ ์„ ํ†ตํ•ฉํ•˜์—ฌ ๋ณด๋‹ค ์œ ์—ฐํ•˜๊ณ  ๊ฐ•๋ ฅํ•œ ํ‘œํ˜„๋ ฅ์„ ์ง€๋‹ˆ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, BART๋Š” ๋‹ค์–‘ํ•œ noising functions์„ ์ ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ์ž„์˜๋กœ ๋ณ€ํ˜•์‹œํ‚ค๊ณ , ์ด๋ฅผ ์›๋ณธ ํ…์ŠคํŠธ๋กœ ๋ณต์›ํ•˜๋Š” Denoising Autoencoder ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.

(A-์ขŒ) BERT, (B-์šฐ) GPT

(a) BERT: Random Token ๋“ค์ด Mask๋กœ ๋Œ€์ฒด๋˜๊ณ , ๋ฌธ์„œ๋Š” Bidrectional(์–‘๋ฐฉํ–ฅ)์œผ๋กœ Encoding๋ฉ๋‹ˆ๋‹ค.
๋น ์ง„ Toekn๋“ค์€ ๋…๋ฆฝ์ ์œผ๋กœ ์˜ˆ์ธก๋˜๊ธฐ ๋•Œ๋ฌธ์—, BERT๋Š” ์ƒ์„ฑ์„ ์œ„ํ•ด ์‰ฝ๊ฒŒ ์‚ฌ์šฉ๋  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

(b) GPT: Token๋“ค์€ AutoRegressive(์ž๋™ ํšŒ๊ท€)์ ์œผ๋กœ ์˜ˆ์ธก๋˜์–ด, ์ƒ์„ฑ์„ ์œ„ํ•ด ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํ•˜์ง€๋งŒ ๋‹จ์–ด๋“ค์€ ์™ผ์ชฝ ๋งฅ๋ฝ์—๋งŒ ์กฐ๊ฑด์„ ๋‹ฌ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, Bidrectional(์–‘๋ฐฉํ–ฅ) ์ƒํ˜ธ์ž‘์šฉ์„ ํ•™์Šตํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

Figure 1: BART์™€ BERT (Devlin et al., 2019) ๋ฐ GPT (Radford et al., 2018)์˜ ๋„์‹์  ๋น„๊ต.

(c) BART: Encoder์— ๋Œ€ํ•œ ์ž…๋ ฅ์€ Devoder ์ถœ๋ ฅ๊ณผ ์ •๋ ฌ๋  ํ•„์š”๊ฐ€ ์—†์œผ๋ฉฐ, ์ž„์˜์˜ ์žก์Œ ๋ณ€ํ™˜์„ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์—์„œ ๋ฌธ์„œ๋Š” Mask ์‹ฌ๋ณผ๋กœ ํ…์ŠคํŠธ์˜ ์ผ๋ถ€๋ถ„์„ ๋Œ€์ฒดํ•˜์—ฌ ์†์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์†์ƒ๋œ ๋ฌธ์„œ(์™ผ์ชฝ)๋Š” Bidrectional(์–‘๋ฐฉํ–ฅ) ๋ชจ๋ธ๋กœ Encoding๋˜๋ฉฐ, ๊ทธ ํ›„ ์›๋ณธ ๋ฌธ์„œ์˜ ๊ฐ€๋Šฅ์„ฑ(์˜ค๋ฅธ์ชฝ)์€ AutoRegressive(์ž๋™ ํšŒ๊ท€) Decoder๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.
Fine-Tuning ์‹œ, ์†์ƒ๋˜์ง€ ์•Š์€ ๋ฌธ์„œ๋Š” Encoder์™€ Decoder ๋ชจ๋‘์— ์ž…๋ ฅ๋˜๊ณ , ์šฐ๋ฆฌ๋Š” Decoder์˜ ์ตœ์ข… ์ˆจ๊ฒจ์ง„ ์ƒํƒœ์—์„œ ํ‘œํ˜„์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋Š” Arbitrary Transformations์„ ํ†ตํ•ด ์›๋ณธ ํ…์ŠคํŠธ์— ์ž์œ ๋กญ๊ฒŒ noising์„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•˜๋ฉฐ, ์ด ๋ง์€ ์–ด๋–ค ์ž„์˜์˜ ๋ณ€ํ˜•์ด๋ผ๋„ ๊ธฐ์กด์˜ Text์— ๋ฐ”๋กœ ์ ์šฉ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ธธ์ด๋„ ๋ณ€ํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ œ์‹œํ•˜๋Š” ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”๊ฑด ๊ธฐ์กด ๋ฌธ์žฅ ์ˆœ์„œ๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์„ž์€ ํ›„ ์ž„์˜ ๊ธธ์ด์˜ ํ…์ŠคํŠธ๋ฅผ ํ•˜๋‚˜์˜ MASK Token์œผ๋กœ ๋ด๊พธ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ง์€ ์ฆ‰, BERT์˜ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก ์ธ ๋‹จ์–ด Masking & Next Sentence Prediction ๊ธฐ๋ฒ•์„ ์ผ๋ฐ˜ํ™” ํ–ˆ์œผ๋ฉฐ, ๋ชจ๋ธ์ด ์ „์ฒด์ ์œผ๋กœ ๋ฌธ์žฅ์˜ ๊ธธ์ด์— ๋ฐํ•˜์—ฌ ํ•™์Šต ๋ฐ ๋ณ€ํ˜•๋œ Input์— ๋” ๋งŽ์ด Attention ํ•˜๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

 

BART๋Š” ํŠนํžˆ Text Generation์— ๋Œ€ํ•ด fine-tuning ๋˜์—ˆ์„ ๋•Œ ํšจ์œจ์ ์ด์ง€๋งŒ, Comprehension(์ดํ•ด๋ ฅ) ํ…Œ์ŠคํŠธ์—์„œ๋„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. SQuAD์™€ GLUE์—์„œ ๋น„์Šทํ•œ ์ˆ˜์ค€์˜ ํ•™์Šต์„ ํ•˜์˜€์„ ๋•Œ RoBERTa์˜ ์„ฑ๋Šฅ๊ณผ ๋น„์Šทํ•˜์˜€๊ณ , abstractive dialogue, question answering, summarization task์—์„œ sota์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ค‘์—์„œ XSum ๋ฐ์ดํ„ฐ์…‹์˜ ์ด์ „ SOTA ์„ฑ๋Šฅ๋ณด๋‹ค 6ROUGE๋งŒํผ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๊ธฐ๋„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ Fine-Tuning์—์„œ ๋ชจ๋ธ์— ๋ช‡๊ฐœ์˜ ์ถ”๊ฐ€์ ์ธ transformer ๋ ˆ์ด์–ด๋ฅผ ์Œ“์•„ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์œผ๋กœ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•˜์˜€์Šต๋‹ˆ๋‹ค. 


Model

BART Model์€ ์ž„์˜๋กœ ๋ณ€ํ˜•๋œ ๋ฌธ์„œ์˜ ๋‚ด์šฉ์„ ์›๋ž˜๋Œ€๋กœ ๋˜๋Œ๋ฆฌ๋Š” Denoising Autoencoder ์ž…๋‹ˆ๋‹ค.

๊ตฌํ˜„์€ Corrupted Text์— ๋Œ€ํ•œ Bidrectional Encoder & Left-to-Right Autogressive Decoder๋กœ ๊ตฌ์„ฑ๋œ Sequence-to-Sequence ๋ชจ๋ธ๋กœ ์ด๋ฃจ์–ด ์กŒ์Šต๋‹ˆ๋‹ค. Pre-train์„ ์œ„ํ•ด์„œ๋Š” ์›๋ณธ ๋ฌธ์„œ์— ๋Œ€ํ•œ Negative Log-Likelihood(NLL) Loss๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

Architecture

BART๋Š” ํ‘œ์ค€ Sequence-to-Sequence & Transformer Model ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ง€๋งŒ, Activation Function(ํ™œ์„ฑํ™” ํ•จ์ˆ˜)๋ฅผ GPT์™€ ์œ ์‚ฌํ•˜๊ฒŒ GeLU ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํŒŒ๋ผ๋ฏธํ„ฐ ์ดˆ๊ธฐํ™”๋Š” ํ‰๊ท ์ด 0์ด๊ณ  ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 0.02์ธ ์ •๊ทœ๋ถ„ํฌ(N(0, 0.02))๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

Base ๋ชจ๋ธ์€ Encoder์™€ Decoder ๊ฐ๊ฐ์— 6๊ฐœ์˜ ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , Large ๋ชจ๋ธ์€ 12๊ฐœ์˜ ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

๊ตฌ์กฐ๋Š” BERT Model๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ ๋‹ค๋ฅธ ์ฐจ์ด์ ์ด ๋ช‡๊ฐ€์ง€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

  • Decoder์˜ Cross-Attention: Decoder์˜ ๊ฐ ๋ ˆ์ด์–ด๊ฐ€ Encoder์˜ ๋งˆ์ง€๋ง‰ Hidden Layer์™€ Cross-Attention์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • Feed-Forward Net์˜ ์‚ฌ์šฉ: BERT๋Š” Word Prediction์„ ์œ„ํ•ด ์ถ”๊ฐ€์ ์ธ Feed-Forward Net์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, BART๋Š” ๊ทธ๋ ‡์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜: ์ „์ฒด์ ์œผ๋กœ BART๋Š” BERT๋ณด๋‹ค ์•ฝ 10% ๋” ๋งŽ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Pre-Training BART

BART Model์€ Corrupted Document(๋ณ€ํ˜•๋œ ๋ฌธ์„œ)๋ฅผ ์›๋ณตํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ Pre-Training(์‚ฌ์ „ ํ•™์Šต)์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์ด๋•Œ, Reconstruction Loss๋Š” Decoder์˜ ์ถœ๋ ฅ & ์›๋ณธ ๋ฌธ์„œ๊ฐ„ Cross-Entropy Loss๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Pre-Training(์‚ฌ์ „ ํ•™์Šต)์— ์‚ฌ์šฉ๋œ Denoising Method๋Š” 5๊ฐ€์ง€๋กœ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Figure 2: ์‹คํ—˜ํ•˜๋Š” Input์„ Noise๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ๋ณ€ํ™˜. ์ด๋Ÿฌํ•œ ๋ณ€ํ™˜์€ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Token Masking: BERT์˜ MLM๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ๋žœ๋คํ•œ ํ† ํฐ์„ [MASK]๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.
  • Token Deletion: ๋žœ๋คํ•œ ํ† ํฐ์„ ์‚ญ์ œํ•˜์—ฌ ๋ชจ๋ธ์ด ์‚ฌ๋ผ์ง„ ์œ„์น˜๋ฅผ ์ฐพ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. MLM์˜ Token Masking๊ณผ ๋‹ฌ๋ฆฌ ์–ด๋Š ์œ„์น˜์—์„œ Token์ด ์‚ญ์ œ๊ฐ€ ๋˜์—ˆ๋Š”์ง€๋ฅผ ๋งž์ถฐ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • Text Infilling: ํฌ์•„์†ก ๋ถ„ํฌ(ฮป=3)์— ๋”ฐ๋ผ ์ƒ˜ํ”Œ๋ง๋œ ํ…์ŠคํŠธ ์ŠคํŒฌ์„ ๋‹จ์ผ [MASK] ํ† ํฐ์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” SpanBERT์—์„œ ์ œ์•ˆ๋œ ๋ฐฉ์‹๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ, SpanBERT๋Š” ๋‹ค๋ฅธ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋งํ•˜๋ฉฐ ์ •ํ™•ํžˆ ๊ฐ™์€ ๊ธธ์ด์˜ [MASK] ํ† ํฐ์œผ๋กœ ์น˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. Text Infilling์€ ๋ชจ๋ธ์ด ์ŠคํŒฌ์—์„œ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ํ† ํฐ์ด ์‚ฌ๋ผ์กŒ๋Š”์ง€ ์˜ˆ์ธกํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
์—์‹œ๋กœ ABC.DE.๋ผ๋Š” ๋‘ ๋ฌธ์žฅ์„ ๋ณด๋ฉด, ์ฒซ๋ฒˆ์งธ ๋ฌธ์žฅ์€ 2๋งŒํผ ๊ธธ์ด span์—์„œ 'BC'๋ผ๋Š” text๊ฐ€ ์ƒ˜ํ”Œ๋ง์ด ๋˜์–ด ๋‹จ์ผ [MASK] token, ๋‘๋ฒˆ์งธ ๋ฌธ์žฅ์€ '0'๋งŒํผ ๊ธธ์ด span์—์„œ 'empty'๋ผ๋Š” text๊ฐ€์ด ์ƒ˜ํ”Œ๋ง์ด ๋˜์–ด [MASK] token์œผ๋กœ ๋Œ€์ฒด๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ span์€ text(๊ธ€์ž)์˜ Token์œผ๋กœ ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. span์€ labmda๊ฐ€ 3์ธ Poisson(ํฌ์•„์†ก) ๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ [MASK] Token์œผ๋กœ ๋Œ€์ฒด๋œ๋‹ค๊ณ  ๋งํ•˜๊ธฐ ๋•Œ๋ฌธ์—, 0~6 ์‚ฌ์ด์˜ span ๊ธธ์ด๊ฐ€ ๋ฝ‘ํžŒ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • Sentence Permutation: ๋ฌธ์„œ๋ฅผ ์›๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ž„์˜๋กœ ์„ž์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ์›๋ž˜ ์ˆœ์„œ๋ฅผ ๋งž์ถฐ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
์˜ˆ์‹œ๋ฅผ ๋ณด๋ฉด 'ABC.DE'๋ผ๋Š” ์›๋ž˜ ๋ฐ์ดํ„ฐ๊ฐ€ 'DE.ABC.'๋กœ ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์žฅ ๊ฐ„์˜ ๊ตฌ๋ถ„์„ ๋งˆ์นจํ‘œ(full stop)์œผ๋กœ ๊ตฌ๋ถ„ํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • Document Rotation: ๋™์ผํ•œ ํ™•๋ฅ ๋กœ ํ•˜๋‚˜์˜ Random Token์„ ๊ณ ๋ฅธํ›„, ๊ทธ Random Token์˜ ์ง€์ ์„ ์ž˜๋ผ์„œ ๊ทธ ์ง€์ ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋„๋ก ๋ณ€ํ˜•ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ์›๋ž˜ ์‹œ์ž‘์ ์„ ์ฐพ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.
์˜ˆ์‹œ๋ฅผ ๋ณด๋ฉด, 'ABC.DE'๋ผ๋Š” Original Text ์—์„œ, Randomํ•˜๊ฒŒ 'C'๋ผ๋Š” token์„ ๋ฝ‘์•„ ์‹œ์ž‘ ์ ์— ๋ฐฐ์น˜ํ•ด๋‘๊ณ  'C'์•ž์— ์žˆ๋˜ token๋“ค์€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋’ค๋กœ ๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋ฌธ์žฅ์€ 'C.DE.AB'๋กœ ๋ณ€๊ฒฝ๋ฉ๋‹ˆ๋‹ค.

Fine-Tuning BART

BART Model์ด ์ƒ์„ฑํ•œ Representatoin์€ ๋‹ค์–‘ํ•œ Downstream Application์—์„œ ์—ฌ๋Ÿฌ๊ฐ€์ง€์˜ ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉ๋ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•œ๋ฒˆ ์•„๋ž˜์˜ Task๋“ค์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์„ค๋ช…์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Sequence Classification Tasks

Classification ๋ฌธ์ œ์— BART๋ฅผ ์‚ฌ์šฉ์‹œ. Input์€ Encoder, Decoder์— ์ž…๋ ฅ, ์ตœ์ข…์ถœ๋ ฅ์—” Repre-Sentation์ด ์‚ฌ์šฉ๋จ.

 

Sequence Classification Task๋Š” ์ฃผ์–ด์ง„ ์‹œํ€€์Šค๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ํƒœ์Šคํฌ์ž…๋‹ˆ๋‹ค.

GLUE์˜ CoLA๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์ด ๋ฌธ๋ฒ•์ ์œผ๋กœ ํ•ฉ๋‹นํ•œ์ง€ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ํƒœ์Šคํฌ์—์„œ๋Š” ๋™์ผํ•œ ์ž…๋ ฅ์ด Encoder์™€ Decoder ๋ชจ๋‘์— ์ฃผ์–ด์ง€๋ฉฐ, Decoder์˜ ๋งˆ์ง€๋ง‰ Hidden State๋Š” ์ƒˆ๋กœ์šด Multi-Class Linear Classifier์— ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.

์ด๋Š” BERT์˜ CLS ํ† ํฐ๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ, BART์—์„œ๋Š” ์ถ”๊ฐ€์ ์ธ ํ† ํฐ์„ ๋์— ์ถ”๊ฐ€ํ•˜์—ฌ Decoder์˜ Representation์ด ์ „์ฒด ์ž…๋ ฅ์„ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Token Classification Tasks

Token Classification Task๋Š” ๊ฐ ํ† ํฐ ๋‹จ์œ„๋กœ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ํƒœ์Šคํฌ์ž…๋‹ˆ๋‹ค.

๋Œ€ํ‘œ์ ์œผ๋กœ SQuAD์˜ Answer Endpoint Classification์ด ์žˆ์Šต๋‹ˆ๋‹ค. SQuAD๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์„œ ๋‚ด์—์„œ ์ •๋‹ต์— ํ•ด๋‹นํ•˜๋Š” ํ…์ŠคํŠธ ์ŠคํŒฌ์˜ ์‹œ์ž‘๊ณผ ๋ ํ† ํฐ์„ ์ฐพ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. BART์—์„œ๋Š” ์ „์ฒด ๋ฌธ์„œ๋ฅผ Encoder์™€ Decoder์— ์ž…๋ ฅ์œผ๋กœ ์ฃผ๊ณ , Decoder์˜ ์ตœ์ƒ๋‹จ Hidden State๋ฅผ ๊ฐ ํ† ํฐ์˜ Representation์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ Start์™€ End Token์„ ์˜ˆ์ธกํ•˜๋Š” ๋ถ„๋ฅ˜๊ธฐ์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

Sequence Generation Tasks

BART๋Š” Autoregressive Decoder๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์–ด, Abstractive Question Answering์ด๋‚˜ Summarization๊ณผ ๊ฐ™์€ ์ƒ์„ฑ ํƒœ์Šคํฌ์— ๋ฐ”๋กœ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋“ค ํƒœ์Šคํฌ๋Š” ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ๋ณ€ํ˜•ํ•˜์—ฌ ์ถœ๋ ฅ ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํŠน์ง•์ด ์žˆ์œผ๋ฉฐ, ์ด๋Š” Denoising Pre-training Objective์™€ ๊ธด๋ฐ€ํžˆ ์—ฐ๊ด€๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. Encoder์—๋Š” ์ž…๋ ฅ ์‹œํ€€์Šค๊ฐ€ ๋“ค์–ด๊ฐ€๊ณ , Decoder๋Š” Autoregressive ๋ฐฉ์‹์œผ๋กœ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Machine Translation

๊ธฐ๊ณ„ ๋ฒˆ์—ญ์˜ ๊ฒฝ์šฐ, BART์˜ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ๋Œ€์ฒดํ•˜๋Š” ์ž‘์€ ์ถ”๊ฐ€ Encoder๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด Encoder๋Š” ๋ถ„๋ฆฌ๋œ ์–ดํœ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

BART๋Š” ์ „์ฒด ๋ชจ๋ธ์„ ํ•˜๋‚˜์˜ Encoder์ฒ˜๋Ÿผ ์ƒ๊ฐํ•˜์—ฌ Machine Translation ํƒœ์Šคํฌ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ, BART์˜ Encoder Embedding Layer๋ฅผ ๋žœ๋ค ์ดˆ๊ธฐํ™”๋œ ์ƒˆ๋กœ์šด Encoder๋กœ ๊ต์ฒดํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ End-to-End๋กœ ํ•™์Šต๋˜๋ฉฐ, ์ƒˆ๋กœ์šด Encoder๋Š” ์™ธ๊ตญ์–ด ๋‹จ์–ด๋ฅผ ์˜์–ด๋กœ ๋งคํ•‘ํ•˜์—ฌ BART๊ฐ€ ์˜์–ด๋กœ Noisy๋ฅผ Denoise ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด Encoder๋Š” ์›๋ž˜ BART ๋ชจ๋ธ๊ณผ ๋‹ค๋ฅธ Vocabulary๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

Source Encoder๋Š” ๋‘ ๋‹จ๊ณ„๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค

  1. BART์˜ ๋Œ€๋ถ€๋ถ„์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ Freezeํ•˜๊ณ , ๋žœ๋ค์œผ๋กœ ์ดˆ๊ธฐํ™”๋œ Source Encoder, BART Positional Embeddings, ์ฒซ ๋ฒˆ์งธ Encoder Layer์˜ Self-Attention Input Projection Matrix๋งŒ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค.
  2. ์ ์€ Iteration ์ˆ˜๋กœ ๋ชจ๋ธ์˜ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค.

Comparision Pre-Training Objectives

BART Model์€ Base-size ๊ธฐ์ค€ (6๊ฐœ์˜ Encoder, 6๊ฐœ์˜ Decoder, hidden size: 768๊ฐœ)์œผ๋กœ ๋‹ค์–‘ํ•œ Pre-Training Objective๋ฅผ ์—ฌ๋Ÿฌ task์— ๋Œ€ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.

Comparison Objectives

  1. Language Model (LM): GPT์™€ ์œ ์‚ฌํ•œ Left-to-Right LM์œผ๋กœ, BART์˜ Decoder์™€ ๋™์ผํ•˜์ง€๋งŒ Cross-Attention์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  2. Permuted Language Model (PLM): XLNet ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋กœ, 1/6 Token์„ Samplingํ•˜์—ฌ Autoregressive(AR) ๋ฐฉ์‹์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ Model๊ณผ ๋น„๋ฃŒ๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด์„œ Relative Positional Embedding์ด๋‚˜ Attention Across Segments๋Š” ์ ์šฉํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
  3. Masked Language Model (MLM): BERT์™€ ๋™์ผํ•˜๊ฒŒ 15% ํ† ํฐ์„ [MASK]๋กœ ๋Œ€์ฒดํ•˜๊ณ , ๋…๋ฆฝ์ ์œผ๋กœ ์›๋ž˜ ํ† ํฐ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  4. Multitask Masked Language Model (MMLM): UniLM์—์„œ ์ œ์•ˆํ•œ ๋ฐฉ์‹์œผ๋กœ, ์ถ”๊ฐ€์ ์ธ Self-Attention Masks๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  5. Masked Seq-to-Seq: MASS์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ ๋ชจ๋ธ๋กœ, 50%์˜ ํ† ํฐ์„ ๋งˆ์Šคํฌํ•˜๊ณ  Seq2Seq ๋ฐฉ์‹์œผ๋กœ ๋งˆ์Šคํฌ๋œ ํ† ํฐ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

Tasks

๋น„๊ต ์‹คํ—˜์—์„œ ์‚ฌ์šฉ๋œ Task๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  • SQuAD: Wikipedia ๋ฌธ๋‹จ์„ ์‚ฌ์šฉํ•˜๋Š” Extractive Question Answering ํƒœ์Šคํฌ๋กœ, Input์œผ๋กœ ์งˆ๋ฌธ๊ณผ ๋ฌธ๋งฅ์„ ๊ฒฐํ•ฉํ•˜์—ฌ Encoder์— ๋„ฃ๊ณ  Decoder๋กœ ๋‚˜์˜จ ๊ฒฐ๊ณผ๋กœ ์ •๋‹ต ์ŠคํŒฌ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • MNLI: ๋‘ ๋ฌธ์žฅ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” Bitext Classification ํƒœ์Šคํฌ๋กœ, ํ•œ ๋ฌธ์žฅ์ด ๋‹ค๋ฅธ ๋ฌธ์žฅ์„ ์ˆ˜๋ฐ˜ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•ฉ๋‹ˆ๋‹ค.
[EOS] Token์„ ๋‘ ๋ฌธ์žฅ๊ณผ ์—ฐ๊ฒฐํ•˜์—ฌ, BART Model์˜ Encoder์— ๋„ฃ๊ณ  Decoder๋กœ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์„œ [EOS] Token์˜ ์—ญํ• ์€ ๋ฌธ์žฅ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š”๋ฐ ์ฃผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • ELI5: Long-form Abstractive Question Answering ํƒœ์Šคํฌ๋กœ, ๊ธด ํ˜•์‹์˜ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์ž์œ  ํ˜•์‹์˜ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • XSum: ๋งค์šฐ ํ•จ์ถ•์ ์ธ ๋‰ด์Šค ์š”์•ฝ ํƒœ์Šคํฌ๋กœ, ์ถ”์ƒ์ ์ธ ์š”์•ฝ๋ฌธ์„ ์ƒ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ConvAI2: Context์™€ Persona ์กฐ๊ฑด์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ Dialogue Response Generation ํƒœ์Šคํฌ์ž…๋‹ˆ๋‹ค.
  • CNN/DM: News Summarization ํƒœ์Šคํฌ๋กœ, ์ž…๋ ฅ ๋ฌธ์„œ์™€ ๋ฐ€์ ‘ํ•˜๊ฒŒ ์—ฐ๊ด€๋œ ์š”์•ฝ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Results

์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์˜ Tabel๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Pre-Training ๋น„๊ต. ๋ชจ๋“  Model์€ ๋น„์Šทํ•œ ํฌ๊ธฐ์ด๋ฉฐ ์ฑ…๊ณผ ์œ„ํ‚คํ”ผ๋””์•„ ๋ฐ์ดํ„ฐ์„ ํ•ฉ์นœ 1M Step๋งŒํผ ๋ฐ์ดํ„ฐ๋กœ Train ๋ฉ๋‹ˆ๋‹ค.
๋‘๋ฒˆ์งธ, ๋งˆ์ง€๋ง‰ ๋ธ”๋ก๋“ค์€ ๋™์ผํ•œ Code-Base๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์ฒซ๋ฒˆ์งธ ์นธ์˜ ํ•ญ๋ชฉ์€ BERT-Base Model์˜ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
๋‘๋ฒˆ์งธ ์นธ์€ Pre-Training Objective์— ๋”ฐ๋ฅธ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ ํ‰๊ฐ€ ๋ชฉํ‘œ์— ์ง‘์ค‘ํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์ˆœํ™” ๋˜์—ˆ๋‹ค๋Š” ํŠน์ง•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
๋งˆ์ง€๋ง‰ ์นธ์€ BART ๋ชจ๋ธ์—์„œ document corruption์„ ์œ„์—์„œ ์„ค๋ช…ํ•œ 5๊ฐ€์ง€ ๋ฐฉ๋ฒ•๊ณผ ์กฐํ•ฉ์— ์žˆ์–ด ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
BART with Text Infilling์ด ๋Œ€์ฒด์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ์ œ์ผ ์ข‹์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋˜ํ•œ ์„ฑ๋Šฅ์€ Task๋งˆ๋‹ค ๋‹ค๋ฅด์ง€๋งŒ text infilling์ด ํฌํ•จ๋œ BART ๋ชจ๋ธ์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • Performance of pre-training methods varies significantly across tasks (๋‘๋ฒˆ์งธ์นธ 3๋ฒˆ์งธ)
    • Pre-training ๋ฐฉ๋ฒ•๋ก ์˜ ์„ฑ๋Šฅ์€ ํƒœ์Šคํฌ๋ณ„๋กœ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.
    • LM ๋ชจ๋ธ์€ ELI5์—์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, SQuAD์—์„œ๋Š” ์ตœ์ € ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.
  • Token masking is crucial (๋งˆ์ง€๋ง‰์นธ 4,5๋ฒˆ์งธ)
    • Rotating Document๋‚˜ Permuting Sentences ๊ธฐ๋ฐ˜ Pre-training ๋ฐฉ๋ฒ•๋ก ์€ ์„ฑ๋Šฅ์ด ๋‚ฎ์•˜์Šต๋‹ˆ๋‹ค.
    • ๋ฐ˜๋ฉด, Token Deletion์ด๋‚˜ Token Masking, Self-Attention Mask๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  • Left-to-right pre-training improves generation (AutoRegressive, AR ๋ฐฉ์‹ ์‚ฌ์šฉ X, ๋‘๋ฒˆ์งธ ์นธ 1,4๋ฒˆ์งธ)
    • MLM๊ณผ PLM ๋ชจ๋ธ์€ ์ƒ์„ฑ ํƒœ์Šคํฌ์—์„œ ์„ฑ๋Šฅ์ด ๋‚ฎ์•˜์œผ๋ฉฐ, BART๋Š” Text Infilling๊ณผ Sentence Shuffling์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  • Bidirectional encoders are crucial for SQuAD (๋‘๋ฒˆ์งธ ์นธ 3๋ฒˆ์งธ)
    • BART๋Š” ์ ˆ๋ฐ˜์˜ ์–‘๋ฐฉํ–ฅ ๋ ˆ์ด์–ด๋ฅผ ๊ฐ€์ง€๊ณ ๋„ SQuAD์—์„œ RoBERTa์™€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.
    • ๋‹จ์ˆœ left-to-right Decodersms SQuAD์—์„œ ์„ฑ๋Šฅ์€ ๋‚ฎ์ง€๋งŒ, BART ๋ชจ๋ธ์—์„  ๊ทธ๋ ‡์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
  • The pre-training objective is not the only important factor (๋‘๋ฒˆ์งธ ์นธ 4๋ฒˆ์งธ)
    • Permuted LM์€ ๊ธฐ์กด XLNet๊ณผ Pre-Training Objective๋Š” ๊ฐ™์ง€๋งŒ ์„ฑ๋Šฅ์ด ๋‚ฎ์•˜์Šต๋‹ˆ๋‹ค.
    • ์ด๋Š” Relative-Position Embedding์ด๋‚˜ Segment-Level Recurrence์™€ ๊ฐ™์€ ์ถ”๊ฐ€์ ์ธ ๊ตฌ์กฐ์  ํ–ฅ์ƒ์ด ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • Pure Language Models in ELI5
    • ELI5 ํƒœ์Šคํฌ์—์„œ๋Š” ์œ ์ผํ•˜๊ฒŒ LM์—์„œ Pre-trained Objective๋“ค์˜ ์„ฑ๋Šฅ์ด BART๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.
  • Consistent Performance of BART
    • ELI5๋ฅผ ์ œ์™ธํ•œ ๋Œ€๋ถ€๋ถ„์˜ ํƒœ์Šคํฌ์—์„œ BART๋Š” Text Infilling์„ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์ด ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

Large-scale Pre-training Experiments

์ด๋ฒˆ ์‹คํ—˜์—์„œ๋Š” BART ๋ชจ๋ธ์„ RoBERTa์™€ ๋™์ผํ•œ ๊ทœ๋ชจ๋กœ Pre-Training ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

์œ„์˜ ์‹คํ—˜์—์„œ๋Š” ๋‹จ์ˆœํžˆ Text Infilling๋งŒ์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์ด ์„ฑ๋Šฅ์ด ์ข‹์•˜์ง€๋งŒ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” large scale ๋‹จ์œ„๋กœ ๊ฐ€๊ฒŒ ๋˜๋ฉด sentence shuffling์ด ์ž˜ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€์„ค์„ ๋‘์—ˆ๊ณ  ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Experimental Setup

  • ๋ชจ๋ธ ๊ตฌ์„ฑ: 12 Layer Encoder/Decoder, Hidden Size 1024
  • ํ•™์Šต ์„ค์ •:
    • Batch Size: 8000
    • Training Steps: 500,000
    • Tokenization: Byte-Pair Encoding (BPE)
    • Noising Scheme: Text Infilling๊ณผ Sentence Permutation์˜ ์กฐํ•ฉ
    • Dropout: ํ•™์Šต ๋‹จ๊ณ„์˜ ๋งˆ์ง€๋ง‰ 10%์—์„œ๋Š” Dropout์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ, Overfitting ๋ฐฉ์ง€ ๋ชฉ์ 
  • ํ•™์Šต ๋ฐ์ดํ„ฐ: News, Books, Stories, Web Text ๋“ฑ 160GB์˜ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ

Discriminative Tasks

BART๋Š” SQuAD์™€ GLUE ํƒœ์Šคํฌ์—์„œ RoBERTa, XLNet์™€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

 

๋Œ€๋ถ€๋ถ„์˜ Discriminative Task์—์„œ BART๋Š” RoBERTa์™€ ํฐ ์ฐจ์ด๋Š” ๋ณด์ด์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

์ด๋Š” BART์˜ uni-directional decoder layers๊ฐ€ discriminative tasks์—์„œ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ค์ง€ ์•Š์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

๊ฐ€์žฅ ์ง์ ‘์ ์œผ๋กœ ๋น„๊ต ๊ฐ€๋Šฅํ•œ Baseline์€ ๋™์ผํ•œ ์ž์›์œผ๋กœ Pre-Training(์‚ฌ์ „ ํ•™์Šต)๋˜์—ˆ์œผ๋‚˜, ๋‹ค๋ฅธ ๋ชฉํ‘œ๋ฅผ ๊ฐ€์ง„ RoBERTa์ž…๋‹ˆ๋‹ค.

์ „๋ฐ˜์ ์œผ๋กœ BART๋Š” ๋Œ€๋ถ€๋ถ„์˜ ํƒœ์Šคํฌ์—์„œ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ๋ชจ๋ธ ๊ฐ„์˜ ์ฐจ์ด๋Š” ๋ฏธ๋ฏธํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” BART์˜ Generation Task์—์„œ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด Classification ์„ฑ๋Šฅ์„ ํฌ์ƒํ•˜์ง€ ์•Š์Œ์„ ์˜๋ฏธํ•˜๋ฉฐ, ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Generation Tasks

2๊ฐœ์˜ ํ‘œ์ค€ Summerization Dataset์— ๊ด€ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. BART ๋ชจ๋ธ์€ ๋‘๊ฐ€์ง€ Task & ๋ชจ๋“  ์ง€ํ‘œ์— ๋Œ€ํ•œ Summerization์— ๋Œ€ํ•œ ์ด์ „์˜ Task๋ฅผ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, ์ถ”์ƒ์ ์ธ Dataset์—์„œ 6์  ์ด์ƒ์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœ๋˜๋Š” BART๋Š” ๋ชจ๋“  Task์—์„œ ๋ชจ๋“  ํ‰๊ฐ€์ง€ํ‘œ, Rough์˜ R1, R2, RL์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”๊ฑธ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ ๋งจ์•ž 3๊ฐœ์˜ ๋ฌธ์žฅ์„ ์ถ”์ถœํ•˜๋Š” Lead-3๋ฅผ ๋ณด๋ฉด, ์ „์ž์˜ Task๋ณด๋‹ค ํ›„์ž์˜ Task์—์„œ ์„ฑ๋Šฅ์ด ๋งค์šฐ ๋‚ฎ์Œ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

Summarization

  • CNN/DailyMail (์‰ฌ์šด ๋ฐ์ดํ„ฐ): Source Sentences์™€ ์œ ์‚ฌํ•œ ์š”์•ฝ์„ ์ƒ์„ฑํ•˜๋Š” ํƒœ์Šคํฌ๋กœ, BART๋Š” ๊ธฐ์กด ๋ชจ๋ธ๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ROUGE ์ ์ˆ˜๋ฅผ ๊ธฐ๋กํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • XSum (์–ด๋ ค์šด ๋ฐ์ดํ„ฐ): ๋งค์šฐ ์ถ”์ƒ์ ์ธ ์š”์•ฝ์„ ์ƒ์„ฑํ•ด์•ผ ํ•˜๋Š” ํƒœ์Šคํฌ๋กœ, BART๋Š” ๊ธฐ์กด SOTA ๋ชจ๋ธ๋ณด๋‹ค ์•ฝ 6 ROUGE ์ ์ˆ˜ ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

Dialogue

  • ConvAI2: Context์™€ Persona๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋Œ€ํ™” ์‘๋‹ต ์ƒ์„ฑ ํƒœ์Šคํฌ์—์„œ BART๋Š” ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ Perplexity์—์„œ ํฐ ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

Abstractive QA

  • ELI5: ๊ธด ์ž์œ ํ˜•์‹ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” ํƒœ์Šคํฌ์—์„œ BART๋Š” SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Translation

BART๋Š” WMT16 Romanian-English ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ธฐ์กด Transformer ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜์—ฌ 1.1 BLEU ์ ์ˆ˜ ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

์ด๋Š” Back-Translation Data Augmentation์„ ์ ์šฉํ•œ ๊ฒฐ๊ณผ๋กœ, BART์˜ Encoder๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•™์Šต์‹œ์ผœ ์™ธ๊ตญ์–ด ํ…์ŠคํŠธ๋ฅผ ์˜์–ด๋กœ ๋งคํ•‘ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

 

๋‹ค๋งŒ, ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ํš๊ธฐ์ ์œผ๋กœ ๋˜์ง€ ์•Š์€๊ฒƒ์œผ๋กœ ๋ณด์•„, Machine Translation ๋ถ€๋ถ„์—์„œ๋Š” BART ๋ชจ๋ธ์ด ์ข‹์ง€๋Š” ์•Š์€๊ฒƒ ์ฒ˜๋Ÿผ ๋ณด์—ฌ์ง‘๋‹ˆ๋‹ค.


Qualitative Analysis

์›๋ณธ ๋ฌธ์„œ (๊ฐ„๋žตํ™”) BART ๋ชจ๋ธ ์š”์•ฝ
์—ฐ๊ตฌ์ž๋“ค์€ ํ”ผ์ง€ ์—ฐ์•ˆ ์‚ฐํ˜ธ์ดˆ์—์„œ ์„ธ ๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ์‚ฐํ˜ธ๋ฅผ ์กฐ์‚ฌํ–ˆ์Šต๋‹ˆ๋‹ค... ์—ฐ๊ตฌ์ž๋“ค์€ ๋ฌผ๊ณ ๊ธฐ๊ฐ€ ๋งŽ์„ ๋•Œ, ๋ฌผ๊ณ ๊ธฐ๋“ค์ด ์‚ฐํ˜ธ์˜ ํ•ด์กฐ๋ฅ˜์™€ ๋ฐ”๋‹คํ’€์„ ๋จน์–ด ์‚ฐํ˜ธ๊ฐ€ ์˜จ๋„ ์ƒ์Šน์— ๋” ์ €ํ•ญํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ , ๋ฐฑํ™” ํ˜„์ƒ๊ณผ ๊ด€๋ จ๋œ ๋ฐ•ํ…Œ๋ฆฌ์•„์ธ Vibrio coralliilyticus์— ๋œ ์ทจ์•ฝํ•ด์ง„๋‹ค๊ณ  ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์ž๋“ค์€ ํ•ด์กฐ๋ฅ˜๊ฐ€ ์˜จ๋„ ์ƒ์Šน์ฒ˜๋Ÿผ ์‚ฐํ˜ธ์˜ ํ™”ํ•™์  ๋ฐฉ์–ด ๋Šฅ๋ ฅ์„ ์•ฝํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ฌผ๊ณ ๊ธฐ๋“ค์ด ์‚ฐํ˜ธ๋ฅผ ๋ณดํ˜ธํ•˜๋Š” ์—ญํ• ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ํ”ผ์ง€ ์—ฐ์•ˆ์˜ ์–ด๋ฅ˜๋Š” Science ์ €๋„์— ์‹ค๋ฆฐ ์—ฐ๊ตฌ์— ๋”ฐ๋ฅด๋ฉด, ์ง€๊ตฌ ์˜จ๋‚œํ™”์˜ ์˜ํ–ฅ์œผ๋กœ๋ถ€ํ„ฐ ์‚ฐํ˜ธ์ดˆ๋ฅผ ๋ณดํ˜ธํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
์™ธ๊ต๊ด€์˜ ๋ถ€์ธ์œผ๋กœ์„œ ๋ฉด์ฑ…ํŠน๊ถŒ์ด ์žˆ๋Š” ์‚ฌ์ฝ”์˜ฌ๋ผ์Šค๋Š” ๊ตํ†ต์‚ฌ๊ณ ์— ์—ฐ๋ฃจ๋˜์—ˆ์Šต๋‹ˆ๋‹ค... ์กด์Šจ ์ด๋ฆฌ๋Š” ์™€ํŠธํผ๋“œ ๋ณ‘์›์—์„œ ๊ธฐ์ž๋“ค๊ณผ์˜ ๋Œ€ํ™” ์ค‘ ์ด ์‚ฌ๊ฑด์— ๋Œ€ํ•ด ์–ธ๊ธ‰ํ•˜๋ฉฐ "์•ค ์‚ฌ์ฝ”์˜ฌ๋ผ์Šค๊ฐ€ ๋‹ค์‹œ ๋Œ์•„์˜ค๊ธธ ํฌ๋งํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ ํ•ด๊ฒฐํ•  ์ˆ˜ ์—†๋‹ค๋ฉด, ๋‹น์—ฐํžˆ ๋ฐฑ์•…๊ด€์— ์ง์ ‘ ์ด ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค."๋ผ๊ณ  ๋งํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณด๋ฆฌ์Šค ์กด์Šจ์€ ๋ฐฑ์•…๊ด€์— ๋ฏธ๊ตญ ์™ธ๊ต๊ด€ ์•ค ์‚ฌ์ฝ”์˜ฌ๋ผ์Šค์˜ ๋ฉด์ฑ…ํŠน๊ถŒ ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•  ๊ฒƒ์ด๋ผ๊ณ  ๋งํ–ˆ์Šต๋‹ˆ๋‹ค.
์‹œ๋ฆฌ์•„ ๊ตญ์˜ ์–ธ๋ก ์— ๋”ฐ๋ฅด๋ฉด ์ •๋ถ€๊ตฐ์ด ์–ด์ œ SDF๊ฐ€ ํ†ต์ œํ•˜๋˜ ์ง€์—ญ์— ๋ฐฐ์น˜๋˜๊ธฐ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค... 10์›” 6์ผ, ๋„๋„๋“œ ํŠธ๋Ÿผํ”„ ๋ฏธ๊ตญ ๋Œ€ํ†ต๋ น๊ณผ ํ„ฐํ‚ค ๋Œ€ํ†ต๋ น ๋ ˆ์ œํ”„ ํƒ€์ด์ดํ”„ ์—๋ฅด๋„์•ˆ์ด ์ „ํ™” ํ†ตํ™”๋ฅผ ํ–ˆ๊ณ , ์–‘๊ตญ์€ ๋ถ๋™๋ถ€ ์‹œ๋ฆฌ์•„๋กœ์˜ ์นจ๊ณต์ด ์ž„๋ฐ•ํ–ˆ๋‹ค๋Š” ์„ฑ๋ช…์„ ๋ฐœํ‘œํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜์š”์ผ์— ํ„ฐํ‚ค๋Š” ๊ณต์Šต์— ์ด์–ด ์ง€์ƒ ์นจ๊ณต์„ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹œ๋ฆฌ์•„ ์ •๋ถ€๊ตฐ์€ ํ„ฐํ‚ค์˜ ์นจ๊ณต์— ๋Œ€์‘ํ•˜์—ฌ ๋ฏธ๊ตญ์ด ์ง€์›ํ•˜๋Š” ์‹œ๋ฆฌ์•„ ๋ฏผ์ฃผ๊ตฐ(SDF)์ด ์ ๋ นํ•œ ์ง€์—ญ์— ์ง„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค.
์‚ฌ์ƒ ์ฒ˜์Œ์œผ๋กœ ํ•œ ์‚ฌ๋žŒ์ด 42.195ํ‚ฌ๋กœ๋ฏธํ„ฐ(์•ฝ 26๋งˆ์ผ)์˜ ํ’€ ๋งˆ๋ผํ†ค์„ ๋‘ ์‹œ๊ฐ„ ๋‚ด์— ์™„์ฃผํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๊ฒฝ๊ธฐ๋Š” IAAF์—์„œ ๊ณต์ธํ•œ ๊ณต์‹ ๊ธฐ๋ก ๊ฒฝ๊ธฐ๋Š” ์•„๋‹ˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํ‚ต์ดˆ๊ฒŒ์˜ ๊ธฐ๋ก์€ 1์‹œ๊ฐ„ 59๋ถ„ 40.2์ดˆ์˜€์Šต๋‹ˆ๋‹ค. ์˜ค์ŠคํŠธ๋ฆฌ์•„ ๋น„์—”๋‚˜์—์„œ ์—ด๋ฆฐ ์ด ๊ฒฝ๊ธฐ๋Š” ํ‚ต์ดˆ๊ฒŒ๊ฐ€ 2์‹œ๊ฐ„์˜ ๋ฒฝ์„ ๋„˜๋„๋ก ๋•๊ธฐ ์œ„ํ•ด ํŠน๋ณ„ํžˆ ๊ธฐํš๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ผ€๋ƒ์˜ ์—˜๋ฆฌ์šฐ๋“œ ํ‚ต์ดˆ๊ฒŒ๋Š” 2์‹œ๊ฐ„ ๋‚ด์— ๋งˆ๋ผํ†ค์„ ์™„์ฃผํ–ˆ์Šต๋‹ˆ๋‹ค.
PG&E๋Š” ์‚ฐ๋ถˆ ์œ„ํ—˜์„ ์ค„์ด๊ธฐ ์œ„ํ•œ ์˜ˆ๋ณด์— ๋”ฐ๋ผ ๊ฐ•ํ’๊ณผ ๊ฑด์กฐํ•œ ๋‚ ์”จ๋ฅผ ๋Œ€๋น„ํ•ด ์ •์ „์„ ๊ณ„ํšํ–ˆ๋‹ค๊ณ  ๋ฐœํ‘œํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฑฐ์˜ 80๋งŒ ๋ช…์˜ ๊ณ ๊ฐ์ด ์˜ํ–ฅ์„ ๋ฐ›์„ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜๋ฉฐ ์ •์ „์€ ์ตœ์†Œ ๋‚ด์ผ ์˜คํ›„๊นŒ์ง€ ์ง€์†๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ์บ˜๋ฆฌํฌ๋‹ˆ์•„์—์„œ ์ˆ˜๋ฐฑ๋งŒ ๋ช…์˜ ๊ณ ๊ฐ์„ ๋Œ€์ƒ์œผ๋กœ ์ •์ „ ๊ณ„ํš์ด ์‹คํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

BART์˜ ์ƒ์„ฑ ๊ฒฐ๊ณผ๋Š” ๋งค์šฐ abstractiveํ•˜๋ฉฐ, ์ž…๋ ฅ์—์„œ ๋ณต์‚ฌ๋œ ๊ตฌ๊ฐ€ ๊ฑฐ์˜ ์—†์Šต๋‹ˆ๋‹ค. ์ถœ๋ ฅ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์‹ค์ ์œผ๋กœ ์ •ํ™•ํ•˜๋ฉฐ, ์ž…๋ ฅ ๋ฌธ์„œ์˜ ์ „๋ฐ˜์ ์ธ ์ฆ๊ฑฐ์™€ ๋ฐฐ๊ฒฝ ์ง€์‹์„ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ๋ฅผ ๋“ค์–ด, ์ด๋ฆ„์„ ์ •ํ™•ํ•˜๊ฒŒ ์™„์„ฑํ•˜๊ฑฐ๋‚˜, PG&E๊ฐ€ ์บ˜๋ฆฌํฌ๋‹ˆ์•„์—์„œ ์šด์˜๋œ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์ถ”๋ก ํ•˜๋Š” ๊ฒƒ). ์ฒซ ๋ฒˆ์งธ ์˜ˆ์‹œ์—์„œ, ๋ฌผ๊ณ ๊ธฐ๊ฐ€ ์ง€๊ตฌ ์˜จ๋‚œํ™”๋กœ๋ถ€ํ„ฐ ์‚ฐํ˜ธ์ดˆ๋ฅผ ๋ณดํ˜ธํ•˜๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ถ”๋ก ํ•˜๋Š” ๊ฒƒ์€ ํ…์ŠคํŠธ์—์„œ ๋น„์ง๊ด€์ ์ธ ์ถ”๋ก ์„ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค.


Conclusion

Corrupted Documents๋ฅผ ์›๋ž˜๋Œ€๋กœ ๋ณต์›ํ•˜๋Š” ์‚ฌ์ „ ํ•™์Šต ๋ฐฉ์‹์„ ๊ฐ€์ง„ BART๋ฅผ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค.

BART๋Š” Discriminative Task์—์„œ RoBERTa์™€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ์—ฌ๋Ÿฌ Text Generation Task์—์„œ๋Š” ์ƒˆ๋กœ์šด State-of-the-Art (SOTA) ์„ฑ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.

 

ํŠนํžˆ, Text Infilling์„ ํ™œ์šฉํ•œ ์‚ฌ์ „ ํ•™์Šต ๋ฐฉ๋ฒ•๋ก ์ด ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ NLP ํƒœ์Šคํฌ์—์„œ BART์˜ ์œ ์—ฐ์„ฑ๊ณผ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. BART๋Š” Discriminative Tasks์—์„œ RoBERTa์™€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๋™์‹œ์—, ์—ฌ๋Ÿฌ Text Generation Tasks์—์„œ ์ƒˆ๋กœ์šด State-of-the-Art ์„ฑ๊ณผ๋ฅผ ์ด๋ค˜์Šต๋‹ˆ๋‹ค. ํ–ฅํ›„ ์—ฐ๊ตฌ์—์„œ๋Š” ์‚ฌ์ „ ํ•™์Šต์„ ์œ„ํ•œ ๋ฌธ์„œ ์†์ƒ ๋ฐฉ๋ฒ•์„ ์ƒˆ๋กญ๊ฒŒ ํƒ๊ตฌํ•˜๊ณ , ์ด๋ฅผ ํŠน์ • End Tasks์— ๋งž๊ฒŒ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ๊ณผ ํ•จ๊ป˜, ๋ชจ๋ธ์˜ ๊ฒฝ๋Ÿ‰ํ™”, ๋‹ค์–ธ์–ด ์ง€์›, ํšจ์œจ์ ์ธ ํ•™์Šต ๋ฐฉ๋ฒ• ๋“ฑ์„ ์ฃผ์š” ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์œผ๋กœ ๊ธฐ๋Œ€๋ฉ๋‹ˆ๋‹ค.

 


Word Explaination (์šฉ์–ด ์„ค๋ช…)

์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ž‘๋™ํ•˜๋Š” Auto-Regressive ํŠน์„ฑ (Left-to-Right Auto-Regressive)

  • ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ์ด์ „ ๋‹จ์–ด๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. GPT์™€ BART์˜ Decoder๋Š” ์ด๋Ÿฌํ•œ ํŠน์„ฑ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

Text Infilling ์Šคํ‚ด

  • ํ…์ŠคํŠธ ๋‚ด ์—ฐ์†๋œ ๋‹จ์–ด ์ŠคํŒฌ์„ ํ•˜๋‚˜์˜ [MASK] ํ† ํฐ์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋…ธ์ด์ง• ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ์ด [MASK] ํ† ํฐ์„ ์›๋ž˜์˜ ํ…์ŠคํŠธ๋กœ ๋ณต์›ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ดํ•ด๋ ฅ ํ…Œ์ŠคํŠธ (Comprehension Tests)

  • ๋ชจ๋ธ์ด ํ…์ŠคํŠธ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ดํ•ดํ•˜๊ณ  ์žˆ๋Š”์ง€๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ํƒœ์Šคํฌ์ž…๋‹ˆ๋‹ค. BART๋Š” ํ…์ŠคํŠธ ์ƒ์„ฑ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ดํ•ด๋ ฅ ํ…Œ์ŠคํŠธ์—์„œ๋„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

GLUE (General Language Understanding Evaluation)

  • ๋‹ค์–‘ํ•œ ์ž์—ฐ์–ด ์ดํ•ด ํƒœ์Šคํฌ๋ฅผ ํฌํ•จํ•œ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ, ๋ชจ๋ธ์˜ ์ „๋ฐ˜์ ์ธ ์–ธ์–ด ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

SQuAD (Stanford Question Answering Dataset)

  • ์ฃผ์–ด์ง„ ๋ฌธ์„œ์—์„œ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ถ”์ถœํ•˜๋Š” ์งˆ๋ฌธ ์‘๋‹ต ํƒœ์Šคํฌ๋ฅผ ํฌํ•จํ•œ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์˜ ์ดํ•ด๋ ฅ๊ณผ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

RoBERTa (Robustly optimized BERT approach)

  • BERT์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต ๋ฐฉ๋ฒ•๊ณผ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ตœ์ ํ™”ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ NLP ํƒœ์Šคํฌ์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

ROUGE ์ ์ˆ˜

  • ํ…์ŠคํŠธ ์š”์•ฝ์˜ ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ง€ํ‘œ๋กœ, ์ƒ์„ฑ๋œ ์š”์•ฝ๊ณผ ์ฐธ์กฐ ์š”์•ฝ ๊ฐ„์˜ n-๊ทธ๋žจ, ๋‹จ์–ด ์ˆœ์„œ, ๊ตฌ๋ฌธ ๊ตฌ์กฐ ๋“ฑ์„ ๋น„๊ตํ•˜์—ฌ ์œ ์‚ฌ์„ฑ์„ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

Abstractive Dialogue

  • ์ฃผ์–ด์ง„ ๋Œ€ํ™” ๋ฌธ๋งฅ์„ ๋ฐ”ํƒ•์œผ๋กœ ์ƒˆ๋กœ์šด ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๋Œ€ํ™” ์ƒ์„ฑ ํƒœ์Šคํฌ์ž…๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ ์ž…๋ ฅ ๋ฌธ์žฅ์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ƒˆ๋กœ์šด ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค

State-of-the-Art (SOTA)

  • ํŠน์ • ํƒœ์Šคํฌ์—์„œ ํ˜„์žฌ๊นŒ์ง€ ๋‹ฌ์„ฑ๋œ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. BART๋Š” ์—ฌ๋Ÿฌ ํ…์ŠคํŠธ ์ƒ์„ฑ ํƒœ์Šคํฌ์—์„œ SOTA ์„ฑ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Ablation ์‹คํ—˜

  • ๋ชจ๋ธ์˜ ํŠน์ • ๊ตฌ์„ฑ ์š”์†Œ๋‚˜ ๊ธฐ๋Šฅ์„ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ๋ณ€๊ฒฝํ•˜์—ฌ, ํ•ด๋‹น ์š”์†Œ๊ฐ€ ์ „์ฒด ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ์‹คํ—˜์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.