A A
[LLM] Parameter-Efficient Transfer Learning for NLP ๋ฆฌ๋ทฐ
์ด๋ฒˆ์—๋Š” "Parameter-Efficient Transfer Learning for NLP" ๋…ผ๋ฌธ์„ ํ•œ๋ฒˆ ๋ฆฌ๋ทฐํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ๋…ผ๋ฌธ ๋งํฌ
 

Parameter-Efficient Transfer Learning for NLP

Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we propose transfer

arxiv.org

Abstract

๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์„ Fine-tuningํ•˜๋Š” ๊ฒƒ์€ NLP์—์„œ ํšจ๊ณผ์ ์ธ ์ „์ด ๋ฐฉ๋ฒ•์ด์ง€๋งŒ, ๋‹ค์ˆ˜์˜ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์ด ์žˆ์„ ๊ฒฝ์šฐ ๋น„ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค. ๊ฐ ์ž‘์—…๋งˆ๋‹ค ์ƒˆ๋กœ์šด ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ €์ž๋“ค์€ Adapter Modules์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

Adapter Modules์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์žฅ์ ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค:

  • ์ปดํŒฉํŠธํ•˜๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅ: ๊ฐ ์ž‘์—…๋งˆ๋‹ค ๊ทน์†Œ๋Ÿ‰์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋งŒ ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค.
  • ํšจ์œจ์ ์ธ ์ „์ด: ์ƒˆ๋กœ์šด ์ž‘์—…์„ ์ถ”๊ฐ€ํ•  ๋•Œ ๊ธฐ์กด ์ž‘์—…์„ ๋‹ค์‹œ ํ•™์Šตํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.
  • ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ : ์›๋ž˜ ๋„คํŠธ์›Œํฌ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ณ ์ •ํ•˜์—ฌ ๋†’์€ ์ˆ˜์ค€์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Adapter Modules์„ BERT Transformer ๋ชจ๋ธ์— ์ ์šฉํ•˜์—ฌ GLUE Benchmark๋ฅผ ํฌํ•จํ•œ 26๊ฐœ์˜ ๋‹ค์–‘ํ•œ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ ํ…Œ์ŠคํŠธํ•œ ๊ฒฐ๊ณผ, ๊ฑฐ์˜ ์ตœ์‹  ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์ž‘์—…๋‹น ๋งค๊ฐœ๋ณ€์ˆ˜ ์ถ”๊ฐ€๊ฐ€ ๋งค์šฐ ์ ์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ GLUE์—์„œ๋Š” ์ „์ฒด Fine-tuning ์„ฑ๋Šฅ ๋Œ€๋น„ 0.4% ์ด๋‚ด์˜ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ž‘์—…๋‹น 3.6%์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋งŒ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, ์ „ํ†ต์ ์ธ Fine-tuning ๋ฐฉ๋ฒ•์€ ์ž‘์—…๋‹น 100%์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํ•™์Šต์‹œ์ผœ์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก : Adapter Module์„ ํ™œ์šฉํ•œ Transfer Learning ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  • Transfer Learning ๊ด€๋ จํ•œ ๊ฐœ๋…์€ ์•„๋ž˜ ๊ธ€์— ์ž‘์„ฑํ•ด ๋†“์•˜์œผ๋‹ˆ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”!
 

[DL] Transfer Learning - ์ „์ด ํ•™์Šต

Transfer Learning, ์ฆ‰ ์ „์ด ํ•™์Šต์€ ML(๋จธ์‹  ๋Ÿฌ๋‹)๊ณผ DL(๋”ฅ๋Ÿฌ๋‹)์—์„œ ๊ธฐ์กด์˜ Pre-Training ๋œ ๋ชจ๋ธ์„ ์ƒˆ๋กœ์šด ์ž‘์—…์— ์žฌ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ํŠนํžˆ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ž‘์€ ๋ฐ์ด

daehyun-bigbread.tistory.com


Introduction

์ด ๋…ผ๋ฌธ์—์„œ ์ž‘์—…๋“ค์ด ์ŠคํŠธ๋ฆผ์œผ๋กœ ๋„์ฐฉํ•˜๋Š” Online Setting์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

"In this paper we address the online setting, where tasks arrive in sequence"

์—ฌ๊ธฐ์„œ์˜ Online Setting์€ ๊ณ ๊ฐ์˜ ์—ฐ์†์ ์ธ ์ž‘์—…์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋งŽ์€ ์ž‘์—…์„ ํ•™์Šตํ•ด์•ผ ํ•˜๋Š” ํด๋ผ์šฐ๋“œ ์„œ๋น„์Šค์™€ ๊ฐ™์€ ํ™˜๊ฒฝ์ด๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ๋ชฉํ‘œ๋Š” ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด ์ƒˆ๋กœ์šด ๋ชจ๋ธ์„ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ  ๋ชจ๋“  ์ž‘์—…์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜๋Š” ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด๋ฅผ ์œ„ํ•ด Compactํ•˜๊ณ  Extensibleํ•œ Downstream Model์„ ์ œ๊ณตํ•˜๋Š” Transfer Learning ์ „๋žต์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์ด์ œ 2๊ฐ€์ง€ ๋ชจ๋ธ์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.

  • Compact๋Š” ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด ์†Œ์ˆ˜์˜ ์ถ”๊ฐ€ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๋งŽ์€ ์ž‘์—…์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค
  • Extensible๋Š”์ด์ „ ์ž‘์—…์„ ์žŠ์ง€ ์•Š๊ณ  ์ƒˆ๋กœ์šด ์ž‘์—…์„ ์ ์ง„์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Figure 1 ์€ Adapter Tuning ๊ณผ Fine-Tuning ๊ฐ„์˜ ์ •ํ™•๋„์™€ ์ž‘์—…๋ณ„๋กœ ํ•™์Šต๋œ Task-Specific Parameters ์˜ ์ˆ˜ ์‚ฌ์ด์˜ Trade-Off ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. y-์ถ• ์€ Full Fine-Tuning ์˜ ์„ฑ๋Šฅ์œผ๋กœ ์ •๊ทœํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ž์„ธํ•œ ๋‚ด์šฉ์€ Section 3 ์— ์žˆ์Šต๋‹ˆ๋‹ค. ๊ณก์„ ์€ GLUE Benchmark ์˜ 9๊ฐœ ์ž‘์—…์—์„œ 20๋ฒˆ์งธ, 50๋ฒˆ์งธ, 80๋ฒˆ์งธ ์„ฑ๋Šฅ ๋ฐฑ๋ถ„์œ„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. Adapter-Based Tuning ์€ ๋‘ ์ž๋ฆฟ์ˆ˜์˜ ํฌ๊ธฐ๊ฐ€ ๋” ์ ์€ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ Full Fine-Tuning ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

 

์ด์ œ ์—ฌ๊ธฐ์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋‘ ๊ฐ€์ง€ Transfer Learning ๊ธฐ๋ฒ•์€ Feature-Based Transfer์™€ Fine-Tuning์ž…๋‹ˆ๋‹ค. 

  • Feature-Based Transfer๋Š” ์‹ค์ˆ˜ํ˜• ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์‚ฌ์ „ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ํฌํ•จํ•˜๋ฉฐ, ์ด ์ž„๋ฒ ๋”ฉ์€ ๋‹จ์–ด, ๋ฌธ์žฅ, ๋˜๋Š” ๋ฌธ๋‹จ ์ˆ˜์ค€ ์ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ. ์ž„๋ฒ ๋”ฉ์€ ์ดํ›„ ์ปค์Šคํ…€ Downstream Model๋กœ ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.
  • Fine-Tuning์€ ์‚ฌ์ „ ํ•™์Šต๋œ ๋„คํŠธ์›Œํฌ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋ณต์‚ฌํ•˜์—ฌ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์—์„œ ํŠœ๋‹ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ์— ๋”ฐ๋ฅด๋ฉด Fine-Tuning์ด Feature-Based Transfer๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ Feature-Based Transfer์™€ Fine-Tuning ๋ชจ๋‘ ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด ์ƒˆ๋กœ์šด ๊ฐ€์ค‘์น˜ ์„ธํŠธ๋ฅผ ํ•„์š”๋กœ ํ•˜๊ฐ€ ๋•Œ๋ฌธ์— ์ƒˆ๋กœ์šด Task์— ์ ์‘์„์„ ์งˆ ํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ์ด์Šˆ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ Transfer Learning ๊ธฐ๋ฒ• ๋Œ€์‹ , Adapter Modules์— ๊ธฐ๋ฐ˜ํ•œ ๋Œ€์•ˆ์ ์ธ Transfer Method๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

  • Fine-Tuning์€ ๋„คํŠธ์›Œํฌ์˜ ํ•˜์œ„ ๋ ˆ์ด์–ด๋ฅผ ์ž‘์—… ๊ฐ„์— ๊ณต์œ ํ•  ๊ฒฝ์šฐ ๋” ํšจ์œจ์ ์ด์ง€๋งŒ, ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•˜๋Š” Adapter Tuning ๋ฐฉ๋ฒ•์€ ํ›จ์”ฌ ๋” Parameter Efficientํ•ฉ๋‹ˆ๋‹ค. Adapter-Based Tuning์€ Fine-Tuning์— ๋น„ํ•ด ์ž‘์—…๋‹น ํ•„์š”ํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ํ›จ์”ฌ ์ ์œผ๋ฉด์„œ๋„ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Adapters๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ ๋„คํŠธ์›Œํฌ์˜ ๋ ˆ์ด์–ด ์‚ฌ์ด์— ์ถ”๊ฐ€๋˜๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋“ˆ์ž…๋‹ˆ๋‹ค.

 

Adapter-Based Tuning์€ Feature-Based Transfer์™€ Fine-Tuning๊ณผ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฐจ์ด์ ์— ๋ฐํ•˜์—ฌ ์„ค๋ช…ํ•ด๋ณด๋ฉด. ํŒŒ๋ผ๋ฏธํ„ฐ w๋ฅผ ๊ฐ€์ง„ ํ•จ์ˆ˜ ฯ•w(x) ๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•  ๋•Œ

  • Feature-Based Transfer๋Š” ฯ•w์— ์ƒˆ๋กœ์šด ํ•จ์ˆ˜ χv๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ χv(ฯ•w(x))๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ž‘์—…๋ณ„ ๋งค๊ฐœ๋ณ€์ˆ˜ v๋งŒ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
  • Fine-Tuning์€ ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด ์›๋ž˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ w๋ฅผ ์กฐ์ •ํ•˜์—ฌ Compactness๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค.

Adapter-Based Tuning ๊ฐœ๋…์„ ์„ค๋ช…ํ•ด๋ณด๋ฉด

  • ์‚ฌ์ „ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ w๋ฅผ ๊ณ ์ •ํ•˜๊ณ , ์ƒˆ๋กœ์šด ์ž‘์—…๋ณ„ ์ถ”๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ v๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ์ƒˆ๋กœ์šด ํ•จ์ˆ˜ ψw,v(x)๋ฅผ ์ •์˜ํ•˜์—ฌ, ์ดˆ๊ธฐ ํŒŒ๋ผ๋ฏธํ„ฐ v0๊ฐ€ ψw,v0(x) ≈ ฯ•w(x)๊ฐ€ ๋˜๋„๋ก ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ํ›ˆ๋ จ ์ค‘์—๋Š” ์ž‘์—…๋ณ„ ์ถ”๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ v๋งŒ ์กฐ์ •, |v| โ‰ช |w|๋ฅผ ๋งŒ์กฑํ•ด ํšจ์œจ์ ์ด๊ณ  Compactํ•œ ๋ชจ๋ธ ์„ค๊ณ„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • ๊ธฐ์กด ์ž‘์—…์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๊ณ  ์ƒˆ๋กœ์šด ์ž‘์—…์— ํ™•์žฅ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด์˜ ํ•™์Šต ๋ฐฉ์‹๊ณผ ๋น„๊ตํ•ด๋ณด์ž๋ฉด

Multi-Task Learning๋„ Compact ๋ชจ๋ธ์€ ๋ชจ๋“  ์ž‘์—…์— ๋Œ€ํ•œ ๋™์‹œ ์ ‘๊ทผ์„ ํ•„์š”๋กœ ํ•˜์ง€๋งŒ Adapter-Based Tuning์€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
Continual Learning ์‹œ์Šคํ…œ์€ ์ž‘์—… ์ŠคํŠธ๋ฆผ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‚˜, ์ด์ „ ์ž‘์—…์„ ์žŠ๋Š” Forgetting Problem์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

 

Adapter-Based Tuning์€ 

  • ์ž‘์—… ๊ฐ„ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • ์†Œ์ˆ˜์˜ ์ž‘์—…๋ณ„ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ด์ „ ์ž‘์—…์„ ์™„๋ฒฝํžˆ ๊ธฐ์–ตํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๊ณ  Adapters๊ฐ€ NLP๋ฅผ ์œ„ํ•œ Parameter-Efficient Tuning์€ Adapter Module์„ ์„ค๊ณ„ํ•˜๊ณ  ์ด๋ฅผ ๊ธฐ๋ณธ ๋ชจ๋ธ๊ณผ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์ด์ง€๋งŒ, ๊ฐ„๋‹จํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ Bottleneck Architecture๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

GLUE Benchmark์—์„œ ์šฐ๋ฆฌ์˜ ์ „๋žต์€ ์ „์ฒด Fine-Tuning๋œ BERT์™€ ๊ฑฐ์˜ ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜๋ฉด์„œ๋„ ์ž‘์—…๋‹น 3%์˜ ์ž‘์—…๋ณ„ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ์„ ์‚ฌ์šฉํ•˜๋ฉฐ, Fine-Tuning์€ ์ž‘์—…๋‹น 100%์˜ ์ž‘์—…๋ณ„ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ถ”๊ฐ€๋กœ 17๊ฐœ์˜ ๊ณต๊ฐœ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹๊ณผ SQuAD ์ถ”์ถœํ˜• ์งˆ๋ฌธ ์‘๋‹ต์—์„œ๋„ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ด€์ฐฐํ–ˆ์Šต๋‹ˆ๋‹ค. 

์š”์•ฝํ•˜๋ฉด, Adapter-Based Tuning์€ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜์—์„œ ์ตœ์‹  ์„ฑ๋Šฅ์— ๊ฐ€๊นŒ์šด ๋‹จ์ผ, ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Adapter tuning for NLP

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ Downstream Task์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ Text Model์„ ์กฐ์ •ํ• ์ˆ˜ ์žˆ๋Š” Adapter Tuning์ด๋ผ๋Š” ์ „๋žต์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” 3๊ฐ€์ง€ ํŠน์„ฑ์— ๋ฐํ•˜์—ฌ ์„ค๋ช…์„ ๋“œ๋ฆฌ๋ฉด

  1. ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•จ
  2. ์ž‘์—…์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ - ์ฆ‰, ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์— ๋™์‹œ์— ์ ‘๊ทผํ•  ํ•„์š”๊ฐ€ ์—†์Œ
  3. ์ž‘์—…๋ณ„๋กœ ์†Œ๋Ÿ‰์˜ ์ถ”๊ฐ€ ๋งค๊ฐœ๋ณ€์ˆ˜๋งŒ ์ถ”๊ฐ€๋จ

์ด๋Ÿฌํ•œ ํŠน์„ฑ์€ ํŠนํžˆ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์ผ๋ จ์˜ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์— ๋Œ€ํ•ด ํ•™์Šตํ•ด์•ผ ํ•˜๋Š” ํด๋ผ์šฐ๋“œ ์„œ๋น„์Šค ํ™˜๊ฒฝ์—์„œ ์œ ์šฉํ•˜๋ฉฐ, ๋†’์€ ๊ณต์œ ๋„๋ฅผ ์ œ๊ณตํ•˜์—ฌ ํšจ์œจ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ๋ชฉํ‘œ๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ๋ณ‘๋ชฉ ์–ด๋Œ‘ํ„ฐ ๋ชจ๋“ˆ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

Adapter Tuning์€ ๋ชจ๋ธ์— ์†Œ์ˆ˜์˜ ์ƒˆ๋กœ์šด ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  ์ด๋ฅผ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์— ๋Œ€ํ•ด ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค

์ „ํ†ต์ ์ธ Fine-Tuning์—์„œ๋Š” ๋„คํŠธ์›Œํฌ์˜ ์ตœ์ƒ์ธต์„ ์ˆ˜์ •ํ•˜๋Š”๋ฐ, ์ด๋Š” ์ƒ์œ„ ์ž‘์—…๊ณผ ํ•˜์œ„ ์ž‘์—… ๊ฐ„์˜ ๋ ˆ์ด๋ธ” ๊ณต๊ฐ„ ๋ฐ ์†์‹ค์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ํ•„์š”ํ•˜๊ณ , ์ƒˆ๋กœ์šด ๋ ˆ์ด์–ด๋ฅผ ์›๋ž˜ ๋„คํŠธ์›Œํฌ์— ์‚ฝ์ž…ํ•ฉ๋‹ˆ๋‹ค. ์›๋ž˜ ๋„คํŠธ์›Œํฌ์˜ ๊ฐ€์ค‘์น˜๋Š” ๊ทธ๋Œ€๋กœ ๋‘๊ณ , ์ƒˆ๋กœ์šด ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด๋งŒ ๋ฌด์ž‘์œ„๋กœ ์ดˆ๊ธฐํ™”ํ•˜์—ฌ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

 

์ด๋Ÿฌํ•œ Adapter ๋ชจ๋“ˆ์—๋Š” ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์†Œ์ˆ˜์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์‚ฌ์šฉ: ์–ด๋Œ‘ํ„ฐ ๋ชจ๋“ˆ์€ ์›๋ž˜ ๋„คํŠธ์›Œํฌ์˜ ๋ ˆ์ด์–ด๋ณด๋‹ค ์ž‘์•„์•ผ ํ•˜๋ฉฐ, ์ž‘์—…์ด ์ถ”๊ฐ€๋  ๋•Œ ์ „์ฒด ๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์ฒœ์ฒœํžˆ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • ๊ฑฐ์˜ ๋™์ผํ•œ ์ดˆ๊ธฐํ™”: ํ•™์Šต์ด ์•ˆ์ •์ ์ด๊ธฐ ์œ„ํ•ด ์–ด๋Œ‘ํ„ฐ ๋ชจ๋“ˆ์„ ๊ฑฐ์˜ ๋™์ผ ํ•จ์ˆ˜๋กœ ์ดˆ๊ธฐํ™”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์ดˆ๊ธฐํ™”๋ฅผ ํ†ตํ•ด ํ•™์Šต์ด ์‹œ์ž‘๋  ๋•Œ ์›๋ž˜ ๋„คํŠธ์›Œํฌ์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์œผ๋ฉฐ, ํ•™์Šต ์ค‘ ์–ด๋Œ‘ํ„ฐ๋Š” ๋„คํŠธ์›Œํฌ ์ „์ฒด์— ๊ฑธ์ณ ํ™œ์„ฑํ™” ๋ถ„ํฌ๋ฅผ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ด๋Œ‘ํ„ฐ ๋ชจ๋“ˆ์ด ํ•„์š”ํ•˜์ง€ ์•Š์„ ๊ฒฝ์šฐ ๋ฌด์‹œํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.


Instantiation for Transformer Networks

Transformer์— ์–ด๋Œ‘ํ„ฐ ๊ธฐ๋ฐ˜ ํŠœ๋‹์„ ์ ์šฉํ•˜์—ฌ ์ตœ์‹  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์–ด๋Œ‘ํ„ฐ ๋ชจ๋“ˆ์—๋Š” ๋‹ค์–‘ํ•œ ์„ค๊ณ„ ์˜ต์…˜์ด ์žˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹จ์ˆœํ•œ ์„ค๊ณ„๊ฐ€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

 

Transformer์˜ ๊ฐ ๋ ˆ์ด์–ด๋Š” ๋‘ ๊ฐœ์˜ ์ฃผ์š” ํ•˜์œ„ ๋ ˆ์ด์–ด๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค: Attention Layer์™€ Feedforward Layer. ๊ฐ ํ•˜์œ„ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ์€ ์ž…๋ ฅ ํฌ๊ธฐ๋กœ ๋‹ค์‹œ ํˆฌ์˜๋˜๋ฉฐ, ์ดํ›„ Skip Connection์ด ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ฐ ํ•˜์œ„ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ์€ Layer Normalization์— ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ๊ฐ ํ•˜์œ„ ๋ ˆ์ด์–ด ๋’ค์— ๋‘ ๊ฐœ์˜ ์ง๋ ฌ ์–ด๋Œ‘ํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ์–ด๋Œ‘ํ„ฐ๋Š” ํ•˜์œ„ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ์— ์ง์ ‘ ์ ์šฉ๋˜๋ฉฐ, ์ž…๋ ฅ ํฌ๊ธฐ๋กœ ํˆฌ์˜ํ•œ ํ›„ Skip Connection์„ ์ ์šฉํ•˜๊ธฐ ์ „ ๋‹จ๊ณ„์— ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ์ด ์–ด๋Œ‘ํ„ฐ์˜ ์ถœ๋ ฅ์€ ์ดํ›„ Layer Normalization์œผ๋กœ ๋ฐ”๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค.

FIgure 2 ๋Š” Transformer์— ์–ด๋Œ‘ํ„ฐ ๋ชจ๋“ˆ์„ ํ†ตํ•ฉํ•œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์™ผ์ชฝ : ๊ฐ Transformer ๋ ˆ์ด์–ด์— ์–ด๋Œ‘ํ„ฐ ๋ชจ๋“ˆ์„ ๋‘ ๋ฒˆ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ Multi-Headed Attention ๋’ค์˜ ํˆฌ์˜ ๋ ˆ์ด์–ด ์ดํ›„์— ์ถ”๊ฐ€๋˜๊ณ , ๋‘ ๋ฒˆ์งธ๋กœ Feed-Forward Layer ๋’ค์— ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ์˜ค๋ฅธ์ชฝ : ์–ด๋Œ‘ํ„ฐ๋Š” ๋ณ‘๋ชฉ ๊ตฌ์กฐ ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด ์›๋ž˜ ๋ชจ๋ธ์˜ Attention ๋ฐ Feed-Forward ๋ ˆ์ด์–ด์— ๋น„ํ•ด ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์–ด๋Œ‘ํ„ฐ ๋‚ด๋ถ€์—๋Š” Skip-Connection ๋„ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹ ์ค‘์— ์ดˆ๋ก์ƒ‰ ๋ ˆ์ด์–ด ๊ฐ€ ๋‹ค์šด์ŠคํŠธ๋ฆผ ๋ฐ์ดํ„ฐ์—์„œ ํ•™์Šต๋˜๋ฉฐ, ์—ฌ๊ธฐ์—๋Š” ์–ด๋Œ‘ํ„ฐ, Layer Normalization ํŒŒ๋ผ๋ฏธํ„ฐ , ๊ทธ๋ฆฌ๊ณ  ์ตœ์ข… ๋ถ„๋ฅ˜ ๋ ˆ์ด์–ด (๊ทธ๋ฆผ์— ํ‘œ์‹œ๋˜์ง€ ์•Š์Œ)๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, Adapter Module์˜ Parameter์˜ ์ˆ˜๋ฅผ ์ œํ•œํ•˜๊ธฐ ์œ„ํ•ด์„œ ์•ž์—์„œ ์„ค๋ช…ํ–ˆ๋“ฏ์ด, ๋ณ‘๋ชฉ Architecuter๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

๐‘ค: Pretrained ๋ชจ๋ธ์˜ ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ(vector)
๐‘ฃ: ์ƒˆ๋กญ๊ฒŒ ํ•™์Šตํ•ด์•ผํ•  ํŒŒ๋ผ๋ฏธํ„ฐ(vector)
๐œ‘_๐‘ค: ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ(Neural network)
x: ์ธํ’‹ ๋ฐ์ดํ„ฐ

Feature-based learning:  ๐œ’_๐‘ฃ(๐œ‘_๐‘ค(x)) 

: ๐œ’_๐‘ฃ์€ ๋‹จ์ˆœํžˆ ์ถœ๋ ฅ๋งŒ ๋ฐ”๊ฟ”์ฃผ๋Š” final layer๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๊ธฐ์กด ์‚ฌ์ „ ํ•™์Šต๋œ ๋„คํŠธ์›Œํฌ ๐œ‘_๐‘ค์˜ ๊ฒฐ๊ณผ๋ฅผ ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ์˜ ์ถœ๋ ฅ(๐œ’_๐‘ฃ)์— ๋งž๊ฒŒ ๋ณ€ํ™˜ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

Fine-tuning: ๐œ‘'_๐‘ค'(x) 

์‚ฌ์ „ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ž์ฒด๋ฅผ ๋ณ€ํ˜•, ์ฆ‰ ๋ชจ๋ธ ํ•จ์ˆ˜ ์ž์ฒด๋ฅผ ๋ณ€ํ˜•ํ•ฉ๋‹ˆ๋‹ค.  -  ๐œ‘_๐‘ค(x) -> ๐œ‘'_๐‘ค'(x) 

Adapter: ๐œ“_{๐‘ค,๐‘ฃ}

๐‘ค๋Š” ๊ณ ์ •ํ•˜๋˜, ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜ ๐‘ฃ๋งŒ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. 

d ์ฐจ์›์˜ ํŠน์ง•์„ ๋” ์ž‘์€ ์ฐจ์› m์œผ๋กœ ํˆฌ์˜ํ•œ ํ›„, ๋น„์„ ํ˜•์„ฑ์„ ์ ์šฉํ•˜๊ณ  ๋‹ค์‹œ d ์ฐจ์›์œผ๋กœ ํˆฌ์˜ํ•ฉ๋‹ˆ๋‹ค.
๊ฐ ๋ ˆ์ด์–ด๋‹น ์ถ”๊ฐ€๋˜๋Š” ์ด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” 2md + d + m์ž…๋‹ˆ๋‹ค. m < d๋กœ ์„ค์ •ํ•จ์œผ๋กœ์จ ์ž‘์—…๋‹น ์ถ”๊ฐ€๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์‹ค์ œ๋กœ ์›๋ž˜ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์•ฝ 0.5-8%๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
๋ณ‘๋ชฉ ์ฐจ์› m์„ ์กฐ์ ˆํ•จ์œผ๋กœ์จ ์„ฑ๋Šฅ๊ณผ ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์„ฑ์„ ์‰ฝ๊ฒŒ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ด๋Œ‘ํ„ฐ ๋ชจ๋“ˆ ์ž์ฒด์—๋Š” ๋‚ด๋ถ€์ ์œผ๋กœ Skip Connection์ด ์žˆ์–ด, ํˆฌ์˜ ๋ ˆ์ด์–ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๊ฑฐ์˜ 0์œผ๋กœ ์ดˆ๊ธฐํ™”๋  ๊ฒฝ์šฐ ์–ด๋Œ‘ํ„ฐ ๋ชจ๋“ˆ์€ ๋Œ€๋žต์ ์ธ ๋™์ผ ํ•จ์ˆ˜ ๋กœ ์ดˆ๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, ์ž‘์—…๋งˆ๋‹ค ์ƒˆ๋กœ์šด Layer Normalization ํŒŒ๋ผ๋ฏธํ„ฐ๋„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ Conditional Batch Normalization, FiLM, Self-Modulation๊ณผ ์œ ์‚ฌํ•˜๋ฉฐ, ๋ ˆ์ด์–ด๋‹น 2d ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ์œผ๋กœ ๋„คํŠธ์›Œํฌ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ ์‘์‹œํ‚ต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Layer Normalization ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์œผ๋กœ๋Š” ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์—†๋‹ค๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.


Experiments

Adapter ๊ธฐ๋ฐ˜ ํŠœ๋‹์ด ํ…์ŠคํŠธ ์ž‘์—…์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์ ์ธ ์ „์ด๋ฅผ ๋‹ฌ์„ฑํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. GLUE ๋ฒค์น˜๋งˆํฌ์—์„œ BERT๋ฅผ ์™„์ „ํžˆ ํŒŒ์ธํŠœ๋‹ํ•œ ์„ฑ๋Šฅ์— ๋น„ํ•ด ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹์€ ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ 0.4%์— ๋ถˆ๊ณผํ•˜์ง€๋งŒ, ํŒŒ์ธํŠœ๋‹๋ณด๋‹ค ์•ฝ 3%์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฐ๊ณผ๋Š” ์ถ”๊ฐ€์ ์ธ 17๊ฐœ์˜ ๊ณต๊ฐœ ๋ถ„๋ฅ˜ ์ž‘์—… ๋ฐ SQuAD ์งˆ๋ฌธ ์‘๋‹ต์—์„œ๋„ ํ™•์ธ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ถ„์„ ๊ฒฐ๊ณผ, ์–ด๋Œ‘ํ„ฐ ๊ธฐ๋ฐ˜ ํŠœ๋‹์€ ์ž๋™์œผ๋กœ ๋„คํŠธ์›Œํฌ์˜ ์ƒ์œ„ ๋ ˆ์ด์–ด์— ์ง‘์ค‘ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

Experimental Setting

  • ๊ธฐ๋ณธ ๋ชจ๋ธ: ์‚ฌ์ „ ํ•™์Šต๋œ BERT Transformer ๋„คํŠธ์›Œํฌ.
  • ๋ถ„๋ฅ˜ ์ž‘์—…: Devlin et al. (2018)์˜ ๋ฐฉ์‹ ์ ์šฉ. ํŠน๋ณ„ํ•œ "[CLS]" ํ† ํฐ๊ณผ ์„ ํ˜• ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํด๋ž˜์Šค ์˜ˆ์ธก ์ˆ˜ํ–‰.
  • ํ›ˆ๋ จ ๊ณผ์ •: Adam ์˜ตํ‹ฐ๋งˆ์ด์ €์™€ ์›Œ๋ฐ์—… ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„ ์‚ฌ์šฉ, ๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” 32. Google Cloud TPU 4๋Œ€๋ฅผ ํ™œ์šฉํ•ด ํ›ˆ๋ จ.
  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹: ๊ฐ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๊ฒ€์ฆ ์„ธํŠธ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์  ๋ชจ๋ธ์„ ์„ ํƒ.

์ฃผ์š” ๋ชฉํ‘œ: ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ตœ์†Œ๋กœ ์ถ”๊ฐ€(์ด์ƒ์ ์œผ๋กœ๋Š” ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ 1๋ฐฐ)ํ•˜์—ฌ ํŒŒ์ธํŠœ๋‹๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ.

 

Glue Benchmark

  • ์‚ฌ์šฉ ๋ชจ๋ธ: BERTLARGE (24๊ฐœ ๋ ˆ์ด์–ด, 3์–ต 3์ฒœ๋งŒ ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ).
  • ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹: ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด ์ถ”๊ฐ€ ํ›„ ์ผ๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํ•™์Šต:
    • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ: ํ•™์Šต๋ฅ  3×10−5,3×10−4,3×10−3, ์—ํฌํฌ ์ˆ˜ 3,20, ์–ด๋Œ‘ํ„ฐ ํฌ๊ธฐ 8,64,256.3,20{3, 20}
    • 8,64,256{8, 64, 256}
    • 3×10−5,3×10−4,3×10−3{3 × 10โปโต, 3 × 10โปโด, 3 × 10โป³}
    • ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด ๋ฌด์ž‘์œ„ ์‹œ๋“œ๋กœ 5ํšŒ ๋ฐ˜๋ณต ํ›ˆ๋ จ.
  • ์„ฑ๋Šฅ:
    • ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹: GLUE ํ‰๊ท  ์ ์ˆ˜ 80.0.
    • ์ „์ฒด ํŒŒ์ธํŠœ๋‹: ํ‰๊ท  ์ ์ˆ˜ 80.4 (0.4% ๋” ๋†’์Œ).
    • ์–ด๋Œ‘ํ„ฐ ํฌ๊ธฐ๋ฅผ 64๋กœ ๊ณ ์ •ํ–ˆ์„ ๋•Œ ํ‰๊ท  ์ ์ˆ˜๋Š” 79.6์œผ๋กœ ์•ฝ๊ฐ„ ๊ฐ์†Œ.
    • ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์„ฑ:
      • ์ „์ฒด ํŒŒ์ธํŠœ๋‹: BERT ํŒŒ๋ผ๋ฏธํ„ฐ์˜ 9๋ฐฐ ํ•„์š”.
      • ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹: 1.3๋ฐฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ์š”๊ตฌ.

 

Additional Classification Tasks

  • ๋ฐ์ดํ„ฐ์…‹: 900~33๋งŒ๊ฐœ์˜ ํ•™์Šต ์˜ˆ์ œ, 2157 ํด๋ž˜์Šค, ํ…์ŠคํŠธ ๊ธธ์ด 57~1,900์ž.
  • ํ‰๊ฐ€ ๋ฐฉ๋ฒ•:
    • ์ „์ฒด ํŒŒ์ธํŠœ๋‹.
    • ๊ฐ€๋ณ€ ํŒŒ์ธํŠœ๋‹(์ƒ์œ„ n๊ฐœ ๋ ˆ์ด์–ด๋งŒ ํŠœ๋‹).
    • ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹.
  • ๊ฒฐ๊ณผ:
    • ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹์€ ์ „์ฒด ํŒŒ์ธํŠœ๋‹๊ณผ ๊ฑฐ์˜ ๋™์ผํ•œ ์„ฑ๋Šฅ(0.4% ์ฐจ์ด)์œผ๋กœ, ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์„ฑ์ด ํ›จ์”ฌ ๋›ฐ์–ด๋‚จ.
    • ํŒŒ๋ผ๋ฏธํ„ฐ ๋น„๊ต:
      • ์ „์ฒด ํŒŒ์ธํŠœ๋‹: BERTBASE ํŒŒ๋ผ๋ฏธํ„ฐ์˜ 17๋ฐฐ.
      • ๊ฐ€๋ณ€ ํŒŒ์ธํŠœ๋‹: ํ‰๊ท  9.9๋ฐฐ.
      • ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹: ๋ชจ๋“  ์ž‘์—…์—์„œ 1.19๋ฐฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ์‚ฌ์šฉ.

Parameter/Performance Trade-off

์–ด๋Œ‘ํ„ฐ ํฌ๊ธฐ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์„ฑ์„ ์กฐ์ ˆํ•˜๋ฉฐ, ์ž‘์€ ์–ด๋Œ‘ํ„ฐ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์ค„์ด์ง€๋งŒ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํƒ์ƒ‰ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์–ด๋Œ‘ํ„ฐ ํฌ๊ธฐ๋ฅผ ์‹คํ—˜ํ•˜๊ณ  ๋‘ ๊ฐ€์ง€ ๊ธฐ์ค€๊ณผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.

  • (i) BERTBASE์˜ ์ƒ์œ„ k ๋ ˆ์ด์–ด๋งŒ ํŒŒ์ธํŠœ๋‹.
  • (ii) ๋ ˆ์ด์–ด ์ •๊ทœํ™” ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํŠœ๋‹.

Table 1. GLUE ํ…Œ์ŠคํŠธ ์…‹ ๊ฒฐ๊ณผ๋Š” GLUE ํ‰๊ฐ€ ์„œ๋ฒ„๋ฅผ ํ†ตํ•ด ์ฑ„์ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. MRPC์™€ QQP๋Š” F1 ์ ์ˆ˜๋กœ ํ‰๊ฐ€๋˜๋ฉฐ, STS-B๋Š” Spearman์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ฐ€๋ฉ๋‹ˆ๋‹ค. CoLA๋Š” Matthew์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ฐ€๋ฉ๋‹ˆ๋‹ค. ๋‚˜๋จธ์ง€ ์ž‘์—…์€ ์ •ํ™•๋„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ฐ€๋ฉ๋‹ˆ๋‹ค. Adapter tuning์€ ์ด 1.3๋ฐฐ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Full Fine-tuning (9๋ฐฐ)๊ณผ ๋น„๊ตํ•ด ์œ ์‚ฌํ•œ ์ „์ฒด ์ ์ˆ˜ (80.0 vs. 80.4)๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. Adapter ํฌ๊ธฐ๋ฅผ 64๋กœ ๊ณ ์ •ํ•˜๋ฉด ์ „์ฒด ์ ์ˆ˜๊ฐ€ ์•ฝ๊ฐ„ ๊ฐ์†Œํ•˜์—ฌ 79.6์ด ๋˜๊ณ  ๋ชจ๋ธ ํฌ๊ธฐ๋„ ์•ฝ๊ฐ„ ์ž‘์•„์ง‘๋‹ˆ๋‹ค.
Table 2. ์ถ”๊ฐ€ ๋ถ„๋ฅ˜ ์ž‘์—…์˜ ํ…Œ์ŠคํŠธ ์ •ํ™•๋„์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‹คํ—˜์—์„œ BERTBASE ๋ชจ๋ธ์—์„œ ์ „์ด ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์ž‘์—…๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด, ๊ฒ€์ฆ ์…‹ ์ •ํ™•๋„๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๋ชจ๋ธ์ด ์„ ํƒ๋ฉ๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ๋žœ๋ค ์‹œ๋“œ๋กœ ์ˆ˜ํ–‰๋œ ์‹คํ–‰ ๊ฐ„์˜ ํ‰๊ท  ํ…Œ์ŠคํŠธ ์ •ํ™•๋„์™€ ํ‘œ์ค€ ์˜ค์ฐจ์˜ ํ‰๊ท  (s.e.m.)์„ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค.
Figure 3. ์ž‘์—… ์ „๋ฐ˜์— ๊ฑธ์ณ ์ง‘๊ณ„๋œ, ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ๋Œ€๋น„ ์ •ํ™•๋„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ Adapter(์ฃผํ™ฉ์ƒ‰)๋ฅผ n ๊ฐœ์˜ ์ƒ์œ„ ๋ ˆ์ด์–ด๋ฅผ Fine-tuningํ•˜๋Š” ๋ฐฉ์‹(ํŒŒ๋ž€์ƒ‰)๊ณผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค. ์„ ๊ณผ ์Œ์˜์€ ์ž‘์—… ์ „๋ฐ˜์—์„œ 20th, 50th, 80th ๋ฐฑ๋ถ„์œ„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ ์ž‘์—…๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด, ๊ฐ ๊ณก์„ ์˜ ๊ฐ ์ง€์ ์—์„œ ์ตœ์ƒ์˜ ๋ชจ๋ธ์ด ์„ ํƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. GLUE์˜ ๊ฒฝ์šฐ ๊ฒ€์ฆ ์…‹ ์ •ํ™•๋„๊ฐ€ ๋ณด๊ณ ๋˜๋ฉฐ, ์ถ”๊ฐ€ ์ž‘์—…์˜ ๊ฒฝ์šฐ ํ…Œ์ŠคํŠธ ์…‹ ์ •ํ™•๋„๊ฐ€ ๋ณด๊ณ ๋ฉ๋‹ˆ๋‹ค. ์ž‘์—… ๊ฐ„ ์ ์ˆ˜ ๋ณ€๋™์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด, ๊ฐ ๋ชจ๋ธ๊ณผ ์ž‘์—…์˜ ์ ์ˆ˜๋ฅผ ํ•ด๋‹น ์ž‘์—…์— ๋Œ€ํ•œ Full Fine-tuning ์„ฑ๋Šฅ์„ ๋นผ์„œ ์ •๊ทœํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

Figure 3์—์„œ๋Š” GLUE์™€ ์ถ”๊ฐ€ ๋ถ„๋ฅ˜ ์ž‘์—… ์ „์ฒด์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์˜ Trade-off๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. GLUE์—์„œ๋Š” ์ ์€ ๋ ˆ์ด์–ด๋ฅผ ํŒŒ์ธํŠœ๋‹ํ•  ๋•Œ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๊ฐ์†Œํ•˜๋Š” ๋ฐ˜๋ฉด, ์ผ๋ถ€ ์ถ”๊ฐ€ ์ž‘์—…์—์„œ๋Š” ์ ์€ ๋ ˆ์ด์–ด ํ•™์Šต์ด ์œ ๋ฆฌํ•˜์—ฌ ์„ฑ๋Šฅ ๊ฐ์†Œ๊ฐ€ ์ ์Šต๋‹ˆ๋‹ค. ๋‘ ๊ฒฝ์šฐ ๋ชจ๋‘, ์–ด๋Œ‘ํ„ฐ๋Š” ํŒŒ์ธํŠœ๋‹๋ณด๋‹ค ํ›จ์”ฌ ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ๋„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค.

Figure 4. ์„ธ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ๊ฒ€์ฆ ์…‹ ์ •ํ™•๋„์™€ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค: (i) Adapter tuning์œผ๋กœ, Adapter ํฌ๊ธฐ๋Š” 2^n (์ฃผํ™ฉ์ƒ‰)์œผ๋กœ ์„ค์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค, ์—ฌ๊ธฐ์„œ n=0,…,9. (ii) ์ƒ์œ„ k ๊ฐœ ๋ ˆ์ด์–ด๋ฅผ Fine-tuningํ•˜๋ฉฐ, k=1,…,12 (ํŒŒ๋ž€์ƒ‰). (iii) Layer normalization ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํ•™์Šต(๋…น์ƒ‰). ์˜ค๋ฅ˜ ๋ง‰๋Œ€๋Š” ์„ธ ๊ฐœ์˜ ๋žœ๋ค ์‹œ๋“œ ๊ฐ„ ±1 ํ‘œ์ค€ ์˜ค์ฐจ์˜ ํ‰๊ท ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

Figure 4์—์„œ๋Š” ๋‘ GLUE ์ž‘์—…(MNLIm๊ณผ CoLA)์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ƒ์œ„ ๋ ˆ์ด์–ด๋ฅผ ํŠœ๋‹ํ•˜๋ฉด ๋ชจ๋“  k > 2์— ๋Œ€ํ•ด ๋” ๋งŽ์€ ์ž‘์—…๋ณ„ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ์œ ์‚ฌํ•œ ์ˆ˜์˜ ์ž‘์—…๋ณ„ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํŒŒ์ธํŠœ๋‹ํ•  ๋•Œ ์–ด๋Œ‘ํ„ฐ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด, ์ƒ์œ„ ๋ ˆ์ด์–ด ํ•˜๋‚˜๋งŒ ํŒŒ์ธํŠœ๋‹ํ•  ๊ฒฝ์šฐ ์•ฝ 900๋งŒ ๊ฐœ์˜ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ MNLIm์—์„œ 77.8% ± 0.1%์˜ ๊ฒ€์ฆ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, ํฌ๊ธฐ 64์˜ ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹์€ ์•ฝ 200๋งŒ ๊ฐœ์˜ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ 83.7% ± 0.1%์˜ ๊ฒ€์ฆ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์™„์ „ ํŒŒ์ธํŠœ๋‹์€ MNLIm์—์„œ 84.4% ± 0.02%์˜ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. CoLA์—์„œ๋„ ์œ ์‚ฌํ•œ ๊ฒฝํ–ฅ์ด ๋‚˜ํƒ€๋‚ฌ์Šต๋‹ˆ๋‹ค.

 

๋˜ํ•œ, ๋ ˆ์ด์–ด ์ •๊ทœํ™” ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํŠœ๋‹ํ•˜์—ฌ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ ˆ์ด์–ด๋Š” ์ ๋ณ„ ์ถ”๊ฐ€ ๋ฐ ๊ณฑ์…ˆ๋งŒ ํฌํ•จํ•˜์—ฌ 4๋งŒ ๊ฐœ์˜ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋„์ž…ํ•˜์ง€๋งŒ, ์„ฑ๋Šฅ์ด CoLA์—์„œ ์•ฝ 3.5%, MNLIm์—์„œ ์•ฝ 4% ๊ฐ์†Œํ•˜์—ฌ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹์€ ๋งค์šฐ ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์ ์ด๋ฉฐ, 0.5-5%์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ๋„ ์›๋ณธ ๋ชจ๋ธ์˜ ํฌ๊ธฐ์— ๋น„ํ•ด ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๊ฑฐ์˜ ์—†๊ณ , BERTLARGE์˜ ์„ฑ๋Šฅ์— ๊ทผ์ ‘ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค.

SQuAD Extractive Question Answering

Figure 5. SQuAD v1.1์— ๋Œ€ํ•œ ๊ฒ€์ฆ ์ •ํ™•๋„์™€ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ๋น„๊ต ์ด ๊ทธ๋ฆผ์€ SQuAD v1.1 ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฒ€์ฆ ์ •ํ™•๋„๋ฅผ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์— ๋”ฐ๋ผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ๋ž˜ํ”„์— ํ‘œ์‹œ๋œ ์˜ค์ฐจ ๋ง‰๋Œ€๋Š” ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„ธ ๊ฐ€์ง€ ๋žœ๋ค ์‹œ๋“œ์— ๊ฑธ์ณ ์–ป์€ ํ‘œ์ค€ ์˜ค์ฐจ(s.e.m.)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด ๊ทธ๋ž˜ํ”„๋Š” ๋‹ค์–‘ํ•œ Adapter ํฌ๊ธฐ์—์„œ ๊ฒ€์ฆ ์ •ํ™•๋„๊ฐ€ ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š”์ง€ ์‹œ๊ฐํ™”ํ•˜๋ฉฐ, Fine-tuning๊ณผ ๋น„๊ตํ•˜์—ฌ Adapter ํŠœ๋‹์ด ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ, Adapter๊ฐ€ ๋ถ„๋ฅ˜ ์™ธ์˜ ์ž‘์—…์—๋„ ํšจ๊ณผ๊ฐ€ ์žˆ์Œ์„ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด SQuAD v1.1 ๋ฐ์ดํ„ฐ์…‹์—์„œ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ์ž‘์—…์€ ์งˆ๋ฌธ๊ณผ ์œ„ํ‚คํ”ผ๋””์•„ ๋ฌธ๋‹จ์„ ์ฃผ์–ด ๋ฌธ๋‹จ์—์„œ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

Figure 5๋Š” SQuAD ๊ฒ€์ฆ ์…‹์—์„œ Fine-tuning๊ณผ Adapter์˜ ํŒŒ๋ผ๋ฏธํ„ฐ/์„ฑ๋Šฅ ๊ฐ„ trade-off๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Fine-tuning์˜ ๊ฒฝ์šฐ ํ•™์Šต๋œ ๋ ˆ์ด์–ด ์ˆ˜, ํ•™์Šต๋ฅ  {3·10โปโต, 5·10โปโต, 1·10โปโด}, ์—ํฌํฌ ์ˆ˜ {2, 3, 5}๋ฅผ ์กฐ์ •ํ•˜์˜€๊ณ , Adapter์˜ ๊ฒฝ์šฐ Adapter ํฌ๊ธฐ, ํ•™์Šต๋ฅ  {3·10โปโต, 1·10โปโด, 3·10โปโด, 1·10โป³}, ์—ํฌํฌ ์ˆ˜ {3, 10, 20}์„ ์กฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๋ถ„๋ฅ˜ ์ž‘์—…๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, Adapter๋Š” ํ›จ์”ฌ ์ ์€ ์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šตํ•˜๋ฉด์„œ๋„ Fine-tuning๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ํฌ๊ธฐ๊ฐ€ 64์ธ Adapter(2%์˜ ํŒŒ๋ผ๋ฏธํ„ฐ)๋Š” ์ตœ๊ณ  F1 ์ ์ˆ˜ 90.4%๋ฅผ ๋‹ฌ์„ฑํ–ˆ๊ณ , Fine-tuning์€ 90.7%๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ํฌ๊ธฐ๊ฐ€ ๋งค์šฐ ์ž‘์€ Adapter(ํฌ๊ธฐ 2, 0.1%์˜ ํŒŒ๋ผ๋ฏธํ„ฐ)์กฐ์ฐจ F1 ์ ์ˆ˜ 89.9%๋ฅผ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.


Analysis and Discussion

1. Adapter์˜ ์ค‘์š”์„ฑ๊ณผ ์—ญํ• 

  • ๊ฐœ๋ณ„ Adapter ์ œ๊ฑฐ ์‹คํ—˜
    • ์ผ๋ถ€ ํ•™์Šต๋œ Adapter๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ์žฌํ•™์Šต ์—†์ด ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, ๋‹จ์ผ ๋ ˆ์ด์–ด์˜ Adapter ์ œ๊ฑฐ๋Š” ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์ด ๋ฏธ๋ฏธํ–ˆ์œผ๋ฉฐ, ์ตœ๋Œ€ ์„ฑ๋Šฅ ์ €ํ•˜๋Š” 2%์— ๋ถˆ๊ณผํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ๋ชจ๋“  Adapter๋ฅผ ์ œ๊ฑฐํ–ˆ์„ ๋•Œ: MNLI์—์„œ 37%, CoLA์—์„œ 69%์˜ ์„ฑ๋Šฅ ์ €ํ•˜ ๋ฐœ์ƒ. ์ด๋Š” Adapter๊ฐ€ ๊ฐœ๋ณ„์ ์œผ๋กœ๋Š” ์ž‘์€ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€๋งŒ, ์ „์ฒด ๋„คํŠธ์›Œํฌ์— ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • ์ƒ์œ„ ๋ ˆ์ด์–ด์˜ ์ค‘์š”์„ฑ
    • ํ•˜์œ„ ๋ ˆ์ด์–ด(0~4 ๋ ˆ์ด์–ด)์˜ Adapter๋ฅผ ์ œ๊ฑฐํ•ด๋„ ์„ฑ๋Šฅ์— ๊ฑฐ์˜ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Œ.
    • ์ƒ์œ„ ๋ ˆ์ด์–ด Adapter๊ฐ€ ๋” ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นจ. ์ด๋Š” ์ƒ์œ„ ๋ ˆ์ด์–ด๊ฐ€ ์ž‘์—…๋ณ„ ๊ณ ์œ  ํŠน์ง•์„ ํ•™์Šตํ•˜๋ฉฐ, Adapter๊ฐ€ ์ƒ์œ„ ๋ ˆ์ด์–ด์— ์šฐ์„ ์ ์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค๋Š” ์ ์—์„œ Fine-tuning ์ „๋žต๊ณผ ์œ ์‚ฌ.

2. Adapter ์ดˆ๊ธฐํ™”์™€ ํฌ๊ธฐ์˜ ๊ฐ•๊ฑด์„ฑ

  • ์ดˆ๊ธฐํ™” ํฌ๊ธฐ ์‹คํ—˜
    • Adapter ๋ชจ๋“ˆ์˜ ๊ฐ€์ค‘์น˜๋Š” ํ‘œ์ค€ํŽธ์ฐจ 10โป² ์ดํ•˜์—์„œ ์„ฑ๋Šฅ์ด ์•ˆ์ •์ ์ด์—ˆ์Œ.
    • ์ดˆ๊ธฐํ™” ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด(CoLA์—์„œ ๋” ๋šœ๋ ทํ•˜๊ฒŒ) ์„ฑ๋Šฅ์ด ์ €ํ•˜๋จ.
    • ์ดˆ๊ธฐํ™” ํ‘œ์ค€ํŽธ์ฐจ ๋ฒ”์œ„ [10โปโท, 1] ๋‚ด์—์„œ 10โป² ์ดํ•˜๋ฅผ ๊ถŒ์žฅ.
  • Adapter ํฌ๊ธฐ๋ณ„ ์„ฑ๋Šฅ
    • ๋‹ค์–‘ํ•œ ํฌ๊ธฐ(8, 64, 256)์˜ Adapter๋กœ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ, ํฌ๊ธฐ 8~256 ์‚ฌ์ด์—์„œ ์„ฑ๋Šฅ ์ฐจ์ด๋Š” ๊ฑฐ์˜ ์—†์Œ.
    • MNLI ํ‰๊ท  ๊ฒ€์ฆ ์ •ํ™•๋„:
      • ํฌ๊ธฐ 8: 86.2%
      • ํฌ๊ธฐ 64: 85.8%
      • ํฌ๊ธฐ 256: 85.7%

3. Adapter ์•„ํ‚คํ…์ฒ˜ ํ™•์žฅ ์‹œ๋„

  • ํ™•์žฅ ์‹คํ—˜: ๋‹ค์–‘ํ•œ ๋ณ€ํ˜• ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‹คํ—˜ํ–ˆ์ง€๋งŒ, ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ๋ฏธ๋ฏธํ–ˆ์Œ.
    • ์‹œ๋„ํ•œ ํ™•์žฅ ๋ฐฉ์‹:
      1. Batch/Layer Normalization ์ถ”๊ฐ€.
      2. Adapter์˜ ๋ ˆ์ด์–ด ์ˆ˜ ์ฆ๊ฐ€.
      3. tanh ๋“ฑ ๋‹ค๋ฅธ ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์ ์šฉ.
      4. Attention ๋ ˆ์ด์–ด ๋‚ด๋ถ€์—๋งŒ Adapter ์‚ฝ์ž….
      5. ์ฃผ์š” ๋ ˆ์ด์–ด์— ๋ณ‘๋ ฌ๋กœ Adapter ์ถ”๊ฐ€ ๋ฐ ๊ณฑ์…ˆ ์ƒํ˜ธ์ž‘์šฉ ๋„์ž….
    • ๊ฒฐ๊ณผ: ์ œ์•ˆ๋œ ๊ธฐ๋ณธ Adapter ๊ตฌ์กฐ์™€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ.

Related Work

Figure 6: ์ขŒ์ธก, ์ค‘์•™ : ์—ฐ์†์ ์ธ ๊ณ„์ธต ๊ตฌ๊ฐ„์—์„œ ํ•™์Šต๋œ adapters์˜ ablation(์ œ๊ฑฐ) ์‹คํ—˜ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์—ด ์ง€๋„(heatmap)๋Š” ์™„์ „ํžˆ ํ•™์Šต๋œ adapter ๋ชจ๋ธ์— ๋น„ํ•ด validation accuracy๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๊ฐ์†Œํ•œ ๋น„์œจ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. y์ถ•๊ณผ x์ถ•์€ ๊ฐ๊ฐ ablation๋œ ์ฒซ ๋ฒˆ์งธ์™€ ๋งˆ์ง€๋ง‰ ๊ณ„์ธต์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๋™์ผํ•œ ๊ณ„์ธต์—์„œ์˜ ablation์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋Œ€๊ฐ์„  ์…€์ด ์ดˆ๋ก์ƒ‰์œผ๋กœ ๊ฐ•์กฐ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ค๋ฅธ์ชฝ ์ƒ๋‹จ์˜ ์…€์€ ๋ชจ๋“  adapters๊ฐ€ ablation๋œ ๊ฒฝ์šฐ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํ•˜๋‹จ ์‚ผ๊ฐํ˜•์˜ ์…€๋“ค์€ ์˜๋ฏธ๊ฐ€ ์—†์œผ๋ฉฐ, ์ตœ์ƒ์˜ ์ƒ๋Œ€ ์„ฑ๋Šฅ์ธ 0%๋กœ ์„ค์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์šฐ์ธก : ์„œ๋กœ ๋‹ค๋ฅธ ์ดˆ๊ธฐ ๊ฐ€์ค‘์น˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„ adapters๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์˜ BERTBASE์˜ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. x์ถ•์€ ์ดˆ๊ธฐํ™” ๋ถ„ํฌ์˜ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์‚ฌ์ „ ํ•™์Šต๋œ ํ…์ŠคํŠธ ํ‘œํ˜„

  • ์‚ฌ์ „ ํ•™์Šต๋œ ํ…์ŠคํŠธ ํ‘œํ˜„์€ NLP ์ž‘์—… ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋ฉฐ, ์ฃผ๋กœ ๋Œ€๊ทœ๋ชจ ๋น„์ง€๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ ์ž‘์—…์—์„œ fine-tuning์„ ํ†ตํ•ด ์ตœ์ ํ™”๋ฉ๋‹ˆ๋‹ค.
  • ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฐœ์ „: Brown ํด๋Ÿฌ์Šคํ„ฐ์™€ ๊ฐ™์€ ์ดˆ๊ธฐ ๊ธฐ๋ฒ•์—์„œ ์‹œ์ž‘ํ•˜์—ฌ Word2Vec, GloVe, FastText ๋“ฑ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์œผ๋กœ ๋ฐœ์ „(Mikolov et al., 2013; Pennington et al., 2014). ๊ธด ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๊ธฐ์ˆ ๋„ Le & Mikolov(2014) ๋“ฑ์˜ ์—ฐ๊ตฌ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฌธ๋งฅ ํฌํ•จ: ELMo, BiLSTM ๋“ฑ์€ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜๋ฉฐ, ์–ด๋Œ‘ํ„ฐ๋Š” ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์ฒ˜๋Ÿผ ๋‚ด๋ถ€ ๊ณ„์ธต์„ ํ™œ์šฉํ•˜์ง€๋งŒ, ๋„คํŠธ์›Œํฌ ์ „์ฒด์—์„œ ํ”ผ์ฒ˜๋ฅผ ์žฌ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์ด ํŠน์ง•.

Pre-trained text representations

  • ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ ์ „์ฒด๋ฅผ ์ž‘์—…์— ๋งž๊ฒŒ fine-tuningํ•˜๋ฉฐ, ์ƒˆ๋กœ์šด ์ž‘์—…๋งˆ๋‹ค ๋„คํŠธ์›Œํฌ ๊ฐ€์ค‘์น˜ ์„ธํŠธ๊ฐ€ ํ•„์š”.
  • ์žฅ์ : task๋ณ„ ๋ชจ๋ธ ์„ค๊ณ„๊ฐ€ ํ•„์š” ์—†์œผ๋ฉฐ, Masked Language Model(MLM)์„ ํ™œ์šฉํ•œ Transformer ๊ธฐ๋ฐ˜ ๋„คํŠธ์›Œํฌ(Vaswani et al., 2017)๊ฐ€ ์งˆ๋ฌธ ๋‹ต๋ณ€, ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ๋“ฑ์˜ ์ž‘์—…์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ(Devlin et al., 2018).

Multi-task Learning (MTL)

  • ํ•˜์œ„ ๊ณ„์ธต์€ ๊ณต์œ , ์ƒ์œ„ ๊ณ„์ธต์€ ์ž‘์—…๋ณ„ ํŠนํ™” ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉ.
  • ์—ฌ๋Ÿฌ ์ž‘์—…์„ ๋™์‹œ์— ํ•™์Šตํ•˜๋ฉฐ, ์ž‘์—… ๊ฐ„ ๊ทœ์น™์„ฑ์„ ํ™œ์šฉํ•ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ(Caruana, 1997).
  • ํ™œ์šฉ ์‚ฌ๋ก€: ํ’ˆ์‚ฌ ํƒœ๊น…, ๊ฐœ์ฒด๋ช… ์ธ์‹, ๊ธฐ๊ณ„ ๋ฒˆ์—ญ(Johnson et al., 2017), ์งˆ๋ฌธ ๋‹ต๋ณ€(Choi et al., 2017) ๋“ฑ.
  • ์ œํ•œ: ํ›ˆ๋ จ ์ค‘ ์ž‘์—…๋“ค์— ๋™์‹œ ์ ‘๊ทผ์ด ํ•„์š”ํ•˜๋ฉฐ, ์ด๋Š” Adapter์™€ ์ฐจ๋ณ„์ .

Continual Learning

  • ์ž‘์—… ์‹œํ€€์Šค์—์„œ ํ•™์Šตํ•˜๋ฉฐ, ์ƒˆ๋กœ์šด ์ž‘์—…์„ ํ•™์Šตํ•  ๋•Œ ์ด์ „ ์ž‘์—…์„ ์žŠ๋Š” "๋ง๊ฐ(catastrophic forgetting)" ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๋Š” ์ ‘๊ทผ.
  • ๋ฐฉ๋ฒ•: ๊ทธ๋Ÿฌ๋‚˜ ์ž‘์—… ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๋น„ํšจ์œจ์ . Adapter๋Š” ์ด๋ณด๋‹ค ํšจ์œจ์ ์œผ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅ.
    • Progressive Networks๋Š” ์ž‘์—…๋งˆ๋‹ค ์ƒˆ๋กœ์šด ๋„คํŠธ์›Œํฌ๋ฅผ ์ƒ์„ฑํ•ด ๋ง๊ฐ ๋ฐฉ์ง€(Rusu et al., 2016).

Transfer Learning in Vision

  • ImageNet ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ: Fine-tuning์€ ๋ถ„๋ฅ˜, ๊ฒ€์ถœ, ์„ธ๊ทธ๋จผํŠธ ๋“ฑ์˜ ๋น„์ „ ์ž‘์—…์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ(Kornblith et al., 2018).
  • Convolutional Adapter:Adapter ํฌ๊ธฐ๋ฅผ ์ค„์—ฌ๋„ ์„ฑ๋Šฅ ์œ ์ง€, ๋ชจ๋ธ ํฌ๊ธฐ๋Š” ์ž‘์—…๋‹น ์•ฝ 11% ์ฆ๊ฐ€.
  • ์ž‘์€ convolutional ๊ณ„์ธต์„ ์ถ”๊ฐ€ํ•ด ์ž‘์—…๋ณ„ ํ•™์Šต ์ˆ˜ํ–‰(Rebuffi et al., 2017).

BERT์˜ Adapter ์—ฐ๊ตฌ์™€ ๋น„๊ต

  • Stickland & Murray(2019):PALs์™€ Adapter๋Š” ์œ ์‚ฌํ•˜์ง€๋งŒ, ์•„ํ‚คํ…์ฒ˜์™€ ์ ‘๊ทผ ๋ฐฉ์‹์ด ๋‹ค๋ฆ„.
  • Projected Attention Layers(PALs)์„ ๋„์ž…ํ•ด BERT์˜ ๋ชจ๋“  GLUE ์ž‘์—…์—์„œ ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต ์ˆ˜ํ–‰.
  • ๊ฒฐ๋ก : ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต๊ณผ ์ง€์† ํ•™์Šต์—๋„ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์„ ๋ณด์—ฌ์คŒ.
    • Adapter๋Š” ์ž‘์€ ํฌ๊ธฐ๋กœ ํšจ์œจ์ ์ธ ํ™•์žฅ์„ ์ œ๊ณตํ•˜๋ฉฐ, ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ๊ณผ fine-tuning์˜ ํ•œ๊ณ„๋ฅผ ๋ณด์™„.