A A
[LLM] LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS ๋ฆฌ๋ทฐ
์ด๋ฒˆ์—๋Š” "LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS" ๋…ผ๋ฌธ์„ ํ•œ๋ฒˆ ๋ฆฌ๋ทฐํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ๋…ผ๋ฌธ ๋งํฌ
 

LoRA: Low-Rank Adaptation of Large Language Models

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes le

arxiv.org

Abstract

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์ค‘์š”ํ•œ ํŒจ๋Ÿฌ๋‹ค์ž„์€ ์ผ๋ฐ˜์ ์ธ ๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ ํ•™์Šต๊ณผ ํŠน์ • ์ž‘์—… ๋˜๋Š” ๋„๋ฉ”์ธ์—์˜ ์ ์‘์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๋ฉด์„œ ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์žฌํ•™์Šตํ•˜๋Š” ์™„์ „ ๋ฏธ์„ธ ์กฐ์ •์€ ์ ์  ๋น„ํ˜„์‹ค์ ์ด ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด GPT-3 175B์˜ ๊ฒฝ์šฐ, ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด 175B ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํฌํ•จํ•œ ๋…๋ฆฝ์ ์ธ ๋ชจ๋ธ ์ธ์Šคํ„ด์Šค๋ฅผ ๋ฐฐํฌํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ๋น„์šฉ์ด ๋งŽ์ด ๋“ญ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ ์ •ํ•˜๊ณ  ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์˜ ๊ฐ ๊ณ„์ธต์— ์ €๋žญํฌ(rank decomposition) ํ–‰๋ ฌ์„ ์‚ฝ์ž…ํ•˜๋Š” LoRA(Low-Rank Adaptation)๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํ•˜์œ„ ์ž‘์—…์—์„œ ํ•™์Šตํ•ด์•ผ ํ•  ๋งค๊ฐœ๋ณ€์ˆ˜ ์ˆ˜๋ฅผ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

LoRA๋Š” GPT-3 175B์—์„œ Adam์„ ์‚ฌ์šฉํ•œ ์™„์ „ ๋ฏธ์„ธ ์กฐ์ • ๋Œ€๋น„ ํ•™์Šต ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ 10,000๋ฐฐ ์ค„์ด๊ณ  GPU ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ๋Ÿ‰์„ 3๋ฐฐ ์ค„์ž…๋‹ˆ๋‹ค. RoBERTa, DeBERTa, GPT-2, GPT-3์—์„œ LoRA๋Š” ๋” ์ ์€ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ๋„ ๋ฏธ์„ธ ์กฐ์ •๋ณด๋‹ค ๋™๋“ฑํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ๋ชจ๋ธ ํ’ˆ์งˆ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ํ•™์Šต ์†๋„๊ฐ€ ๋” ๋น ๋ฅด๊ณ  ์ถ”๊ฐ€์ ์ธ ์ถ”๋ก  ์ง€์—ฐ์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์–ธ์–ด ๋ชจ๋ธ ์ ์‘์—์„œ์˜ ๋žญํฌ ๊ฒฐํ•(rank-deficiency)์„ ์‹ค์ฆ์ ์œผ๋กœ ์กฐ์‚ฌํ•˜๋ฉฐ LoRA์˜ ํšจ๋Šฅ์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” PyTorch ๋ชจ๋ธ๊ณผ์˜ ํ†ตํ•ฉ์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๋Š” ํŒจํ‚ค์ง€๋ฅผ ์ œ๊ณตํ•˜๊ณ , RoBERTa, DeBERTa, GPT-2์— ๋Œ€ํ•œ ๊ตฌํ˜„๊ณผ ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค.

 

GitHub - microsoft/LoRA: Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"

Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models" - microsoft/LoRA

github.com


Introduction

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP)์˜ ๋งŽ์€ ์‘์šฉ์€ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ ํ•™์Šต๋œ ์–ธ์–ด ๋ชจ๋ธ์„ ๋‹ค์–‘ํ•œ ํ•˜์œ„ ์‘์šฉ์— ๋งž๊ฒŒ ์ ์‘์‹œํ‚ค๋Š” ๋ฐ ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ ์‘์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฏธ์„ธ ์กฐ์ •(fine-tuning)์„ ํ†ตํ•ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฏธ์„ธ ์กฐ์ •(fine-tuning)์˜ ์ฃผ์š” ๋‹จ์ ์€ ์ƒˆ๋กœ์šด ๋ชจ๋ธ์ด ์›๋ž˜ ๋ชจ๋ธ๊ณผ ๋™์ผํ•œ ์ˆ˜์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํฌํ•จํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

 

๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ๊ณ„์† ์ปค์ง€๋ฉด์„œ ์ด๋Š” ๋‹จ์ˆœํžˆ "๋ถˆํŽธํ•จ"์„ ๋„˜์–ด์„œ GPT-2(Radford et al., b)๋‚˜ RoBERTa large(Liu et al., 2019)์™€ ๊ฐ™์€ ๋ชจ๋ธ์—์„œ ์‹œ์ž‘๋œ ๋ฌธ์ œ๊ฐ€, GPT-3(1750์–ต ๊ฐœ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜)์—์„œ๋Š” ์‹ฌ๊ฐํ•œ ๋ฐฐํฌ ๋ฌธ์ œ๋กœ ์ด์–ด์กŒ์Šต๋‹ˆ๋‹ค.

๋งŽ์€ ์—ฐ๊ตฌ๋“ค์€ ์ผ๋ถ€ ๋งค๊ฐœ๋ณ€์ˆ˜๋งŒ ์ ์‘ํ•˜๊ฑฐ๋‚˜ ์ƒˆ๋กœ์šด ์ž‘์—…์„ ์œ„ํ•œ ์™ธ๋ถ€ ๋ชจ๋“ˆ์„ ํ•™์Šตํ•จ์œผ๋กœ์จ ์ด ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๋ ค ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด ์†Œ์ˆ˜์˜ ์ž‘์—…๋ณ„ ๋งค๊ฐœ๋ณ€์ˆ˜๋งŒ ์ €์žฅํ•˜๊ณ  ๋กœ๋“œํ•˜๋ฉด ๋˜๋ฏ€๋กœ ์šด์˜ ํšจ์œจ์„ฑ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด ๊ธฐ์ˆ ์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ๋ชจ๋ธ์˜ ๊นŠ์ด๋ฅผ ํ™•์žฅํ•ด ์ถ”๋ก  ์ง€์—ฐ(inference latency)์„ ๋„์ž…ํ•˜๊ฑฐ๋‚˜(Houlsby et al., 2019; Rebuffi et al., 2017),
  2. ๋ชจ๋ธ์˜ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์ค„์ด๋Š” ๋ฐฉ์‹(Li & Liang, 2021; Lester et al., 2021; Hambardzumyan et al., 2020; Liu et al., 2021) ๋“ฑ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
  3. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ํšจ์œจ์„ฑ๊ณผ ๋ชจ๋ธ ํ’ˆ์งˆ ๊ฐ„์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„(trade-off)๋ฅผ ์ดˆ๋ž˜ํ•˜๋ฉฐ, ๋ฏธ์„ธ ์กฐ์ • ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ์„ ์ž์ฃผ ๋”ฐ๋ผ๊ฐ€์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

 

LoRA์˜ ํ•ต์‹ฌ ๊ฐ€์„ค: Li et al. (2018a) ๋ฐ Aghajanyan et al. (2020)์˜ ์—ฐ๊ตฌ์— ์˜๊ฐ์„ ๋ฐ›์•„, ์‚ฌ์ „ ํ•™์Šต๋œ ๊ณผ์ ํ•ฉ(over-parametrized) ๋ชจ๋ธ์ด ๋ณธ์งˆ์ ์œผ๋กœ ๋‚ฎ์€ ์ฐจ์›์„ ๊ฐ€์ง„๋‹ค๋Š” ์ ์„ ๊ด€์ฐฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, ์ €์ž๋Š” ๋ชจ๋ธ ์ ์‘ ์ค‘ ๊ฐ€์ค‘์น˜ ๋ณ€ํ™” ์—ญ์‹œ ๋‚ฎ์€ "๋‚ด์žฌ์  ๋žญํฌ(intrinsic rank)"๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

 

LoRA๋Š” ๋ชจ๋ธ์˜ ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ ์ •ํ•œ ์ƒํƒœ์—์„œ, ์ ์‘ ์ค‘ ๊ฐ€์ค‘์น˜ ๋ณ€ํ™” ๋ถ€๋ถ„์„ ์ €๋žญํฌ ํ–‰๋ ฌ(rank decomposition matrices)๋กœ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด GPT-3(1750์–ต ๋งค๊ฐœ๋ณ€์ˆ˜)์˜ ๊ฒฝ์šฐ์—๋„ ๋งค์šฐ ๋‚ฎ์€ ๋žญํฌ(r)๋กœ ํšจ์œจ์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค(์˜ˆ: ๋žญํฌ r=1 ๋˜๋Š” 2๋กœ๋„ ์ถฉ๋ถ„).

 

LoRA์˜ ์ฃผ์š” ์žฅ์ 

  1. ํšจ์œจ์  ์ €์žฅ ๋ฐ ์ž‘์—… ์ „ํ™˜
    • ํ•˜๋‚˜์˜ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๊ณต์œ ํ•˜๋ฉฐ, ์ž‘์—…๋ณ„๋กœ ์†Œํ˜• LoRA ๋ชจ๋“ˆ(์ €๋žญํฌ ํ–‰๋ ฌ A์™€ B)๋งŒ ๊ต์ฒดํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ €์žฅ ์š”๊ตฌ ์‚ฌํ•ญ๊ณผ ์ž‘์—… ์ „ํ™˜ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ํฌ๊ฒŒ ์ค„์ž…๋‹ˆ๋‹ค.
  2. ํšจ์œจ์  ํ•™์Šต๊ณผ ํ•˜๋“œ์›จ์–ด ์š”๊ตฌ์‚ฌํ•ญ ๊ฐ์†Œ
    • ๋Œ€๋ถ€๋ถ„์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•ด ๊ฒฝ์‚ฌ ๊ณ„์‚ฐ์ด๋‚˜ ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•  ํ•„์š”๊ฐ€ ์—†์œผ๋ฏ€๋กœ ํ•™์Šต ํšจ์œจ์„ฑ์ด ์ตœ๋Œ€ 3๋ฐฐ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.
  3. ์ถ”๋ก  ์ง€์—ฐ ์—†์Œ
    • ํ•™์Šต๋œ ํ–‰๋ ฌ์„ ๊ณ ์ •๋œ ๊ฐ€์ค‘์น˜์— ํ†ตํ•ฉํ•˜์—ฌ ์ €์žฅํ•˜๋ฉด ์ถ”๋ก  ์‹œ ์ถ”๊ฐ€ ์ง€์—ฐ์ด ์—†์Šต๋‹ˆ๋‹ค.
  4. ๊ธฐ์กด ๋ฐฉ์‹๊ณผ์˜ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅ์„ฑ
    • LoRA๋Š” ํ”„๋ฆฌํ”ฝ์Šค ํŠœ๋‹(prefix-tuning) ๊ฐ™์€ ๊ธฐ์กด ๋ฐฉ์‹๊ณผ ๋ณ‘ํ–‰ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
ํŠธ๋žœ์Šคํฌ๋จธ ๊ณ„์ธต์˜ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ์ฐจ์›: dmodel
์…€ํ”„ ์–ดํ…์…˜ ๋ชจ๋“ˆ
- Wq: ์ฟผ๋ฆฌ ํ”„๋กœ์ ์…˜ ํ–‰๋ ฌ
- Wk: ํ‚ค ํ”„๋กœ์ ์…˜ ํ–‰๋ ฌ
- Wv: ๊ฐ’ ํ”„๋กœ์ ์…˜ ํ–‰๋ ฌ
- Wo: ์ถœ๋ ฅ ํ”„๋กœ์ ์…˜ ํ–‰๋ ฌ
์‚ฌ์ „ ํ•™์Šต ๊ฐ€์ค‘์น˜: W0
๋žญํฌ: r (LoRA ๋ชจ๋“ˆ์˜ ๋žญํฌ)
์ตœ์ ํ™”: Adam ์˜ตํ‹ฐ๋งˆ์ด์ € ์‚ฌ์šฉ(Loshchilov & Hutter, 2019)
MLP ํ”ผ๋“œํฌ์›Œ๋“œ ์ฐจ์›: dffn = 4 × dmodel

Problem Statement

์ด ๋…ผ๋ฌธ์—์„œ์˜ ์ œ์•ˆ์€ ํŠน์ • training objective์— ์ข…์†๋˜์ง€ ์•Š์ง€๋งŒ, language modeling์„ ์ฃผ์š” ์‚ฌ๋ก€๋กœ ์„ค์ •ํ•˜์—ฌ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” language modeling ๋ฌธ์ œ์— ๋Œ€ํ•œ ๊ฐ„๋žตํ•œ ์„ค๋ช…๊ณผ, ํŠน์ • ์ž‘์—…(task)-๊ธฐ๋ฐ˜ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ž‘์—…์— ๋Œ€ํ•œ ๊ฐœ์š”์ž…๋‹ˆ๋‹ค.

 

์‚ฌ์ „ ํ•™์Šต๋œ autoregressive language model PΦ(yโˆฃx)์ด ์ฃผ์–ด์กŒ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ Φ๋Š” ๋ชจ๋ธ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, PΦ(yโˆฃx))๋Š” Transformer architecture(Vaswani et al., 2017)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ GPT(Radford et al., b; Brown et al., 2020)์™€ ๊ฐ™์€ ์ผ๋ฐ˜์ ์ธ multi-task learner์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์€ downstream conditional text generation tasks์— ์ ์‘์‹œํ‚ค๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค:

  • summarization (์š”์•ฝ)
  • machine reading comprehension (MRC)
  • natural language to SQL (NL2SQL)

๊ฐ downstream task๋Š” context-target ์Œ์˜ training dataset Z={(xi,yi)}i=1,..,N์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ xi์™€ yi๋Š” ๋ชจ๋‘ ํ† ํฐ์˜ ์‹œํ€€์Šค์ž…๋‹ˆ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด:

  • NL2SQL์—์„œ๋Š” xi๊ฐ€ ์ž์—ฐ์–ด ์ฟผ๋ฆฌ์ด๊ณ  yi๋Š” ํ•ด๋‹น SQL ๋ช…๋ น์ž…๋‹ˆ๋‹ค.
  • summarization์—์„œ๋Š” xi๊ฐ€ ๊ธฐ์‚ฌ ๋‚ด์šฉ์ด๊ณ  yi๋Š” ๊ทธ ์š”์•ฝ์ž…๋‹ˆ๋‹ค.

 

Full Fine-Tuning

full fine-tuning ๊ณผ์ •์—์„œ๋Š” ๋ชจ๋ธ์ด ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜ Φ0๋กœ ์ดˆ๊ธฐํ™”๋˜๊ณ , ์กฐ๊ฑด๋ถ€ ์–ธ์–ด ๋ชจ๋ธ๋ง ๋ชฉํ‘œ๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ฒฝ์‚ฌ๋ฅผ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋”ฐ๋ฅด๋ฉฐ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์‹์ด ์ด๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ full fine-tuning์˜ ์ฃผ์š” ๋‹จ์ ์€, ๊ฐ downstream task๋งˆ๋‹ค โˆฃΔΦโˆฃ=โˆฃΦ0โˆฃ์ธ ๋ณ„๋„์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์„ธํŠธ ΔΦ๋ฅผ ํ•™์Šตํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์ด ๋งค์šฐ ํด ๊ฒฝ์šฐ(์˜ˆ: GPT-3์—์„œ โˆฃΦ0โˆฃ≈175 Billion), ์—ฌ๋Ÿฌ fine-tuned ๋ชจ๋ธ ์ธ์Šคํ„ด์Šค๋ฅผ ์ €์žฅํ•˜๊ณ  ๋ฐฐํฌํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ต๊ฑฐ๋‚˜ ๋ถˆ๊ฐ€๋Šฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Parameter-Efficient Approach

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋” ํšจ์œจ์ ์ธ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ task-specific parameter increment ΔΦ=ΔΦ(Θ)๋Š” ํ›จ์”ฌ ๋” ์ž‘์€ ํฌ๊ธฐ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ง‘ํ•ฉ Θ๋กœ ์ธ์ฝ”๋”ฉ๋ฉ๋‹ˆ๋‹ค (โˆฃΘโˆฃโ‰ชโˆฃΦ0โˆฃ). ๊ฒฐ๊ณผ์ ์œผ๋กœ, ΔΦ๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ๋Š” Θ๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฌธ์ œ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.

Low-Rank Representation ์ œ์•ˆ

์ดํ›„ ์„น์…˜์—์„œ๋Š” ΔΦ๋ฅผ low-rank representation์œผ๋กœ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ณ„์‚ฐ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์ด GPT-3 175B์ผ ๊ฒฝ์šฐ, ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ โˆฃΘโˆฃ๋Š” โˆฃΦ0โˆฃ์˜ 0.01%๋งŒํผ ์ž‘์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


Aren’t Existing Solutions Good Enough? (๊ธฐ์กด ์†”๋ฃจ์…˜์€ ์ถฉ๋ถ„ํžˆ ์ข‹์€๊ฐ€?)

์šฐ๋ฆฌ๊ฐ€ ํ•ด๊ฒฐํ•˜๋ ค๋Š” ๋ฌธ์ œ๋Š” ์ƒˆ๋กœ์šด ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค. Transfer learning์ด ๋“ฑ์žฅํ•œ ์ดํ›„, ์ˆ˜๋งŽ์€ ์—ฐ๊ตฌ๋“ค์ด ๋ชจ๋ธ ์ ์‘์„ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฐ ๊ณ„์‚ฐ ํšจ์œจ์ ์œผ๋กœ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์„ ๋ชจ์ƒ‰ํ•ด ์™”์Šต๋‹ˆ๋‹ค. ์–ธ์–ด ๋ชจ๋ธ๋ง์„ ์˜ˆ๋กœ ๋“ค๋ฉด, ํšจ์œจ์ ์ธ ์ ์‘์„ ์œ„ํ•ด ๋‹ค์Œ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ์ „๋žต์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค:

  1. Adapter Layers ์ถ”๊ฐ€(Houlsby et al., 2019; Rebuffi et al., 2017; Pfeiffer et al., 2021; Rücklé et al., 2020)
  2. ์ž…๋ ฅ ๊ณ„์ธต ํ™œ์„ฑํ™”(activations) ์ตœ์ ํ™”(Li & Liang, 2021; Lester et al., 2021; Hambardzumyan et al., 2020; Liu et al., 2021)

๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๋‘ ๊ฐ€์ง€ ์ „๋žต์€ ํŠนํžˆ ๋Œ€๊ทœ๋ชจ ๋ฐ ์ง€์—ฐ(latency)์— ๋ฏผ๊ฐํ•œ ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

Adapter Layers๋Š” ์ถ”๋ก  ์ง€์—ฐ(Inference Latency)์„ ์ดˆ๋ž˜

Adapter์—๋Š” ์—ฌ๋Ÿฌ ๋ณ€ํ˜•์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” Houlsby et al. (2019)์˜ ๋‘ ๊ฐœ์˜ ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด๋ฅผ Transformer ๋ธ”๋ก๋‹น ์ถ”๊ฐ€ํ•˜๋Š” ์›๋ž˜ ์„ค๊ณ„์™€ Lin et al. (2020)์˜ ํ•˜๋‚˜์˜ ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด์™€ ์ถ”๊ฐ€ LayerNorm(Ba et al., 2016)์„ ์‚ฌ์šฉํ•˜๋Š” ์„ค๊ณ„๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค.

  • ์žฅ์ ๊ณผ ํ•œ๊ณ„:๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๊ทœ๋ชจ ์‹ ๊ฒฝ๋ง์—์„œ๋Š” ํ•˜๋“œ์›จ์–ด ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ์ง€์—ฐ์„ ์ค„์ด์ง€๋งŒ, adapter layers๋Š” ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋กœ ์ธํ•ด ์˜จ๋ผ์ธ ์ถ”๋ก  ํ™˜๊ฒฝ์—์„œ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 1์ผ ๊ฒฝ์šฐ ๋ˆˆ์— ๋„๋Š” ์ง€์—ฐ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Adapter layers๋Š” bottleneck dimension(์ข์€ ์ฐจ์›)์„ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์–‘์„ ์ค„์ž…๋‹ˆ๋‹ค(์ข…์ข… ์›๋ž˜ ๋ชจ๋ธ์˜ 1% ๋ฏธ๋งŒ).
  • GPT-2 ์˜ˆ์‹œ:
    • ๋ชจ๋ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ, GPT-2 medium์—์„œ adapter layers๋Š” ์ž‘์€ bottleneck dimension์„ ์‚ฌ์šฉํ•ด๋„ ์ถ”๋ก  ์ง€์—ฐ์„ ์œ ๋ฐœํ•ฉ๋‹ˆ๋‹ค(Table 1 ์ฐธ์กฐ).
  • ๋ฌธ์ œ ์‹ฌํ™”:
    • ๋ชจ๋ธ์„ shard(๋ถ„ํ• )ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ(Shoeybi et al., 2020; Lepikhin et al., 2020), ์ถ”๊ฐ€๋œ ๊นŠ์ด๋Š” ๋” ๋งŽ์€ ๋™๊ธฐ GPU ์—ฐ์‚ฐ(AllReduce ๋ฐ Broadcast)์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. Adapter ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์—ฌ๋Ÿฌ ๋ฒˆ ์ค‘๋ณต ์ €์žฅํ•˜์ง€ ์•Š๋Š” ํ•œ, ์ด ๋ฌธ์ œ๋Š” ๋”์šฑ ์•…ํ™”๋ฉ๋‹ˆ๋‹ค.

 

ํ”„๋กฌํ”„ํŠธ ์ตœ์ ํ™”๋Š” ์–ด๋ ต๋‹ค (Directly Optimizing the Prompt is Hard)

ํ”„๋ฆฌํ”ฝ์Šค ํŠœ๋‹(prefix tuning; Li & Liang, 2021)์˜ ์‚ฌ๋ก€์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ, ํ”„๋กฌํ”„ํŠธ ์ตœ์ ํ™”๋Š” ๋˜ ๋‹ค๋ฅธ ์–ด๋ ค์›€์„ ๊ฒช์Šต๋‹ˆ๋‹ค:

  1. ์ตœ์ ํ™”์˜ ์–ด๋ ค์›€: Prefix tuning์˜ ์„ฑ๋Šฅ์€ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์ˆ˜์— ๋”ฐ๋ผ ๋น„์„ ํ˜•์ ์œผ๋กœ ๋ณ€ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ธฐ์กด ๋…ผ๋ฌธ์—์„œ๋„ ์œ ์‚ฌํ•œ ๊ด€์ฐฐ์ด ๋ณด๊ณ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  2. ์‹œํ€€์Šค ๊ธธ์ด์˜ ์ œ์•ฝ: ์ ์‘์„ ์œ„ํ•ด ์‹œํ€€์Šค ๊ธธ์ด์˜ ์ผ๋ถ€๋ฅผ ์˜ˆ์•ฝํ•ด์•ผ ํ•˜๋ฏ€๋กœ, downstream task์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค. ์ด๋Š” ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹์ด ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ์„ฑ๋Šฅ์ด ๋‚ฎ์€ ์›์ธ์œผ๋กœ ์ถ”์ •๋ฉ๋‹ˆ๋‹ค.

๊ฒฐ๋ก : Adapter layers์™€ prefix tuning์€ ๊ฐ๊ฐ ๊ณ ์œ ํ•œ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๋ฉฐ, ํŠนํžˆ ๋Œ€๊ทœ๋ชจ ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ๋Š” ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ ๊ฐ„์˜ ๊ท ํ˜•์„ ๋งž์ถ”๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์€ LoRA์™€ ๊ฐ™์€ ์ƒˆ๋กœ์šด ์ ‘๊ทผ๋ฒ•์ด ํ•„์š”ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Table 1: GPT-2 Medium์˜ ๋‹จ์ผ Forward Pass ์ถ”๋ก  ์ง€์—ฐ ์ด ํ‘œ๋Š” NVIDIA Quadro RTX8000์—์„œ GPT-2 medium ๋ชจ๋ธ์˜ ๋‹จ์ผ forward pass์— ๋Œ€ํ•œ ์ถ”๋ก  ์ง€์—ฐ(inference latency)์„ ๋ฐ€๋ฆฌ์ดˆ(ms) ๋‹จ์œ„๋กœ ์ธก์ •ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋ชจ๋“  ๊ฐ’์€ 100๋ฒˆ์˜ ์‹คํ—˜์„ ํ‰๊ท  ๋‚ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค. “โˆฃΘโˆฃ”๋Š” ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด์˜ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. AdapterL๊ณผ AdapterH๋Š” ์–ด๋Œ‘ํ„ฐ ํŠœ๋‹์˜ ๋‘ ๊ฐ€์ง€ ๋ณ€ํ˜•์ด๋ฉฐ, ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ์ถ”๋ก  ์ง€์—ฐ์€ ํŠนํžˆ ์˜จ๋ผ์ธ ํ™˜๊ฒฝ ๋ฐ ์งง์€ ์‹œํ€€์Šค ๊ธธ์ด ์˜ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์œ ์˜๋ฏธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


Our Method (์šฐ๋ฆฌ์˜ ๋ฐฉ๋ฒ•)

LoRA์˜ ๋‹จ์ˆœํ•œ ์„ค๊ณ„์™€ ์‹ค์งˆ์ ์ธ ์ด์ ์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์„ค๋ช…ํ•˜๋Š” ์›์น™์€ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ๋ชจ๋“  ๋ฐ€์ง‘ ๊ณ„์ธต(dense layers)์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” Transformer ์–ธ์–ด ๋ชจ๋ธ์˜ ํŠน์ • ๊ฐ€์ค‘์น˜์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์‹คํ—˜ํ•ฉ๋‹ˆ๋‹ค.

Low-Rank-Parameterized Update Matrices (์ €๋žญํฌ ๋งค๊ฐœ๋ณ€์ˆ˜ํ™” ์—…๋ฐ์ดํŠธ ํ–‰๋ ฌ)

๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ ์—ฌ๋Ÿฌ ๋ฐ€์ง‘ ๊ณ„์ธต์„ ํฌํ•จํ•˜๋ฉฐ, ์ด ๊ณ„์ธต์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ํ’€๋žญํฌ(full-rank)์ž…๋‹ˆ๋‹ค. Aghajanyan et al. (2020)์€ ์‚ฌ์ „ ํ•™์Šต๋œ ์–ธ์–ด ๋ชจ๋ธ์ด ๋‚ฎ์€ "intrinsic dimension(๋‚ด์žฌ์  ์ฐจ์›)"์„ ๊ฐ€์ง€๋ฉฐ, ์ž‘์€ ํ•˜์œ„ ๊ณต๊ฐ„์œผ๋กœ์˜ ๋žœ๋ค ํˆฌ์˜์—๋„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ ๋˜ํ•œ ๋‚ฎ์€ "intrinsic rank(๋‚ด์žฌ์  ๋žญํฌ)"๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

 

LoRA์˜ ํ•ต์‹ฌ:

์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W0∈Rd×kW_0 \in \mathbb{R}^{d \times k}W0∈Rd×k์˜ ์—…๋ฐ์ดํŠธ๋ฅผ ์ €๋žญํฌ ํ–‰๋ ฌ๋กœ ์ œํ•œํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

  • ํ•™์Šต ์ค‘์—๋Š” W0๋Š” ๊ณ ์ •๋˜๊ณ , A์™€ B๋งŒ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ˆ˜์ •๋œ ์ˆœ์ „ํŒŒ(forward pass) ๊ณผ์ •์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Figure 1์€ ์ด ์žฌ๊ตฌ์„ฑ ๊ณผ์ •์„ ์‹œ๊ฐ์ ์œผ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ดˆ๊ธฐํ™”์—์„œ A๋Š” ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๋กœ, B๋Š” 0์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ์ดˆ๊ธฐ์— ΔW=0์ด๋ฏ€๋กœ ๋ชจ๋ธ์€ ์‚ฌ์ „ ํ•™์Šต๋œ ์„ฑ๋Šฅ๊ณผ ๋™์ผํ•˜๊ฒŒ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

Figure 1: Our reparametriza-tion. We only train Aand B.

์Šค์ผ€์ผ๋ง ๋ฐ ์ตœ์ ํ™”

ΔWx๋ฅผ α\r๋กœ ์Šค์ผ€์ผ๋งํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ α๋Š” ์ƒ์ˆ˜์ด๊ณ , r์€ ๋žญํฌ์ž…๋‹ˆ๋‹ค. Adam ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ α ํŠœ๋‹์€ ํ•™์Šต๋ฅ  ์กฐ์ •๊ณผ ์œ ์‚ฌํ•œ ์—ญํ• ์„ ํ•˜๋ฏ€๋กœ, ์ดˆ๊ธฐ α๋ฅผ ์„ค์ •ํ•œ ํ›„ ์ถ”๊ฐ€ ํŠœ๋‹ ์—†์ด ํ•™์Šต์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

Full Fine-Tuning์˜ ์ผ๋ฐ˜ํ™”

LoRA๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์— ๋Œ€ํ•œ ์—…๋ฐ์ดํŠธ๊ฐ€ ์ ์‘(adaptation) ๋™์•ˆ ํ’€๋žญํฌ์ผ ํ•„์š”๊ฐ€ ์—†์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. LoRA๋ฅผ ๋ชจ๋“  ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์— ์ ์šฉํ•˜๊ณ  bias๋ฅผ ํ•™์Šตํ•˜๋ฉด, LoRA์˜ ๋žญํฌ r์„ ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์˜ ๋žญํฌ๋กœ ์„ค์ •ํ•˜์—ฌ full fine-tuning์˜ ํ‘œํ˜„๋ ฅ์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ, ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด LoRA๋Š” ์›๋ž˜ ๋ชจ๋ธ์˜ ํ•™์Šต๊ณผ ๋™์ผํ•œ ์ˆ˜์ค€์œผ๋กœ ์ˆ˜๋ ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, adapter-based ๋ฐฉ๋ฒ•์€ MLP๋กœ ์ˆ˜๋ ดํ•˜๋ฉฐ, prefix-based ๋ฐฉ๋ฒ•์€ ๊ธด ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์ถ”๋ก  ์‹œ ์ถ”๊ฐ€ ์ง€์—ฐ ์—†์Œ (No Additional Inference Latency)

ํ”„๋กœ๋•์…˜์—์„œ LoRA๋ฅผ ๋ฐฐํฌํ•  ๋•Œ, W = W0+BA ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ  ์ €์žฅํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ์ถ”๋ก ์ฒ˜๋Ÿผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค๋ฅธ downstream task๋กœ ์ „ํ™˜ํ•˜๋ ค๋ฉด BA๋ฅผ ๋นผ๊ณ  B′A′๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ „ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ถ”๋ก  ์ง€์—ฐ ์—†์ด ์ž‘์—… ๊ฐ„ ์ „ํ™˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

 

Applying LoRA to Transformer (Transformer์— LoRA ์ ์šฉ)

LoRA๋Š” ์‹ ๊ฒฝ๋ง์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ ์ผ๋ถ€์— ์ ์šฉํ•˜์—ฌ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Transformer ์•„ํ‚คํ…์ฒ˜์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค:

  • Self-attention ๋ชจ๋“ˆ: Wq,Wk,Wv,Wo
  • MLP ๋ชจ๋“ˆ: ๋‘ ๊ฐœ์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ

์šฐ๋ฆฌ๋Š” downstream tasks์—์„œ self-attention์˜ ๊ฐ€์ค‘์น˜๋งŒ ์ ์‘์‹œํ‚ค๊ณ , MLP ๋ชจ๋“ˆ์€ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ  ๊ณ ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์„ค๊ณ„๋ฅผ ๋‹จ์ˆœํ™”ํ•˜๊ณ  ๋งค๊ฐœ๋ณ€์ˆ˜ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค. (Section 7.1์—์„œ self-attention ๊ฐ€์ค‘์น˜ ์กฐ์ •์˜ ์˜ํ–ฅ์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์—ฐ๊ตฌ)

 

Practical Benefits and Limitations (์‹ค์งˆ์  ์ด์ ๊ณผ ํ•œ๊ณ„)

  1. ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ์ €์žฅ ๊ณต๊ฐ„ ์ ˆ์•ฝ
    • GPT-3 175B์˜ ๊ฒฝ์šฐ, VRAM ์‚ฌ์šฉ๋Ÿ‰์„ 1.2TB์—์„œ 350GB๋กœ ์ค„์ž…๋‹ˆ๋‹ค.
    • Checkpoint ํฌ๊ธฐ๋ฅผ ์•ฝ 10,000๋ฐฐ ๊ฐ์†Œ์‹œํ‚ต๋‹ˆ๋‹ค(350GB → 35MB).
  2. GPU ์š”๊ตฌ๋Ÿ‰ ๊ฐ์†Œ ๋ฐ ์†๋„ ํ–ฅ์ƒ
    • Full fine-tuning ๋Œ€๋น„ 25% ํ•™์Šต ์†๋„ ์ฆ๊ฐ€(GPT-3 175B ๊ธฐ์ค€).
    • ๋Œ€๋ถ€๋ถ„์˜ ๋งค๊ฐœ๋ณ€์ˆ˜์—์„œ ๊ฒฝ์‚ฌ๋ฅผ ๊ณ„์‚ฐํ•  ํ•„์š”๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  3. ์ž‘์—… ๊ฐ„ ์ „ํ™˜ ์šฉ์ด
    • LoRA ๊ฐ€์ค‘์น˜๋งŒ ๊ต์ฒดํ•˜์—ฌ ์ž‘์—… ๊ฐ„ ์ „ํ™˜ ๊ฐ€๋Šฅ.
    • VRAM์— ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜๋ฅผ ์ €์žฅํ•œ ์ƒํƒœ๋กœ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ์ฆ‰์‹œ ์ „ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•œ๊ณ„

  1. ์„œ๋กœ ๋‹ค๋ฅธ A,B๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ๋ฐฐ์น˜(batch) ์ฒ˜๋ฆฌํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
  2. ์ง€์—ฐ์ด ์ค‘์š”ํ•˜์ง€ ์•Š์€ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋ณ‘ํ•ฉํ•˜์ง€ ์•Š๊ณ  ๋™์ ์œผ๋กœ LoRA ๋ชจ๋“ˆ์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Empirical Experiments (์‹คํ—˜ ์—ฐ๊ตฌ)

LoRA๋ฅผ RoBERTa (Liu et al., 2019), DeBERTa (He et al., 2021), ๊ทธ๋ฆฌ๊ณ  GPT-2 (Radford et al., b)์—์„œ ํ‰๊ฐ€ํ•œ ๋’ค, ์ด๋ฅผ GPT-3 175B (Brown et al., 2020)๋กœ ํ™•์žฅํ•˜์—ฌ ์‹คํ—˜ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜์€ ์ž์—ฐ์–ด ์ดํ•ด(NLU)๋ถ€ํ„ฐ ์ƒ์„ฑ(NLG) ์ž‘์—…๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๋ฒ”์œ„๋ฅผ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ์…‹ ๋ฐ ์ž‘์—…

  1. RoBERTa ๋ฐ DeBERTa:
    • GLUE ๋ฒค์น˜๋งˆํฌ (Wang et al., 2019)
  2. GPT-2:
    • Li & Liang (2021)์˜ ์„ค์ •์„ ๋”ฐ๋ฆ„.
  3. GPT-3 (๋Œ€๊ทœ๋ชจ ์‹คํ—˜):
    • WikiSQL (Zhong et al., 2017): NL-to-SQL ์ฟผ๋ฆฌ ๋ณ€ํ™˜.
    • SAMSum (Gliwa et al., 2019): ๋Œ€ํ™” ์š”์•ฝ(conversation summarization).

์ถ”๊ฐ€ ์ •๋ณด

๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์„ค๋ช…์€ Appendix C๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค. ๋ชจ๋“  ์‹คํ—˜์€ NVIDIA Tesla V100 GPU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

Baselines (๊ธฐ์ค€์ )

๋‹ค์–‘ํ•œ ๊ธฐ์ค€์ ๊ณผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด, ์ด์ „ ์—ฐ๊ตฌ์—์„œ ์‚ฌ์šฉ๋œ ์„ค์ •์„ ๋ณต์ œํ•˜๊ณ  ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ ๋ณด๊ณ ๋œ ์ˆ˜์น˜๋ฅผ ์žฌ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ์ผ๋ถ€ ๊ธฐ์ค€์ ์€ ํŠน์ • ์‹คํ—˜์—์„œ๋งŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Fine-Tuning (FT)

  • ์„ค๋ช…: ์ผ๋ฐ˜์ ์ธ ์ ์‘(adaptation) ์ ‘๊ทผ๋ฒ•์œผ๋กœ, ๋ชจ๋ธ์„ ์‚ฌ์ „ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜์™€ bias๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ฒฝ์‚ฌ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
  • ๋ณ€ํ˜•: ์˜ˆ๋ฅผ ๋“ค์–ด, GPT-2์—์„œ Li & Liang (2021)์ด ๋ณด๊ณ ํ•œ FTTop2๋Š” ๋งˆ์ง€๋ง‰ ๋‘ ๊ณ„์ธต๋งŒ ์ ์‘์‹œํ‚ค๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
    • ์ผ๋ถ€ ๊ณ„์ธต๋งŒ ์—…๋ฐ์ดํŠธํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ๊ณ ์ •(freeze)ํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ณ€ํ˜•๋„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

Table 2: GLUE ๋ฒค์น˜๋งˆํฌ์—์„œ RoBERTa์™€ DeBERTa์˜ ๋‹ค์–‘ํ•œ ์ ์‘ ๋ฐฉ๋ฒ• ๋น„๊ต ์ด ํ‘œ๋Š” RoBERTa (base ๋ฐ large)์™€ DeBERTa (XXL) ๋ชจ๋ธ์— ๋Œ€ํ•ด ๋‹ค์–‘ํ•œ ์ ์‘ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•œ ํ›„ GLUE ๋ฒค์น˜๋งˆํฌ์—์„œ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํ‰๊ฐ€ ์ง€ํ‘œ๋Š” ์ž‘์—…๋ณ„๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค: MNLI: ์ „์ฒด(matched ๋ฐ mismatched) ์ •ํ™•๋„., CoLA: Matthew’s correlation, STS-B: Pearson correlation, ๊ธฐํƒ€ ์ž‘์—…: ์ •ํ™•๋„(accuracy). ๋ชจ๋“  ์ง€ํ‘œ๋Š” ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜๋ฉฐ, ํ‘œ์—๋Š” ์ด์ „ ์—ฐ๊ตฌ์—์„œ ๋ณด๊ณ ๋œ ๊ฒฐ๊ณผ์™€ ๋™์ผํ•œ ์„ค์ •์—์„œ ์ˆ˜ํ–‰๋œ ์‹คํ—˜ ๊ฒฐ๊ณผ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค(ํ‘œ๊ธฐ: *๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ ๊ฒฐ๊ณผ, †๋Š” Houlsby et al. (2019)์™€ ์œ ์‚ฌํ•œ ์„ค์ •).

1. Bias-only or BitFit

  • ์„ค๋ช…: ์ด ๋ฐฉ๋ฒ•์—์„œ๋Š” ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ณ ์ •ํ•œ ์ƒํƒœ๋กœ bias ๋ฒกํ„ฐ๋งŒ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ๊ด€๋ จ ์—ฐ๊ตฌ: BitFit (Zaken et al., 2021)์—์„œ ์œ ์‚ฌํ•œ ์ ‘๊ทผ๋ฒ•์„ ์—ฐ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์žฅ์ : ๋งค์šฐ ์ ์€ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๊ฐ„๋‹จํžˆ ์ ์šฉ ๊ฐ€๋Šฅ.
  • ๋‹จ์ : ์„ฑ๋Šฅ์ด ์ œํ•œ์ ์ผ ์ˆ˜ ์žˆ์Œ.

2. Prefix-embedding Tuning (PreEmbed)

  • ์„ค๋ช…: ์ž…๋ ฅ ํ† ํฐ ์‚ฌ์ด์— ํŠน๋ณ„ํ•œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ† ํฐ์„ ์‚ฝ์ž…ํ•ฉ๋‹ˆ๋‹ค.
    • Prefixing: ํ”„๋กฌํ”„ํŠธ ์•ž์— ์‚ฝ์ž….
    • Infixing: ํ”„๋กฌํ”„ํŠธ ๋’ค์— ์‚ฝ์ž….
  • ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ˆ˜: โˆฃΘโˆฃ=dmodel×(lp+li),
  • lp: prefix ํ† ํฐ ์ˆ˜, li: infix ํ† ํฐ ์ˆ˜.
  • ์„ฑ๋Šฅ ์˜ํ–ฅ: ํ† ํฐ ๋ฐฐ์น˜ ์œ„์น˜์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ(Li & Liang, 2021).

3. Prefix-layer Tuning (PreLayer)

  • ์„ค๋ช…: Prefix-embedding Tuning์˜ ํ™•์žฅํŒ์œผ๋กœ, ํŠน๋ณ„ํ•œ ํ† ํฐ์˜ ์ž„๋ฒ ๋”ฉ๋งŒ ํ•™์Šตํ•˜๋Š” ๋Œ€์‹  ๊ฐ Transformer ๋ ˆ์ด์–ด์—์„œ ํ™œ์„ฑ๊ฐ’(activations)์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ˆ˜: โˆฃΘโˆฃ=L×dmodel×(lp+li),L: Transformer ๋ ˆ์ด์–ด ์ˆ˜.
  • ํŠน์ง•: ๋” ๋งŽ์€ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํ•™์Šตํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค์ง€๋งŒ, ๊ณ„์‚ฐ ๋น„์šฉ์ด ์ฆ๊ฐ€.

4. Adapter Tuning

  • ์„ค๋ช…: Adapter Layers๋ฅผ Transformer ๋ชจ๋“ˆ(self-attention ๋ฐ MLP) ์‚ฌ์ด์— ์‚ฝ์ž…ํ•˜์—ฌ ์ ์‘ํ•ฉ๋‹ˆ๋‹ค.
    • AdapterH (Houlsby et al., 2019): ๊ธฐ๋ณธ ์„ค๊ณ„. ๋‘ ๊ฐœ์˜ fully connected layers์™€ ๋น„์„ ํ˜•์„ฑ์„ ํฌํ•จ.
    • AdapterL (Lin et al., 2020): MLP ๋ชจ๋“ˆ ๋’ค์™€ LayerNorm ์ดํ›„์—๋งŒ ์–ด๋Œ‘ํ„ฐ๋ฅผ ์ ์šฉํ•˜์—ฌ ํšจ์œจ์„ฑ์„ ๋†’์ž„.
    • AdapterP (Pfeiffer et al., 2021): AdapterL๊ณผ ์œ ์‚ฌํ•œ ์„ค๊ณ„.
    • AdapterD (Rücklé et al., 2020): ์ผ๋ถ€ ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด๋ฅผ ์‚ญ์ œํ•˜์—ฌ ํšจ์œจ์„ฑ์„ ๋†’์ž„.
  • ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ˆ˜: โˆฃΘโˆฃ=LAdpt × (2 × dmodel × r + r + dmodel) + 2 × LLN × dmodel,
  • LAdpt: ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด ์ˆ˜
  • LLN: ํ•™์Šต ๊ฐ€๋Šฅํ•œ LayerNorm ์ˆ˜.

5. LoRA (Low-Rank Adaptation)

  • ์„ค๋ช…: ๊ธฐ์กด์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ๊ณผ ๋ณ‘๋ ฌ๋กœ ์ €๋žญํฌ(rank decomposition) ํ–‰๋ ฌ ์Œ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
    • ๋Œ€๋ถ€๋ถ„์˜ ์‹คํ—˜์—์„œ Wq์™€ Wv์—๋งŒ ์ ์šฉ(Section 4.2).
  • ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ˆ˜: โˆฃΘโˆฃ=2×LLoRA×dmodel×r
  • LLoRA: LoRA๋ฅผ ์ ์šฉํ•œ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์˜ ์ˆ˜.

 

RoBERTa Base/Large

RoBERTa(Liu et al., 2019)๋Š” BERT(Devlin et al., 2019a)์—์„œ ์ œ์•ˆ๋œ ์‚ฌ์ „ ํ•™์Šต ๋ฐฉ์‹์„ ์ตœ์ ํ™”ํ•˜์—ฌ, ๋” ๋งŽ์€ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๊ณ ๋„ ์ž‘์—… ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

์ตœ๊ทผ GLUE(Wang et al., 2019)์™€ ๊ฐ™์€ NLP ๋ฆฌ๋”๋ณด๋“œ์—์„œ ๋” ํฐ ๋ชจ๋ธ๋“ค์ด RoBERTa๋ฅผ ๋›ฐ์–ด๋„˜์—ˆ์ง€๋งŒ, RoBERTa๋Š” ์—ฌ์ „ํžˆ ํฌ๊ธฐ์— ๋น„ํ•ด ๊ฒฝ์Ÿ๋ ฅ ์žˆ๊ณ  ์‹ค๋ฌด์ž๋“ค์—๊ฒŒ ์ธ๊ธฐ ์žˆ๋Š” ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

  • ์„ค์ •:
    • ์šฐ๋ฆฌ๋Š” HuggingFace Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ(Wolf et al., 2020)์—์„œ ์ œ๊ณตํ•˜๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ RoBERTa base (125M)์™€ RoBERTa large (355M)๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
    • GLUE ๋ฒค์น˜๋งˆํฌ ์ž‘์—…์—์„œ ๋‹ค์–‘ํ•œ ํšจ์œจ์ ์ธ ์ ์‘ ๋ฐฉ๋ฒ•์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋น„๊ต:
    • Houlsby et al. (2019)์™€ Pfeiffer et al. (2021)์˜ ์„ค์ •์„ ๋ณต์ œํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ๊ณต์ •ํ•œ ๋น„๊ต๋ฅผ ์œ„ํ•ด ๋‹ค์Œ ๋‘ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ๋ณ€๊ฒฝ์„ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค:
      1. ๋ชจ๋“  ์ž‘์—…์—์„œ ๋™์ผํ•œ ๋ฐฐ์น˜ ํฌ๊ธฐ์™€ 128์˜ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์‚ฌ์šฉํ•ด adapter ๊ธฐ์ค€์ ๊ณผ ์ผ์น˜์‹œํ‚ด.
      2. MRPC, RTE, STS-B ์ž‘์—…์—์„œ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ดˆ๊ธฐํ™”ํ•˜๋ฉฐ, MNLI์— ์ด๋ฏธ ์ ์‘๋œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ(fine-tuning ๊ธฐ์ค€์„ ๊ณผ ๋‹ค๋ฆ„).
  • ๊ฒฐ๊ณผ:
    • ์ด ์ œํ•œ๋œ ์„ค์ •(Houlsby et al., 2019)์„ ๋”ฐ๋ฅด๋Š” ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” Table 2์˜ ์ƒ์œ„ 3๊ฐœ ์„น์…˜์— ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.
    • ์‚ฌ์šฉ๋œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์ •๋ณด๋Š” Section D.1์„ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

 

DeBERTa XXL

DeBERTa(He et al., 2021)๋Š” BERT์˜ ์ตœ์‹  ๋ณ€ํ˜•์œผ๋กœ, ํ›จ์”ฌ ๋” ํฐ ๊ทœ๋ชจ๋กœ ํ•™์Šต๋˜์—ˆ์œผ๋ฉฐ GLUE(Wang et al., 2019) ๋ฐ SuperGLUE(Wang et al., 2020)์™€ ๊ฐ™์€ ๋ฒค์น˜๋งˆํฌ์—์„œ ๋งค์šฐ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

  • ๋ชฉํ‘œ: LoRA๊ฐ€ GLUE ๋ฒค์น˜๋งˆํฌ์—์„œ **DeBERTa XXL (1.5B)**์˜ ์™„์ „ ๋ฏธ์„ธ ์กฐ์ •(full fine-tuning) ์„ฑ๋Šฅ์— ํ•„์ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ‰๊ฐ€.
  • ๊ฒฐ๊ณผ:
    • Table 2์˜ ํ•˜๋‹จ ์„น์…˜์— ๊ฒฐ๊ณผ๊ฐ€ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.
    • ์‚ฌ์šฉ๋œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ Section D.2๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

 

GPT-2 Medium/Large

LoRA์˜ NLU์—์„œ์˜ ๊ฒฝ์Ÿ๋ ฅ์„ ์ž…์ฆํ•œ ํ›„, ์šฐ๋ฆฌ๋Š” LoRA๊ฐ€ NLG ๋ชจ๋ธ์—์„œ๋„ ์—ฌ์ „ํžˆ ์šฐ์œ„๋ฅผ ์ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด GPT-2 medium ๋ฐ large ๋ชจ๋ธ(Radford et al., b)์„ ์‹คํ—˜ ๋Œ€์ƒ์œผ๋กœ ์‚ผ์•˜์Šต๋‹ˆ๋‹ค.

  • ์„ค์ •:
    • Li & Liang (2021)์™€ ์ง์ ‘ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •์„ ์ตœ๋Œ€ํ•œ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ๊ณต๊ฐ„ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ์ด ์„น์…˜์—์„œ๋Š” E2E NLG ์ฑŒ๋ฆฐ์ง€์˜ ๊ฒฐ๊ณผ๋งŒ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค(Table 3).
  • ์ถ”๊ฐ€ ๊ฒฐ๊ณผ:
    • WebNLG (Gardent et al., 2017) ๋ฐ DART (Nan et al., 2020)์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋Š” Section F.1์„ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.
    • ์‚ฌ์šฉ๋œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ชฉ๋ก์€ Section D.3์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Table 4: GPT-3 175B์˜ ๋‹ค์–‘ํ•œ ์ ์‘ ๋ฐฉ๋ฒ• ์„ฑ๋Šฅ. WikiSQL์—์„œ๋Š” logical form validation ์ •ํ™•๋„๋ฅผ, MultiNLI-matched์—์„œ๋Š” validation ์ •ํ™•๋„๋ฅผ, SAMSum์—์„œ๋Š” Rouge-1/2/L ์ ์ˆ˜๋ฅผ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค. LoRA๋Š” full fine-tuning์„ ํฌํ•จํ•œ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. WikiSQL์˜ ๊ฒฐ๊ณผ๋Š” ±0.5%, MNLI-m์€ ±0.1%, SAMSum์˜ ์„ธ ๊ฐ€์ง€ ์ง€ํ‘œ๋Š” ๊ฐ๊ฐ ±0.2/±0.2/±0.1์˜ ๋ณ€๋™์„ฑ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

Scaling Up to GPT-3 175B (GPT-3 175B๋กœ ํ™•์žฅ)

LoRA์˜ ์ตœ์ข… ํ…Œ์ŠคํŠธ๋กœ, 1750์–ต ๊ฐœ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ฐ€์ง„ GPT-3์— ์ ์šฉํ•˜์—ฌ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

๋†’์€ ํ•™์Šต ๋น„์šฉ ๋•Œ๋ฌธ์—, ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด ๋ฌด์ž‘์œ„ ์ดˆ๊ธฐํ™”(seed)๋กœ ์ธํ•œ ํ‘œ์ค€ ํŽธ์ฐจ์˜ ์ „ํ˜•์ ์ธ ๋ฒ”์œ„๋งŒ ๋ณด๊ณ ํ•˜๋ฉฐ, ๋ชจ๋“  ๊ฒฐ๊ณผ ํ•ญ๋ชฉ๋ณ„ ํ‘œ์ค€ ํŽธ์ฐจ๋ฅผ ์ œ๊ณตํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

Table 4์— ๋‚˜ํƒ€๋‚œ ๊ฒƒ์ฒ˜๋Ÿผ, LoRA๋Š” ์„ธ ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘์—์„œ fine-tuning ๊ธฐ์ค€์„ ๊ณผ ๋™๋“ฑํ•˜๊ฑฐ๋‚˜ ์ด๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

 

ํŠน์ง• ๋ฐ ๊ด€์ฐฐ

  1. ๋งค๊ฐœ๋ณ€์ˆ˜ ์ฆ๊ฐ€์˜ ๋น„์„ ํ˜•์  ์„ฑ๋Šฅ ๋ณ€ํ™”:
    • Figure 2์— ๋‚˜ํƒ€๋‚œ ๋ฐ”์™€ ๊ฐ™์ด, ๋ชจ๋“  ๋ฐฉ๋ฒ•์ด ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ์ผ๊ด€๋˜๊ฒŒ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค.
    • Prefix-embedding tuning: ํŠน๋ณ„ ํ† ํฐ์ด 256๊ฐœ๋ฅผ ์ดˆ๊ณผํ•˜๋ฉด ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๊ฐ์†Œ.
    • Prefix-layer tuning: ํŠน๋ณ„ ํ† ํฐ์ด 32๊ฐœ๋ฅผ ์ดˆ๊ณผํ•˜๋ฉด ์„ฑ๋Šฅ์ด ํ•˜๋ฝ.
  2. ์›์ธ์— ๋Œ€ํ•œ ๊ฐ€์„ค:
    • ํŠน๋ณ„ ํ† ํฐ์ด ๋งŽ์•„์งˆ์ˆ˜๋ก ์ž…๋ ฅ ๋ถ„ํฌ๊ฐ€ ์‚ฌ์ „ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์—์„œ ๋” ๋ฉ€์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ์ถ”์ •๋ฉ๋‹ˆ๋‹ค.
    • ์ด ํ˜„์ƒ์€ Li & Liang (2021)์—์„œ๋„ ์œ ์‚ฌํ•˜๊ฒŒ ๊ด€์ฐฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

LoRA๋Š” ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ(GPT-3 175B)์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜๋ฉฐ, ํŠนํžˆ ์ ์€ ๋ฐ์ดํ„ฐ์™€ ๋‹ค์–‘ํ•œ ์ž‘์—… ํ™˜๊ฒฝ์—์„œ์˜ ๊ฐ•๋ ฅํ•œ ์ ์‘๋ ฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

Figure 2: GPT-3 175B์˜ WikiSQL ๋ฐ MNLI-matched ์ž‘์—…์—์„œ ์ ์‘ ๋ฐฉ๋ฒ•๋ณ„ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ˆ˜์— ๋”ฐ๋ฅธ ๊ฒ€์ฆ ์ •ํ™•๋„ ๋น„๊ต. LoRA๋Š” ๋‹ค๋ฅธ ์ ์‘ ๋ฐฉ๋ฒ•๋“ค์— ๋น„ํ•ด ๋” ๋‚˜์€ ํ™•์žฅ์„ฑ๊ณผ ์ž‘์—… ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํ”Œ๋กฏ๋œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์ •๋ณด๋Š” Section F.2๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.


Understanding the Low-Rank Updates (์ €๋žญํฌ ์—…๋ฐ์ดํŠธ์˜ ์ดํ•ด)

LoRA์˜ ์‹คํ—˜์  ์„ฑ๋Šฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, downstream task์—์„œ ํ•™์Šต๋œ low-rank adaptation์˜ ํŠน์„ฑ์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. Low-rank ๊ตฌ์กฐ๋Š” ํ•˜๋“œ์›จ์–ด ์ง„์ž… ์žฅ๋ฒฝ์„ ๋‚ฎ์ถฐ ์—ฌ๋Ÿฌ ์‹คํ—˜์„ ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋ฉฐ, ์—…๋ฐ์ดํŠธ๋œ ๊ฐ€์ค‘์น˜(โˆ†W)์™€ pre-trained ๊ฐ€์ค‘์น˜(W) ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋” ์ž˜ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” GPT-3 175B๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ์—ฐ๊ตฌํ•˜๋ฉฐ, ์—ฌ๊ธฐ์„œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ตœ๋Œ€ 10,000๋ฐฐ ์ค„์ด๋ฉด์„œ๋„ task performance๋ฅผ ์œ ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๋˜ํ•œ LoRA์˜ ์„ฑ๋Šฅ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์‹คํ—˜์  ์—ฐ๊ตฌ๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

  1. Parameter budget constraint๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ, pre-trained Transformer์˜ ์–ด๋–ค weight matrix์— LoRA๋ฅผ ์ ์šฉํ•ด์•ผ downstream performance๊ฐ€ ์ตœ๋Œ€ํ™”๋˜๋Š”๊ฐ€?
  2. Optimal adaptation matrix โˆ†W๋Š” ์‹ค์ œ๋กœ rank-deficientํ•œ๊ฐ€? ๊ทธ๋ ‡๋‹ค๋ฉด, ์‹ค์šฉ์ ์œผ๋กœ ์ ํ•ฉํ•œ rank๋Š” ๋ฌด์—‡์ธ๊ฐ€?
  3. โˆ†W์™€ W ๊ฐ„์˜ ๊ด€๊ณ„๋Š” ๋ฌด์—‡์ธ๊ฐ€?
    • โˆ†W๋Š” W์™€ ์–ผ๋งˆ๋‚˜ ๋†’์€ correlation์„ ๊ฐ€์ง€๋Š”๊ฐ€?
    • โˆ†W๋Š” W์— ๋น„ํ•ด ์–ผ๋งˆ๋‚˜ ํฐ๊ฐ€?
  • Question (2)์™€ Question (3)์— ๋Œ€ํ•œ ๋‹ต๋ณ€์€ pre-trained language model์„ downstream task์— ์‚ฌ์šฉํ•˜๋Š” ๊ทผ๋ณธ ์›์น™์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ insight๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

 

Transformer์˜ ์–ด๋–ค ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์— LoRA๋ฅผ ์ ์šฉํ•ด์•ผ ํ•˜๋Š”๊ฐ€?

Parameter budget๊ฐ€ ์ œํ•œ๋  ๊ฒฝ์šฐ, LoRA๋ฅผ ์–ด๋–ค weight type์— ์ ์šฉํ•ด์•ผ downstream task์—์„œ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์„๊นŒ์š”?

์šฐ๋ฆฌ๋Š” self-attention module์˜ weight matrices๋งŒ ๊ณ ๋ คํ–ˆ์Šต๋‹ˆ๋‹ค. GPT-3 175B์—์„œ ์•ฝ 18M trainable parameters(FP16 ๊ธฐ์ค€ ์•ฝ 35MB)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์‹คํ—˜์„ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • r = 8: ํ•˜๋‚˜์˜ attention weight type์— LoRA๋ฅผ ์ ์šฉ.
  • r = 4: ๋‘ ๊ฐ€์ง€ attention weight type์— LoRA๋ฅผ ์ ์šฉ.(์ด 96๊ฐœ์˜ layers์—์„œ ์‹คํ—˜.)

Table 5: WikiSQL๊ณผ MultiNLI์—์„œ LoRA๋ฅผ GPT-3์˜ ๋‹ค์–‘ํ•œ attention weight ์œ ํ˜•์— ์ ์šฉํ•œ ํ›„์˜ ๊ฒ€์ฆ ์ •ํ™•๋„. ๋™์ผํ•œ ์ˆ˜์˜ trainable parameters๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ, Wq์™€ Wv๋ฅผ ๋™์‹œ์— ์ ์‘ ์‹œํ‚ค๋Š” ๊ฒƒ์ด ์ „๋ฐ˜์ ์œผ๋กœ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” random seed์— ๋”ฐ๋ฅธ ํ‘œ์ค€ ํŽธ์ฐจ๊ฐ€ ๋ฐ์ดํ„ฐ์…‹๋ณ„๋กœ ์ผ๊ด€๋œ ๊ฒƒ์„ ํ™•์ธํ–ˆ์œผ๋ฉฐ, ์ด๋ฅผ ์ฒซ ๋ฒˆ์งธ ์—ด์— ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค.

 

๋ชจ๋“  parameters๋ฅผ โˆ†Wq ๋˜๋Š” โˆ†Wk์— ์ง‘์ค‘์‹œํ‚ค๋Š” ๊ฒฝ์šฐ, ์„ฑ๋Šฅ์ด ์ƒ๋‹นํžˆ ๋‚ฎ์•„์กŒ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, Wq์™€ Wv๋ฅผ ๋™์‹œ์— ์ ์‘์‹œํ‚ค๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ณ์•˜์Šต๋‹ˆ๋‹ค. ์ด๋Š” rank๊ฐ€ 4์™€ ๊ฐ™์ด ์ž‘์€ ๊ฐ’์œผ๋กœ๋„ โˆ†W์—์„œ ์ถฉ๋ถ„ํ•œ ์ •๋ณด๋ฅผ ์บก์ฒ˜ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๋‹จ์ผ weight ์œ ํ˜•์— ๋” ํฐ rank๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์—ฌ๋Ÿฌ weight matrices๋ฅผ ์ ์‘์‹œํ‚ค๋Š” ๊ฒƒ์ด ๋” ๋ฐ”๋žŒ์งํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

 

What is the Optimal Rank r for LoRA? (LoRA์— ์ ํ•ฉํ•œ ์ตœ์ ์˜ Rank r๋Š” ๋ฌด์—‡์ธ๊ฐ€?)

์šฐ๋ฆฌ๋Š” rank r๊ฐ€ ๋ชจ๋ธ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค:

  • {Wq,Wv}๋ฅผ ์ ์‘.
  • {Wq,Wk,Wv,Wc}๋ฅผ ์ ์‘.
  • ๋‹จ์ผ {Wq}๋ฅผ ์ ์‘.

Table 6: WikiSQL๊ณผ MultiNLI์—์„œ rank r ๊ฐ’์— ๋”ฐ๋ฅธ ๊ฒ€์ฆ ์ •ํ™•๋„. ๋†€๋ž๊ฒŒ๋„, ์ด ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” {Wq,Wv}๋ฅผ ์ ์‘ํ•  ๋•Œ rank r=1 ๊ณผ ๊ฐ™์€ ์ž‘์€ ๊ฐ’์œผ๋กœ๋„ ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, {Wq}๋งŒ ํ•™์Šตํ•  ๊ฒฝ์šฐ ๋” ํฐ r ๊ฐ’์ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

Table 6์˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • LoRA๋Š” ๋งค์šฐ ์ž‘์€ rank r ๊ฐ’์œผ๋กœ๋„ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค.
  • ํŠนํžˆ {Wq,Wv}๋ฅผ ์ ์‘ํ•  ๋•Œ, ๋‹จ์ผ {Wq}๋งŒ ์ ์‘ํ•˜๋Š” ๊ฒฝ์šฐ๋ณด๋‹ค ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค.
  • ์ด๋Š” update matrix ΔW๊ฐ€ ๋งค์šฐ ์ž‘์€ "intrinsic rank"๋ฅผ ๊ฐ€์งˆ ๊ฐ€๋Šฅ์„ฑ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

Subspace Similarity Between Different r

์ž‘์€ r๊ฐ’์œผ๋กœ LoRA๊ฐ€ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜๋Š” ์ด์œ ๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด, ์„œ๋กœ ๋‹ค๋ฅธ r๊ฐ’์— ๋”ฐ๋ฅธ subspace ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, rank r=8์—์„œ ํ•™์Šต๋œ ํ–‰๋ ฌ Ar=8 ๊ณผ rank r=64์—์„œ ํ•™์Šต๋œ ํ–‰๋ ฌ Ar=64๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๋Š” Singular Value Decomposition(SVD)์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ฐ ํ–‰๋ ฌ์˜ right-singular unitary matrix UAr=8์™€ Ar=64๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.

 

Q. UAr=8์—์„œ ์ƒ์œ„ i๊ฐœ์˜ singular vector๊ฐ€ ์ƒ์„ฑํ•˜๋Š” subspace๊ฐ€, UAr=64์˜ ์ƒ์œ„ j๊ฐœ์˜ singular vector๊ฐ€ ์ƒ์„ฑํ•˜๋Š” subspace์— ์–ผ๋งˆ๋‚˜ ํฌํ•จ๋˜๋Š”๊ฐ€?

Normalized Subspace Similarity (์ •๊ทœํ™”๋œ Subspace ์œ ์‚ฌ์„ฑ): ์ด ์œ ์‚ฌ์„ฑ์„ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด Grassmann distance ๊ธฐ๋ฐ˜์˜ ์ •๊ทœํ™”๋œ subspace similarity๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • Ui: UAr=8์—์„œ ์ƒ์œ„ i๊ฐœ์˜ singular vector์˜ column์œผ๋กœ ๊ตฌ์„ฑ๋œ matrix.
  • Uj: UAr=64์—์„œ ์ƒ์œ„ j๊ฐœ์˜ singular vector์˜ column์œผ๋กœ ๊ตฌ์„ฑ๋œ matrix.
  • ฯ• ๊ฐ’์˜ ๋ฒ”์œ„๋Š” [0,1]:
    • 1: ๋‘ subspace๊ฐ€ ์™„์ „ํžˆ ๊ฒน์นจ.
    • 0: ๋‘ subspace๊ฐ€ ์™„์ „ํžˆ ๋ถ„๋ฆฌ๋จ.

Figure 3: ΔWq์™€ ΔWv์— ๋Œ€ํ•ด Ar=8๊ณผ Ar=64์˜ column vectors ๊ฐ„ subspace ์œ ์‚ฌ์„ฑ ์ด ๊ทธ๋ฆผ์€ ΔWq์™€ ΔWv์— ๋Œ€ํ•ด Ar=8๊ณผ Ar=64์˜ column vectors ๊ฐ„์˜ subspace ์œ ์‚ฌ์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์„ธ ๋ฒˆ์งธ์™€ ๋„ค ๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„๋Š” ์ฒซ ๋ฒˆ์งธ์™€ ๋‘ ๋ฒˆ์งธ ๊ทธ๋ž˜ํ”„์—์„œ ์™ผ์ชฝ ํ•˜๋‹จ ์‚ผ๊ฐํ˜• ๋ถ€๋ถ„ ์„ ํ™•๋Œ€ํ•˜์—ฌ ์„ธ๋ถ€์ ์ธ ๋น„๊ต๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. Rank r=8์—์„œ์˜ ์ฃผ์š” ๋ฐฉํ–ฅ(top directions)์€ r=64์˜ subspace์— ํฌํ•จ๋˜๋ฉฐ, ๊ทธ ๋ฐ˜๋Œ€๋„ ์„ฑ๋ฆฝํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ž‘์€ r ๊ฐ’์œผ๋กœ๋„ ์ค‘์š”ํ•œ subspace ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์บก์ฒ˜ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ค‘์š”ํ•œ ๊ด€์ฐฐ์  (Figure 3 ๋ถ„์„):

  1. Top Singular Vector์˜ ์ค‘๋ณต์„ฑ:
    • Ar=8๊ณผ Ar=64์—์„œ top singular vector ๋ฐฉํ–ฅ์€ ํฌ๊ฒŒ ์ค‘๋ณต๋˜๋ฉฐ, ๋‚˜๋จธ์ง€ ๋ฐฉํ–ฅ์€ ๊ทธ๋ ‡์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
    • ํŠนํžˆ, Ar=8์˜ ΔWv์™€ Ar=64์˜ ΔWv (๋˜๋Š” ΔWq)๋Š” ์ฐจ์› 1์˜ subspace๋ฅผ ๊ณต์œ ํ•˜๋ฉฐ, ์ •๊ทœํ™”๋œ ์œ ์‚ฌ์„ฑ ๊ฐ’์ด 0.5 ์ด์ƒ์ž…๋‹ˆ๋‹ค.
    • ์ด๋Š” rank r=1์ด GPT-3์˜ downstream task์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ์ด์œ ๋ฅผ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
  2. Noise์™€ ์œ ์šฉํ•œ ๋ฐฉํ–ฅ:
    • Ar=8๊ณผ Ar=64 ๋ชจ๋‘ ๋™์ผํ•œ pre-trained ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋˜์—ˆ์œผ๋ฏ€๋กœ, Figure 3์€ Ar=8๊ณผ Ar=64์˜ top singular vector ๋ฐฉํ–ฅ์ด ๊ฐ€์žฅ ์œ ์šฉํ•จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • ๋ฐ˜๋ฉด, ๋‹ค๋ฅธ ๋ฐฉํ–ฅ์€ ํ•™์Šต ์ค‘ ์ถ•์ ๋œ random noise๊ฐ€ ํฌํ•จ๋  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.
    • ๋”ฐ๋ผ์„œ adaptation matrix๋Š” ์‹ค์ œ๋กœ ๋งค์šฐ ๋‚ฎ์€ rank๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Rank r=64๋กœ ํ•™์Šต๋œ ๋‘ ๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ random seed ์‹คํ–‰์—์„œ ΔWq์™€ ΔWv์˜ ์ •๊ทœํ™”๋œ subspace similarity๋ฅผ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ΔWq:
    • ๋” ๋†’์€ "intrinsic rank"๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๋‘ ์‹คํ–‰ ๊ฐ„ ๋” ๋งŽ์€ ๊ณตํ†ต singular value ๋ฐฉํ–ฅ์„ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์ด๋Š” Table 6์—์„œ ๊ด€์ฐฐ๋œ ์‹คํ—˜ ๊ฒฐ๊ณผ์™€ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.
  • ΔWv:
    • ์ƒ๋Œ€์ ์œผ๋กœ ๋” ๋‚ฎ์€ intrinsic rank๋ฅผ ๋ณด์ด๋ฉฐ, ๊ณตํ†ต singular value ๋ฐฉํ–ฅ์˜ ์ˆ˜๊ฐ€ ์ ์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๋น„๊ต:
    • ๋‘ ๊ฐœ์˜ random Gaussian matrices์—์„œ๋Š” ๊ณตํ†ต singular value ๋ฐฉํ–ฅ์ด ์ „ํ˜€ ๊ณต์œ ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

 

How Does the Adaptation Matrix ΔWCompare to W? (W์™€ W์˜ ๊ด€๊ณ„ ๋ถ„์„)

์šฐ๋ฆฌ๋Š” ΔW์™€ W๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋” ์ž์„ธํžˆ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ์งˆ๋ฌธ:

  1. ΔW๋Š” W์™€ ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๊ฐ€์ง€๋Š”๊ฐ€? (์ฆ‰, ΔW๋Š” ์ฃผ๋กœ W์˜ ์ƒ์œ„ singular directions์— ํฌํ•จ๋˜๋Š”๊ฐ€?)
  2. ΔW๋Š” ํฌ๊ธฐ ์ธก๋ฉด์—์„œ W์˜ ํ•ด๋‹น ๋ฐฉํ–ฅ๊ณผ ๋น„๊ตํ•˜์—ฌ ์–ผ๋งˆ๋‚˜ "ํฐ๊ฐ€"?

์ด ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์€ pre-trained language model์˜ ์ ์‘ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ๋‹จ์„œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Figure 4: Random Seed ๊ฐ„ ๋ฐ Random Gaussian Matrices ๊ฐ„ Subspace Similarity ์™ผ์ชฝ๊ณผ ๊ฐ€์šด๋ฐ ๊ทธ๋ž˜ํ”„๋Š” 48๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ random seed๊ฐ€ ๋‹ค๋ฅธ ๋‘ ์‹คํ–‰์—์„œ ΔWq์™€ ΔWv์˜ Ar=64 column vector ๊ฐ„์˜ ์ •๊ทœํ™”๋œ subspace similarity๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„๋Š” ๋‘ ๊ฐœ์˜ random Gaussian matrices ๊ฐ„ column vector์˜ subspace similarity๋ฅผ ๋‚˜ํƒ€๋‚ธ heat-map์ž…๋‹ˆ๋‹ค

  • Projection:
    • W๋ฅผ ΔW์˜ r-์ฐจ์› subspace์— ํˆฌ์˜.
    • ์ด๋ฅผ ์œ„ํ•ด UโŠคWVโŠค๋ฅผ ๊ณ„์‚ฐ, ์—ฌ๊ธฐ์„œ U/V๋Š” ΔW์˜ ์ขŒ/์šฐ singular vector matrix.
  • Frobenius Norm ๋น„๊ต:
    • โˆฅUโŠคWVโŠคโˆฅF์™€ โˆฅWโˆฅF์˜ ๊ฐ’์„ ๋น„๊ต.
    • ๋น„๊ต๋ฅผ ์œ„ํ•ด U,V๋ฅผ W์˜ ์ƒ์œ„ r-singular vectors ๋˜๋Š” random matrix๋กœ ๋Œ€์ฒดํ•˜์—ฌ ๋™์ผํ•œ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰.

Table 7: UโŠคWqVโŠค์˜ Frobenius Norm ์—ฌ๊ธฐ์„œ U์™€ V๋Š” ๋‹ค์Œ ์ค‘ ํ•˜๋‚˜์—์„œ ๊ฐ€์ ธ์˜จ ์ขŒ/์šฐ r-singular vector ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค: 1. ΔWq, 2. Wq, 3. random matrix ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์€ GPT-3์˜ 48๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

Table 7 ๋ถ„์„ ๊ฒฐ๊ณผ

  1. ΔW์™€ W์˜ ์ƒ๊ด€์„ฑ: ΔW๋Š” random matrix๋ณด๋‹ค W์™€ ๋” ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๊ฐ€์ง€๋ฉฐ, ์ด๋Š” ΔW๊ฐ€ W์— ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ์ผ๋ถ€ ํŠน์ง•์„ ์ฆํญ(amplify)ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  2. ΔW\์˜ ๋…ํŠนํ•œ ๋ฐฉํ–ฅ: ΔW๋Š” WW ์ƒ์œ„ singular directions๋ฅผ ๋ฐ˜๋ณตํ•˜์ง€ ์•Š๊ณ , W์—์„œ ๋œ ๊ฐ•์กฐ๋œ ๋ฐฉํ–ฅ์„ ์ฆํญํ•ฉ๋‹ˆ๋‹ค.
  3. ์ฆํญ ๊ณ„์ˆ˜: r=4์—์„œ ์ฆํญ ๊ณ„์ˆ˜๋Š” ์•ฝ 21.5≈6.91/0.32๋กœ ๋งค์šฐ ํฝ๋‹ˆ๋‹ค. r=64์—์„œ ์ฆํญ ๊ณ„์ˆ˜๊ฐ€ ๋” ์ž‘์Šต๋‹ˆ๋‹ค.
  4. ์ถ”๊ฐ€ ์‹œ๊ฐํ™”: Wq์˜ ์ƒ์œ„ singular directions๋ฅผ ๋” ํฌํ•จํ• ์ˆ˜๋ก ์ƒ๊ด€์„ฑ์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š”์ง€ ์‹œ๊ฐํ™”ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Low-rank adaptation matrix๋Š” ์ผ๋ฐ˜์ ์ธ ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์—์„œ ํ•™์Šต๋˜์—ˆ์ง€๋งŒ ๊ฐ•์กฐ๋˜์ง€ ์•Š์€ ์ค‘์š”ํ•œ ํŠน์ง•์„ ํŠน์ • downstream task์— ๋งž๊ฒŒ ์ฆํญํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.


Conclusion and Future Work (๊ฒฐ๋ก  ๋ฐ ๋ฏธ๋ž˜ ์ž‘์—…)

๊ฑฐ๋Œ€ํ•œ ์–ธ์–ด ๋ชจ๋ธ์„ fine-tuningํ•˜๋Š” ๊ฒƒ์€ ํ•„์š”ํ•œ ํ•˜๋“œ์›จ์–ด์™€ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•œ ๋…๋ฆฝ ์ธ์Šคํ„ด์Šค๋ฅผ ํ˜ธ์ŠคํŒ…ํ•˜๋Š” ๋ฐ ๋“œ๋Š” ์ €์žฅ/์ „ํ™˜ ๋น„์šฉ ์ธก๋ฉด์—์„œ ๋งค์šฐ ๋น„์‹ธ๊ณ  ๋น„ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” LoRA๋ฅผ ์ œ์•ˆํ•˜๋ฉฐ, ์ด๋Š” ๋†’์€ ๋ชจ๋ธ ํ’ˆ์งˆ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ์ถ”๋ก  ์ง€์—ฐ์„ ์ดˆ๋ž˜ํ•˜์ง€ ์•Š์œผ๋ฉฐ ์ž…๋ ฅ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์ค„์ด์ง€ ์•Š๋Š” ํšจ์œจ์ ์ธ ์ ์‘ ์ „๋žต์ž…๋‹ˆ๋‹ค. ํŠนํžˆ, ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ณต์œ ํ•จ์œผ๋กœ์จ ์„œ๋น„์Šค๋กœ ๋ฐฐํฌ๋  ๋•Œ ๋น ๋ฅธ ์ž‘์—… ์ „ํ™˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” Transformer ์–ธ์–ด ๋ชจ๋ธ์— ์ดˆ์ ์„ ๋งž์ท„์ง€๋งŒ, ์ œ์•ˆ๋œ ์›์น™์€ dense layers๋ฅผ ํฌํ•จํ•œ ๋ชจ๋“  ์‹ ๊ฒฝ๋ง์— ์ผ๋ฐ˜์ ์œผ๋กœ ์ ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

 

๋ฏธ๋ž˜ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๋ฐฉํ–ฅ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค:

  1. LoRA๋Š” ๋‹ค๋ฅธ ํšจ์œจ์ ์ธ ์ ์‘ ๋ฐฉ๋ฒ•๊ณผ ๊ฒฐํ•ฉ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ์ƒํ˜ธ ๋ณด์™„์ ์ธ ๊ฐœ์„ ์„ ์ œ๊ณตํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  2. Fine-tuning ๋˜๋Š” LoRA์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ์—ฌ์ „ํžˆ ๋ช…ํ™•ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์‚ฌ์ „ ํ•™์Šต ์ค‘ ํ•™์Šต๋œ ํŠน์ง•์€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜๋˜์–ด downstream task์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š”๊ฐ€? LoRA๋Š” full fine-tuning๋ณด๋‹ค ์ด๋ฅผ ๋” ๋ช…ํ™•ํžˆ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐํšŒ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  3. ์šฐ๋ฆฌ๋Š” ๋Œ€๋ถ€๋ถ„ ํœด๋ฆฌ์Šคํ‹ฑ์— ์˜์กดํ•ด LoRA๋ฅผ ์ ์šฉํ•  weight matrices๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๋” ์ฒด๊ณ„์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์„๊นŒ์š”?
  4. ๋งˆ์ง€๋ง‰์œผ๋กœ, ΔW์˜ rank-deficiency๋Š” W ์—ญ์‹œ rank-deficientํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•˜๋ฉฐ, ์ด๋Š” ๋ฏธ๋ž˜ ์—ฐ๊ตฌ์˜ ์˜๊ฐ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.