A A
[Paper Review] When MOE meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications
LLM์—์„œ MOE ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ Medical Domain์—์„œ Task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” Reference๋ฅผ ์ฐพ์•„๋ณด๋ฉด์„œ ๋…ผ๋ฌธ์„ ์ฝ์€ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•ด ๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ๋…ผ๋ฌธ ์›๋ฌธ ์‚ฌ์ดํŠธ
 

When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications

The recent surge in Large Language Models (LLMs) has garnered significant attention across numerous fields. Fine-tuning is often required to fit general LLMs for a specific domain, like the web-based healthcare system. However, two problems arise during fi

arxiv.org


Abstract

์ตœ๊ทผ Large Language Models (LLMs)์˜ ๊ธ‰๊ฒฉํ•œ ์ฆ๊ฐ€๊ฐ€ ์—ฌ๋Ÿฌ ๋ถ„์•ผ์—์„œ ํฐ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Fine-tuning์€ ์›น ๊ธฐ๋ฐ˜ healthcare system๊ณผ ๊ฐ™์€ ํŠน์ • ๋„๋ฉ”์ธ์— ์ผ๋ฐ˜ LLMs์„ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ข…์ข… ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์˜๋ฃŒ ์‘์šฉ ๋ถ„์•ผ์—์„œ LLMs๋ฅผ fine-tuningํ•˜๋Š” ๊ณผ์ •์—์„œ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

์ฒซ ๋ฒˆ์งธ๋Š” task variety ๋ฌธ์ œ๋กœ, ์ด๋Š” ์‹ค์ œ ์˜๋ฃŒ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋‹ค์–‘ํ•œ ์ž‘์—…์ด ํฌํ•จ๋œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

  • ์ด๋Ÿฌํ•œ ๋‹ค์–‘์„ฑ์€ data imbalance ๋ฐ seesaw ๋ฌธ์ œ๋กœ ์ธํ•ด sub-optimal fine-tuning์„ ์ดˆ๋ž˜ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

๋‘ ๋ฒˆ์งธ๋กœ, LLMs์˜ ๋Œ€๊ทœ๋ชจ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” fine-tuning์— ๋งŽ์€ ์‹œ๊ฐ„๊ณผ computation ์ž์›์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค.

  • ์ด๋Ÿฌํ•œ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” MOELoRA๋กœ ๋ช…๋ช…๋œ multi-task ์˜๋ฃŒ ์‘์šฉ์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด parameter efficient fine-tuning framework๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

์„ค๊ณ„๋œ framework๋Š” multi-task learning์„ ์œ„ํ•œ mixture-of-expert (MOE)์˜ ์ด์ ๊ณผ low-rank adaptation (LoRA) parameter efficient fine-tuning์˜ ์ด์ ์„ ๋ชจ๋‘ ํก์ˆ˜ํ•˜๋„๋ก ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

 

MOE์™€ LoRA๋ฅผ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” trainable parameter๋กœ์„œ ์—ฌ๋Ÿฌ expert๋ฅผ ๊ณ ์•ˆํ–ˆ์œผ๋ฉฐ, ๊ฐ expert๋Š” trainable parameter์˜ ์†Œํ˜• ํฌ๊ธฐ๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด low-rank matrix์˜ ์ผ๋ถ€๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ ๋‹ค์Œ, ๋ชจ๋“  MOELoRA layer์— ๋Œ€ํ•ด task-motivated gate function์„ ์ œ์•ˆํ•˜์—ฌ ๊ฐ expert์˜ ๊ธฐ์—ฌ๋„๋ฅผ ์ œ์–ดํ•˜๊ณ  ๋‹ค์–‘ํ•œ task๋ฅผ ์œ„ํ•œ distinct parameter๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


Introduction

๋”ฐ๋ผ์„œ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” open-source LLMs์™€ ์˜๋ฃŒ ์ง€์‹ ๋ฐ ์ž„์ƒ ์ž‘์—…์— ๋Œ€ํ•œ fine-tuning์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

์˜๋ฃŒ ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ LLMs์˜ fine-tuning์€ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๋„์ „ ๊ณผ์ œ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

  1. Task Variety Problem: ์‹ค์ œ ํด๋ฆฌ๋‹‰์—์„œ LLMs๋Š” doctor recommendation, diagnosis prediction, medicine recommendation, medical named entity recognition , clinical report generation๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ์ ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. High Tuning Cost: fine-tuning์ด ํ‘œ์ค€ ์ ‘๊ทผ ๋ฐฉ์‹์ด์—ˆ๋˜ BERT ์‹œ๋Œ€ ๋™์•ˆ์—๋„, LLMs์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์„œ ๋„์ „ ๊ณผ์ œ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Task variety ๋ฌธ์ œ์™€ ๊ด€๋ จํ•˜์—ฌ, ์—ฌ๋Ÿฌ multi-task ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ์ค‘์—์„œ๋„ Mixture-of-Experts (MOE)๋Š” standout ์ค‘ ํ•˜๋‚˜๋กœ, ์ด๋Š” task-shared์™€ task-specific ์ง€์‹์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์ „๋ฌธ๊ฐ€๋ฅผ ๋ถ„๋ฆฌํ•˜์—ฌ ์‚ฌ์šฉํ•˜๊ณ , ์ „๋ฌธ๊ฐ€ ๊ธฐ์—ฌ๋„๋ฅผ ์กฐ์ •ํ•˜๋Š” gate function์„ ํ†ตํ•ฉํ•˜์—ฌ task ๊ฐ„์˜ ๊ท ํ˜•์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

 

์ตœ๊ทผ parameter efficient fine-tuning (PEFT) ๋ฐฉ๋ฒ•๋ก ์ด ์ด๋Ÿฌํ•œ ๋†’์€ fine-tuning ๋น„์šฉ ๋ฌธ์ œ์— ๋Œ€ํ•œ ์ž ์žฌ์ ์ธ ํ•ด๊ฒฐ์ฑ…์„ ์ œ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ์‹œ๊ฐ„ ์†Œ๋ชจ์ ์ด๊ณ  ์ง€์‹ ๊ณต์œ ์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, fine-tuning์ด ๊ฐ€๋Šฅํ•œ ์ž‘์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์…‹์ด๋ผ ํ• ์ง€๋ผ๋„, ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜•๊ณผ seesaw ๋ฌธ์ œ๋กœ ์ธํ•ด ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Figure 1: ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ์— ๋Œ€ํ•œ ์„ค๋ช….

์ด๋ฅผ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ์ค‘๊ตญ์˜ ์˜๋ฃŒ ๋ฐ์ดํ„ฐ์…‹๊ณผ ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์˜ ์ƒ˜ํ”Œ ๋ถ„ํฌ๋ฅผ Figure 1์— ๋ถ„์„ํ•˜์˜€์Šต๋‹ˆ๋‹ค

๋”ฐ๋ผ์„œ, unique training process๋ฅผ ํ†ตํ•ด separate parameters๋ฅผ ์‚ฌ์šฉํ•˜๋Š” multi-task parameter efficient fine-tuning์ด ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ๋™์‹œ์— ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

Task variety์™€ high tuning costs์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ์—ฌ๋Ÿฌ ์ž‘์—…์— ๋Œ€ํ•ด ๋ณ„๋„์˜ ํŒŒ๋ผ๋ฏธํ„ฐ efficient fine-tuning ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋Š” LoRA์˜ ๊ธฐ๋ณธ์ ์ธ parameter efficiency scheme์„ ์ฑ„ํƒํ•˜์˜€์œผ๋ฉฐ, ์ฆ‰, dense layers์˜ parallelํ•œ ์ž‘์€ parameter ์…‹๋งŒ fine-tuningํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋•Œ, ์šฐ๋ฆฌ๋Š” MOELoRA๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” MOE์™€ LoRA์˜ ๊ฐ•์ ์„ ๊ฒฐํ•ฉํ•œ ์ƒˆ๋กœ์šด multi-task PEFT framework์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ์šฐ๋ฆฌ๋Š” ๊ฐ task์— ๋Œ€ํ•ด distinct parameters๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด task-motivated gate function์„ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.


PRELIMINARY

LLMs for Medical Applications

์ง€๋Šฅํ˜• ์˜๋ฃŒ ์‹œ์Šคํ…œ์€ ํ˜„๋Œ€์˜ ์›น ๊ธฐ๋ฐ˜ ์˜๋ฃŒ ํ™˜๊ฒฝ์—์„œ ์ ์  ๋” ๋ณดํŽธํ™”๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค

๋งŽ์€ ์—ฐ๊ตฌ๋“ค์€ ์ผ๊ด€๋œ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ํŒจํ„ด์„ ์ •์˜ํ•˜์—ฌ ์˜๋ฃŒ ์ž‘์—…์„ ํ‘œ์ค€ํ™”ํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ ์„ค๊ณ„ ๊ณผ์ •์„ ๊ฐ„์†Œํ™”ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

์˜๋ฃŒ ํ…์ŠคํŠธ๊ฐ€ ํฌํ•จ๋œ medical tasks๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด LLMs๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์˜ˆ์‹œ

์œ„์— ์ œ์‹œ๋œ ์˜๋ฃŒ named entity recognition (NER)์˜ ์˜ˆ์‹œ์™€ ๊ฐ™์ด, ์ „ํ†ต์ ์ธ ๋ชจ๋ธ๋“ค์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์˜๋ฃŒ ํ…์ŠคํŠธ, ์ฆ‰ IMI์„ ์ฒ˜๋ฆฌํ•˜์—ฌ head entities Ohead์™€ tail entities Otail๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์˜๋ฃŒ ์ž‘์—…์„ LLMs์— ์ ์‘์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ํŒจํ„ด ๋ชจ๋‘๋ฅผ ์žฌ๊ตฌ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • Input Modification: ์šฐ๋ฆฌ๋Š” LLMs๊ฐ€ ์˜๋ฃŒ ํ…์ŠคํŠธ๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์›๋ž˜ ์˜๋ฃŒ ํ…์ŠคํŠธ์— instruction templates๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. Figure 2์—์„œ ์˜ˆ์‹œ๋œ ๋ฐ”์™€ ๊ฐ™์ด, template๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋งŒ๋“ญ๋‹ˆ๋‹ค:
  • "Please recognize the medical entity in this sentence: [Medical Text]". ์—ฌ๊ธฐ์„œ "[Medical Text]"๋Š” ์ƒˆ๋กœ์šด ์˜๋ฃŒ ์ž‘์—… IM์„ ์œ„ํ•œ ์ž๋ฆฌํ‘œ์‹œ์ž๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • Output Modification: ๊ธฐ์กด target ๋Œ€์‹ , ์šฐ๋ฆฌ๋Š” ์ธ์‹๋œ head entity Ohead์™€ tail entity Otail๋ฅผ template๋กœ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  • "The medical text has the following pairs of entities: [head entity] is [head entity] and tail entity is [tail entity]". LLMs๊ฐ€ NER task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


Multi-task Fine-tuning

์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋ฐ”์™€ ๊ฐ™์ด, ์˜๋ฃŒ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ name entity recognition, medical inquiry ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ํฌํ•จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ์ด๋Ÿฌํ•œ task์— ๋Œ€ํ•ด LLMs๋ฅผ fine-tuneํ•˜์—ฌ ๊ฐ task์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๊ณ , ๋™์‹œ์— ์ „์ฒด healthcare system์—๋„ ํ˜œํƒ์„ ์ค„ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

multi-task fine-tuning์„ ์œ„ํ•ด, ์ฃผ์–ด์ง„ structured data Dj๋ฅผ ๊ณ ๋ คํ•ด๋ณด๋ฉด, ๊ฐ task Tj์˜ ๋ชฉํ‘œ๋Š” LLMs๋ฅผ fine-tuneํ•˜์—ฌ ์ ํ•ฉํ•œ ์ถœ๋ ฅ ๋ฐ ์ž…๋ ฅ ํŒจํ„ด์„ template ํ˜•ํƒœ๋กœ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

fine-tuning ๋„์ค‘ ๊ฐ task์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ D๊ฐ€ ์ฃผ์–ด์ง„๋‹ค๋ฉด, multi-task fine-tuning ๋ฌธ์ œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ช…ํ™•ํžˆ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


Method

์ œ์•ˆ๋œ ํ”„๋ ˆ์ž„์›Œํฌ์— ๋Œ€ํ•œ ์ข…ํ•ฉ์ ์ธ ์„ค๋ช…์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์˜ ๊ฐœ์š”๋กœ ์‹œ์ž‘ํ•˜์—ฌ MOELoRA์™€ task-motivated gate์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ fine-tuning ๋ฐ inference ํ”„๋กœ์„ธ์Šค๋ฅผ ์ž์„ธํžˆ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

Overview

MOELoRA๋ฅผ ์‚ฌ์šฉํ•˜๋Š” LLMs์˜ parameter efficient fine-tuning ๋ฐ inference ํ”„๋กœ์„ธ์Šค๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

parameter efficient fine-tuning framework์—์„œ, LoRA๋Š” dense layers์—์„œ low-rank matrices๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ์„œ ๊ฐœ๋…์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, ์šฐ๋ฆฌ๋Š” MOELoRA layers๋ฅผ ๊ฐ layer์— ํ†ตํ•ฉํ•˜์—ฌ keys, queries, ๊ทธ๋ฆฌ๊ณ  values์˜ ํ•™์Šต์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, feed-forward network (FFN)์„ ์˜ˆ์‹œ๋กœ ํ•˜์—ฌ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

์ถ”๊ฐ€์ ์œผ๋กœ, ๊ฐ MOELoRA layer๋Š” ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•ด ๋‹ค์–‘์„ฑ์„ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด multiple experts๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ, task-motivated gate function์„ ๋„์ž…ํ•˜์—ฌ ๊ฐ MOELoRA layer์— ์žˆ๋Š” expert๋“ค์ด ํ•ด๋‹น task์— ๋งž๋Š” parameters๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

์ด gate function์€ MOELoRA layers์—์„œ expert ๊ฐ„์˜ ๊ธฐ์—ฌ๋„๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, MOELoRA๋Š” ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด distinct fine-tuned weights๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


MOELoRA

Low-rank Adaptation (LoRA)๋Š” parameter efficient fine-tuning์—์„œ ๊ทธ ํšจ๊ณผ์™€ ํšจ์œจ์„ฑ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

LoRA๋Š” intrinsic dimensionality ํ˜„์ƒ์—์„œ ์˜๊ฐ์„ ๋ฐ›์•˜์œผ๋ฉฐ, LLMs์˜ parameter fine-tuning ๋ฌธ์ œ๋ฅผ low-rank decomposition์œผ๋กœ ์žฌ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์ด decomposition์€ ์ฃผ์–ด์ง„ matrix A๊ฐ€ low-rank ๋ฐ trainableํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

์„ค์ •์—์„œ LoRA layer์™€ ๊ฒฐํ•ฉ๋œ linear layer์˜ forward ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

์‹ 4.

์—ฌ๊ธฐ์„œ x๋Š” ์ฐจ์› din์˜ ์ž…๋ ฅ ๋ฒกํ„ฐ๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, hhh๋Š” ์ฐจ์› dout์˜ ์ถœ๋ ฅ ๋ฒกํ„ฐ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ low-rank ํ–‰๋ ฌ์˜ rank๋Š” r์ด๋ฉฐ, ์ด๋Š” ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜, ์›๋ž˜ LoRA์—์„œ๋Š” ๋ชจ๋“  ์ž‘์—…์— ๋Œ€ํ•ด ์ผ๊ด„์ ์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ fine-tuning๋˜๋ฏ€๋กœ, ์˜๋ฃŒ ์ง€์‹์˜ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์„ ํ•™์Šตํ•˜๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ๋Š” LoRA์™€ MOELoRA์— ๋Œ€ํ•œ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค.

LoRA์—์„œ๋Š” ๋‘ ๊ฐœ์˜ low-rank ํ–‰๋ ฌ B∈Rdin×r ๋ฐ A∈Rr×dout์ด ๋ชจ๋“  ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.


Task-Motivated Gate Function

์ด ์„น์…˜์—์„œ๋Š” task-motivated gate function์˜ ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๊ฐ expert์˜ ๊ธฐ์—ฌ๋„๊ฐ€ ํŠน์ • ์ž‘์—…์— ๋งž์ถฐ์ ธ์•ผ ํ•œ๋‹ค๋Š” ์ ์„ ๊ฐ•์กฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์—ฌ๋„๋ฅผ ์กฐ์ ˆํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” gate function์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

์ด๋“ค์€ inherently task-specific์ด๋ฏ€๋กœ, ๊ฐ gate function์€ ์ž‘์—…์˜ ์ •์ฒด์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ์„ค๊ณ„๋Š” ์ž‘์—…๋ณ„๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ณต๊ตฌํ•  ์ˆ˜ ์—†๊ฒŒ ๋งŒ๋“ค๋ฉฐ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  1. Task๋ณ„ ๋งž์ถคํ™”: ๊ฐ ์ž‘์—…์€ ๋ณ„๋„์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์…‹์œผ๋กœ fine-tuning๋˜๋ฉฐ, ์ด๋Š” ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  2. Inference ์‹œ ํšจ์œจ์„ฑ: ๋ณต๊ตฌ๋œ fine-tuned LLMs๋Š” ๊ฐ์†Œ๋œ inference latency๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ์ด๋Š” MOELoRA layer์™€ ์—ฐ๊ด€๋œ ์ถ”๊ฐ€ forward ๊ณ„์‚ฐ์ด ํ•„์š”ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

Fine-tune and Inference

MOELoRA์˜ fine-tuning ๋ฐ inference ๊ณผ์ •์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€๋…์„ฑ์„ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” Algorithm 1์—์„œ ๊ฒฐ๋ก ์„ ๋‚ด๋ฆฝ๋‹ˆ๋‹ค.

  • Fine-tuning: ์šฐ๋ฆฌ๋Š” LLMs์—์„œ ์ง€์ •๋œ layer์— ๋Œ€ํ•ด MOELoRA๋ฅผ ์„ค์ •ํ•˜๊ณ  ์—ฌ๋Ÿฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค (๋ผ์ธ 1-3). ์ดํ›„, fine-tuning ์ „ ํŒŒ๋ผ๋ฏธํ„ฐ efficient fine-tuning์„ ์œ„ํ•ด ๋ชจ๋“  ์‚ฌ์ „ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™๊ฒฐํ•˜๊ณ , ๊ฐ ์ƒ˜ํ”Œ์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•œ ๋ฐฐ์น˜์—์„œ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (๋ผ์ธ 4-7).
Algorithm 1 MOELoRA์˜ Fine-tuning ๋ฐ Inference ๊ณผ์ •

1. Fine-tuning์ด ํ•„์š”ํ•œ LLMs์™€ ๋ ˆ์ด์–ด๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
2. Rank ๊ฐ’ r๊ณผ Scale ๊ฐ’ α๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.
3. MOELoRA์˜ Expert ์ˆ˜ N์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

Fine-tuning ๊ณผ์ •
4. ์‚ฌ์ „ ํ•™์Šต๋œ LLMs์˜ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ: Wq, Wk, Wv.
5. ๋ฐ์ดํ„ฐ์…‹ D์˜ ์ƒ˜ํ”Œ ๋ฐฐ์น˜ B์— ๋Œ€ํ•ด,
6. MOELoRA๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ LLMs์— ๋Œ€ํ•œ forward ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (์œ„์˜ ์‹ 4 ์ฐธ๊ณ ).
7. ์†์‹ค ํ•จ์ˆ˜ L๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค (์‹ 2 ์ฐธ๊ณ ).
8. MOELoRA์˜ ํŒŒ๋ผ๋ฏธํ„ฐ {Ai,Bi}i=1N ๋ฐ Gate Function์˜ ํŒŒ๋ผ๋ฏธํ„ฐ {E,WT}๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
9. ์ข…๋ฃŒ.

Inference ๊ณผ์ •
10. ๋ชจ๋“  ์ž‘์—… Tj์— ๋Œ€ํ•ด,
11. ๊ฐ Expert์— ๋Œ€ํ•ด ๊ธฐ์—ฌ ๊ฐ€์ค‘์น˜ ωj๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค (์‹ 5 ์ฐธ๊ณ ).
12. ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด ์‹ 8์„ ์‚ฌ์šฉํ•˜์—ฌ MOELoRA์˜ fine-tuned ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ณต๊ตฌํ•ฉ๋‹ˆ๋‹ค.
13. ์ข…๋ฃŒ.
14. ํŠน์ • ์ž‘์—… Tj์— ๋Œ€ํ•ด, ํ•ด๋‹น ์ž‘์—…์— ํ•„์š”ํ•œ LLMs์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • LLMs์˜ ํŒŒ๋ผ๋ฏธํ„ฐ (๋ผ์ธ 4)๊ฐ€ ๋™๊ฒฐ๋ฉ๋‹ˆ๋‹ค. Fine-tuning ๋™์•ˆ, ์šฐ๋ฆฌ๋Š” ๋™์ผํ•œ ์ž‘์—…์˜ ์ƒ˜ํ”Œ์„ ํ•˜๋‚˜์˜ ๋ฐฐ์น˜๋กœ ๊ทธ๋ฃนํ™”ํ•˜๋Š” ๋Œ€์‹ , ๋ชจ๋“  ์ž‘์—…์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Š” ์ผ๋ถ€ multi-task ์—ฐ๊ตฌ์—์„œ ์ˆ˜ํ–‰๋˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ์‹คํ—˜์—์„œ ์„ฑ๋Šฅ ๋น„๊ต๋ฅผ ์œ„ํ•ด ๋ฐฐ์น˜์— ๋Œ€ํ•ด ๋ฌด์ž‘์œ„ ์ƒ˜ํ”Œ๋ง์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ forward ๊ณผ์ •์„ ์ˆ˜ํ–‰ํ•˜๊ณ  fine-tuning์„ ์œ„ํ•œ ์†์‹ค์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (๋ผ์ธ 6-7). ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ๋ฅผ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” MOELoRA์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ task-motivated gate function, ์ฆ‰ {Ai,Bi}i=1N ๋ฐ {E,WT}๋งŒ fine-tuningํ•ฉ๋‹ˆ๋‹ค.

Inference: ์•ž์„œ ์„ค๋ช…ํ•œ ๋ฐ”์™€ ๊ฐ™์ด, MOELoRA๋Š” ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด fine-tuned ํŒŒ๋ผ๋ฏธํ„ฐ ํ–‰๋ ฌ์„ ๋ณต๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (๋ผ์ธ 8).

  • ๊ทธ๋Ÿฐ ๋‹ค์Œ, ๊ฐ ์ž‘์—…์— ๋Œ€ํ•ด LLMs ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ํ•จ๊ป˜ ํ•ด๋‹น ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Experiment

์ด ์„น์…˜์—์„œ๋Š” ๋‹ค์Œ ์—ฐ๊ตฌ ์งˆ๋ฌธ(RQ)์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.
  • RQ1: MOELoRA๊ฐ€ ๋‹ค๋ฅธ parameter-efficient fine-tuning ์ „๋žต ๋ฐ cross-task generalization ๋ฐฉ๋ฒ•๊ณผ ๋น„๊ตํ•˜์—ฌ ์„ฑ๋Šฅ ์ธก๋ฉด์—์„œ ์–ด๋–ค ์ฐจ์ด๊ฐ€ ์žˆ๋Š”๊ฐ€?
  • RQ2: MOE ์•„ํ‚คํ…์ฒ˜์™€ gate function์ด fine-tuning ๊ณผ์ •์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์€ ๋ฌด์—‡์ธ๊ฐ€? ๋‹ค์–‘ํ•œ ํ›ˆ๋ จ ์ „๋žต์ด MOELoRA์˜ ์„ฑ๋Šฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”๊ฐ€?
  • RQ3: MOELoRA์˜ expert ์ˆ˜์™€ rank๊ฐ€ ์„ฑ๋Šฅ ๊ฒฐ๊ณผ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”๊ฐ€?
  • RQ4: ์ œ์•ˆ๋œ MOELoRA๊ฐ€ fine-tuning ๋ฐ inference ๊ณผ์ •์—์„œ ํšจ์œจ์ ์ธ๊ฐ€?

Table 1: PromptCBLUE ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฐ„๋‹จํ•œ ์„ค๋ช… ๋ฐ ํ†ต๊ณ„

Task Description # Train # Validation  # Test
CMeIE Name Entity Recognition 2,828 600 600
CHIP-CDN Normalization 2,381 600 600
CHIP-CDEE Attribute Extraction 1,562 600 600
CHIP-MDCFNPC Clinic Entity Discovery 4,935 600 600
CHIP-CTC Medical Text Classification 3,622 1,100 1,100
KUAKE-QIC Query Intention 3,279 660 660
IMCS-V2-MRG Report Generation 1,799 600 600
MedDG Doctor Dialogue 4,964 600 600
  • RQ5: ์ „๋ฌธ๊ฐ€๋“ค์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ์ง€์‹์„ ํฌ์ฐฉํ•˜๋Š” ๋ฐ ์žˆ์–ด ํŠนํ™”๋˜์–ด ์žˆ๋Š”๊ฐ€?

Experimental Settings

Dataset

์šฐ๋ฆฌ์˜ ์‹คํ—˜์€ multi-task Chinese medical dataset์ธ PromptCBLUE์—์„œ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

์ด ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์–‘ํ•œ ์˜๋ฃŒ ์ž‘์—…์„ ํฌํ•จํ•˜๋ฉฐ, ์ด๋Š” LLMs์™€์˜ ํ˜ธํ™˜์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ํ…์ŠคํŠธ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์šฐ๋ฆฌ์˜ ์ง€์‹์— ๋”ฐ๋ฅด๋ฉด, PromptCBLUE๋Š” LLMs์— ๋งž์ถฐ์ง„ ์œ ์ผํ•œ multi-task ์˜๋ฃŒ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” computational constraints๋กœ ์ธํ•ด 8๊ฐœ์˜ ์ž‘์—…์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•˜์—ฌ ์‹คํ—˜์— ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

์›๋ณธ ๋ฐ์ดํ„ฐ์…‹์˜ ์ค‘๋ณต๋œ ์ƒ˜ํ”Œ์„ ์ œ๊ฑฐํ•œ ํ›„, ์šฐ๋ฆฌ๋Š” ํ•™์Šต ์„ธํŠธ๋ฅผ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋„๋ก ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์˜ ํ†ต๊ณ„ ์ •๋ณด๋Š” Table 1์— ์š”์•ฝ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

Baselines

์šฐ๋ฆฌ์˜ ์‹คํ—˜์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋„ค ๊ฐ€์ง€ ์ข…๋ฅ˜์˜ baselines์™€ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.
    • LLMs without Fine-tuning: LLMs๊ฐ€ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ConText Learning์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
      • ChatGPT: ChatGPT๋Š” ๊ฐ€์žฅ ์ธ๊ธฐ ์žˆ๋Š” LLMs ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” task-relevant ability๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด, ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ 3์—์„œ 10๊ฐœ์˜ ์ž…๋ ฅ-์ถœ๋ ฅ ์Œ์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•˜์—ฌ, ์ž…๋ ฅ๊ณผ ๋™์ผํ•œ task๋กœ ๋ฐ๋ชจ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
      • Hautuo: Hautuo๋Š” ์ค‘๊ตญ ์˜๋ฃŒ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ˆ˜์ง‘๋œ instruct dataset์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” in-context learning ๋ฐฉ์‹์œผ๋กœ ChatGPT baseline๊ณผ ๊ณต์ •ํ•˜๊ฒŒ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด ChatGLM-6B์˜ version์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • LLMs with Fine-tuning: ์ด ๊ทธ๋ฃน์€ fine-tuning ์ „๋žต์˜ ๋ณ€์ข…์„ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
    • P-Tuning: P-Tuning์€ ํ”„๋กฌํ”„ํŠธ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ”„๋กฌํ”„ํŠธ ์ธ์ฝ”๋”๋ฅผ fine-tuningํ•˜์—ฌ ์ž…๋ ฅ ์‹œํ€€์Šค์— ์‚ฝ์ž…ํ•ฉ๋‹ˆ๋‹ค.
    • LoRA (Full): LoRA๋Š” dense layers์˜ low-rank matrices๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋“  ์‚ฌ์ „ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™๊ฒฐํ•ฉ๋‹ˆ๋‹ค.
    • LoRA (Single): ์šฐ๋ฆฌ๋Š” LoRA (Single)๋ฅผ task๋ณ„๋กœ ๋”ฐ๋กœ LoRA๋ฅผ ํ›ˆ๋ จํ•˜์—ฌ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.
    • LoRA (Full+TP): ์šฐ๋ฆฌ๋Š” ์ž…๋ ฅ ํ…์ŠคํŠธ์— ๊ฐ„๋‹จํ•œ task demonstration์„ ์ถ”๊ฐ€ํ•˜์—ฌ LLMs๊ฐ€ ์ž‘์—… ๊ฐ„์˜ ๊ตฌ๋ถ„์„ ์ธ์‹ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌํ˜„ ์ธก๋ฉด์—์„œ๋Š” LoRA (Full)์™€ ๋™์ผํ•œ ํ›ˆ๋ จ ๊ณผ์ •์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Multi-task์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ์ž‘์—…์— ๋”ฐ๋ผ, ์šฐ๋ฆฌ๋Š” ๋ชจ๋“  task ๋ฒกํ„ฐ๋ฅผ ํ•จ๊ป˜ ๋”ํ•˜๊ณ  validation ์„ธํŠธ์—์„œ scale factor๋ฅผ ์กฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

Cross-task Generalization: Multi-task fine-tuning์— ๋Œ€ํ•œ cross-task generalization์˜ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ์ตœ๊ทผ์˜ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ๋ฒ•์ธ LoRAHub์™€ MoLoRA๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

  • LoRAHub: LoRAHub๋Š” source tasks์—์„œ fine-tuned ๋œ LoRA ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐํ•ฉํ•˜์—ฌ, unseen target tasks์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”๋ฅผ ๋ชจ์ƒ‰ํ•˜๋Š” ์กฐ๋ฆฝ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
  • ์‹คํ—˜์—์„œ, ์šฐ๋ฆฌ๋Š” ๊ฐ ์ž‘์—…์„ LoRA๋กœ fine-tuningํ•˜๊ณ , ์ง€์ •๋œ ์ž‘์—…์˜ validation์„ ์‚ฌ์šฉํ•˜์—ฌ ์กฐํ•ฉ ๊ฐ€์ค‘์น˜๋ฅผ ํ•™์Šตํ•˜๋ฉฐ ์ด ์ž‘์—…์˜ ์„ฑ๋Šฅ์„ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค.
  • MoLoRA: MoLoRA๋Š” ๋น„๊ต์  ์ตœ๊ทผ์˜ ์—ฐ๊ตฌ๋กœ, MOE ๊ตฌ์กฐ๋ฅผ LoRA์— ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ MoLoRA์˜ gate๋Š” token์˜ ์ค‘๊ฐ„ embedding์„ ์‚ฌ์šฉํ•˜์—ฌ expert weights๋ฅผ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ์˜ ์‹คํ—˜์—์„œ, ์šฐ๋ฆฌ๋Š” ์ด๋ฅผ multi-task ์„ค์ •์— ๋งž์ถ”์–ด, ์ฆ‰, ๋™์ผํ•œ ์ž‘์—… ์„ธํŠธ์—์„œ ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธํ•˜๋„๋ก ์กฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

Implementation Details

์šฐ๋ฆฌ์˜ ์‹คํ—˜์€ PyTorch 1.12.0 ๋ฐ Python 3.9.5๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Tesla V100 32G GPUs์—์„œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

LLM ChatGLM-6B๋Š” Chinese language processing์— ๋Šฅ์ˆ™ํ•œ ๊ฒƒ์œผ๋กœ ์ธ์ •๋ฐ›์•„ fine-tuning์„ ์œ„ํ•œ ๊ธฐ๋ณธ ๋ชจ๋ธ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

๋ชจ๋“  LoRA fine-tuning baselines ๋ฐ ์ œ์•ˆ๋œ MOELoRA์— ๋Œ€ํ•ด, ์šฐ๋ฆฌ๋Š” ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ๋ ˆ์ด์–ด๋ฅผ "query_key_value", "dense", "dense_h_to_4h", ๋ฐ "dense_4h_to_h"๋กœ ์ง€์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

์ตœ๋Œ€ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ๊ธธ์ด๋Š” ๊ฐ๊ฐ 1,024 ๋ฐ 196์œผ๋กœ ์„ค์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ 64๋กœ ์„ค์ •ํ•˜๊ณ  ์ตœ๋Œ€ 8,000 training steps๊นŒ์ง€ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. LoRA rank rrr๋Š” 16์œผ๋กœ ๊ณ ์ •๋˜์—ˆ์œผ๋ฉฐ, LoRA dropout α=0.1 α=0.1๋กœ ์„ค์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

MOELoRA์˜ ๊ฒฝ์šฐ, experts์˜ ์ˆ˜๋Š” 8๋กœ ์„ค์ •๋˜์—ˆ์œผ๋ฉฐ, sparse gate ๋ฒ„์ „์˜ MOELoRA์— ๋Œ€ํ•ด ์ตœ์ ์˜ ๊ฐ’์„ ์ฐพ๊ธฐ ์œ„ํ•ด KKK ๊ฐ’์„ 1์—์„œ 7๊นŒ์ง€ ๊ฒ€์ƒ‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

ํ…Œ์ŠคํŠธ ์ค‘์—๋Š”, ์šฐ๋ฆฌ๋Š” generation์„ ์œ„ํ•œ ์˜จ๋„๋ฅผ 0.95๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ MOELoRA ๊ตฌํ˜„์€ PEFT ํŒจํ‚ค์ง€์™€ ํ˜ธํ™˜๋˜์–ด, ์ œ์•ˆ๋œ MOELoRA์˜ ๋” ์‰ฌ์šด ์ฑ„ํƒ๊ณผ ํ™œ์šฉ์„ ์ด‰์ง„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Evaluation Metrics

ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ๊ฐ ์ž‘์—…์˜ ํŠน์„ฑ์— ๋งž๋Š” ๋‹ค์–‘ํ•œ ๋ฉ”ํŠธ๋ฆญ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, CMeIE๋Š” named entity recognition (NER) ์ž‘์—…์œผ๋กœ, ๋งŽ์€ entity ํด๋ž˜์Šค๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค (CMeIE์—๋Š” 1,262๊ฐœ์˜ ํด๋ž˜์Šค๊ฐ€ ์žˆ์Œ).

Table 2: PromptCBLUE์—์„œ ๊ฒฝ์Ÿ์ ์ธ ๋ฒ ์ด์Šค๋ผ์ธ ๋ฐ MOELoRA์˜ ์ „๋ฐ˜์ ์ธ ๊ฒฐ๊ณผ.

๊ตต์€ ๊ธ€์”จ๋Š” ์ตœ๊ณ  ์ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ๋ฐ‘์ค„์€ ํ•ด๋‹น ๋ฐฉ๋ฒ•์˜ ์ตœ์ € ์ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

"**"๋Š” ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ๊ฐœ์„ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค (์ฆ‰, ์–‘์ธก t-test์—์„œ (p < 0.05)๋กœ ์ตœ๊ณ  ๋ฒ ์ด์Šค๋ผ์ธ๊ณผ ๋น„๊ต).

 

๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ์ด ์ž‘์—…์— ๋Œ€ํ•ด ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” Micro-F1์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. CHIP-CDN (579 ํด๋ž˜์Šค), CHIP-CDEE (998 ํด๋ž˜์Šค), CHIP-MDCFNPC (2,065 ํด๋ž˜์Šค)๋Š” ๋ชจ๋“  task๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์€ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฏ€๋กœ, Micro-F1์ด ํ‰๊ฐ€์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

๋น„๊ต์ ์œผ๋กœ, CHIP-CTC (44๊ฐœ ํด๋ž˜์Šค)์™€ QUAKE-QIC (7๊ฐœ ํด๋ž˜์Šค) ์ž‘์—…์€ ๋” ์ ์€ ์ˆ˜์˜ ํด๋ž˜์Šค๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๊ฐ ํด๋ž˜์Šค์˜ ๋™์ผํ•œ ์ค‘์š”์„ฑ์„ ๊ณ ๋ คํ•ด์•ผ ํ•˜๋ฏ€๋กœ Macro-F1์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ์ƒ์„ฑ ์ž‘์—…, ์˜ˆ๋ฅผ ๋“ค์–ด IMCS-V2-MRG ๋ฐ MedDG์˜ ๊ฒฝ์šฐ, Rouge-L ์ด ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ „์ฒด ์ž‘์—…์— ๊ฑธ์นœ ํ‰๊ท  ์ ์ˆ˜๋Š” ์ „์ฒด ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

๊ฒฐ๊ณผ์˜ ๊ฒฌ๊ณ ์„ฑ๊ณผ ์žฌํ˜„์„ฑ์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด, ํ…Œ์ŠคํŠธ๋Š” ๋ฌด์ž‘์œ„ ์‹œ๋“œ {42, 43, 44}๋กœ ์„ธ ๋ฒˆ ์‹คํ–‰๋˜๋ฉฐ, ํ‰๊ท  ์ ์ˆ˜๊ฐ€ ๋‹ค์Œ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ์— ๋ณด๊ณ ๋ฉ๋‹ˆ๋‹ค.


Overall Performance (RQ1)

MOELoRA์™€ ๊ฒฝ์Ÿ์ ์ธ ๋ฒ ์ด์Šค๋ผ์ธ๋“ค์˜ ์ข…ํ•ฉ์ ์ธ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” Table 2์— ๋‚˜์™€ ์žˆ์Šต๋‹ˆ๋‹ค.

MOELoRA(D)์™€ MOELoRA(S)๋Š” ๊ฐ๊ฐ MOELoRA์˜ dense ๋ฐ sparse gate ๋””์ž์ธ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์ „์ฒด ํ‰๊ท  ์ ์ˆ˜๋ฅผ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ, MOELoRA(D)๋Š” ๋‹ค๋ฅธ ๋ชจ๋“  ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ผ๊ด€๋˜๊ฒŒ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

RQ1์— ๋Œ€ํ•œ ์‘๋‹ต์„ ์œ„ํ•ด, ์„ธ๋ถ€ ๋ถ„์„์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • LLMs without Fine-tuning: Fine-tuning์ด ์—†๋Š” LLMs ๊ทธ๋ฃน์€ task-specific medical knowledge๋ฅผ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด fine-tuning์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉด์„œ, ์„ฑ๋Šฅ์ด ์ƒ๋‹นํžˆ ๋’ค์ณ์ง‘๋‹ˆ๋‹ค.
  • Parameter Efficient Fine-tuning Strategies: ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์ ์ธ fine-tuning ์ „๋žต ์ค‘, LoRA ๊ธฐ๋ฐ˜ ๋ฉ”์„œ๋“œ๊ฐ€ ๋ช…ํ™•ํ•˜๊ฒŒ P-Tuning์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
    • LoRA (Full) ๋ฐ LoRA (Full+TP)๋Š” ๋ชจ๋“  ์ž‘์—…์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ LoRA (Single)๋ณด๋‹ค ์šฐ์ˆ˜ํ•˜์ง€๋งŒ, ์ผ๋ถ€ task์—์„œ๋Š” underperformํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” task ๊ฐ„์˜ ์ง€์‹ ๊ณต์œ ์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.
  • Model Editing: Task-Arithmetic์€ task ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์ ์ธ fine-tuning์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • Cross-task Generalization: Cross-task generalization ํ™˜๊ฒฝ์—์„œ ๋‘ ๊ฐ€์ง€ ์ตœ๊ทผ์˜ ์ ‘๊ทผ๋ฒ•์„ ํ‰๊ฐ€ํ•˜์˜€์œผ๋ฉฐ, ์ด๋“ค์€ multi-task ์„ค์ •์—์„œ ์–ด๋ ค์›€์„ ๊ฒช์—ˆ์Šต๋‹ˆ๋‹ค.
  • Dense Gate vs. Sparse Gate: Table 2์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, sparse gate๊ฐ€ ๋‘ ๊ฐ€์ง€ task์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋‚˜, multi-task ์˜๋ฃŒ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์—์„œ๋Š” shared medical knowledge๊ฐ€ ๋” ์ค‘์š”ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • dense gate๋Š” ๋ชจ๋“  expert๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ•™์Šต๋œ ์ง€์‹์„ ๊ณต์œ ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋ฏ€๋กœ, ๋Œ€๋ถ€๋ถ„์˜ ์ž‘์—…์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • Task-specific Observations: ์„ฑ๋Šฅ์˜ ๋ณ€๋™์€ ์ž‘์—… ๊ฐ„์— ๋ช…ํ™•ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, LoRA (Full) ๋ฐ LoRA (Full+TP)๋Š” ๋ฐ์ดํ„ฐ์…‹์ด ํฐ ์ž‘์—…์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, LoRA (Single)๋Š” ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ๊ฐ•์กฐํ•˜์—ฌ ์ƒ˜ํ”Œ์ด ์ ์€ ์ž‘์—…์—์„œ ๋น›์„ ๋ฐœํ•ฉ๋‹ˆ๋‹ค.
    • MOELoRA๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์ž‘์—…์—์„œ ์ผ๊ด€๋˜๊ฒŒ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ์ด๋Ÿฌํ•œ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. MedDG ์ž‘์—…์˜ ๊ฒฝ์šฐ, ChatGPT์™€ Hautuo์˜ ๊ณ ์œ ํ•œ ๋Œ€ํ™” ๊ธฐ๋Šฅ์ด ๋‹ค๋ฅธ ์ ‘๊ทผ๋ฒ•์— ๋น„ํ•ด ์šฐ์œ„๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Ablation Study (RQ2)

RQ2๋ฅผ ๋” ๊นŠ์ด ์—ฐ๊ตฌํ•˜๊ณ  ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์˜ ๊ธฐ์—ฌ๋„๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” Table 3์—์„œ ์ œ์‹œ๋œ ablation ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

w/o MOE (๋ณธ์งˆ์ ์œผ๋กœ LoRA (Full)๋กœ ๋˜๋Œ์•„๊ฐ) ๋ณ€ํ˜•์€ MOE ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ณ€ํ˜•์€ ์™„์ „ํ•œ MOELoRA์™€ ๋น„๊ตํ•˜์—ฌ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋จ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, MOE ์•„ํ‚คํ…์ฒ˜์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, gate function์„ ์šฐํšŒํ•˜์—ฌ ๊ท ์ผํ•œ expert weights๋ฅผ ์‚ฌ์šฉํ•˜๋Š” w/o gate ๋ณ€ํ˜•๋„ MOELoRA๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋’ค์ฒ˜์ง€๋ฉฐ, gate function์˜ ํšจ๊ณผ๋ฅผ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

 

w multiple gate ๋ณ€ํ˜•์€ ๊ฐ MOELoRA ๋ ˆ์ด์–ด์— ๋Œ€ํ•ด ๊ณ ์œ ํ•œ gate function์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ณ€ํ˜•์ด ์ผ๋ถ€ ์ž‘์—…์—์„œ๋Š” ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ, ๊ณผ๋„ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐํ™”๋กœ ์ธํ•ด single gate function ์„ค๊ณ„์— ๋น„ํ•ด ์•ฝ๊ฐ„ ๋’ค์ฒ˜์ง‘๋‹ˆ๋‹ค

๊ฒŒ๋‹ค๊ฐ€, ๋‹ค์ค‘ gate function์€ ๋” ๋งŽ์€ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ดˆ๋ž˜ํ•˜์—ฌ, ํšจ์œจ์„ฑ์ด ๊ฐ์†Œํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‚ณ์Šต๋‹ˆ๋‹ค.

 

์ถ”๊ฐ€์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ๋‹ค์–‘ํ•œ ํ›ˆ๋ จ ์ „๋žต์ด ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

ํŠนํžˆ, w BT ๋ฐฉ๋ฒ• [36]์€ ๋™์ผํ•œ ์ž‘์—…์—์„œ ์ƒ˜ํ”Œ์„ ํ•˜๋‚˜์˜ ๋ฐฐ์น˜๋กœ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

๋ฐ˜๋ฉด์—, w RBT ์ ‘๊ทผ๋ฒ• [39]์€ ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜๋งˆ๋‹ค ๋ฌด์ž‘์œ„๋กœ ์ž‘์—…์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋“ค ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ• ๋ชจ๋‘ MOELoRA์— ๋œ ์œ ๋ฆฌํ•œ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋ฉฐ, ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ์„ฑ๋Šฅ ๋น„๊ต๋Š” ํŠน์ • ํ›ˆ๋ จ ํŒจํ„ด์˜ ์˜ํ–ฅ๋ ฅ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

 

์ œ์•ˆ๋œ MOELoRA์˜ ๊ฒฌ๊ณ ์„ฑ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” attention ๋ ˆ์ด์–ด์— LoRA ๋ ˆ์ด์–ด๋งŒ์„ ๋ถ€๊ณผํ•˜๋Š” ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ–ˆ์œผ๋ฉฐ, ์ด๋ฅผ LoRA (Full)-QKV ๋ฐ MOELoRA(D)-QKV๋กœ ๋ช…๋ช…ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ์—์„œ, ์šฐ๋ฆฌ๋Š” MOELoRA(D)-QKV๊ฐ€ ๋Œ€๋ถ€๋ถ„์˜ ์ž‘์—…์—์„œ LoRA (Full)-QKV๋ฅผ ๋Šฅ๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์œผ๋ฉฐ, ์ด๋Š” Table 2์—์„œ MOELoRA(D)์™€ LoRA (Full)์˜ ์„ฑ๋Šฅ ๋น„๊ต์™€ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, MOELoRA(D)๋Š” MOELoRA(D)-QKV๋ณด๋‹ค ์šฐ์ˆ˜ํ•˜๋ฏ€๋กœ, ๋” ๋งŽ์€ MOELoRA ๋ ˆ์ด์–ด๊ฐ€ ์ง€์†์ ์œผ๋กœ fine-tuning ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.


Hyper-parameter Analysis (RQ3)

Figure 4.

RQ3์— ๋‹ตํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” MOELoRA(D)์˜ ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์˜ํ–ฅ์„ ๋” ๊นŠ์ด ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, expert ์ˆ˜ N๊ณผ LoRA rank r์˜ ๋ณ€๋™์ด ๊ฒฐ๊ณผ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

Figure 4์— ๋‚˜ํƒ€๋‚œ ๋ฐ”์™€ ๊ฐ™์ด, ์šฐ๋ฆฌ์˜ ๊ด€์ฐฐ์€ N์ด 0์—์„œ 8๋กœ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ MOELoRA์˜ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ด ํ–ฅ์ƒ์€ ๋” ๋งŽ์€ ์ˆ˜์˜ experts๊ฐ€ ๋” ๊ด‘๋ฒ”์œ„ํ•œ ์ง€์‹ ์ŠคํŽ™ํŠธ๋Ÿผ์˜ ํ•™์Šต์„ ์ด‰์ง„ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์— ๊ธฐ์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ N์ด 16์œผ๋กœ ์„ค์ •๋˜๋ฉด, ์„ฑ๋Šฅ์ด ์•ฝ๊ฐ„ ํ•˜๋ฝํ•˜๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๊ฐ expert์— ๋Œ€ํ•ด ์ž‘์€ LoRA rank๊ฐ€ ์„ค์ •๋˜์–ด, low-rank ํ–‰๋ ฌ์˜ ํ•™์Šต ๋Šฅ๋ ฅ์„ ์ €ํ•˜์‹œํ‚ฌ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, ๊ฐ expert์˜ rank๋ฅผ 2๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

Figure 4b์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, r์˜ ์ฆ๊ฐ€๊ฐ€ ์ผ๊ด€๋˜๊ฒŒ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค์ง€๋งŒ, ๋™์‹œ์— ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ๋น„๋ก€ํ•˜์—ฌ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ๊ณ ๋ คํ•  ๋•Œ, r์˜ ์‹ค์šฉ์ ์ธ ์„ ํƒ์€ 16์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.


Hyper-parameter Analysis (RQ4)

Figure 5.

ํ›ˆ๋ จ ๋ฐ inference ํšจ์œจ์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” Figure 5์—์„œ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋น„์œจ๊ณผ inference latency๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

Inference latency๋Š” inference ์ƒ˜ํ”Œ์˜ ์ˆ˜์— ๋Œ€ํ•œ inference ์‹œ๊ฐ„์˜ ํ‰๊ท ์„ ํ†ตํ•ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

 

MOELoRA(M)๋Š” task-motivated gate๊ฐ€ ๋™๋ฐ˜๋œ MOELoRA์˜ ๋ณ€ํ˜•์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ๋Š” MOELoRA๊ฐ€ LoRA (Full)์™€ ๋™์ผํ•œ ์ˆ˜์ค€์˜ ๋†’์€ ํ›ˆ๋ จ ๋ฐ inference ํšจ์œจ์„ฑ์„ ๋‹ฌ์„ฑํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด๋Š” LLMs ํŒŒ๋ผ๋ฏธํ„ฐ์˜ 0.48% ์ด์ƒ์„ fine-tuningํ•  ํ•„์š” ์—†์ด ๋ฆฌ์†Œ์Šค๋ฅผ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

 

MoLoRA ๋ฐ MOELoRA(M)๋Š” ๊ฐ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ low-rank ๋ ˆ์ด์–ด์— ๋Œ€ํ•œ ์ถ”๊ฐ€ gate๋ฅผ ์„ค์ •ํ•˜๋ฏ€๋กœ ๋” ๋งŽ์€ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

Inference์— ์žˆ์–ด, ๋ชจ๋“  ๋ชจ๋ธ์€ ๋™์ผํ•œ inference latency๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€๋งŒ, MoLoRA๋Š” fine-tuned ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ Equation (8)๊ณผ ๊ฐ™์ด ๋ณต๊ตฌํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ, ์ƒ˜ํ”Œ์—์„œ expert weights๋ฅผ ์ถ”์ถœํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

๋”ฐ๋ผ์„œ, MoLoRA๋Š” inference ์‹œ ์ถ”๊ฐ€์ ์ธ forward ๊ณ„์‚ฐ์ด ํ•„์š”ํ•˜์—ฌ, ๋” ๋งŽ์€ inference latency๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋น„๊ต๋Š” task-motivated gate ์„ค๊ณ„์˜ ์ด์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. RQ4์— ๋Œ€ํ•œ ์‘๋‹ต์œผ๋กœ, ์„ค๊ณ„๋œ MOELoRA๋Š” ๋†’์€ ํ›ˆ๋ จ ๋ฐ inference ํšจ์œจ์„ฑ์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ, task-motivated gate์— ์˜ํ•œ ํšจ์œจ์„ฑ ์ €ํ•˜๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.


Case Study (RQ5)

RQ4์— ๋Œ€ํ•ด, ์šฐ๋ฆฌ๋Š” Figure 6์— ์žˆ๋Š” ๋„ค ๊ฐ€์ง€ ์ž‘์—…์— ๋Œ€ํ•œ expert weights์˜ ์‹œ๊ฐํ™”๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
๊ฐ ์ž‘์—…์—์„œ, ๋‹ค๋ฅธ ์ƒ‰์ƒ์˜ ๋ง‰๋Œ€ ๊ธธ์ด๋Š” ํ•ด๋‹น expert์˜ weights๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

Figure 6

Expert weights๋Š” 1๋กœ ์ •๊ทœํ™”๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ ์ž‘์—…์˜ ๋ง‰๋Œ€ ๊ธธ์ด๋Š” ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

๋งคํฌ๋กœ ์ˆ˜์ค€์—์„œ ๋ณผ ๋•Œ, ๊ฐ expert์˜ ๊ธฐ์—ฌ๋„๊ฐ€ ์ƒ๋‹นํžˆ ๋‹ค๋ฅด๋ฉฐ, ์ด๋Š” ์„œ๋กœ ๋‹ค๋ฅธ expert๊ฐ€ ์˜๋ฃŒ ์ง€์‹์˜ ๋‹ค์–‘ํ•œ ์ธก๋ฉด์—์„œ ํŠนํ™”๋œ๋‹ค๋Š” ๊ฐœ๋…์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

 

๋˜ํ•œ, ์ž‘์—… ๊ฐ„ weight์˜ ํ˜„์ €ํ•œ ์ฐจ์ด๋Š” ์˜๋ฃŒ ์‘์šฉ์˜ ๋‹ค์–‘ํ•œ ํŠน์„ฑ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

CHIP-CDN ๋ฐ KUAKE-QIC ์ž‘์—…์„ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๋ฉด, ํ•ด๋‹น ์ž‘์—…์—์„œ expert weights๊ฐ€ ๋Œ€๋ถ€๋ถ„ ์ผ์น˜ํ•˜์ง€๋งŒ, experts 3๊ณผ 4์˜ ๊ฒฝ์šฐ๋Š” ์˜ˆ์™ธ์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

 

์ง„๋‹จ์šฉ ๋‹จ์–ด ์ •๊ทœํ™”๊ฐ€ ์งˆ์˜ ๋ถ„๋ฅ˜๋ฅผ ๊ฐ•ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ๊ณ ๋ คํ•  ๋•Œ, expert weights์˜ ์œ ์‚ฌ์„ฑ์€ MOELoRA๊ฐ€ ๊ด€๋ จ ์ž‘์—…์— ๋„์›€์ด ๋˜๋Š” ๊ณต์œ ๋œ ์ง€์‹์„ ์ž˜ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.


Related Work

LLM for Medical Applications

์ตœ๊ทผ, LLMs์˜ ๊ฐ•๋ ฅํ•œ ๊ธฐ๋Šฅ์ด ๋งŽ์€ ๋ถ„์•ผ์—์„œ ์ž…์ฆ๋˜์—ˆ์œผ๋ฉฐ, ์˜๋ฃŒ ๋„๋ฉ”์ธ์„ ํฌํ•จํ•˜์—ฌ ํฐ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, Med-PaLM์€ ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ์ธ MultiMedQA๋ฅผ ์ œ์•ˆํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ์ž„์ƒ ์ง€์‹ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์ง„๋‹จ ์งˆ์˜ ์‘๋‹ต ๊ณผ์ œ๋ฅผ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

Med-PaLM2๋Š” ์ƒˆ๋กœ์šด prompting ์ „๋žต๊ณผ ensemble ๊ฐ•ํ™”๋กœ Med-PaLM์„ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ์ด ์ „๋žต์€ MedQA ๋ฐ ์ž๊ธฐ ์ผ๊ด€์„ฑ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ, MedQA์—์„œ ์ƒ๋‹นํ•œ ์„ฑ๊ณผ๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ChatDoctor๋Š” 100,000๊ฐœ์˜ ํ™˜์ž-์˜์‚ฌ ๋Œ€ํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ LLMs์„ fine-tuningํ•˜์˜€์œผ๋ฉฐ, ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ์˜๋ฃŒ ์ƒ๋‹ด ํ”Œ๋žซํผ์—์„œ ํŒŒ์ƒ๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๋˜ํ•œ, HuaTuo๋Š” CMeKG๋กœ ์ฒ˜์Œ์œผ๋กœ ์ค‘๊ตญ์–ด ์˜๋ฃŒ ์ง€์‹์„ ํ•™์Šตํ•˜๊ณ , ์ค‘๊ตญ์–ด ์˜๋ฃŒ ํ…์ŠคํŠธ์—์„œ LLMs๋ฅผ fine-tuningํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ณด๋‹ค ๊ตฌ์ฒด์ ์ธ ์˜๋ฃŒ ์‘์šฉ์„ ์œ„ํ•ด, Liu et al.๋Š” ์˜๋ฃŒ LLM ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ํ™˜๊ฐ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ ํŽธ์ง‘ ๋ฐฉ๋ฒ•์„ ์„ค๊ณ„ํ•˜์˜€์œผ๋ฉฐ, Xu et al. ๋Š” ์˜๋ฃŒ LLMs์—์„œ ๋ฐœ์ƒํ•˜๋Š” ํ™˜๊ฐ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ ํŽธ์ง‘ ๋ฐฉ๋ฒ•์„ ์„ค๊ณ„ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„์˜ ์ด์ „ ์ž‘์—…์€ ํŠน์ • ์˜๋ฃŒ ์ž‘์—…์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์—ฌ๋Ÿฌ ์ค‘์š”ํ•œ ์ž‘์—…์„ ๋™์‹œ์— ๋‹ค๋ฃจ๋Š” ๊ฒƒ์„ ๊ฐ„๊ณผํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ด๋Ÿฌํ•œ ์ ‘๊ทผ๋ฒ•์€ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ƒ๋‹นํ•œ fine-tuning ๋น„์šฉ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

Parameter Efficient Fine-tuning

Parameter Efficient Fine-tuning (PEFT) ๋ฐฉ๋ฒ•์€ fine-tuning ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆ˜์™€ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ์„ ์ตœ์†Œํ™”ํ•จ์œผ๋กœ์จ LLMs์˜ ์ƒˆ๋กœ์šด ์ž‘์—…์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋ ค ํ•ฉ๋‹ˆ๋‹ค. Adapter Tuning ์€ ๊ฐ€๋ฒผ์šด adapter ๋ชจ๋“ˆ์„ ์ฒ˜์Œ ์†Œ๊ฐœํ•˜์˜€์œผ๋ฉฐ, ์ด๋Š” ์†Œ์ˆ˜์˜ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

 

Prefix-tuning๊ณผ P-Tuning์€ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ continuous prompts ๋˜๋Š” embeddings์„ ์›๋ž˜ ์‹œํ€€์Šค์— ์ถ”๊ฐ€ํ•˜๋Š” task-specific ๊ฐ€์ƒ ํ† ํฐ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ prompts๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ์‹œํ€€์Šค ๊ธธ์ด ์ œํ•œ์œผ๋กœ ์ธํ•ด ๊ธธ์ด๊ฐ€ ๊ธด ์ž…๋ ฅ์—์„œ ์–ด๋ ค์›€์„ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

LoRA๋Š” ๊ฐ dense layer์— ๋Œ€ํ•ด ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ low-rank matrices ๋‘ ๊ฐœ๋ฅผ ๋„์ž…ํ•˜๋ฉฐ, inference ์ค‘ ์ถ”๊ฐ€ ๊ณ„์‚ฐ ์—†์ด full fine-tuning๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ ๊ฒƒ์œผ๋กœ ์ž…์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ LoRA fine-tuning์€ multi-task ์˜๋ฃŒ ์‘์šฉ์—์„œ ์—ฌ์ „ํžˆ ์ตœ์ ํ™”๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. LoRA ๊ธฐ๋ฐ˜ PEFT ๋ฐฉ๋ฒ•์€ ์—ฌ์ „ํžˆ ๋ฐœ์ „ ์ค‘์ด๋ฉฐ, ์šฐ๋ฆฌ๋Š” ๊ทธ ์ฒซ ๊ฑธ์Œ์„ ๋‚ด๋”›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.


Conclusion

์ด ๋…ผ๋ฌธ์—์„œ๋Š” LLM-driven ์˜๋ฃŒ ์‘์šฉ์„ ์œ„ํ•œ multi-task parameter efficient fine-tuning์˜ ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋ฅผ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋งŒ์กฑ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” MOELoRA๋ผ๋Š” ์ƒˆ๋กœ์šด multi-task fine-tuning ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

๊ตฌ์ฒด์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ์—ฌ๋Ÿฌ low-rank ํ–‰๋ ฌ๋กœ ๊ตฌ์„ฑ๋œ MOELoRA ์•„ํ‚คํ…์ฒ˜๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ, trainable ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ task-specific ์ง€์‹ ๋ฐ ๋†’์€ ํšจ์œจ์„ฑ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ฐ ์ž‘์—…์— ๋Œ€ํ•œ ๋…ํŠนํ•œ fine-tuned ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” task-motivated gate function์„ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

์ค‘๊ตญ ์˜๋ฃŒ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด, ์ œ์•ˆ๋œ MOELoRA์˜ ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ํ–ฅํ›„์—๋Š” ์ง€์‹ ๊ทธ๋ž˜ํ”„์™€ ๊ฐ™์€ ๋ณต์žกํ•œ ์˜๋ฃŒ ์ง€์‹์„ LLMs์™€ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํƒ๊ตฌํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.