A A
[kakaotech] Pitching ๊ฐœ๋ฐœ๊ธฐ - PLLaVA Paper Review

VLM (Vision Language Model)์„ ๊ณต๋ถ€ํ•˜๋˜์ค‘ PLLaVA ๋…ผ๋ฌธ์„ ์ฝ์€ํ›„ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.
  • Paper Link
 

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the pr

arxiv.org


PLLaVA์˜ ์ฃผ์š” ๊ฐœ๋…

PLLaVA๋Š” ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€-์–ธ์–ด ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ(LLaVA)์„ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์— ๋งž๊ฒŒ ํšจ์œจ์ ์œผ๋กœ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์—ฐ๊ตฌํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋น„๋””์˜ค ๊ด€๋ จ ์ž‘์—…์„ ์œ„ํ•œ ์‚ฌ์ „ ํ•™์Šต์€ ๋ง‰๋Œ€ํ•œ ์ปดํ“จํŒ… ๋ฐ ๋ฐ์ดํ„ฐ ์ž์›์ด ํ•„์š”ํ•œ๋ฐ, PLLaVA๋Š” ์ด๋Ÿฐ ์ œ์•ฝ์„ ๊ทน๋ณตํ•˜๊ณ  ๋” ํšจ์œจ์ ์ธ ๋ฐฉ์‹์œผ๋กœ ๋น„๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆผ 1: PLLaVA์˜ ์„ฑ๋Šฅ ํ”„๋ ˆ์  ํ…Œ์ด์…˜ .

 

(a) PLLaVA 34B๋กœ ์ƒ์„ฑ๋œ ์บก์…˜์˜ ์˜ˆ. (b) PLLaVA์™€ ๋‹ค์–‘ํ•œ ๋น„๋””์˜ค ๋ฒค์น˜๋งˆํฌ์— ๋Œ€ํ•œ ์ตœ๊ทผ์˜ ๊ฐ•๋ ฅํ•œ ๊ธฐ์ค€์„ ์˜ ์„ฑ๋Šฅ ๋น„๊ต ๋ฐ (c) PLLaVA ๋ฐ ์ตœ๊ทผ SOTA ๋ฐฉ๋ฒ•์˜ ์Šค์ผ€์ผ๋ง ๊ณก์„ .

 


ํ•ต์‹ฌ ๋ฌธ์ œ์  ๋ฐ ํ•ด๊ฒฐ ๋ฐฉ์•ˆ

์—ฐ๊ตฌํŒ€์€ ๊ธฐ์กด ์ด๋ฏธ์ง€-์–ธ์–ด ๋ชจ๋ธ์„ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

 

  1. ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•œ ์ทจ์•ฝ์„ฑ: ํ•™์Šต๋œ ๋ชจ๋ธ์ด ํ”„๋กฌํ”„ํŠธ ํŒจํ„ด ๋ณ€ํ™”์— ๋งค์šฐ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘
  2. ์–ธ์–ด ๋ชจ๋ธ ํฌ๊ธฐ ํ™•์žฅ์˜ ํ•œ๊ณ„: ์–ธ์–ด ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ ค๋„ ๋น„๋””์˜ค ์ดํ•ด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€ ์•Š์Œ

 

์ด๋Ÿฌํ•œ ๋ฌธ์ œ์˜ ๊ทผ๋ณธ ์›์ธ์„ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ, ํŠน์ • ์‹œ๊ฐ์  ํŠน์ง• ํ† ํฐ์ด ๋ฏธ์„ธ ์กฐ์ • ๊ณผ์ •์—์„œ ๋‹ค๋ฅธ ํ† ํฐ๋ณด๋‹ค ํ˜„์ €ํžˆ ํฐ ๋…ธ๋ฆ„(norm)์„ ๊ฐ–๋Š” ํ˜„์ƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

4-Frame ๋ฐ PLLaVA์— ๋Œ€ํ•œ ์ƒ์„ฑ ํ…์ŠคํŠธ ๊ธธ์ด ๋ถ„ํฌ์˜ ํžˆ์Šคํ† ๊ทธ๋žจ . x์ถ•์€ ํ…์ŠคํŠธ ๊ธธ์ด์ด๊ณ , y์ถ•์€ ํ…์ŠคํŠธ ๊ธธ์ด์˜ ์นด์šดํŠธ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. 4-Frame์€ ๋” ๋งŽ์€ ํ›ˆ๋ จ ๋‹จ๊ณ„์™€ ๋ฐฐํฌ ์ค‘๋‹จ ํ”„๋กฌํ”„ํŠธ์—์„œ ๋” ์งง์€ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐ˜๋ฉด, PLLaVA๋Š” ๋‘ ์ƒํ™ฉ ๋ชจ๋‘์—์„œ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

 

์ด ๊ทธ๋ฆผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ๊ธฐ์กด 4-Frame ๋ฐฉ์‹์€ ๋” ๋งŽ์€ ํ•™์Šต ๋‹จ๊ณ„์™€ ๋ฐฐํฌ ์ค‘๋‹จ ํ”„๋กฌํ”„ํŠธ์—์„œ ๋” ์งง์€ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐ˜๋ฉด, PLLaVA๋Š” ๋‘ ์ƒํ™ฉ ๋ชจ๋‘์—์„œ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆผ 3: n-frame๊ณผ PLLaVA์˜ norm ๋ถ„ํฌ ๋ฐ ์ƒ์„ฑ๋œ ํ…์ŠคํŠธ ๋น„๊ต

 

์œ„ ๊ทธ๋ฆผ์€ n-frame๊ณผ PLLaVA์˜ norm ๋ถ„ํฌ ๋ฐ ์ƒ์„ฑ๋œ ํ…์ŠคํŠธ๋ฅผ ๋น„๊ตํ•œ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค. n-frame ์„ค์ •์—์„œ๋Š” ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ์„ ํ•™์Šตํ• ์ˆ˜๋ก dominant tokens(๋†’์€ norm์„ ๊ฐ€์ง„ ํ† ํฐ๋“ค)์ด ์ฆ๊ฐ€ํ•˜๊ณ  ์ƒ์„ฑ๋œ ํ…์ŠคํŠธ์˜ ํ’ˆ์งˆ์ด ์ €ํ•˜๋ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด PLLaVA์˜ norm ๋ถ„ํฌ๋Š” ๋‹ค์–‘ํ•œ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ฐ ํ”„๋กฌํ”„ํŠธ ์ˆ˜์— ๋”ฐ๋ผ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ฉฐ, ์ƒ์„ฑ๋œ ํ…์ŠคํŠธ๋„ ์ผ๊ด€๋œ ํ’ˆ์งˆ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

 

 


PLLaVA์˜ ํ•ต์‹ฌ ํ•ด๊ฒฐ์ฑ…: Pooling ์ „๋žต

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด PLLaVA๋Š” ์‹œ๊ฐ„ ์ฐจ์›์— ๋”ฐ๋ผ ํŠน์ง• ๋ถ„ํฌ๋ฅผ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ํ•˜์—ฌ ๊ทน๋‹จ์ ์ธ ํŠน์ง•์˜ ์ง€๋ฐฐ์ ์ธ ์˜ํ–ฅ์„ ์ค„์ด๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ ํšจ๊ณผ์ ์ธ pooling ์ „๋žต์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

 

PLLaVA์˜ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค

  1. ์‚ฌ์šฉ์ž๊ฐ€ ์ œ๊ณตํ•œ ๋น„๋””์˜ค๋ฅผ ViT-L๊ณผ MM projector๋ฅผ ํ†ตํ•ด ์ฒ˜๋ฆฌ
  2. (T, w, h, d) ํ˜•ํƒœ์˜ ์‹œ๊ฐ์  ํŠน์ง•์„ ์ถ”์ถœ
  3. ํ‰๊ท  ํ’€๋ง์„ ํ†ตํ•ด ์‹œ๊ฐ„์  ๋ฐ ๊ณต๊ฐ„์  ์ฐจ์›์„ ํšจ๊ณผ์ ์œผ๋กœ ์ถ•์†Œ
  4. ํ’€๋ง๋œ ํŠน์ง•์„ ํ‰ํƒ„ํ™”ํ•œ ํ›„ ์งˆ๋ฌธ ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฒฐํ•ฉ
  5. ์ด๋ฏธ์ง€ Large Language Model (LLM)์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์‘๋‹ต ์ƒ์„ฑ
  6. ์ด๋ฏธ์ง€ LLM์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋น„๋””์˜ค ์ƒ˜ํ”Œ์—์„œ ํ•™์Šต๋œ LoRA ๊ฐ€์ค‘์น˜์™€ ๊ฒฐํ•ฉ

Pooling์˜ ์˜ํ–ฅ๊ณผ ์ตœ์  ์„ค๊ณ„

pooling์ด ์‹œ๊ฐ„์ (temporal) ์ฐจ์›๊ณผ ๊ณต๊ฐ„์ (spatial) ์ฐจ์›์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

 

์œ„ ๊ทธ๋ฆผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ๊ณต๊ฐ„์  ์ฐจ์›์„ 50% ์ถ•์†Œํ•ด๋„ ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜์ง€ ์•Š์ง€๋งŒ, ์‹œ๊ฐ„์  ์ฐจ์›์—์„œ์˜ ํ’€๋ง์€ ํ•ญ์ƒ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ค๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‚ณ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ PLLaVA๋Š” ๊ณ„์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ์™€ ์„ฑ๋Šฅ ๊ฐ„์˜ ๊ท ํ˜•์„ ๊ณ ๋ คํ•˜์—ฌ ๊ณต๊ฐ„์  ์ฐจ์›์„ 12×12๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.


Post-Training Optimization

PLLaVA๋Š” ๋ชจ๋ธ ํฌ๊ธฐ ํ™•์žฅ๊ณผ ๊ด€๋ จ๋œ ์„ฑ๋Šฅ ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Post-Training Optimization ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.
์ด ๋ฐฉ๋ฒ•์€ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋œ Language Model(LLM)๊ณผ ๊ธฐ๋ณธ ์ด๋ฏธ์ง€ MLLM์˜ ์›๋ž˜ LLM์„ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

 

Video MLLMs๊ฐ€ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ํ™•์žฅํ•  ๋•Œ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€ ์•Š์Œ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ทธ๋ž˜ํ”„

 

  • ์œ„ ๊ทธ๋ฆผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, Post Optimization ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ชจ๋ธ ํฌ๊ธฐ ํ™•์žฅ ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์„ฑ๋Šฅ ์ €ํ•˜ ๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ๊ฒฐ๊ณผ

PLLaVA๋Š” ๋‹ค์–‘ํ•œ ๋น„๋””์˜ค ์ดํ•ด ๋ฒค์น˜๋งˆํฌ์—์„œ ๋†€๋ผ์šด ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆผ 9: MVBench์™€ VCG Score ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ๋‹ค์šด์ƒ˜ํ”Œ๋ง ๋น„์œจ์˜ ์˜ํ–ฅ

 

 

์ด ๊ทธ๋ฆผ์€ ๋‹ค์šด์ƒ˜ํ”Œ๋ง ๋น„์œจ์ด MVBench์™€ VCG Score ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋‘ ๋ฒค์น˜๋งˆํฌ ๊ฐ„์— ๋šœ๋ ทํ•œ ์ฐจ์ด๊ฐ€ ์žˆ์œผ๋ฉฐ, ์ด์ƒ์ ์ธ ์กฐํ•ฉ ๋น„์œจ์€ ํƒœ์Šคํฌ์— ๋”ฐ๋ผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. PLLaVA์˜ ์ž๋ง‰ ์ž‘์„ฑ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๋Š” ๋ช‡ ๊ฐ€์ง€ ์‚ฌ๋ก€ ์—ฐ๊ตฌ๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

 

๊ทธ๋ฆผ 10: Case Studies

 

์œ„ ๊ทธ๋ฆผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, PLLaVA 34B๋Š” IG-VLM๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋น„๋””์˜ค์— ๋Œ€ํ•œ ๋” ๋งŽ์€ ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ์ธ์‹ํ•˜๊ณ , ๋น„๋””์˜ค ๋‚ด์šฉ๋„ ๋” ์ •ํ™•ํ•˜๊ฒŒ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆผ 11: PLLaVA 34B์™€ Open-Sora์˜ Recaption ๋น„๊ต

 

 

PLLaVA์˜ ์ž๋ง‰ ์žฌ์ž‘์„ฑ(recaption) ๋Šฅ๋ ฅ๋„ ์ธ์ƒ์ ์ž…๋‹ˆ๋‹ค. Open-Sora GPT-4 ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, PLLaVA๋Š” ๋” ๋‚˜์€ ์ž๋ง‰ ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ํฌ์ฐฉํ•˜๋ฉฐ, ๋น„๋””์˜ค ๋‚ด์˜ ์›€์ง์ž„ ์ •๋ณด๋ฅผ ๋” ํšจ๊ณผ์ ์œผ๋กœ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.


์ฃผ์š” ์„ฑ๊ณผ

PLLaVA๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋†€๋ผ์šด ์„ฑ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค:

 

  1. Video ChatGPT ๋ฒค์น˜๋งˆํฌ์—์„œ 5๊ฐ€์ง€ ํ‰๊ฐ€๋œ ์ฐจ์›์—์„œ ํ‰๊ท  3.48์ ์„ ๊ธฐ๋กํ•˜์—ฌ, ์ด์ „์˜ state-of-the-art ๊ฒฐ๊ณผ๋ณด๋‹ค 9% ํ–ฅ์ƒ
  2. MVBench์—์„œ 20๊ฐœ ํ•˜์œ„ ์ž‘์—…์—์„œ ํ‰๊ท  58.1%์˜ ์ •ํ™•๋„๋ฅผ ๊ธฐ๋กํ•˜๋ฉฐ ๊ธฐ์กด ์ตœ๊ณ  ์„ฑ๋Šฅ๋ณด๋‹ค 14.5% ํ–ฅ์ƒ
  3. VideoQA์—์„œ MSVD, MSRVTT, ActivityNet, TGIF์˜ ์ •ํ™•๋„์™€ ์ ์ˆ˜ ์ง€ํ‘œ์—์„œ ๊ธฐ์กด ๋ชจ๋“  ๋ฐฉ๋ฒ• ๋Šฅ๊ฐ€

๊ฒฐ๋ก 

PLLaVA๋Š” ์ด๋ฏธ์ง€-์–ธ์–ด ๋ชจ๋ธ์„ ๋น„๋””์˜ค๋กœ ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ„๋‹จํ•˜๋ฉด์„œ๋„ ๋งค์šฐ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์™€ ๋” ํฐ ์–ธ์–ด ๋ชจ๋ธ๋กœ ํ›ˆ๋ จ์„ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์„ ์šฉ์ดํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ , ๊ณผํ›ˆ๋ จ๊ณผ ์„ฑ๋Šฅ ํฌํ™”์— ๋Œ€ํ•ด ๋” ์ž˜ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋Š” ์ „๋žต์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

PLLaVA์˜ ์„ธ๋ถ€์ ์ธ ์ž๋ง‰ ์ œ๊ณต ๋Šฅ๋ ฅ์€ ๋ฐœํ‘œ์ž ๋ถ„์„ ์„œ๋น„์Šค์—๋„ ๋„์›€์ด ๋ ์ˆ˜ ์žˆ์„๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐ.
ํŠนํžˆ ๋ฐœํ‘œ ์˜์ƒ์—์„œ ๋ฐœํ‘œ์ž์˜ ํ–‰๋™, ์ œ์Šค์ฒ˜, ํ‘œ์ • ๋“ฑ์„ ์ƒ์„ธํ•˜๊ฒŒ ๋ถ„์„ํ•˜๊ณ  ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์€ ๋งค๋ ฅ์ .