A A
[kakaotech] Pitching ๊ฐœ๋ฐœ๊ธฐ - Video-LLaMA Paper Review

Q. VLM์„ ์ ์šฉํ•ด์„œ ๋ฐœํ‘œ์ž์˜ ํ–‰๋™์„ ํƒ์ง€ & ๋ถ„์„ โ†’ Prompt๋กœ ํŠน์ง•์ด ๋‚˜์˜ค๋ ค๋ฉด?
1. VLM์œผ๋กœ Object Detection์„ ํ•œ ์‚ฌ๋ ˆ๋ฅผ ์•Œ์•„์•ผ ํ•œ๋‹ค.
2. VLM์œผ๋กœ ์ตœ๋Œ€ ๋ช‡๋ถ„๊นŒ์ง€์˜ ์˜์ƒ ๋ถ„์„์ด ๊ฐ€๋Šฅํ•œ๊ฐ€? (max 1์‹œ๊ฐ„)
3. ์‚ฌ์šฉํ• ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ์žˆ๋Š”๊ฐ€? Fine-tuning๋ฐ ๋ชจ๋ธ ์‚ฌ์šฉ๋ฒ•์€?
VLM (Vision Language Model)์„ ๊ณต๋ถ€ํ•˜๋˜์ค‘ Video-LLaMA ๋…ผ๋ฌธ์„ ์ฝ์–ด์„œ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.

 

  • Paper Link
 

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and

arxiv.org


Video-LLaMA๋ž€?

Video-LLaMA๋Š” BLIP-2์™€ MiniGPT-4๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋œ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ,
๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์ด ๋น„๋””์˜ค์˜ ์‹œ๊ฐ์  ์ฝ˜ํ…์ธ ์™€ ์˜ค๋””์˜ค ์ฝ˜ํ…์ธ ๋ฅผ ๋ชจ๋‘ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์ด ๋ชจ๋ธ์˜ ์ฃผ์š” ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

 

  • ๋น„์ „-์–ธ์–ด(VL) ๋ถ„๊ธฐ์™€ ์˜ค๋””์˜ค-์–ธ์–ด(AL) ๋ถ„๊ธฐ๋ฅผ ํฌํ•จํ•œ ํ†ตํ•ฉ ์•„ํ‚คํ…์ฒ˜
  • ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์ด ์‹œ์ฒญ๊ฐ ์ฝ˜ํ…์ธ ๋ฅผ ์ข…ํ•ฉ์ ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ
  • ๋น„๋””์˜ค ํ”„๋ ˆ์ž„๊ณผ ์˜ค๋””์˜ค ์ŠคํŠธ๋ฆผ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ๋Šฅ

 

์ฃผ์š” ์—ฐ๊ตฌ ๊ณผ์ œ์™€ ํ•ด๊ฒฐ์ฑ…

Video-LLaMA๋Š” ๋‹ค์Œ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘์—ˆ์Šต๋‹ˆ๋‹ค:

 

  1. ์‹œ๊ฐ ์žฅ๋ฉด์—์„œ์˜ ์‹œ๊ฐ„ ๋ณ€ํ™” ํฌ์ฐฉ
    • ์‚ฌ์ „ ํ•™์Šต๋œ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋ฅผ ๋น„๋””์˜ค ์ธ์ฝ”๋”์— ์กฐํ•ฉํ•˜๋Š” Video Q-former๋ฅผ ์ œ์•ˆ
    • ๋น„๋””์˜ค-ํ…์ŠคํŠธ ์ƒ์„ฑ ์ž‘์—…์„ ๋„์ž…ํ•˜์—ฌ ๋น„๋””์˜ค์™€ ์–ธ์–ด ๊ฐ„์˜ ๋Œ€์‘์„ ํ•™์Šต
  2. ์˜ค๋””์˜ค-๋น„์ฃผ์–ผ ์‹ ํ˜ธ ํ†ตํ•ฉ
    • ImageBind๋ผ๋Š” ์—ฌ๋Ÿฌ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์ •๋ ฌํ•˜๋Š” ๋ฒ”์šฉ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ์„ ์‚ฌ์ „ ํ•™์Šต๋œ ์˜ค๋””์˜ค ์ธ์ฝ”๋”๋กœ ์‚ฌ์šฉ
    • Audio Q-former๋ฅผ ๋„์ž…ํ•˜์—ฌ ์–ธ์–ด ๋ชจ๋ธ ๋ชจ๋“ˆ์— ์ ํ•ฉํ•œ ์˜ค๋””์˜ค ์ฟผ๋ฆฌ ์ž„๋ฒ ๋”ฉ์„ ํ•™์Šต

 

ํ‘œ 1. Video-LLaMA์™€ ๊ธฐ์กด ๋ชจ๋ธ ๋น„๊ต

 


์•„ํ‚คํ…์ฒ˜ ๊ตฌ์„ฑ

Model Repo

 

GitHub - DAMO-NLP-SG/Video-LLaMA: [EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Unde

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding - DAMO-NLP-SG/Video-LLaMA

github.com

 

1. Vision-Language Branch

๋น„์ „-์–ธ์–ด ๋ถ„๊ธฐ๋Š” LLM์ด ์‹œ๊ฐ์  ์ž…๋ ฅ์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์œผ๋ฉฐ, ๋‹ค์Œ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ํฌํ•จ.

 

  • ๋™๊ฒฐ๋œ ์‚ฌ์ „ ํ•™์Šต๋œ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”: ๋น„๋””์˜ค ํ”„๋ ˆ์ž„์—์„œ ํŠน์ง•์„ ์ถ”์ถœ
  • Position Embedding Layer: ๋น„๋””์˜ค ํ”„๋ ˆ์ž„์— ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ์ฃผ์ž…
  • Video Q-former: ํ”„๋ ˆ์ž„ ์ˆ˜์ค€์˜ ํ‘œํ˜„์„ ์ง‘ํ•ฉ
  • Linear Layer: ์ถœ๋ ฅ๋œ ๋น„๋””์˜ค ํ‘œํ˜„์„ LLM์˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ๊ณผ ๋™์ผํ•œ ์ฐจ์›์œผ๋กœ ํˆฌ์˜

 

2. Audio-Language Branch

์˜ค๋””์˜ค-์–ธ์–ด ๋ถ„๊ธฐ๋Š” ๋น„๋””์˜ค์˜ ์ฒญ๊ฐ์  ์ฝ˜ํ…์ธ ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์œผ๋ฉฐ, ๋‹ค์Œ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ํฌํ•จ.

 

  • ์‚ฌ์ „ ํ•™์Šต๋œ ์˜ค๋””์˜ค ์ธ์ฝ”๋”(ImageBind): ์˜ค๋””์˜ค ์„ธ๊ทธ๋จผํŠธ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ํŠน์ง• ๊ณ„์‚ฐ
  • Position Embedding Layer: ์˜ค๋””์˜ค ์„ธ๊ทธ๋จผํŠธ์— ์‹œ๊ฐ„ ์ •๋ณด ์ฃผ์ž…
  • Audio Q-Former: ์˜ค๋””์˜ค ์„ธ๊ทธ๋จผํŠธ์˜ ํŠน์ง•์„ ์œตํ•ฉ
  • Linear Layer: ์˜ค๋””์˜ค ํ‘œํ˜„์„ LLM์˜ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์œผ๋กœ ๋งคํ•‘

ํ•™์Šต ๋ฐฉ๋ฒ•

๋‹ค์ค‘ ๋ธŒ๋žœ์น˜ ํฌ๋กœ์Šค ๋ชจ๋‹ฌ ํ•™์Šต

Video-LLaMA๋Š” Vision-Language Branch์™€ Audio-Language Branch๋ฅผ ๊ฐ๊ฐ ๋”ฐ๋กœ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค.

 

  1. Vision-Language Branch ํ•™์Šต
    • Webvid-2M(์Šคํ†ก ์˜์ƒ ์‚ฌ์ดํŠธ์˜ ํ…์ŠคํŠธ ์„ค๋ช…์ด ํฌํ•จ๋œ ์งง์€ ๋น„๋””์˜ค) ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ
    • CC595k(์ด๋ฏธ์ง€ ์บก์…˜ ๋ฐ์ดํ„ฐ์…‹) ํ™œ์šฉ
    • ๋น„๋””์˜ค-ํ…์ŠคํŠธ ์ƒ์„ฑ ์ž‘์—… ์ ์šฉ์œผ๋กœ LLM์ด ๋น„๋””์˜ค์— ๋Œ€ํ•œ ํ…์ŠคํŠธ ์„ค๋ช… ์ƒ์„ฑ
    • MiniGPT-4, LLaVA, Video-Chat์˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ช…๋ น ๋”ฐ๋ฅด๊ธฐ ๋Šฅ๋ ฅ ๋ฏธ์„ธ ์กฐ์ •
  2. Audio-Language Branch ํ•™์Šต
    • ์˜ค๋””์˜ค-ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ํฌ์†Œ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ์šฐํšŒ ์ „๋žต ์‚ฌ์šฉ
    • ImageBind๊ฐ€ ์ œ๊ณตํ•˜๋Š” ๊ณต์œ  ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ๊ฐ-ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
    • ์ด๋ฅผ ํ†ตํ•ด ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋กœ ๋ช…์‹œ์  ํ•™์Šต ์—†์ด๋„ ์˜ค๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ ํ™•๋ณด

 


Video-LLaMA์˜ ์ฃผ์š” ๋Šฅ๋ ฅ

๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด Video-LLaMA๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ธ์ƒ์ ์ธ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

 

 

  1. ์˜ค๋””์˜ค-๋น„์ฃผ์–ผ ํ†ตํ•ฉ ์ธ์‹ ๋Šฅ๋ ฅ
    • ๋น„๋””์˜ค์˜ ์‹œ๊ฐ์  ์ฝ˜ํ…์ธ ์™€ ์ฒญ๊ฐ์  ์ฝ˜ํ…์ธ ๋ฅผ ๋™์‹œ์— ์ดํ•ด
    • ์‹œ๊ฐ ๊ด€๋ จ ์งˆ๋ฌธ๊ณผ ์˜ค๋””์˜ค ๊ด€๋ จ ์งˆ๋ฌธ์— ๋ชจ๋‘ ์ •ํ™•ํ•˜๊ฒŒ ์‘๋‹ต
  2. ๋น„๋””์˜ค์˜ ์‹œ๊ฐ„์  ๋™์  ๋ณ€ํ™”๋ฅผ ํฌ์ฐฉํ•˜๋Š” ๋Šฅ๋ ฅ
    • ์‹œ๊ฐ„์— ๋”ฐ๋ผ ๋ณ€ํ™”ํ•˜๋Š” ํ–‰๋™์ด๋‚˜ ์›€์ง์ž„์„ ์ธ์‹ํ•˜๊ณ  ์„ค๋ช…
  3. ์ •์  ์ด๋ฏธ์ง€๋ฅผ ์ธ์‹ํ•˜๊ณ  ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ
    • ์ด๋ฏธ์ง€์˜ ์ฃผ์š” ๋‚ด์šฉ์„ ์ •ํ™•ํ•˜๊ฒŒ ์„ค๋ช…
    • "๋น„์ •์ƒ์ "๊ณผ ๊ฐ™์€ ์ถ”์ƒ์  ๊ฐœ๋…์„ ์ดํ•ดํ•˜๊ณ  ์ ์šฉ
  4. ์ผ๋ฐ˜ ์ƒ์‹ ๊ฐœ๋…์„ ์ธ์‹ํ•˜๋Š” ๋Šฅ๋ ฅ
    • ์œ ๋ช…ํ•œ ๋žœ๋“œ๋งˆํฌ์™€ ์ธ๋ฌผ๋“ค์„ ์ธ์‹
    • ์ƒ์‹์ ์ธ ์งˆ๋ฌธ์— ์ ์ ˆํ•˜๊ฒŒ ์‘๋‹ต

 


ํ•œ๊ณ„์ 

Video-LLaMA๋Š” ๋‹ค์ค‘ ๋ชจ๋‹ฌ ์ดํ•ด์— ํฐ ์ง„์ „์„ ๋ณด์˜€์ง€๋งŒ, ๋ช‡ ๊ฐ€์ง€ ํ•œ๊ณ„๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

 

  1. ์ œํ•œ๋œ ์ธ์‹ ๋Šฅ๋ ฅ: ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์˜ ํ’ˆ์งˆ๊ณผ ๊ทœ๋ชจ์— ์˜ํ•ด ์ œํ•œ๋จ
  2. ๊ธด ๋น„๋””์˜ค ์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ์˜ ์ œํ•œ: ์˜ํ™”๋‚˜ TV ์‡ผ์™€ ๊ฐ™์€ ๊ธด ๋น„๋””์˜ค ์ฒ˜๋ฆฌ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์Œ
  3. ํ™˜๊ฐ ๋ฌธ์ œ: ๊ธฐ๋ณธ LLM์—์„œ ์ƒ์†๋ฐ›์€ ํ™˜๊ฐ ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•จ

์‹œ์‚ฌ์ 

Video-LLaMA ๋ชจ๋ธ์ด ์ €ํฌ Product ๊ฐœ๋ฐœ์— ์ฃผ๋Š” ์‹œ์‚ฌ์ ์„ ์ด๋ ‡๊ฒŒ ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

 

  1. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ถ„์„ ๊ฐ€๋Šฅ์„ฑ: ๋ฐœํ‘œ์ž์˜ ์‹œ๊ฐ์  ์š”์†Œ(ํ‘œ์ •, ์ œ์Šค์ฒ˜, ์ž์„ธ)์™€ ์ฒญ๊ฐ์  ์š”์†Œ(์Œ์„ฑ, ์–ด์กฐ, ์†๋„)๋ฅผ ๋™์‹œ์— ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์คŒ
  2. ์‹œ๊ฐ„์  ๋ณ€ํ™” ํฌ์ฐฉ: ๋ฐœํ‘œ ๊ณผ์ •์—์„œ ์‹œ๊ฐ„์— ๋”ฐ๋ฅธ ๋ณ€ํ™”(์˜ˆ: ์ œ์Šค์ฒ˜์˜ ๋ณ€ํ™”, ๋ชฉ์†Œ๋ฆฌ ํ†ค์˜ ๋ณ€ํ™”)๋ฅผ ๊ฐ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋ก  ์ œ์‹œ
  3. ๋ชจ๋“ˆํ™”๋œ ์•„ํ‚คํ…์ฒ˜: ๋น„์ „-์–ธ์–ด ๋ถ„๊ธฐ์™€ ์˜ค๋””์˜ค-์–ธ์–ด ๋ถ„๊ธฐ๋ฅผ ๋ณ„๋„๋กœ ์„ค๊ณ„ํ•˜๊ณ  ํ•™์Šต์‹œํ‚จ ํ›„ ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉ์‹์€ ๋ฐœํ‘œ์ž ๋ถ„์„ ์‹œ์Šคํ…œ์˜ ๋ชจ๋“ˆ์‹ ๊ฐœ๋ฐœ์— ์ ์šฉ ๊ฐ€๋Šฅ
  4. ๋ฐ์ดํ„ฐ ๋ถ€์กฑ ๋ฌธ์ œ ํ•ด๊ฒฐ ์ „๋žต: ์˜ค๋””์˜ค-ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ํฌ์†Œ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ ์šฐํšŒ ์ „๋žต์€ ๋ฐœํ‘œ ํ”ผ๋“œ๋ฐฑ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ€์กฑํ•œ ์ƒํ™ฉ์—์„œ ์ฐธ๊ณ ํ•  ๋งŒํ•จ.