A A
[kakaotech] Pitching ๊ฐœ๋ฐœ๊ธฐ - Vision Language Model Research

Pitching Project๋ฅผ PM & AI Tech ๋‹ด๋‹น์ž๋กœ ๊ฐœ๋ฐœํ•œ์ง€ ๋ช‡๋‹ฌ์ด ์ง€๋‚ฌ์ง€๋งŒ, ๋‹ค์‹œ ํ•œ๋ฒˆ ์ •๋ฆฌํ•˜๋Š” ๊ฐœ๋…์œผ๋กœ ์–ด๋– ํ•œ ๋‚ด์šฉ์„ ๊ฐœ๋ฐœํ–ˆ๋Š”์ง€ ์˜ฌ๋ ค๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • Pitching Github Organization
 

Pitching-kakaotech

Pitching์€ "๋ชจ๋‘๊ฐ€ ํŽธํ•˜๊ฒŒ ์†Œํ†ตํ•˜๊ณ , ์ž์‹ ์žˆ๊ฒŒ ๋งํ•˜๋ฉฐ ์—ฐ๊ฒฐ๋œ ์„ธ์ƒ์„ ๋งŒ๋“ค์ž." ๋ผ๋Š” ์ƒ๊ฐ์œผ๋กœ ํƒ„์ƒํ•œ ํ”Œ๋žซํผ์ž…๋‹ˆ๋‹ค. - Pitching-kakaotech

github.com

 

ํŒ€์—์„œ AI๊ธฐ์ˆ ์„ ํ™œ์šฉํ•œ ์‹ค์‹œ๊ฐ„ ๋ฐœํ‘œ ํ”ผ๋“œ๋ฐฑ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ํ”„๋กœ๋•ํŠธ๋ฅผ ๋งŒ๋“ค์–ด๋ณด์ž๊ณ  ์•„์ด๋””์–ด๊ฐ€ ๋‚˜์™€์„œ ๊ฐœ๋ฐœ์„ ํ• ๋•Œ,
์ฃผ์ œ์™€ MVP & ์š”๊ตฌ์‚ฌํ•ญ ๋ช…์„ธ์„œ(SRS)๋ฅผ ์ •์˜ํ•œํ›„, ๋‚ด์šฉ์„ ํ•œ๋ฒˆ ์ •๋ฆฌํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

๊ธฐ๋Šฅ ์šฐ์„ ์ˆœ์œ„ ์„ค์ •: ์šฐ์„ ์ˆœ์œ„๋Š” "P0", "P1", "P2"์œผ๋กœ ๊ตฌ๋ถ„๋˜๋ฉฐ, ์‚ฌ์šฉ์ž ๊ฐ€์น˜, ๊ธฐ์ˆ ์  ๋‚œ์ด๋„, ๊ฐœ๋ฐœ ์‹œ๊ฐ„ ๋“ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ๊ฒฐ์ •
์šฐ์„ ์ˆœ์œ„ ๊ธฐ๋Šฅ ์„ค๋ช…
P0 - AI ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐ AI ๋ชจ๋ธ ๊ฐœ๋ฐœ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ํ™•๋ณด ๋ฐ AI ๋ชจ๋ธ ํ•™์Šต
P0 - AI ์‹œ์„ ์ฒ˜๋ฆฌ, ํ‘œ์ • (์˜จ๋ผ์ธ) ์‹œ์„  ์ฒ˜๋ฆฌ, ํ‘œ์ •์„ ํ†ตํ•œ ๋ฐœํ‘œ ํ”ผ๋“œ๋ฐฑ ์ œ๊ณต
P0 - AI ์Œ์„ฑ ๋ฐœ์Œ ์ •ํ™•๋„, ์–ต์–‘, ์†๋„, ์•”๊ธฐ ์—ฌ๋ถ€ ๋“ฑ์„ ๋ถ„์„ํ•˜์—ฌ ํ”ผ๋“œ๋ฐฑ ์ œ๊ณต
P1 - AI ์ œ์Šค์ฒ˜ (์˜คํ”„๋ผ์ธ) ์ œ์Šค์ฒ˜๋ฅผ ํ†ตํ•œ ๋ฐœํ‘œ ํ”ผ๋“œ๋ฐฑ ์ œ๊ณต
P0 - FS ๊ธฐ๋ณธ ๋กœ๊ทธ์ธ ๊ธฐ๋ณธ ๋กœ๊ทธ์ธ, ํšŒ์›๊ฐ€์ž…
P1 - FS ๊ฐ„ํŽธ ๋กœ๊ทธ์ธ OAuth2, JWT ๊ธฐ๋Šฅ
P1 - FS ๋งˆ์ดํŽ˜์ด์ง€ ํ”„๋กœํ•„ ์‚ฌ์ง„, ์ •๋ณด ๋ณ€๊ฒฝ
P0 - FS ํ™”์ƒํšŒ์˜ ๊ธฐ๋Šฅ ๊ตฌํ˜„ ์‹ค์‹œ๊ฐ„ ํ™”์ƒํšŒ์˜๋ฅผ ํ†ตํ•œ ๋ฐœํ‘œ ์—ฐ์Šต ๊ธฐ๋Šฅ ์ œ๊ณต
P1 - FS ํ™”๋ฉด๊ณต์œ  ์‹ค์‹œ๊ฐ„ ํ™”์ƒ ํšŒ์˜์—์„œ ํ™”๋ฉด ๊ณต์œ 
P2 - FS ๋ฐœํ‘œ ๋…ธํ•˜์šฐ ๊ณต์œ  ์ปค๋ฎค๋‹ˆํ‹ฐ ๊ธฐ๋Šฅ ์‚ฌ์šฉ์ž ๊ฐ„ ๋ฐœํ‘œ ํŒ๊ณผ ๋…ธํ•˜์šฐ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๋Š” ์ปค๋ฎค๋‹ˆํ‹ฐ ๊ธฐ๋Šฅ
P0 - FS ํ™”๋ฉด ๋…นํ™”, ์Œ์„ฑ ๋…นํ™” ๋ฐœํ‘œ ํ”ผ๋“œ๋ฐฑ์„ ์œ„ํ•œ ํ™”๋ฉด ๋…นํ™” ๋ฐ ์Œ์„ฑ ๋…นํ™”
P0 - FS AI ํ”ผ๋“œ๋ฐฑ ์š”์ฒญ ๋…นํ™”๋œ ์˜์ƒ(์Œ์„ฑ)์„ AI ์„œ๋ฒ„์— ํ”ผ๋“œ๋ฐฑ ์š”์ฒญ
P0 - FS ์ฑ„ํŒ… ๊ธฐ๋Šฅ ๊ตฌํ˜„ 1:1 ๋ฐ ๊ทธ๋ฃน ์ฑ„ํŒ…, ์ฑ„ํŒ…๋ฐฉ ์ƒ์„ฑ ๊ธฐ๋Šฅ ์ œ๊ณต
P1 - FS ์ฑ„ํŒ… ๋ถ€ํ•˜ ํ…Œ์ŠคํŠธ JMeter ๋“ฑ์œผ๋กœ ํ…Œ์ŠคํŠธํ•˜๋ฉฐ ๊ฐœ์„ 
P2 - FS ์•Œ๋ฆผ ์„œ๋น„์Šค ์ฑ„ํŒ… ์•Œ๋žŒ ์„œ๋น„์Šค ๊ธฐ๋Šฅ
P3 - FS ์ปค๋ฎค๋‹ˆํ‹ฐ ์ปค๋ฎค๋‹ˆํ‹ฐ ์„œ๋น„์Šค ๊ธฐ๋Šฅ
P4 - FS ์ฑ„ํŒ… ๊ด€๋ จ ๊ธฐ๋Šฅ ํˆฌํ‘œ, ์ด๋ชจ์ง€, ๋‹ต์žฅ ๋“ฑ ์ฑ„ํŒ…์—์„œ ํŽธ๋ฆฌํ•œ ๊ธฐ๋Šฅ
P0 - Cloud ๋ฐฐํฌ & ์„œ๋ฒ„ ๊ตฌ์ถ• & CI/CD AWS, Docker, k8s, Jenkins, ArgoCD, Ansible, Terraform
P1 - Cloud ๋ฌด์ค‘๋‹จ ๋ฐฐํฌ BlueGreen
P1 - Cloud ๋ชจ๋‹ˆํ„ฐ๋ง Grafana, Prometheus
P2 - Cloud Terraform ๋ชจ๋“ˆํ™”  
P2 - Cloud MSA ๋ถ„์‚ฐ ์„œ๋น„์Šค ์•„ํ‚คํ…์ฒ˜ (์—ฌ์œ ๊ฐ€ ๋˜๋ฉด ์ง„ํ–‰)

 

์š”๊ตฌ ์‚ฌํ•ญ์„ ์ •์˜ํ•œํ›„, ์ฒซ ํ”ผ๋“œ๋ฐฑ๋•Œ ๊ฐœ๋ฐœ ์ „์ฒด ์ฃผ์ œ๊ฐ€ ์ƒ์„ฑํ˜• AI๋ฅผ ํ™œ์šฉํ•ด์•ผ ํ•œ๋‹ค๋Š” ๋ฌธ์ œ๋„ ์žˆ์—ˆ๊ณ , Cloud Team์—์„œ CV ๋ชจ๋ธ์„ ํ•™์Šตํ• ๋•Œ ํ•„์š”ํ•œ GPU๋ฅผ ์‚ฌ์šฉํ• ์ˆ˜ ์—†๋‹ค๋Š” ์ด์Šˆ๊ฐ€ ์กด์žฌํ•จ์„ ์•Œ๊ฒŒ๋œํ›„, ์ตœ๋Œ€ํ•œ GPU๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , ๋ฆฌ์†Œ์Šค๋ฅผ ์ตœ์†Œํ•œ์œผ๋กœ ์‚ฌ์šฉํ•ด์„œ AI Model์„ ์„œ๋น™ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ๋ฏผ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋–„, VLM (Vision-Language-Model)์ด ์žˆ๋‹ค๋Š”๊ฒƒ์„ ์•Œ๊ฒŒ ๋˜์—ˆ๊ณ , VLM์— ๊ด€ํ•˜์—ฌ ํ•œ๋ฒˆ Research๋ฅผ ์ง„ํ–‰ํ•ด ๋ณด๊ธฐ๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค.


VLM(Vision-Language Models)์„ ํ™œ์šฉํ•œ ๋ฐœํ‘œ์ž ๋ถ„์„ ์„œ๋น„์Šค ๊ฐœ๋ฐœ

VLM ๋ชจ๋ธ ํ˜„ํ™ฉ ๋ฐ ํŠน์ง•

์ฒ˜์Œ์—๋Š” ์–ด๋–ค ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด์•ผ ํ• ์ง€ ๋ง‰๋ง‰ํ–ˆ๋Š”๋ฐ์š”, ๋ฆฌ์„œ์น˜๋ฅผ ํ†ตํ•ด ๋ฐœ๊ฒฌํ•œ ์ฃผ์š” VLM ๋ชจ๋ธ๋“ค์˜ ํŠน์ง•์„ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.

1. Video-LLaMA

 

GitHub - DAMO-NLP-SG/Video-LLaMA: [EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Unde

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding - DAMO-NLP-SG/Video-LLaMA

github.com

์ œ๊ฐ€ ์ฒ˜์Œ ์‚ดํŽด๋ณธ ๊ฒƒ์€ Video-LLaMA์˜€์Šต๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ BLIP-2์™€ MiniGPT-4๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋œ ๋‹ค์ค‘ ๋ชจ๋‹ฌ ํ”„๋ ˆ์ž„์›Œํฌ์ธ๋ฐ์š”, ํ•ต์‹ฌ ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
  • ๋น„์ „-์–ธ์–ด(VL) ๋ถ„๊ธฐ์™€ ์˜ค๋””์˜ค-์–ธ์–ด(AL) ๋ถ„๊ธฐ๋ฅผ ํฌํ•จํ•œ ํ†ตํ•ฉ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์–ด์š”
  • ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์ด ์‹œ์ฒญ๊ฐ ์ฝ˜ํ…์ธ ๋ฅผ ์ข…ํ•ฉ์ ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค
  • ๋น„๋””์˜ค ํ”„๋ ˆ์ž„๊ณผ ์˜ค๋””์˜ค ์ŠคํŠธ๋ฆผ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด์š”

2. Video-LLaVA

 

GitHub - PKU-YuanGroup/Video-LLaVA: ใ€EMNLP 2024๐Ÿ”ฅใ€‘Video-LLaVA: Learning United Visual Representation by Alignment Before P

ใ€EMNLP 2024๐Ÿ”ฅใ€‘Video-LLaVA: Learning United Visual Representation by Alignment Before Projection - PKU-YuanGroup/Video-LLaVA

github.com

๋‘ ๋ฒˆ์งธ๋กœ ์‚ดํŽด๋ณธ Video-LLaVA๋Š” ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ ๋Šฅ๋ ฅ์ด ๋›ฐ์–ด๋‚œ ๋ชจ๋ธ์ธ๋ฐ์š”, ์ €๋Š” ํŠนํžˆ ์ด ๋ชจ๋ธ์˜ ๋‹ค์Œ ํŠน์ง•๋“ค์ด ์ธ์ƒ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค:
  • ์ด๋ฏธ์ง€-์–ธ์–ด ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์„ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋กœ ํšจ๊ณผ์ ์œผ๋กœ ํ™•์žฅํ–ˆ์–ด์š”
  • ๋น„๋””์˜ค์™€ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒํ˜ธ ๋ณด์™„์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์ด ํฅ๋ฏธ๋กœ์› ์Šต๋‹ˆ๋‹ค
  • ๋น„๋””์˜ค ๊ด€๋ จ ์ž‘์—…์˜ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๊ณ  ์žˆ์–ด์š”
  • ๋น„๋””์˜ค ํ”„๋ ˆ์ž„ ๊ฐ„์˜ ์‹œ๊ฐ„์  ๊ด€๊ณ„๋ฅผ ์ž˜ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค
์ œ ์ƒ๊ฐ์—๋Š” ์ด ๋ชจ๋ธ์ด ๋ฐœํ‘œ์ž์˜ ์ œ์Šค์ฒ˜๋‚˜ ํ‘œ์ • ๋ณ€ํ™”์™€ ๊ฐ™์€ ์‹œ๊ฐ„์  ์š”์†Œ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฐ ํšจ๊ณผ์ ์ผ ๊ฒƒ ๊ฐ™์•˜์œผ๋‚˜, 
์ด ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด? ํ”ผ๋“œ๋ฐฑ ์ œ๊ณต์‹œ Question - Answering ๋ฐฉ์‹์œผ๋กœ ๋ณ€๊ฒฝํ•ด์•ผํ•œ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๊ณ , ๊ทธ๊ฒƒ๊นŒ์ง€ ํ•˜๋ฉด ์ƒํ™ฉ์ƒ ์š•์‹ฌ์ผ๊ฑฐ ๊ฐ™์•˜์Šต๋‹ˆ๋‹ค.

3. Video-ChatGPT

 

GitHub - mbzuai-oryx/Video-ChatGPT: [ACL 2024 ๐Ÿ”ฅ] Video-ChatGPT is a video conversation model capable of generating meaningful

[ACL 2024 ๐Ÿ”ฅ] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted fo...

github.com

 

์„ธ ๋ฒˆ์งธ๋กœ ์‚ดํŽด๋ณธ Video-ChatGPT๋Š” ๋น„๋””์˜ค์™€ ๊ด€๋ จ๋œ ์˜๋ฏธ ์žˆ๋Š” ๋Œ€ํ™”๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋น„๋””์˜ค ๋Œ€ํ™” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์˜ ํŠน์ง•์€:
  • ACL 2024์— ๋ฐœํ‘œ๋œ ์ตœ์‹  ๋ชจ๋ธ์ด๋ผ๋Š” ์ ์ด ๋ˆˆ์— ๋„์—ˆ์Šต๋‹ˆ๋‹ค
  • ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ธฐ๋Šฅ๊ณผ, ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒฐํ•ฉํ–ˆ์–ด์š”
  • 100,000๊ฐœ์˜ ๋น„๋””์˜ค-์ง€์นจ ์Œ์œผ๋กœ ํ›ˆ๋ จ๋˜์—ˆ๋‹ค๊ณ  ํ•ด์š”
  • ์‹œ๊ณต๊ฐ„์  ๋น„๋””์˜ค ํ‘œํ˜„์„ ์œ„ํ•œ ์‹œ๊ฐ ์ธ์ฝ”๋”๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค
  • ๋น„๋””์˜ค ๊ธฐ๋ฐ˜ ๋Œ€ํ™” ๋ชจ๋ธ์„ ์œ„ํ•œ '์ •๋Ÿ‰์  ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํ‚น'์„ ๋„์ž…ํ–ˆ์–ด์š”

4. PLLaVA

 

PLLaVA, Vision-Language ๋ชจ๋ธ์ธ LLaVA๋ฅผ Video๋กœ ํ™•์žฅํ•˜๋Š” ํ”„๋กœ์ ํŠธ

PLLaVA, Vision-Language ๋ชจ๋ธ์ธ LLaVA๋ฅผ Video๋กœ ํ™•์žฅํ•˜๋Š” ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ ์ตœ๊ทผ ๋“ค์–ด ๋น„๋””์˜ค์™€ ๊ด€๋ จ๋œ ์‹œ๊ฐ-์–ธ์–ด ํ•™์Šต์ด ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, PLLaVA (Pooling LLaVA) ํ”„๋กœ์ ํŠธ๋Š” ๊ธฐ์กด ์ด๋ฏธ์ง€-์–ธ์–ด

discuss.pytorch.kr

๋งˆ์ง€๋ง‰์œผ๋กœ ์‚ดํŽด๋ณธ PLLaVA๋Š” ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€-์–ธ์–ด ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์„ ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋กœ ํ™•์žฅํ•œ ๋ชจ๋ธ์ธ๋ฐ์š”. ์ œ๊ฐ€ ์ฃผ๋ชฉํ•œ ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
  • ๊ฐ„๋‹จํ•œ ํ’€๋ง ์ „๋žต์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐ„์  ์ฐจ์›์„ ํ‰ํ™œํ™”ํ•˜๋Š” ๋ฐฉ์‹์ด ํฅ๋ฏธ๋กœ์› ์–ด์š”
  • ๋น„๋””์˜ค ํ”„๋ ˆ์ž„์—์„œ ์ง€๋ฐฐ์ ์ธ ํ† ํฐ๋“ค์˜ ์˜ํ–ฅ์„ ์ค„์—ฌ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค
  • ์ปดํ“จํ„ฐ ๋น„์ „๊ณผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ๊ฒฝ๊ณ„๋ฅผ ํ—ˆ๋ฌผ์–ด ๋”์šฑ ์ •๊ตํ•œ ๋น„๋””์˜ค ๋‚ด์šฉ ์ดํ•ด๊ฐ€ ๊ฐ€๋Šฅํ•ด์š”
  • ์žฅ์‹œ๊ฐ„ ๋น„๋””์˜ค ์ฒ˜๋ฆฌ์— ํšจ์œจ์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค

๋ฆฌ์„œ์น˜ํ›„ ์„œ๋น„์Šค ๊ฐœ๋ฐœ ๋ฐฉํ–ฅ

๋ฆฌ์„œ์น˜๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด์„œ ์„œ๋น„์Šค ๊ฐœ๋ฐœ ๋ฐฉํ–ฅ๋„ ์ข€ ๋” ๊ตฌ์ฒดํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ชฉ์ : Vision-Language ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ฐœํ‘œ์ž์˜ ๋น„๋””์˜ค๋ฅผ ๋ถ„์„ํ•˜๊ณ , ๋ฐœํ‘œ ๋‚ด์šฉ๊ณผ ์ „๋‹ฌ ๋ฐฉ์‹์— ๋Œ€ํ•œ ํ”ผ๋“œ๋ฐฑ์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค
  • ๊ธฐ๋Šฅ: ์Œ์„ฑ ์ธ์‹์œผ๋กœ ๋ฐœํ‘œ ๋‚ด์šฉ์„ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ๋น„๋””์˜ค ๋ถ„์„์„ ํ†ตํ•ด ๋น„์–ธ์–ด์  ์š”์†Œ(ํ‘œ์ •, ์ œ์Šค์ฒ˜ ๋“ฑ)๋ฅผ ํ‰๊ฐ€ํ•  ๊ณ„ํš์ด์—์š”

๊ธฐ์ˆ  ๊ตฌํ˜„ ๋ฐฉ๋ฒ•

์„œ๋น„์Šค ๊ฐœ๋ฐœ์„ ์œ„ํ•œ ์ ˆ์ฐจ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„ํšํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ์š”๊ตฌ ์‚ฌํ•ญ ๋ถ„์„: ๋จผ์ € ์„œ๋น„์Šค์—์„œ ์ œ๊ณตํ•  ํ”ผ๋“œ๋ฐฑ์˜ ์ข…๋ฅ˜์™€ ๊นŠ์ด๋ฅผ ๊ฒฐ์ •ํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค
  2. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐ ์ „์ฒ˜๋ฆฌ:
    • ๋ฐœํ‘œ ์˜์ƒ๊ณผ ํ•ด๋‹น ํ”ผ๋“œ๋ฐฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ์žˆ์–ด์š”
    • ๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ๋ฅผ ์œ„ํ•œ ์กฐ์น˜๋„ ํ•จ๊ป˜ ์‹œํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค
  3. ๋ชจ๋ธ ์„ ํƒ ๋ฐ ๊ฐœ๋ฐœ:
    • ์Œ์„ฑ ์ธ์‹ ๋ชจ๋ธ: Whisper, DeepSpeech ๋“ฑ ๊ฒ€ํ†  ์ค‘์ด๋ฉฐ, ์–ด๋–ค ๋ชจ๋ธ์ด ๊ฐ€์žฅ ์ ํ•ฉํ• ์ง€ ํ…Œ์ŠคํŠธ ์ค‘์ž…๋‹ˆ๋‹ค
    • ๋น„๋””์˜ค ๋ถ„์„ ๋ชจ๋ธ: ์œ„์—์„œ ์†Œ๊ฐœํ•œ VLM ๋ชจ๋ธ ์ค‘ ์ €ํฌ ์„œ๋น„์Šค์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๊ฒƒ์„ ์„ ์ • ์ค‘์ด์—์š”
    • ํ…์ŠคํŠธ ๋ถ„์„: NLP ๊ธฐ์ˆ ๋กœ ์ „์‚ฌ๋œ ํ…์ŠคํŠธ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์—ฐ๊ตฌ ์ค‘์ž…๋‹ˆ๋‹ค
  4. ์‹œ์Šคํ…œ ํ†ตํ•ฉ:
    • ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ŠคํŠธ๋ฆฌ๋ฐ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ•์ด ํ•„์š”ํ•ด์š”
    • ๊ฐ ๋ชจ๋“ˆ ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ํ๋ฆ„ ์ตœ์ ํ™”๋„ ์ค‘์š”ํ•œ ๊ณผ์ œ์ž…๋‹ˆ๋‹ค
  5. ์„ฑ๋Šฅ ์ตœ์ ํ™”:
    • ๋ชจ๋ธ ๊ฒฝ๋Ÿ‰ํ™” ๋ฐ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ํ™œ์šฉ ๋ฐฉ์•ˆ์„ ๊ณ ๋ฏผ ์ค‘์ด์—์š”
    • ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๋ฐ ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ๋กœ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ธฐ์ˆ ์  ๊ณผ์ œ์™€ ํ•ด๊ฒฐ ๋ฐฉ์•ˆ

๋ฆฌ์„œ์น˜๋ฅผ ์ง„ํ–‰ํ•˜๋ฉฐ ํ˜„์žฌ ๊ธฐ์ˆ  ์ˆ˜์ค€๊ณผ ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๊ณผ์ œ๋“ค๋„ ํŒŒ์•…ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ˜„์žฌ ๊ธฐ์ˆ  ์ˆ˜์ค€

  • OpenAI์˜ Whisper, Google's Speech-to-Text ๋“ฑ ๊ณ ํ’ˆ์งˆ์˜ ์‹ค์‹œ๊ฐ„ ์Œ์„ฑ ์ธ์‹ ๋ชจ๋ธ์ด ์ด๋ฏธ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค
  • CLIP, ViLT ๋“ฑ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๋ชจ๋ธ๋„ ๊ณ„์† ๋ฐœ์ „ํ•˜๊ณ  ์žˆ์–ด์š”
  • OpenPose ๋“ฑ์œผ๋กœ ์ž์„ธ์™€ ์ œ์Šค์ฒ˜๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ธ์‹ํ•˜๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•ด์กŒ์Šต๋‹ˆ๋‹ค
  • ์ „์‚ฌ๋œ ํ…์ŠคํŠธ๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์–ธ์–ด ์‚ฌ์šฉ ํŒจํ„ด, ๊ฐ์ •, ์–ด์กฐ ๋“ฑ์„ ํŒŒ์•…ํ•˜๋Š” NLP ๊ธฐ์ˆ ๋„ ๊ฝค ์„ฑ์ˆ™๋‹จ๊ณ„์— ์žˆ์–ด์š”

ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๊ธฐ์ˆ ์  ๊ณผ์ œ

์ œ๊ฐ€ ํŒŒ์•…ํ•œ ์ฃผ์š” ๊ณผ์ œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
  1. VLM์œผ๋กœ Object Detection ์ตœ์ ํ™”
    • ๋ฐœํ‘œ์ž์˜ ํ–‰๋™์„ ํƒ์ง€ํ•˜๊ณ  ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋ก ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
    • Prompt ๊ธฐ๋ฐ˜ ํŠน์ง• ์ถ”์ถœ ๋ฐฉ์‹์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ๋” ํ•„์š”ํ•ด์š”
  2. ์žฅ์‹œ๊ฐ„ ์˜์ƒ ๋ถ„์„ ๊ฐ€๋Šฅ์„ฑ
    • VLM์œผ๋กœ ์ตœ๋Œ€ ๋ช‡ ๋ถ„๊นŒ์ง€์˜ ์˜์ƒ ๋ถ„์„์ด ๊ฐ€๋Šฅํ•œ์ง€ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
    • ๋ชฉํ‘œ๋Š” ์ตœ๋Œ€ 1์‹œ๊ฐ„ ๋ฐœํ‘œ ์˜์ƒ์„ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์ด์—์š”
  3. ์ ํ•ฉํ•œ ๋ชจ๋ธ ์„ ์ • ๋ฐ ์ตœ์ ํ™”
    • Video to text ๋ณ€ํ™˜์ด ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์„ ์ฐพ๋Š” ์ค‘์ž…๋‹ˆ๋‹ค
    • ํ•„์š”์‹œ Fine-tuning ๋ฐ ๋ชจ๋ธ ์‚ฌ์šฉ๋ฒ•๋„ ์ •๋ฆฝํ•ด์•ผ ํ•  ๊ฒƒ ๊ฐ™์•„์š”

ํ•„์ˆ˜ ์š”๊ฑด ๋ฐ ์„ฑ๊ณต ์ง€ํ‘œ

์„œ๋น„์Šค ๊ฐœ๋ฐœ์„ ์œ„ํ•œ ํ•„์ˆ˜ ์š”๊ฑด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค:

  • ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ: ์ตœ์†Œํ•œ์˜ ์ง€์—ฐ ์‹œ๊ฐ„์œผ๋กœ ํ”ผ๋“œ๋ฐฑ์„ ์ œ๊ณตํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๋†’์€ ์ •ํ™•๋„: ์Œ์„ฑ ์ธ์‹๊ณผ ๋น„๋””์˜ค ๋ถ„์„์˜ ์ •ํ™•๋„๊ฐ€ ๋†’์•„์•ผ ํ•ด์š”
  • ๋‹ค๊ตญ์–ด ์ง€์›: ํ•œ๊ตญ์–ด๋ฅผ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ ์–ธ์–ด๋ฅผ ์ง€์›ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ์‚ฌ์šฉ์ž ๊ฐœ์ธ์ •๋ณด ๋ณดํ˜ธ: ๋ฐ์ดํ„ฐ ์•”ํ˜ธํ™”, ์ต๋ช…ํ™” ๋“ฑ ๋ณด์•ˆ ์กฐ์น˜๊ฐ€ ํ•„์š”ํ•ด์š”
  • ํ™•์žฅ์„ฑ: ์‚ฌ์šฉ์ž ์ฆ๊ฐ€์— ๋”ฐ๋ฅธ ์‹œ์Šคํ…œ ์Šค์ผ€์ผ๋ง์ด ๊ฐ€๋Šฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ์‚ฌ์šฉ์ž ์นœํ™”์  ์ธํ„ฐํŽ˜์ด์Šค: ์‰ฝ๊ฒŒ ์ ‘๊ทผํ•˜๊ณ  ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” UI/UX๋ฅผ ์ œ๊ณตํ•ด์•ผ ํ•ด์š”

Vision-Language Model์„ ํ™œ์šฉํ•œ ๋ฐœํ‘œ์ž ๋ถ„์„ ์„œ๋น„์Šค ๊ฐœ๋ฐœ์„ ์œ„ํ•œ ๊ธฐ์ˆ  ๋ฆฌ์„œ์น˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ๋‹ค์–‘ํ•œ VLM ๋ชจ๋ธ(Video-LLaMA, Video-LLaVA, Video-ChatGPT, PLLaVA ๋“ฑ)์ด ๊ฐœ๋ฐœ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๊ฐ ๋ชจ๋ธ๋งˆ๋‹ค ๊ณ ์œ ํ•œ ํŠน์„ฑ๊ณผ ์žฅ์ ์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์–ด์š”. ๋‹ค์Œ ๋‹จ๊ณ„๋กœ๋Š” ๊ตฌ์ฒด์ ์ธ ๋ชจ๋ธ ์„ ์ • ๋ฐ ํ”„๋กœํ† ํƒ€์ž… ๊ฐœ๋ฐœ์„ ํ†ตํ•ด ์‹ค์ œ ์„œ๋น„์Šค ๊ตฌํ˜„ ๊ฐ€๋Šฅ์„ฑ์„ ๊ฒ€์ฆํ•˜๋Š” ํ…Œ์ŠคํŠธ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•ด์„œ ๋Œ๋ ค๋ณธ ๋‚ด์šฉ์„ ์˜ฌ๋ฆด ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๐Ÿ˜Š

'๐Ÿฆ kakaotech' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[kakaotech] Pitching ๊ฐœ๋ฐœ๊ธฐ - PLLaVA Paper Review  (0) 2025.04.07
[kakaotech] Pitching ๊ฐœ๋ฐœ๊ธฐ - Video-LLaMA Paper Review  (0) 2025.04.07