A A
[CV] YOLO (You Only Look Once)

YOLO (You Only Look Once)

 

YOLO(You Only Look Once)์€ ์‹ค์‹œ๊ฐ„ ๊ฐ์ฒด ํƒ์ง€ ์‹œ์Šคํ…œ์œผ๋กœ, ์ด๋ฏธ์ง€๋‚˜ ๋น„๋””์˜ค์—์„œ ์—ฌ๋Ÿฌ ๊ฐ์ฒด๋ฅผ ๋™์‹œ์— ํƒ์ง€ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค.
 

YOLO: Real-Time Object Detection

YOLO: Real-Time Object Detection You only look once (YOLO) is a state-of-the-art, real-time object detection system. On a Pascal Titan X it processes images at 30 FPS and has a mAP of 57.9% on COCO test-dev. Comparison to Other Detectors YOLOv3 is extremel

pjreddie.com

  • YOLO์˜ ๊ฐ€์žฅ ํฐ ์žฅ์ ์€ ๋น ๋ฅธ ์†๋„์™€ ๋†’์€ ์ •ํ™•๋„์ž…๋‹ˆ๋‹ค.

 

YOLO์˜ ์›๋ฆฌ

  • ๋‹จ์ผ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•œ ์ „์ฒด ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ: YOLO๋Š” ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋ˆ„์ง€ ์•Š๊ณ , ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • Grid Cell ๋กœ ์ด๋ฏธ์ง€ ๋ถ„ํ• : ์ด๋ฏธ์ง€๋ฅผ ๊ทธ๋ฆฌ๋“œ ์…€๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์€ ํŠน์ • ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•  ํ™•๋ฅ ๊ณผ ๊ทธ ๊ฐ์ฒด์˜ ๊ฒฝ๊ณ„ ์ƒ์ž(Bounding Box) ์ขŒํ‘œ๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฒฝ๊ณ„ ์ƒ์ž ์˜ˆ์ธก: ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์€ BB๊ฐœ์˜ ๊ฒฝ๊ณ„ ์ƒ์ž๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ๊ฒฝ๊ณ„ ์ƒ์ž๋Š” ์ค‘์‹ฌ ์ขŒํ‘œ (x,y, ๋„ˆ๋น„ w, ๋†’์ด , ๊ทธ๋ฆฌ๊ณ  ๊ฐ์ฒด ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ  ๊ฐ’์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • ํด๋ž˜์Šค ํ™•๋ฅ  ์ง€๋„: ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์€ ํŠน์ • ํด๋ž˜์Šค์— ์†ํ•  ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ข… ๊ฐ์ฒด ํƒ์ง€๋Š” ๊ฒฝ๊ณ„ ์ƒ์ž์˜ ํ™•๋ฅ ๊ณผ ํด๋ž˜์Šค ํ™•๋ฅ ์„ ๊ณฑํ•œ ๊ฐ’์ด ๊ฐ€์žฅ ๋†’์€ ๊ฒƒ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

 

YOLO์˜ ํŠน์ง•

  • ์†๋„: YOLO๋Š” ์ด๋ฏธ์ง€ ๋‹น ํ•œ ๋ฒˆ์˜ ์‹ ๊ฒฝ๋ง ํ‰๊ฐ€๋กœ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋งค์šฐ ๋น ๋ฅธ ์†๋„๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 
  • ์ „์ฒด ์ด๋ฏธ์ง€ ๋งฅ๋ฝ ์ด์šฉ: YOLO๋Š” ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋ฏ€๋กœ, ๋‹ค๋ฅธ ๊ฐ์ฒด ํƒ์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณด๋‹ค ์ ์€ ์ˆ˜์˜ false positives(์ž˜๋ชป๋œ ํƒ์ง€)๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.
  • ์ข…๋‹จ๊ฐ„ ํ›ˆ๋ จ: ์ด๋ฏธ์ง€ ๋ถ„ํ• , ๊ฐ์ฒด ํƒ์ง€, ๊ทธ๋ฆฌ๊ณ  ํด๋ž˜์Šค ์˜ˆ์ธก์˜ ๋ชจ๋“  ๋‹จ๊ณ„๋ฅผ ๋‹จ์ผ ์‹ ๊ฒฝ๋ง์ด ๋™์‹œ์— ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

 

YOLO v3 ์„ฑ๋Šฅ

Speed + Accuracy

 

  • YOLO v3 ๋ชจ๋ธ์€ ๊ธฐ์กด ๋ชจ๋ธ๋ณด๋‹ค ๋‹ค์–‘ํ•œ ๋‹ค์ง€์•ˆ ๊ฐœ์„ ์„ ํ†ตํ•ด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • 320x320 ํ•ด์ƒ๋„์—์„œ๋Š” 22ms์˜ ์ถ”๋ก  ์‹œ๊ฐ„๊ณผ 28.2 mAP๋ฅผ ๊ธฐ๋กํ•˜๋ฉฐ, SSD์™€ ๋™์ผํ•œ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์„ธ ๋ฐฐ ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค.
  • ๋˜ํ•œ, 416x416 ํ•ด์ƒ๋„์—์„œ๋Š” 29ms์˜ ์ถ”๋ก  ์‹œ๊ฐ„๊ณผ 31.0 mAP, 608x608 ํ•ด์ƒ๋„์—์„œ๋Š” 51ms์˜ ์ถ”๋ก  ์‹œ๊ฐ„๊ณผ 33.0 mAP๋ฅผ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•  ๋•Œ, YOLOv3๋Š” ๋งค์šฐ ๋น ๋ฅธ ์†๋„์™€ ๋†’์€ ์ •ํ™•๋„๋ฅผ ์ž๋ž‘ํ•˜๋ฉฐ, RetinaNet๋ณด๋‹ค 3.8๋ฐฐ ๋น ๋ฅธ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ํŠน์„ฑ ๋•๋ถ„์— ์ž์œจ์ฃผํ–‰, ๊ฐ์‹œ ์‹œ์Šคํ…œ ๋“ฑ ๋‹ค์–‘ํ•œ ์‹ค์‹œ๊ฐ„ ์‘์šฉ ๋ถ„์•ผ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

YOLO Version

  • YOLOv1: ์ฒซ ๋ฒˆ์งธ ๋ฒ„์ „์œผ๋กœ, ์ด๋ฏธ์ง€ ๋‹น ํ•œ ๋ฒˆ์˜ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹จ์ผ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋น ๋ฅธ Detection ์†๋„๋ฅผ ๋ณด์˜€์ง€๋งŒ, ์ •ํ™•๋„๊ฐ€ ๋‚ฎ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
  • YOLOv2 (๋˜๋Š” YOLO9000): ๊ธฐ์กด ๋ฒ„์ „๋ณด๋‹ค๋Š” ๋” ๋†’์€ ์ •ํ™•๋„์™€ ๋น ๋ฅธ ์†๋„๋ฅผ ์œ„ํ•ด ๋ช‡ ๊ฐ€์ง€ ๊ฐœ์„  ์‚ฌํ•ญ์ด ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, anchor boxes๋ฅผ ๋„์ž…ํ•˜์—ฌ ๊ฒฝ๊ณ„ ์ƒ์ž ์˜ˆ์ธก์„ ๋” ์ž˜ํ•˜๊ฒŒ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋‹ค์ค‘ ์Šค์ผ€์ผ ํ•™์Šต์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ๋” ์ž˜ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • YOLOv3: ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ๋” ์ž˜ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก multi-scale detection ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ž‘์€ ๊ฐ์ฒด์™€ ํฐ ๊ฐ์ฒด ๋ชจ๋‘์— ๋Œ€ํ•ด ๋†’์€ ํƒ์ง€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ˆ˜ํ–‰์‹œ๊ฐ„์€ ์กฐ๊ธˆ ๊ธฐ์กด ๋ฒ„์ „๋ณด๋‹ค๋Š” ๋Š๋ ค์กŒ์œผ๋‚˜, ์„ฑ๋Šฅ์ด ๋Œ€ํญ ๊ฐœ์„ ๋˜์—ˆ๋‹ค๋Š” ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • YOLOv4, YOLOv5: ์„ฑ๋Šฅ ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ๋” ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ ๊ฐ์ฒด ํƒ์ง€๊ฐ€ ๊ฐ€๋Šฅํ•ด์กŒ์Šต๋‹ˆ๋‹ค. YOLOv4๋Š” CSPDarknet-53 ๋ฐฑ๋ณธ ๋„คํŠธ์›Œํฌ์™€ ์ตœ์‹  ๊ธฐ๋ฒ•๋“ค์„ ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ–ˆ์œผ๋ฉฐ, YOLOv5๋Š” PyTorch๋กœ ๊ตฌํ˜„๋˜์–ด ์‚ฌ์šฉ์ด ๊ฐ„ํŽธํ•˜๊ณ  ํ™•์žฅ์„ฑ์ด ๋›ฐ์–ด๋‚ฉ๋‹ˆ๋‹ค. ์ด๋“ค ๋ฒ„์ „์€ ๋†’์€ ํšจ์œจ์„ฑ๊ณผ ์ •ํ™•๋„๋กœ ์‹ค์‹œ๊ฐ„ ์‘์šฉ ๋ถ„์•ผ์—์„œ ํƒ์›”ํ•œ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

YOLO - V1

  • Yolo v1 ์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ S X S Grid๋กœ ๋‚˜๋ˆ„๊ณ  ๊ฐ Grid์˜ Cell์ด ํ•˜๋‚˜์˜ Object์— ๋Œ€ํ•œ Detection์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ Grid Cell ์ด 2๊ฐœ์˜ Bounding Box ํ›„๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Object์˜ Bounding Box ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

 

YOLO-V1 Detection Model

YOLO V1 ๋ชจ๋ธ์˜ Inference๋ฅผ ํ•œ๋ฒˆ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Inference ๊ณผ์ •

์ž…๋ ฅ ์ด๋ฏธ์ง€ (Input image)

  • ํฌ๊ธฐ (Size): 448×448×3
  • ์„ค๋ช… (Description): ์ƒ‰์ƒ ์ฑ„๋„(RGB)์„ ํฌํ•จํ•œ 448x448 ํฌ๊ธฐ์˜ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Convolutional Layers ๋ฐ GoogLeNet ์ˆ˜์ • (20 layers)

  • ์„ค๋ช… (Description): ์ž…๋ ฅ ์ด๋ฏธ์ง€๋Š” ๋‹ค์ˆ˜์˜ Convolutional Layers๋ฅผ ๊ฑฐ์นฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ์ด๋ฏธ์ง€์˜ ํŠน์ง•์ด ์ถ”์ถœ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” GoogLeNet์„ ์ˆ˜์ •ํ•˜์—ฌ 20๊ฐœ์˜ ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Fully Connected Layers (FC)์™€ ReLU ํ™œ์„ฑํ™” ํ•จ์ˆ˜ (R)

  • ์„ค๋ช… (Description): Convolutional Layers ์ดํ›„์—๋Š” Fully Connected Layers๊ฐ€ ๋’ค๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ์ด ๋ ˆ์ด์–ด๋“ค์€ ์ด๋ฏธ์ง€์˜ ์ „์—ญ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ ์ตœ์ข… ์˜ˆ์ธก์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.

์ตœ์ข… ์ถœ๋ ฅ (7x7x30 Tensor)

  • ์„ค๋ช… (Description): ์ตœ์ข…์ ์œผ๋กœ 7x7x30 ํฌ๊ธฐ์˜ ํ…์„œ๋กœ ์ถœ๋ ฅ์ด ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค. ์ด ํ…์„œ๋Š” ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์ด ์˜ˆ์ธกํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํ…์„œ ๊ฐ’ ํ•ด์„ (Tensor values interpretation)

  • ์„ค๋ช… (Description): ์ถœ๋ ฅ ํ…์„œ๋Š” S×S×(B×5+C) ํ˜•ํƒœ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ S๋Š” ๊ทธ๋ฆฌ๋“œ ์…€์˜ ํฌ๊ธฐ (์—ฌ๊ธฐ์„œ๋Š” 7x7), B๋Š” ๊ฐ ์…€์—์„œ ์˜ˆ์ธกํ•œ ๊ฒฝ๊ณ„ ์ƒ์ž ์ˆ˜ (์—ฌ๊ธฐ์„œ๋Š” 2), C๋Š” ํด๋ž˜์Šค ์ˆ˜ (์—ฌ๊ธฐ์„œ๋Š” 20)์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ…์„œ ํฌ๊ธฐ๋Š” 7×7×30์ด ๋ฉ๋‹ˆ๋‹ค.

์„ธ๋ถ€ ๊ตฌ์กฐ (Detailed Structure):

  • ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์€ 30๊ฐœ์˜ ๊ฐ’์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค:
    • 5๊ฐœ์˜ ๊ฐ’: ๊ฐ ๊ฒฝ๊ณ„ ์ƒ์ž๋งˆ๋‹ค (x, y, w, h, confidence)
    • 20๊ฐœ์˜ ํด๋ž˜์Šค ํ™•๋ฅ  ๊ฐ’.

 

์˜ˆ์ธก ํ•ด์„ (Prediction Interpretation)

๊ทธ๋ฆฌ๋“œ ์…€ ์˜ˆ์ธก (Grid Cell Prediction)

  • ์„ค๋ช… (Description): ์ž…๋ ฅ ์ด๋ฏธ์ง€๊ฐ€ 7x7 ๊ทธ๋ฆฌ๋“œ๋กœ ๋‚˜๋‰˜์–ด์ง€๊ณ , ๊ฐ ์…€์€ ๊ฐ์ฒด๊ฐ€ ํ•ด๋‹น ์…€ ์•ˆ์— ์žˆ๋Š”์ง€๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ๋นจ๊ฐ„์ƒ‰ ๋ฐ•์Šค๋Š” ๊ทธ๋ฆฌ๋“œ ์…€์ด ํƒ์ง€ํ•œ ๊ฐ์ฒด์˜ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

๊ฒฝ๊ณ„ ์ƒ์ž (Bounding Box)์™€ ํด๋ž˜์Šค ํ™•๋ฅ  (Class Probabilities)

  • ์„ค๋ช… (Description): ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์€ B๊ฐœ์˜ ๊ฒฝ๊ณ„ ์ƒ์ž๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ฒฝ๊ณ„ ์ƒ์ž๋Š” ์ค‘์‹ฌ ์ขŒํ‘œ(x, y), ๋„ˆ๋น„(w), ๋†’์ด(h), ๊ทธ๋ฆฌ๊ณ  confidence score(๊ฐ์ฒด๊ฐ€ ํฌํ•จ๋  ํ™•๋ฅ )๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ฒฝ๊ณ„ ์ƒ์ž๋Š” ํด๋ž˜์Šค ํ™•๋ฅ ๊ณผ ๊ณฑํ•ด์ ธ ์ตœ์ข… ํƒ์ง€ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.

Detection Procedure

  • ์„ค๋ช… (Description): ์ตœ์ข…์ ์œผ๋กœ ๋ชจ๋“  ๊ทธ๋ฆฌ๋“œ ์…€์˜ ์˜ˆ์ธก์„ ํ†ตํ•ฉํ•˜์—ฌ ๊ฐ์ฒด์˜ ์œ„์น˜์™€ ํด๋ž˜์Šค๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์ธก๋œ ๊ฒฝ๊ณ„ ์ƒ์ž์™€ ํด๋ž˜์Šค ํ™•๋ฅ ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์— ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
๊ฐ Grid Cell ๋ณ„๋กœ ์•„๋ž˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
A. 2๊ฐœ์˜ Bounding Box ํ›„๋ณด์˜ ์ขŒํ‘œ์™€ ํ•ด๋‹น Box๋ณ„

Confidence Score
Confidence Score = ์˜ค๋ธŒ์ ํŠธ์ผ ํ™•๋ฅ  * IOU ๊ฐ’
x, y, w, h : Ground Truth box์™€ ํ›„๋ณด Box๊ฐ„ offset ์ขŒํ‘œ
B. ํด๋ž˜์Šค ํ™•๋ฅ . Pascal VOC ๊ธฐ์ค€ 20๊ฐœ ํด๋ž˜์Šค์˜ ํ™•๋ฅ 


NMS(Non Max Suppression)์œผ๋กœ ์ตœ์ข… Bbox ์˜ˆ์ธก

๊ฐœ๋ณ„ ํด๋ž˜์Šค๋ณ„ NMS๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ตœ์ข… Bounding Box๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ณผ์ •์„ ํ•œ๋ฒˆ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๊ฐœ๋ณ„ Class ๋ณ„ NMS ์ˆ˜ํ–‰

  1. ํŠน์ • Confidence ๊ฐ’ ์ดํ•˜๋Š” ๋ชจ๋‘ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
  2. ๊ฐ€์žฅ ๋†’์€ Confidence๊ฐ’์„ ๊ฐ€์ง„ ์ˆœ์œผ๋กœ Bounding Box๋ฅผ ์ •๋ ฌํ•ฉ๋‹ˆ๋‹ค.
  3. ๊ฐ€์žฅ ๋†’์€ Confidence๋ฅผ ๊ฐ€์ง„ Bounding Box์™€ IOU์™€ ๊ฒน์น˜๋Š” ๋ถ€๋ถ„์ด IOU Threshold ๋ณด๋‹ค ํฐ Bounding Box๋Š” ๋ชจ๋‘ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
  4. ๋‚จ์•„ ์žˆ๋Š” Bounding Box์— ๋Œ€ํ•ด 3๋ฒˆ Step์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.

Object Confidence์™€ IOU Threshold๋กœ Filtering์„ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค.


YOLO-V1 Issue

Detection ์‹œ๊ฐ„์€ ๋น ๋ฅด๋‚˜ Detection ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ํŠนํžˆ ์ž‘์€ Object์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์ด ๋‚˜์ฉ๋‹ˆ๋‹ค.

YOLO v1, v2, v3 ๋น„๊ต

  • YOLOv1์€ 446x446 ํฌ๊ธฐ์˜ ์ด๋ฏธ์ง€๋ฅผ Inception ๋ณ€ํ˜• ๋„คํŠธ์›Œํฌ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์ด 2๊ฐœ์˜ ์•ต์ปค ๋ฐ•์Šค๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • YOLOv2๋Š” 416x416 ํฌ๊ธฐ์˜ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, Darknet 19 ๋„คํŠธ์›Œํฌ๋ฅผ ์ฑ„ํƒํ•˜๊ณ , ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์ด 5๊ฐœ์˜ ์•ต์ปค ๋ฐ•์Šค๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.์•ต์ปค ๋ฐ•์Šค๋Š” K-Means ํด๋Ÿฌ์Šคํ„ฐ๋ง์œผ๋กœ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.
  • YOLOv3๋Š” 416x416 ํฌ๊ธฐ์˜ ์ด๋ฏธ์ง€๋ฅผ Darknet 53 ๋„คํŠธ์›Œํฌ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ , ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ Feature Map์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด 9๊ฐœ์˜ ์•ต์ปค ๋ฐ•์Šค๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ๋ฒ„์ „์—์„œ๋Š” 13x13, 26x26, 52x52 ํฌ๊ธฐ์˜ ์„ธ ๊ฐ€์ง€ Feature Map์„ ์‚ฌ์šฉํ•˜๋ฉฐ, Feature Pyramid Network(FPN) ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ๋” ์ž˜ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

YOLO-v2

YOLO-v2 Detection ์‹œ๊ฐ„ ๋ฐ ์„ฑ๋Šฅ

 

  • YOLOv2๋Š” PASCAL VOC 2007 ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋†’์€ ์ •ํ™•๋„(mAP)์™€ ๋น ๋ฅธ ์ฒ˜๋ฆฌ ์†๋„(FPS)๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์— ๋น„ํ•ด ์ „๋ฐ˜์ ์œผ๋กœ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • MS-COCO ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ YOLOv2๋Š” ๋†’์€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉฐ, ํŠนํžˆ ํฐ ๊ฐ์ฒด์— ๋Œ€ํ•ด ์šฐ์ˆ˜ํ•œ ํƒ์ง€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ž‘์€ ๊ฐ์ฒด์— ๋Œ€ํ•œ ํƒ์ง€ ์„ฑ๋Šฅ์€ ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์ง€๋งŒ, ์—ฌ์ „ํžˆ ์ „๋ฐ˜์ ์œผ๋กœ ๊ท ํ˜• ์žกํžŒ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • YOLOv2๋Š” ์ •ํ™•๋„์™€ ์†๋„๋ฅผ ๋ชจ๋‘ ์ค‘์š”์‹œํ•˜๋Š” ์‘์šฉ ๋ถ„์•ผ์—์„œ ๋งค์šฐ ํšจ๊ณผ์ ์ธ ๋ชจ๋ธ์ž„์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.

YOLO v2์˜ ํŠน์ง•

  • Batch Normalization
    • ๊ฐ ์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด ๋’ค์— ๋ฐฐ์น˜ ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•˜๊ณ , ๊ณผ์ ํ•ฉ(overfitting)์„ ์ค„์ž…๋‹ˆ๋‹ค.
  • High Resolution Classifier
    • ๋„คํŠธ์›Œํฌ์˜ ๋ถ„๋ฅ˜(Classifier) ๋‹จ์„ ๋” ๋†’์€ ํ•ด์ƒ๋„(448x448)๋กœ ๋ฏธ์„ธ ์กฐ์ •(fine-tuning)ํ•ฉ๋‹ˆ๋‹ค.
  • 13 x 13 Feature Map ๊ธฐ๋ฐ˜์˜ Anchor Box
    • 13 x 13 ํฌ๊ธฐ์˜ Feature Map์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ Grid cell์€ 5๊ฐœ์˜ Anchor Box๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • Anchor Box๋ฅผ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์™€ ํ˜•ํƒœ์˜ ๊ฐ์ฒด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Darknet-19 Classification ๋ชจ๋ธ ์ฑ„ํƒ
    • YOLOv2๋Š” Darknet-19 ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
    • Darknet-19๋Š” 19๊ฐœ์˜ Convolutional ๋ ˆ์ด์–ด์™€ 5๊ฐœ์˜ Max-Pooling ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ๋œ ๊ฒฝ๋Ÿ‰์˜ ํšจ์œจ์ ์ธ ๋ถ„๋ฅ˜ ๋„คํŠธ์›Œํฌ์ž…๋‹ˆ๋‹ค.
  • ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ์ด๋ฏธ์ง€๋กœ ๋„คํŠธ์›Œํฌ ํ•™์Šต
    • ์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ์˜ ์ด๋ฏธ์ง€๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๋„คํŠธ์›Œํฌ๋ฅผ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค.
    • ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„์˜ ์ด๋ฏธ์ง€์—์„œ ์ผ๊ด€๋˜๊ฒŒ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

YOLO-v2 Network ๊ตฌ์กฐ

 

  • Conv 3x3: 3x3 ํฌ๊ธฐ์˜ Convolution Layer(์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด)๋กœ Feature(ํŠน์ง•)์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  • Maxpool: Maxpooling Layer(์ตœ๋Œ€ ํ’€๋ง ๋ ˆ์ด์–ด)๋กœ ๊ณต๊ฐ„ ํ•ด์ƒ๋„๋ฅผ ์ค„์ด๊ณ  ์ค‘์š”ํ•œ Feature(ํŠน์ง•)์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  • Conv 1x1: 1x1 ํฌ๊ธฐ์˜ Convolution Layer(์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด)๋กœ ์ฑ„๋„ ์ˆ˜๋ฅผ ์ค„์ด๊ฑฐ๋‚˜ ๋Š˜๋ฆฝ๋‹ˆ๋‹ค.
  • Passthrough Module: ๋‚ฎ์€ ํ•ด์ƒ๋„์—์„œ ๋†’์€ ํ•ด์ƒ๋„๋กœ ํŠน์ง•์„ ์ „๋‹ฌํ•˜์—ฌ ์ž‘์€ ๊ฐ์ฒด์— ๋Œ€ํ•œ ํƒ์ง€ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  • Merge Module: ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ Featuer Map(ํŠน์„ฑ ๋งต)์„ ๋ณ‘ํ•ฉํ•˜์—ฌ ๋” ํ’๋ถ€ํ•œ ํŠน์ง•์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

Yolo v2 Anchor Box๋กœ 1 Cell์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ Object Detection

  • SSD์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ 1๊ฐœ์˜ Cell์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ Anchor๋ฅผ ํ†ตํ•ด ๊ฐœ๋ณ„ Cell์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ Object Detection์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • K-Means Clustering ์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ์ด๋ฏธ์ง€ ํฌ๊ธฐ์™€ Shape Ratio ๋”ฐ๋ฅธ 5๊ฐœ์˜ ๊ตฐ์ง‘ํ™” ๋ถ„๋ฅ˜๋ฅผ ํ•˜์—ฌ Anchor Box๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

YOLO v2 Output Feature Map

๊ฐ„๋‹จํ•˜๊ฒŒ ์‚ฌ์ง„์˜ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

  • Grid Cell: 13x13 ํฌ๊ธฐ์˜ ๊ทธ๋ฆฌ๋“œ๋กœ ๋‚˜๋‰˜์–ด, ๊ฐ ์…€์ด ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • Bounding Box: ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์€ 5๊ฐœ์˜ Bounding Box๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ, ๊ฐ Bounding Box๋Š” 25๊ฐœ์˜ ์ •๋ณด๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • Bounding Box ์†์„ฑ: ์ค‘์‹ฌ ์ขŒํ‘œ, ๋„ˆ๋น„, ๋†’์ด, ๊ฐ์ฒด ํฌํ•จ ํ™•๋ฅ , ํด๋ž˜์Šค ํ™•๋ฅ  ๋“ฑ์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

One Stage Detector

One-stage object detectors๋Š” ๊ฐ์ฒด ํƒ์ง€์—์„œ ๋‹จ์ผ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ๊ฐ์ฒด์˜ ์œ„์น˜์™€ ํด๋ž˜์Šค๋ฅผ ๋™์‹œ์— ์˜ˆ์ธกํ•˜๋Š” ์ ‘๊ทผ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ์†๋„๊ฐ€ ๋น ๋ฅด๊ณ  ์‹ค์‹œ๊ฐ„ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

YOLOv1 (You Only Look Once)

  • ์ถœ์‹œ: 2016๋…„
  • ์ฃผ์š” ํŠน์ง•:
    • ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    • ๋‹จ์ผ ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด ์ด๋ฏธ์ง€์—์„œ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ๋น ๋ฅธ ์†๋„์™€ ๋น„๊ต์  ๋†’์€ ์ •ํ™•๋„๋ฅผ ์ž๋ž‘ํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹จ์ : ์ž‘์€ ๊ฐ์ฒด์— ๋Œ€ํ•œ ํƒ์ง€ ์„ฑ๋Šฅ์ด ๋‚ฎ๊ณ , ๊ฐ์ฒด๊ฐ€ ๋ฐ€์ง‘๋œ ์ƒํ™ฉ์—์„œ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

 

SSD (Single Shot MultiBox Detector)

  • ์ถœ์‹œ: 2016๋…„
  • ์ฃผ์š” ํŠน์ง•:
    • ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๋””ํดํŠธ ๋ฐ•์Šค(default boxes)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    • ์—ฌ๋Ÿฌ ํ•ด์ƒ๋„์˜ ํ”ผ์ฒ˜ ๋งต(feature map)์—์„œ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ท ํ˜• ์žกํžŒ ์ •ํ™•๋„์™€ ์†๋„๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹จ์ : ๊ณ ํ•ด์ƒ๋„ ๊ฐ์ฒด์— ๋Œ€ํ•œ ํƒ์ง€ ์„ฑ๋Šฅ์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์Šต๋‹ˆ๋‹ค.

 

YOLOv2 (YOLO9000)

  • ์ถœ์‹œ: 2017๋…„
  • ์ฃผ์š” ํŠน์ง•:
    • ์•ต์ปค ๋ฐ•์Šค(anchor boxes)๋ฅผ ๋„์ž…ํ•˜์—ฌ ๊ฒฝ๊ณ„ ์ƒ์ž ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
    • ๋‹ค์ค‘ ์Šค์ผ€์ผ ํ•™์Šต์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
    • ๋ฐฐ์น˜ ์ •๊ทœํ™”(batch normalization)์™€ ๊ณ ํ•ด์ƒ๋„ ๋ถ„๋ฅ˜๊ธฐ(high resolution classifier)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์žฅ์ : ๋†’์€ ์ •ํ™•๋„์™€ ๋น ๋ฅธ ์†๋„, ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

RetinaNet

  • ์ถœ์‹œ: 2017๋…„
  • ์ฃผ์š” ํŠน์ง•:
    • ํฌ์ปฌ ๋กœ์Šค(Focal Loss)๋ฅผ ๋„์ž…ํ•˜์—ฌ ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.
    • ResNet๊ณผ FPN(Feature Pyramid Network)์„ ๋ฐฑ๋ณธ ๋„คํŠธ์›Œํฌ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ๋†’์€ ์ •ํ™•๋„์™€ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹จ์ : YOLO ์‹œ๋ฆฌ์ฆˆ์— ๋น„ํ•ด ๋‹ค์†Œ ๋Š๋ฆฝ๋‹ˆ๋‹ค.

 

YOLOv3

  • ์ถœ์‹œ: 2018๋…„
  • ์ฃผ์š” ํŠน์ง•:
    • Darknet-53 ๋ฐฑ๋ณธ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ๋‹ค์ค‘ ์Šค์ผ€์ผ ์˜ˆ์ธก: ์„ธ ๊ฐ€์ง€ ๋‹ค๋ฅธ ํฌ๊ธฐ์—์„œ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐœ์„ ๋œ ์•ต์ปค ๋ฐ•์Šค์™€ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์žฅ์ : ๋†’์€ mAP(mean Average Precision)๊ณผ ๋น ๋ฅธ ์ถ”๋ก  ์†๋„๋ฅผ ์ž๋ž‘ํ•ฉ๋‹ˆ๋‹ค.

 

Feature Pyramid Network (FPN)

  • ์„ค๋ช…: ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํƒ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์ค‘ ์Šค์ผ€์ผ ํ”ผ์ฒ˜๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ํŠน์ง•:
    • ํ”ผ๋ผ๋ฏธ๋“œ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ํ•ด์ƒ๋„์˜ ํ”ผ์ฒ˜ ๋งต์—์„œ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ณ ํ•ด์ƒ๋„์™€ ์ €ํ•ด์ƒ๋„ ํ”ผ์ฒ˜ ๋งต์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.
    • RetinaNet๊ณผ ๊ฐ™์€ ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ๋˜์–ด ๋†’์€ ์ •ํ™•๋„๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

YOLO-v3

YOLO v3์˜ ํŠน์ง•

  • Feature Pyramid Network (FPN) ์œ ์‚ฌํ•œ ๊ธฐ๋ฒ• ์ ์šฉ
    • ์„ค๋ช…: YOLOv3๋Š” Feature Pyramid Network(FPN)์™€ ์œ ์‚ฌํ•œ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ๋‹ค์ค‘ ์Šค์ผ€์ผ ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • ํŠน์ง•: 3๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ์™€ ์Šค์ผ€์ผ์„ ๊ฐ€์ง„ Feature Map์—์„œ ๊ฐ๊ฐ 3๊ฐœ์˜ Anchor Box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
      • 13x13 ํฌ๊ธฐ์˜ Feature Map: ํฐ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
      • 26x26 ํฌ๊ธฐ์˜ Feature Map: ์ค‘๊ฐ„ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
      • 52x52 ํฌ๊ธฐ์˜ Feature Map: ์ž‘์€ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    • ์žฅ์ : ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ž‘์€ ๊ฐ์ฒด์— ๋Œ€ํ•œ ํƒ์ง€ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.
  • ๋†’์€ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์„ ๊ฐ€์ง€๋Š” Darknet-53
    • ์„ค๋ช…: YOLOv3๋Š” Darknet-53์„ ๋ฐฑ๋ณธ ๋„คํŠธ์›Œํฌ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ํŠน์ง•: Darknet-53์€ 53๊ฐœ์˜ Convolutional Layer๋กœ ๊ตฌ์„ฑ๋œ ๊นŠ์€ ์‹ ๊ฒฝ๋ง์ž…๋‹ˆ๋‹ค.
    • ์žฅ์ : ๋” ๋†’์€ ํŠน์ง• ์ถ”์ถœ ๋Šฅ๋ ฅ์„ ์ œ๊ณตํ•˜์—ฌ ๊ฐ์ฒด ํƒ์ง€์˜ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ์ด์ „ ๋ฒ„์ „์˜ Darknet-19๋ณด๋‹ค ๋” ๊นŠ๊ณ  ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
  • Multi-Label ์˜ˆ์ธก
    • ์„ค๋ช…: YOLOv3๋Š” ๊ฐœ๋ณ„ ๊ฐ์ฒด์— ๋Œ€ํ•ด ๋‹ค์ค‘ ๋ ˆ์ด๋ธ”์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
    • ํŠน์ง•: Softmax ๋Œ€์‹  Sigmoid ๊ธฐ๋ฐ˜์˜ Logistic Classifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๊ฐ์ฒด์— ๋Œ€ํ•ด ๋‹ค์ค‘ ๋ ˆ์ด๋ธ”์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ์žฅ์ : ํ•œ ๊ฐ์ฒด๊ฐ€ ์—ฌ๋Ÿฌ ํด๋ž˜์Šค์— ์†ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ์—๋„ ์ •ํ™•ํ•˜๊ฒŒ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํ•œ ๊ฐ์ฒด๊ฐ€ ๋™์‹œ์— '๊ฐœ'์™€ '์• ์™„๋™๋ฌผ'์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

YOLO v3 Network ๊ตฌ์กฐ

๋จผ์ €, ์š”์•ฝ๋‚ด์šฉ์„ ์„ค๋ช…๋“œ๋ฆฌ๋ฉด, Multi-Scale Object Detection์„ ์œ„ํ•œ FPN (Feature Pyramid Network) ๊ธฐ๋ฒ•์ด ์ ์šฉ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

  • Backbone Network: Darknet-53
    • YOLOv3๋Š” 53๊ฐœ์˜ Convolution Layer๋กœ ๊ตฌ์„ฑ๋œ Darknet-53์„ Backbone Network๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ์ด Backbone Network๋Š” ์ด๋ฏธ์ง€์—์„œ ๊ณ ์ˆ˜์ค€์˜ ํŠน์ง•์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  • Residual Blocks
    • ๋„คํŠธ์›Œํฌ๋Š” Residual Blocks๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๊ณ , ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” Gradient Loss ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค.
    • Residual Blocks๋Š” ์ž…๋ ฅ์„ ์ถœ๋ ฅ์— ๋”ํ•˜๋Š” ์Šคํ‚ต ์—ฐ๊ฒฐ(skip connection)์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • Feature Pyramid Network (FPN)
    • YOLOv3๋Š” FPN ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ๋‹ค์ค‘ ์Šค์ผ€์ผ์—์„œ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
    • 3๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ์˜ Feature Map์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
      • Scale 1: 13x13 Feature Map (Stride: 32) - ํฐ ๊ฐ์ฒด ํƒ์ง€
      • Scale 2: 26x26 Feature Map (Stride: 16) - ์ค‘๊ฐ„ ํฌ๊ธฐ ๊ฐ์ฒด ํƒ์ง€
      • Scale 3: 52x52 Feature Map (Stride: 8) - ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€
  • Detection Heads
    • ๊ฐ Feature Map์€ ๊ฐ๊ฐ์˜ Detection Head๋ฅผ ํ†ตํ•ด ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ Detection Head๋Š” 3๊ฐœ์˜ Anchor Box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด์˜ ๊ฒฝ๊ณ„ ์ƒ์ž์™€ ํด๋ž˜์Šค ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋กœ์จ ๊ฐ Feature Map์—์„œ ์ด 9๊ฐœ์˜Anchor Box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • Upsampling and Merging
    • ๋†’์€ ํ•ด์ƒ๋„์˜ Feature Map์—์„œ ์˜ˆ์ธก์„ ๋” ์ž˜ํ•˜๊ธฐ ์œ„ํ•ด ๋‚ฎ์€ ํ•ด์ƒ๋„์˜ Feature Map์„ Upsamplingํ•˜์—ฌ ๋†’์€ ํ•ด์ƒ๋„์˜ Feature Map๊ณผ ๋ณ‘ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

YOLO์™€ SSD์˜ ๋น„๊ต

3๊ฐ€์ง€์˜ ์ฃผ์ œ๋กœ ๋น„๊ต๋ฅผ ํ•œ๋ฒˆ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค

 

 

  • Feature Map์˜ ํ•ด์ƒ๋„:
    • YOLOv2๋Š” ๋‹จ์ผ ํ•ด์ƒ๋„(13x13)๋งŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • YOLOv3๋Š” 3๊ฐœ์˜ ํ•ด์ƒ๋„(52x52, 26x26, 13x13)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • SSD๋Š” 6๊ฐœ์˜ ํ•ด์ƒ๋„(1x1, 2x2, 3x3, 5x5, 10x10, 19x19)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์ค‘ ์Šค์ผ€์ผ ๊ฐ์ฒด ํƒ์ง€:
    • YOLOv3์™€ SSD๋Š” ๋‹ค์ค‘ ์Šค์ผ€์ผ ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์ง€์›ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ๋ณด๋‹ค ํšจ๊ณผ์ ์œผ๋กœ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • YOLOv2๋Š” ๋‹จ์ผ ํ•ด์ƒ๋„๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์ƒ๋Œ€์ ์œผ๋กœ ๋‹ค์ค‘ ์Šค์ผ€์ผ ๊ฐ์ฒด ํƒ์ง€์— ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค.
  • ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ:
    • YOLOv3์™€ SSD๋Š” ์ž‘์€ ๊ฐ์ฒด์— ๋Œ€ํ•œ ํƒ์ง€ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚จ. ์ด๋Š” ์ž‘์€ ํฌ๊ธฐ์˜ Feature Map์„ ์‚ฌ์šฉํ•˜์—ฌ ๋” ์„ธ๋ฐ€ํ•œ ํƒ์ง€๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
    • YOLOv2๋Š” ๋‹จ์ผ ํ•ด์ƒ๋„(13x13)๋กœ ์ธํ•ด ์ž‘์€ ๊ฐ์ฒด์— ๋Œ€ํ•œ ํƒ์ง€ ์„ฑ๋Šฅ์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์Šต๋‹ˆ๋‹ค.

YOLO v3 Output Feature Map

 

  • YOLOv3๋Š” ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๊ธฐ ์œ„ํ•ด 13x13, 26x26, 52x52 ํฌ๊ธฐ์˜ ์„ธ ๊ฐ€์ง€ Feature Map์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ Feature Map์€ ํ•ด๋‹น ํ•ด์ƒ๋„์—์„œ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋ฉฐ, ์—ฌ๋Ÿฌ ๊ฐœ์˜ Anchor Box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฝ๊ณ„ ์ƒ์ž์™€ ํด๋ž˜์Šค ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋•Œ, Bounding Box๋“ค์˜ Attribute๋“ค์„ ํ•œ๋ฒˆ ์•Œ์•„๋ณด๋ฉด?
    • Box Coordinates (tx, ty, tw, th): Bounding Box์˜ ์ค‘์‹ฌ ์ขŒํ‘œ (tx, ty)์™€ ๋„ˆ๋น„ (tw), ๋†’์ด (th)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. aka. ์ขŒํ‘œ๊ฐ’
    • Objectness Score (p0): ํ•ด๋‹น ์•ต์ปค ๋ฐ•์Šค๊ฐ€ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•  ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. aka. Object ํฌํ•จ ํ™•๋ฅ  x IOU
    • Class Scores (p1, p2, ..., pc): ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์ด๋•Œ class์˜ score๋“ค์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š” Dataset๋“ค์€ Pretrained๋œ Coco Dataset์ž…๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ๊ฐ Anchor Box๋Š” ์ด B๊ฐœ์˜ ์ •๋ณด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ B๋Š” Anchor Box์˜ ์ˆ˜์ž…๋‹ˆ๋‹ค.

Darknet-53 ํŠน์„ฑ

 

  • Top-1 ์ •ํ™•๋„: 77.2%
  • Top-5 ์ •ํ™•๋„: 93.8%
  • Bn Ops (Billion Operations): 18.7
  • BFLOP/s (Billion Floating Point Operations per Second): 1457
  • FPS (Frames Per Second): 78

 

  • ์ดˆ๊ธฐ Convolutional Layer: 32๊ฐœ์˜ 3x3 ํ•„ํ„ฐ (์ถœ๋ ฅ: 256x256).
  • ์ดํ›„ Residual Block: ๊ฐ ๋ธ”๋ก์€ 1x1๊ณผ 3x3 Convolutional Layer๋กœ ๊ตฌ์„ฑ, ๋ธ”๋ก๋งˆ๋‹ค ์ถœ๋ ฅ ํฌ๊ธฐ๋Š” ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.
  • ์ตœ์ข… ๋ ˆ์ด์–ด: Global Average Pooling, Fully Connected Layer, Softmax Layer.
Darknet-53์€ YOLOv3์˜ ๋ฐฑ๋ณธ ๋„คํŠธ์›Œํฌ๋กœ ์‚ฌ์šฉ๋˜๋ฉฐ, ๋†’์€ ํŠน์ง• ์ถ”์ถœ ๋Šฅ๋ ฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
Residual Block์„ ์‚ฌ์šฉํ•˜์—ฌ ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ๋„ ํšจ์œจ์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•ฉ๋‹ˆ๋‹ค.

Training

YOLO v3 ๋ชจ๋ธ์ด Training๋˜๋Š” Flow๋ฅผ ํ•œ๋ฒˆ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

  • ๋ฐ์ดํ„ฐ ์ค€๋น„ (Data Preparation)
    • Annotation ํŒŒ์ผ: XML ํ˜•์‹์˜ ์ฃผ์„ ํŒŒ์ผ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ๊ฐ์ฒด์˜ ์œ„์น˜์™€ ํด๋ž˜์Šค ์ •๋ณด๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ์‹œ์—์„œ๋Š” PASCAL VOC ํ˜•์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.
      • <object> ํƒœ๊ทธ ๋‚ด์— ๊ฐ์ฒด์˜ ํด๋ž˜์Šค ์ด๋ฆ„๊ณผ Bounding Box ์ขŒํ‘œ(xmin, ymin, xmax, ymax)๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
  • ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• (Data Augmentation)
    • ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๋‹ค์–‘์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๋ณ€ํ™˜์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋ฏธ์ง€ ํšŒ์ „, ํฌ๊ธฐ ์กฐ์ •, ์ƒ‰์ƒ ๋ณ€ํ™” ๋“ฑ์„ ํ†ตํ•ด ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์ƒํ™ฉ์— ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์ค‘ ์Šค์ผ€์ผ ํ•™์Šต (Multi-Scale Training)
    • YOLOv3๋Š” ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์ด ์—ฌ๋Ÿฌ ํ•ด์ƒ๋„์˜ ๊ฐ์ฒด๋ฅผ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„๋ฅผ ์ฃผ๊ธฐ์ ์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํ•™์Šตํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  • ํ”ผ์ฒ˜ ๋งต ์ƒ์„ฑ (Feature Map Generation)
    • Prediction Feature Map: ์ž…๋ ฅ ์ด๋ฏธ์ง€๊ฐ€ ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ต๊ณผํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ Feature Map์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.
    • 13x13, 26x26, 52x52 Feature Map: ์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ํ•ด์ƒ๋„์˜ Feature Map์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์•ต์ปค ๋ฐ•์Šค ์„ค์ • (Anchor Box Setting)
    • ์„ค๋ช…: ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์•ต์ปค ๋ฐ•์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ๊ตฌ์„ฑ ์š”์†Œ:
      • Box Coordinates (tx,ty,tw,th): Bounding Box์˜ ์ค‘์‹ฌ ์ขŒํ‘œ์™€ ํฌ๊ธฐ.
      • Objectness Score (p0p_0): ํ•ด๋‹น ๋ฐ•์Šค๊ฐ€ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•  ํ™•๋ฅ .
      • Class Scores (p1,p2,...,pc): ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ .
  • ์˜ˆ์ธก (Prediction)
    • ์„ค๋ช…: ๊ฐ Feature Map์˜ Grid Cell์—์„œ Anchor Box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด์˜ ์œ„์น˜์™€ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ๊ณผ์ •:
      • ๊ฐ Grid Cell์€ ์—ฌ๋Ÿฌ Anchor Box๋ฅผ ํ†ตํ•ด ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
      • ์˜ˆ์ธก๋œ Bounding Box์™€ ํด๋ž˜์Šค ํ™•๋ฅ ์„ ํ†ตํ•ด ์ตœ์ข… ๊ฐ์ฒด ํƒ์ง€ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.

Location Prediction

 

 

  • ๊ทธ๋ฆฌ๋“œ ์…€ ๋‚ด์˜ Bounding Box ์ค‘์‹ฌ ๊ณ„์‚ฐ
    • ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ tx, ty ๊ฐ’์€ Sigmoid Function(์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜)๋ฅผ ํ†ตํ•ด 0์—์„œ 1 ์‚ฌ์ด๋กœ ์กฐ์ •๋ฉ๋‹ˆ๋‹ค.
    • ์ด ๊ฐ’์€ ๊ทธ๋ฆฌ๋“œ ์…€์˜ ์˜คํ”„์…‹ cx, cy์— ๋”ํ•ด์ ธ Bounding Box์˜ ์ค‘์‹ฌ ์ขŒํ‘œ bx, by๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด Bounding Box์˜ ์ค‘์‹ฌ์ด ํ•ด๋‹น Grid Cell ๋‚ด์— ์œ„์น˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • Bounding Box ํฌ๊ธฐ ๊ณ„์‚ฐ
    • ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ tw, th ๊ฐ’์€ ๊ฐ๊ฐ Anchor Box์˜ ๋„ˆ๋น„ pw์™€ ๋†’์ด ph์— ์ง€์ˆ˜ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ณฑํ•ด์ง‘๋‹ˆ๋‹ค.
    • ์ด๋ฅผ ํ†ตํ•ด ์ตœ์ข… Bounding Box์˜ ๋„ˆ๋น„ bw์™€ ๋†’์ด bh๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.
  • ๊ฐ์ฒด ํ™•๋ฅ  ๊ณ„์‚ฐ
    • ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๊ฐ์ฒด ํ™•๋ฅ  ๊ฐ’ to๋Š” Sigmoid Function(์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜)๋ฅผ ํ†ตํ•ด 0์—์„œ 1 ์‚ฌ์ด๋กœ ์กฐ์ •๋ฉ๋‹ˆ๋‹ค.
    • ์ด ๊ฐ’์€ ํ•ด๋‹น Bounding Box๊ฐ€ ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•  ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๊ฐ์ฒด ํ™•๋ฅ  ๊ฐ’ σ(toโ€‹)๋ฅผ ์ ์šฉ ํ•˜๋Š” ์ด์œ ๋Š” Center(์ค‘์•™) ์ขŒํ‘œ๊ฐ€ Cell ์ค‘์‹ฌ์„ ๋„ˆ๋ฌด ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋„๋ก 0~1 ์‚ฌ์ด์˜ Sigmoid ๊ฐ’์œผ๋กœ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค. ์ด์œ ๋Š” ์ ์šฉํ•˜์ง€ ์•Š์œผ๋ฉด ์ด๋ฏธ์ง€ ๋์œผ๋กœ๊ฐ€๊ฑฐ๋‚˜ ๋ฒ—์–ด๋‚  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. 

Multi Label ์˜ˆ์ธก

YOLO v3 ๋ชจ๋ธ์€ ์ตœ์ข… Detection ๋‹จ๊ณ„์—์„œ Softmax Function(์†Œํ”„ํŠธ๋งฅ์ˆ˜ ํ•จ์ˆ˜)๋ฅผ ์•ˆ์”๋‹ˆ๋‹ค.
๋Œ€์‹ , ์—ฌ๋Ÿฌ๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ Logistic Classifier๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. (ํด๋ž˜์Šค ๋ณ„๋กœ)

 


YOLO v3 ์„ฑ๋Šฅ ๋น„๊ต

COCO AP ๊ธฐ์ค€ ์„ฑ๋Šฅ ๋น„๊ต

YOLOv3

  • mAP: YOLOv3-320: 28.2, YOLOv3-416: 31.0, YOLOv3-608: 33.0
  • ์ถ”๋ก  ์‹œ๊ฐ„: YOLOv3-320: 22ms, YOLOv3-416: 29ms, YOLOv3-608: 51ms
  • ํŠน์ง•: ๋น ๋ฅธ ์†๋„์™€ ๋†’์€ ์ •ํ™•๋„๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ž‘์€ ๋ชจ๋ธ(YOLOv3-320)์€ ํŠนํžˆ ๋น ๋ฅธ ์ถ”๋ก  ์‹œ๊ฐ„์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

RetinaNet

  • mAP: RetinaNet-50: 32.5, RetinaNet-101: 34.4
  • ์ถ”๋ก  ์‹œ๊ฐ„: RetinaNet-50: 73ms, RetinaNet-101: 90ms
  • ํŠน์ง•: ๋†’์€ ์ •ํ™•๋„ ์ œ๊ณตํ•˜์ง€๋งŒ YOLOv3๋ณด๋‹ค ์ถ”๋ก  ์‹œ๊ฐ„์ด ๋” ๊น๋‹ˆ๋‹ค.

SSD

  • mAP: SSD321: 28.0, SSD513: 31.2
  • ์ถ”๋ก  ์‹œ๊ฐ„: SSD321: 61ms, SSD513: 125ms
  • ํŠน์ง•: ์ค‘๊ฐ„ ์ •๋„์˜ ์ •ํ™•๋„์™€ ์ถ”๋ก  ์‹œ๊ฐ„. YOLOv3๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

๋‹ค๋ฅธ ๋ชจ๋ธ

  • R-FCN: mAP: 29.9, ์ถ”๋ก  ์‹œ๊ฐ„: 85ms
  • FPN FRCN: mAP: 36.2, ์ถ”๋ก  ์‹œ๊ฐ„: 172ms

COCO mAP-50 ๊ธฐ์ค€ ์„ฑ๋Šฅ ๋น„๊ต

YOLOv3

  • mAP-50: YOLOv3-320: 51.5, YOLOv3-416: 55.3, YOLOv3-608: 57.9
  • ์ถ”๋ก  ์‹œ๊ฐ„: YOLOv3-320: 22ms, YOLOv3-416: 29ms, YOLOv3-608: 51ms
  • ํŠน์ง•: ๋†’์€ mAP-50 ์ œ๊ณต. ํŠนํžˆ YOLOv3-608์€ ๊ฐ€์žฅ ๋†’์€ mAP-50๊ณผ ๋น ๋ฅธ ์ถ”๋ก  ์‹œ๊ฐ„์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

RetinaNet

  • mAP-50: RetinaNet-50: 53.1, RetinaNet-101: 57.5
  • ์ถ”๋ก  ์‹œ๊ฐ„: RetinaNet-50: 73ms, RetinaNet-101: 198ms
  • ํŠน์ง•: ๋†’์€ mAP-50 ์ œ๊ณตํ•˜์ง€๋งŒ, ์ถ”๋ก  ์‹œ๊ฐ„์ด ์ƒ๋Œ€์ ์œผ๋กœ ๊น๋‹ˆ๋‹ค.

SSD

  • mAP-50: SSD321: 45.4, SSD513: 50.3
  • ์ถ”๋ก  ์‹œ๊ฐ„: SSD321: 61ms, SSD513: 125ms
  • ํŠน์ง•: ์ค‘๊ฐ„ ์ •๋„์˜ ์„ฑ๋Šฅ ์ œ๊ณต. YOLOv3๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

๋‹ค๋ฅธ ๋ชจ๋ธ

  • R-FCN: mAP-50: 51.9, ์ถ”๋ก  ์‹œ๊ฐ„: 85ms
  • FPN FRCN: mAP-50: 59.1, ์ถ”๋ก  ์‹œ๊ฐ„: 172ms