A A
[CV] SSD - Single Shot (Multibox) Detector

Object Detection History

ํฌ๊ฒŒ 3๊ฐœ์˜ ๋ถ„๋ฅ˜๋กœ ๋‚˜๋ˆ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ถœ์ฒ˜: https://arxiv.org/pdf/1905.05055.pdf

1. ์ „ํ†ต์ ์ธ ํƒ์ง€ ๋ฐฉ๋ฒ• (Traditional Detection Methods)

  • VJ Detector (P. Viola et al., 2001):
    • ๋น„์˜ฌ๋ผ-์กด์Šค ๊ฐ์ฒด ํƒ์ง€๊ธฐ๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋Š” ์ด ๋ฐฉ๋ฒ•์€ ํŠน์ง• ๊ธฐ๋ฐ˜ ์–ผ๊ตด ํƒ์ง€๋ฅผ ์œ„ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋„๋ฆฌ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.
  • HOG Detector (N. Dalal et al., 2005):
    • Histogram of Oriented Gradients (HOG)๋Š” ์ด๋ฏธ์ง€์˜ ๊ตญ์†Œ์ ์ธ ๋ฐฉํ–ฅ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • DPM (P. Felzenszwalb et al., 2008):
    • Deformable Part Model (DPM)์€ ๋ฌผ์ฒด๋ฅผ ์ž‘์€ ๋ถ€๋ถ„๋“ค๋กœ ๋ถ„ํ• ํ•˜์—ฌ ๊ฐ ๋ถ€๋ถ„์˜ ์œ„์น˜์™€ ํ˜•ํƒœ๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
    • 2010๋…„์—๋Š” Bounding Box Regression์ด ์ถ”๊ฐ€๋˜์–ด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

2. ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ํƒ์ง€ ๋ฐฉ๋ฒ• (Deep Learning Based Detection Methods)

  • 2012๋…„ ์ดํ›„:
    • ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ํƒ์ง€ ๋ฐฉ๋ฒ•์ด ๋“ฑ์žฅํ•˜๋ฉด์„œ ๊ฐ์ฒด ํƒ์ง€์˜ ํŒจ๋Ÿฌ๋‹ค์ž„์ด ์ „ํ™˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
    • ํŠนํžˆ, AlexNet์˜ ๋„์ž…์œผ๋กœ ์ธํ•ด ๋”ฅ๋Ÿฌ๋‹์ด ๋ณธ๊ฒฉ์ ์œผ๋กœ ํ™œ์šฉ๋˜๊ธฐ ์‹œ์ž‘ํ–ˆ์Šต๋‹ˆ๋‹ค.

2-1. Two-Stage Detector

  • R-CNN (R. Girshick et al., 2014):
    • Region-based Convolutional Neural Networks (R-CNN)์€ ์ด๋ฏธ์ง€์—์„œ ํ›„๋ณด ์˜์—ญ์„ ์ƒ์„ฑํ•˜๊ณ , ๊ฐ ์˜์—ญ์— ๋Œ€ํ•ด CNN์„ ์ ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค.
  • SPPNet (K. He et al., 2014):
    • Spatial Pyramid Pooling Network๋Š” ์ด๋ฏธ์ง€ ํฌ๊ธฐ์— ๊ด€๊ณ„์—†์ด ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์„ ์ƒ์„ฑํ•˜์—ฌ R-CNN์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค.
  • Fast R-CNN (R. Girshick, 2015):
    • R-CNN์˜ ๋Š๋ฆฐ ์†๋„๋ฅผ ๊ฐœ์„ ํ•œ ๋ฒ„์ „์œผ๋กœ, ๋‹จ์ผ ๋„คํŠธ์›Œํฌ์—์„œ ์˜์—ญ ์ œ์•ˆ๊ณผ ๋ถ„๋ฅ˜๋ฅผ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • Faster R-CNN (S. Ren et al., 2015):
    • Region Proposal Network (RPN)๋ฅผ ๋„์ž…ํ•˜์—ฌ ๋งค์šฐ ๋น ๋ฅธ ์†๋„๋กœ ์˜์—ญ ์ œ์•ˆ์„ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.

2-2. One-Stage Detector

  • YOLO (J. Redmon et al., 2016):
    • You Only Look Once (YOLO)๋Š” ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ๋‹จ์ผ ๋‹จ๊ณ„๋กœ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋งค์šฐ ๋น ๋ฅธ ์†๋„๋กœ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • SSD (W. Liu et al., 2016):
    • Single Shot MultiBox Detector (SSD)๋Š” ์—ฌ๋Ÿฌ ํฌ๊ธฐ์˜ ๋””ํดํŠธ ๋ฐ•์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • RetinaNet (T. Y. Lin et al., 2017):
    • Focal Loss๋ฅผ ๋„์ž…ํ•˜์—ฌ ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

3. ์ตœ์‹  ํƒ์ง€ ๋ฐฉ๋ฒ•

  • Pyramid Networks (T. Y. Lin et al., 2017):
    • ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„์˜ ํŠน์ง• ๋งต์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž‘์€ ๊ฐ์ฒด๋ฅผ ๋” ์ž˜ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.

SSD Network ๊ตฌ์กฐ

ํ•œ๋ฒˆ SSD Network ๊ตฌ์กฐ์— ๋ฐํ•˜์—ฌ ์„ค๋ช…์„ ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

1. ์ž…๋ ฅ ์ด๋ฏธ์ง€

  • ํฌ๊ธฐ: 300x300x3
    • ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๋Š” 300x300 ํ”ฝ์…€์ด๋ฉฐ, RGB ์ฑ„๋„์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

2. VGG-16 ๊ธฐ๋ฐ˜ ํŠน์ง• ์ถ”์ถœ๊ธฐ

  • Conv4_3 ๋ ˆ์ด์–ด:
    • 38x38x512 ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    • VGG-16 ๋„คํŠธ์›Œํฌ์˜ Conv5_3 ๋ ˆ์ด์–ด๊นŒ์ง€ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ง•์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

3. Extra Feature Layers (์ถ”๊ฐ€ ํŠน์ง• ๋ ˆ์ด์–ด)

  • Conv6 (FC6):
    • 19x19x1024 ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    • Convolution ๋ ˆ์ด์–ด๋กœ ๋ณ€ํ™˜๋œ ์™„์ „ ์—ฐ๊ฒฐ (fully connected) ๋ ˆ์ด์–ด์ž…๋‹ˆ๋‹ค.
  • Conv7 (FC7):
    • 19x19x1024 ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    • Convolution ๋ ˆ์ด์–ด๋กœ ๋ณ€ํ™˜๋œ ๋˜ ๋‹ค๋ฅธ ์™„์ „ ์—ฐ๊ฒฐ ๋ ˆ์ด์–ด์ž…๋‹ˆ๋‹ค.

4. ์ถ”๊ฐ€ Convolution ๋ ˆ์ด์–ด

  • Conv8_2:
    • 10x10x512 ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    • ๋‘ ๊ฐœ์˜ Convolution ๋ ˆ์ด์–ด (Conv 3x3x256-s2, Conv 3x3x512-s1)๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • Conv9_2:
    • 5x5x256 ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    • ๋‘ ๊ฐœ์˜ Convolution ๋ ˆ์ด์–ด (Conv 3x3x128-s2, Conv 3x3x256-s1)๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • Conv10_2:
    • 3x3x256 ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    • ๋‘ ๊ฐœ์˜ Convolution ๋ ˆ์ด์–ด (Conv 3x3x128-s2, Conv 3x3x256-s1)๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • Conv11_2:
    • 1x1x256 ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    • ๋‘ ๊ฐœ์˜ Convolution ๋ ˆ์ด์–ด (Conv 3x3x128-s2, Conv 3x3x256-s1)๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

5. Classifier (๋ถ„๋ฅ˜๊ธฐ)

  • ๊ฐ ํŠน์ง• ๋งต์— ๋Œ€ํ•ด Convolution ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ฐ์ฒด ํด๋ž˜์Šค์™€ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • Conv4_3 ๋ ˆ์ด์–ด:
    • 38x38x(4x(Classes+4)) ํฌ๊ธฐ์˜ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • Conv7 ๋ ˆ์ด์–ด:
    • 19x19x(6x(Classes+4)) ํฌ๊ธฐ์˜ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • Conv8_2, Conv9_2, Conv10_2, Conv11_2 ๋ ˆ์ด์–ด:
    • ๊ฐ๊ฐ 10x10, 5x5, 3x3, 1x1 ํฌ๊ธฐ์˜ ์ถœ๋ ฅ์—์„œ (4x(Classes+4))๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

6. Detections (๋””ํ…์…˜)

  • 8732 per Class:
    • ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์™€ ๋น„์œจ์˜ anchor box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ 8732๊ฐœ์˜ ์˜ˆ์ธก ๋ฐ•์Šค๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ ๋ฐ•์Šค๋Š” ํด๋ž˜์Šค ํ™•๋ฅ ๊ณผ ์œ„์น˜ ์ขŒํ‘œ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

SSD ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ

SSD Network์€ ์ฃผ์š” 2๊ฐœ์˜ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. 
Multi Scale Feature Layer, Default (Anchor) Box 2๊ฐœ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

Multi Scale Feature Layer

  • Multi Scale Feature Layer๋Š” SSD(Single Shot MultiBox Detector)์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋กœ, ๊ฐ์ฒด์˜ ํฌ๊ธฐ์™€ ์œ„์น˜์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„์˜ ํŠน์ง• ๋งต์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๋‹ค์ค‘ ํ•ด์ƒ๋„ ํŠน์ง• ๋งต์€ ์ž‘์€ ๊ฐ์ฒด๋ถ€ํ„ฐ ํฐ ๊ฐ์ฒด๊นŒ์ง€ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ฐ์ฒด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

Default (Anchor) Box

  • Default Box(Anchor Box)๋Š” SSD์—์„œ ํŠน์ • ์œ„์น˜์—์„œ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์™€ ๋น„์œจ์˜ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • ๊ฐ ํŠน์ง• ๋งต์˜ ํ”ฝ์…€ ์œ„์น˜์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ anchor box๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ฐ์ฒด์˜ ์œ„์น˜์™€ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

์ด๋ฏธ์ง€ Scale ์กฐ์ •์— ๋”ฐ๋ฅธ ์—ฌ๋Ÿฌ ํฌ๊ธฐ์˜ Object Detection

์‚ฌ์ง„์„ ๋ณด๋ฉด ์›๋ž˜์˜ image์—์„œ sliding window ๋ฐฉ์‹์„ ์ด์šฉํ•ด์„œ Object detection์„ ํ•ฉ๋‹ˆ๋‹ค.

  • ๊ทธ๋ฆฌ๊ณ  Scale(ํฌ๊ธฐ)๋ฅผ ์ค„์—ฌ์„œ ๋‹ค์‹œ Object detection์„ ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ณ„์† ๋ฐ˜๋ณตํ•˜๋‹ค๊ฐ€, Object๋ฅผ ํƒ์ง€ํ•˜๋ฉด ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋™์ž‘์ด ๋ฉˆ์ถฅ๋‹ˆ๋‹ค.

  • ์ด๋ ‡๊ฒŒ ์ด๋ฏธ์ง€ ํ”ผ๋ผ๋ฏธ๋“œ์˜ ๊ฐœ๋…์„ ๋„์ž…ํ•ด์„œ Ground Truth์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ์ค„์ด๋Š” ๋ฐฉ์‹์œผ๋กœ Object Detection์„ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค๋ฅธ ํฌ๊ธฐ์˜ Feature Map์„ ์ด์šฉํ•œ Object Detection

์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ์˜ Feature Map(CNN Architecture)๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ Object Detection ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • ์ฒซ๋ฒˆ์งธ Feature Map์€ ์›๋ณธ ์ด๋ฏธ์ง€์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๋ฉฐ, Size๊ฐ€ ์ž‘์•„์ง€๋ฉด์„œ ์ถ•์•ฝ, ํ•ต์‹ฌ์ ์ธ ์ด๋ฏธ์ง€ ํŠน์ง•์œผ๋กœ ๋ณ€๊ฒฝ๋ฉ๋‹ˆ๋‹ค.
  • ๊ฒฐ๋ก ์€ Feature Map Size๊ฐ€ ์ž‘์„์ˆ˜๋ก ํฐ Object๊ฐ€ ์ž˜ Detect ๋ฉ๋‹ˆ๋‹ค.

Feature Map๊ธฐ๋ฐ˜์˜ Multi-Scale Feautre Layer

์•ž์—์„œ ์„ค๋ช…ํ•˜๋“ฏ์ด, Feature Map์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์„์ˆ˜๋ก ๋” ํฐ ์ด๋ฏธ์ง€๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


Anchor Box ๊ธฐ๋ฐ˜์˜ Object Detection ๋ชจ๋ธ– Faster RCNN

  • Faster R-CNN์€ Region Proposal Network (RPN)์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ ๋ชจ๋ธ๋กœ, ๋‘ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. Region Proposal Network์™€ Fast R-CNN.

1. Convolutional Network

  • ๊ธฐ๋ณธ ํŠน์ง• ๋งต ์ƒ์„ฑ:
    • ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ Convolutional Network๋ฅผ ํ†ตํ•ด ๊ธฐ๋ณธ ํŠน์ง• ๋งต์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ํŠน์ง• ๋งต์€ ์ดํ›„์˜ Region Proposal Network์™€ ๊ฐ์ฒด ๋ถ„๋ฅ˜ ๋ฐ ๊ฒฝ๊ณ„ ๋ฐ•์Šค ์กฐ์ •์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

2. Region Proposal Network (RPN)

  • Region Proposal Network (RPN):
    • RPN์€ ๊ธฐ๋ณธ ํŠน์ง• ๋งต์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์„œ ๊ฐ์ฒด๊ฐ€ ์žˆ์„ ๋ฒ•ํ•œ ์˜์—ญ(Region Proposal)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์—ฌ๋Ÿฌ ํฌ๊ธฐ์™€ ๋น„์œจ์˜ anchor box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด์˜ ์กด์žฌ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • RPN์˜ ์ถœ๋ ฅ์€ ์ž ์žฌ์  ๊ฐ์ฒด ์œ„์น˜์˜ ํ›„๋ณด ์˜์—ญ(Region Proposal)๋“ค๋กœ, ์ด๋Š” ์ดํ›„์˜ Fast R-CNN ๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

RPN์˜ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ

  • Convolution Layer:
    • ์ž…๋ ฅ ํŠน์ง• ๋งต์„ ๋ฐ›์•„์„œ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ํ•„ํ„ฐ๋ฅผ ์ ์šฉํ•˜์—ฌ ํŠน์ง•์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  • Anchor Box:
    • ๊ฐ ์œ„์น˜๋งˆ๋‹ค ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์™€ ๋น„์œจ์˜ anchor box๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ์ผ๋ฐ˜์ ์œผ๋กœ 3๊ฐ€์ง€ ํฌ๊ธฐ์™€ 3๊ฐ€์ง€ ๋น„์œจ์˜ anchor box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด 9๊ฐœ์˜ anchor box๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • Objectness Score:
    • ๊ฐ anchor box์— ๋Œ€ํ•ด ๊ฐ์ฒด๊ฐ€ ์žˆ์„ ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•˜๋Š” ์ ์ˆ˜์ž…๋‹ˆ๋‹ค.
    • ํด๋ž˜์Šค์˜ ๊ฐœ์ˆ˜๊ฐ€ 2๊ฐœ(๊ฐ์ฒด, ๋ฐฐ๊ฒฝ)์ธ softmax ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.
  • Bounding Box Regression:
    • ๊ฐ anchor box์— ๋Œ€ํ•ด ์œ„์น˜ ์กฐ์ •์„ ์œ„ํ•œ ๊ฒฝ๊ณ„ ๋ฐ•์Šค ์˜คํ”„์…‹์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

3. ROI Pooling

  • ROI Pooling:
    • RPN์—์„œ ์ œ์•ˆ๋œ ํ›„๋ณด ์˜์—ญ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ํŠน์ง• ๋งต์—์„œ ํ•ด๋‹น ์˜์—ญ์˜ ํŠน์ง•์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๊ณผ์ •์—์„œ ๋ชจ๋“  ํ›„๋ณด ์˜์—ญ์„ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

4. Object Classification and Bounding Box Regression

  • Fully Connected Layer:
    • ROI Pooling์„ ํ†ตํ•ด ์–ป์€ ๊ณ ์ • ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„์„œ ๊ฐ์ฒด ๋ถ„๋ฅ˜ ๋ฐ ๊ฒฝ๊ณ„ ๋ฐ•์Šค ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • Object Classification:
    • ๊ฐ ํ›„๋ณด ์˜์—ญ์— ๋Œ€ํ•ด ๊ฐ์ฒด์˜ ํด๋ž˜์Šค(์˜ˆ: ์‚ฌ๋žŒ, ์ž๋™์ฐจ ๋“ฑ)๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • softmax๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํด๋ž˜์Šค ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • Bounding Box Regression:
    • ๊ฐ ํ›„๋ณด ์˜์—ญ์— ๋Œ€ํ•ด ๊ฒฝ๊ณ„ ๋ฐ•์Šค๋ฅผ ์กฐ์ •ํ•˜์—ฌ ๋” ์ •ํ™•ํ•œ ๊ฐ์ฒด ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ์—ฐ์†๋œ ๊ฐ’(์ขŒํ‘œ)์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ํšŒ๊ท€(regression)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
Summary
- Convolutional Network:๊ธฐ๋ณธ ํŠน์ง• ๋งต ์ƒ์„ฑ.
- Region Proposal Network (RPN):Anchor box๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด์˜ ์ž ์žฌ์  ์œ„์น˜(Region Proposal) ์˜ˆ์ธก.
- ROI Pooling:Region Proposal์„ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์œผ๋กœ ๋ณ€ํ™˜.
- Object Classification and Bounding Box Regression:ํ›„๋ณด ์˜์—ญ์— ๋Œ€ํ•ด ๊ฐ์ฒด ๋ถ„๋ฅ˜์™€ ๊ฒฝ๊ณ„ ๋ฐ•์Šค ์กฐ์ •.

RPN์—์„œ์˜ Anchor Box์˜ ํ™œ์šฉ

Classification (๋ถ„๋ฅ˜)

  • ๊ฐ Anchor Box๋Š” ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•˜๋Š”์ง€ ์—ฌ๋ถ€์— ๋”ฐ๋ผ Positive(์–‘์„ฑ) ๋˜๋Š” Negative(์Œ์„ฑ)๋กœ ๋ถ„๋ฅ˜๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฏธ์ง€ ์ค‘์•™์— ์žˆ๋Š” ๊ฐœ์˜ ์–ผ๊ตด์„ ์˜ˆ๋กœ ๋“ค๋ฉด, ๊ฐœ๋ฅผ ํฌํ•จํ•˜๋Š” Anchor Box๋Š” Positive๋กœ ๋ถ„๋ฅ˜๋˜๊ณ , ๊ทธ๋ ‡์ง€ ์•Š์€ Anchor Box๋Š” Negative๋กœ ๋ถ„๋ฅ˜๋ฉ๋‹ˆ๋‹ค.
  • Positive Anchor Box๋Š” ๊ฐ์ฒด(Ground Truth Box)์™€ ๋†’์€ IoU(Intersection over Union)๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์ด ๊ฒฝ์šฐ ๊ฐœ์˜ ์–ผ๊ตด์„ ํฌํ•จํ•œ Anchor Box๋“ค์ด Positive๋กœ ๋ถ„๋ฅ˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฐ˜๋ฉด, ๊ฐœ์˜ ์–ผ๊ตด์„ ํฌํ•จํ•˜์ง€ ์•Š์€ Anchor Box๋“ค์€ Negative๋กœ ๋ถ„๋ฅ˜๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Bounding Box Regression (๊ฒฝ๊ณ„ ๋ฐ•์Šค ํšŒ๊ท€)

  • Positive๋กœ ๋ถ„๋ฅ˜๋œ Anchor Box๋“ค์€ ๊ฒฝ๊ณ„ ๋ฐ•์Šค ํšŒ๊ท€๋ฅผ ํ†ตํ•ด ์ •ํ™•ํ•œ ๊ฐ์ฒด ์œ„์น˜๋กœ ์กฐ์ •๋ฉ๋‹ˆ๋‹ค.
  • Ground Truth Box (์‹ค์ œ ๊ฐ์ฒด ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ƒ์ž)์™€ ๋น„๊ตํ•˜์—ฌ Positive Anchor Box์˜ ์œ„์น˜์™€ ํฌ๊ธฐ๋ฅผ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ํšŒ๊ท€ ๋ชจ๋ธ์€ Anchor Box์˜ ์ค‘์‹ฌ ์ขŒํ‘œ(x, y), ๋„ˆ๋น„(w), ๋†’์ด(h)๋ฅผ ์กฐ์ •ํ•˜๋Š” ์˜คํ”„์…‹(Δx, Δy, Δw, Δh)์„ ์˜ˆ์ธกํ•˜์—ฌ Predicted Anchor Box (์˜ˆ์ธก๋œ ๊ฐ์ฒด ์œ„์น˜ ์ƒ์ž)๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • ์ด๋ฏธ์ง€์—์„œ๋Š” Positive Anchor Box๊ฐ€ Ground Truth Box์— ๋งž์ถฐ ์กฐ์ •๋œ ๋ชจ์Šต์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Anchor box ๋ฅผ ํ™œ์šฉํ•œ Object Detection

Anchor Box๋ฅผ ํ™œ์šฉํ•œ Object Detection์—์„œ๋Š” ๊ฐœ๋ณ„ Anchor Box๊ฐ€ ๋‹ค์Œ ์ •๋ณด๋“ค์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

  • Anchor Box์™€ ๊ฒน์น˜๋Š” Feature Map ์˜์—ญ์˜ Object ํด๋ž˜์Šค๋ฅผ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋•Œ Ground Truth ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ์ขŒํ‘œ๋ฅผ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.

  • ๋˜ํ•œ ๊ฐœ๋ณ„ Anchor Box๋ณ„๋กœ Detection ํ•˜๋ ค๋Š” Object ์œ ํ˜•์˜ Softmax ํ•จ์ˆ˜๊ฐ’, ์ˆ˜์ • ์ขŒํ‘œ ๊ฐ’์„ ๊ฐ€์ ธ์™€์„œ Anchor Box๋ฅผ ํ™œ์šฉํ•œ ํ•™์Šต์„ ํ• ๋–„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

SSD Network ๊ตฌ์„ฑ

SSD (Single Shot MultiBox Detector) ๋„คํŠธ์›Œํฌ์˜ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต๊ณผ ๊ทธ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ Anchor Box๋“ค์„ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

1. 38x38 ํŠน์ง• ๋งต

  • ํฌ๊ธฐ: 38x38
  • Anchor Box ๊ฐœ์ˆ˜: ๊ฐ ์œ„์น˜๋งˆ๋‹ค 4๊ฐœ์˜ Anchor Box
  • ์ด Anchor Box ์ˆ˜: 38 x 38 x 4 = 5776

2. 19x19 ํŠน์ง• ๋งต

  • ํฌ๊ธฐ: 19x19
  • Anchor Box ๊ฐœ์ˆ˜: ๊ฐ ์œ„์น˜๋งˆ๋‹ค 6๊ฐœ์˜ Anchor Box
  • ์ด Anchor Box ์ˆ˜: 19 x 19 x 6 = 2166

3. 10x10 ํŠน์ง• ๋งต

  • ํฌ๊ธฐ: 10x10
  • Anchor Box ๊ฐœ์ˆ˜: ๊ฐ ์œ„์น˜๋งˆ๋‹ค 6๊ฐœ์˜ Anchor Box
  • ์ด Anchor Box ์ˆ˜: 10 x 10 x 6 = 600

4. 5x5 ํŠน์ง• ๋งต

  • ํฌ๊ธฐ: 5x5
  • Anchor Box ๊ฐœ์ˆ˜: ๊ฐ ์œ„์น˜๋งˆ๋‹ค 6๊ฐœ์˜ Anchor Box
  • ์ด Anchor Box ์ˆ˜: 5 x 5 x 6 = 150

5. 3x3 ํŠน์ง• ๋งต

  • ํฌ๊ธฐ: 3x3
  • Anchor Box ๊ฐœ์ˆ˜: ๊ฐ ์œ„์น˜๋งˆ๋‹ค 4๊ฐœ์˜ Anchor Box
  • ์ด Anchor Box ์ˆ˜: 3 x 3 x 4 = 36

6. 1x1 ํŠน์ง• ๋งต

  • ํฌ๊ธฐ: 1x1
  • Anchor Box ๊ฐœ์ˆ˜: ๊ฐ ์œ„์น˜๋งˆ๋‹ค 4๊ฐœ์˜ Anchor Box
  • ์ด Anchor Box ์ˆ˜: 1 x 1 x 4 = 4

์ดํ•ฉ

  • ์ด Anchor Box ์ˆ˜: 5776 + 2166 + 600 + 150 + 36 + 4 = 8732

๊ฐ์ฒด ํƒ์ง€์™€ NMS (Non-Maximum Suppression)

  • Object Detection:
    • 8732๊ฐœ์˜ Anchor Box๋Š” ๊ฐ๊ฐ ๊ฐ์ฒด์˜ ํด๋ž˜์Šค์™€ ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๊ณผ์ •์—์„œ ๋งŽ์€ Anchor Box๊ฐ€ ์ค‘๋ณต๋˜๊ฑฐ๋‚˜ ๊ฒน์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • NMS (Non-Maximum Suppression):
    • ๊ฒน์น˜๋Š” ์˜ˆ์ธก์„ ์ œ๊ฑฐํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ๊ฐ€์žฅ ์œ ๋งํ•œ ๊ฐ์ฒด ์œ„์น˜๋งŒ ๋‚จ๊น๋‹ˆ๋‹ค.
    • NMS๋Š” ๋†’์€ ์‹ ๋ขฐ๋„๋ฅผ ๊ฐ€์ง„ ๋ฐ•์Šค๋ฅผ ์šฐ์„ ์ ์œผ๋กœ ์„ ํƒํ•˜๊ณ , ๊ฒน์น˜๋Š” ๋ฐ•์Šค๋“ค์„ ์ œ๊ฑฐํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.

Anchor ๋ฐ•์Šค๋ฅผ ํ™œ์šฉํ•œ Convolution Predictors for detection

SSD (Single Shot MultiBox Detector) ๋„คํŠธ์›Œํฌ์—์„œ 38x38 ํฌ๊ธฐ์˜ ํŠน์ง• ๋งต์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด ํƒ์ง€์™€ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ณผ์ •์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

  • ๋ณด์‹œ๋ฉด, Feature Map์—์„œ 3x3 Convolution Cell์— ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐœ๋ณ„ ์…€๋กœ ํ•˜๋‚˜์”ฉ ๋ถ„ํ•ด๋ฅผ ํ•ด๋ณด๋ฉด 4๊ฐœ์˜ ์ขŒํ‘œ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค.
  • ๊ฐ 4๊ฐœ์˜ ์œ„์น˜๋งˆ๊ฐ€ 4๊ฐœ์˜ Anchor Box๊ฐ€ ํ• ๋‹น๋˜์–ด ์žˆ์œผ๋ฉฐ, ๊ฐ Box์— ๋ฐํ•˜์—ฌ ๊ฐ์ฒด์˜ ํด๋ž˜์Šค & ์œ„์น˜๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋•Œ Class Probabiltiy (ํด๋ž˜์Šค ํ™•๋ฅ )๋ฅผ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค. 
    • 20๊ฐœ์˜ ๊ฐ์ฒด Class์™€ 1๊ฐœ์˜ ๋ฐฐ๊ฒฝ ํด๋ž˜์Šค - ์ด 21๊ฐœ์˜ ํด๋ž˜์Šค ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ๊ทธํ›„, Bounding Box Offset (์œ„์น˜ ์˜คํ”„์…‹) ๊ฐ’์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์€ ๊ฐ Anchor Box์˜ ์œ„์น˜๋ฅผ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค.
    • ๊ทธ๋ฆฌ๊ณ  Bounding Box Offset (์œ„์น˜ ์˜คํ”„์…‹) ๊ฐ’์€ 4๊ฐœ์˜ Offset๊ฐ’ (W, Y, W, H)์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
    • X์™€ Y๋Š” Anchor Box์˜ ์ค‘์‹ฌ ์ขŒํ‘œ, W์™€ H๋Š” Anchor Box์˜ ๋„ˆ๋น„์™€ ๋†’์ด๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ, ๊ฐ ์œ„์น˜์—์„œ ์ด 4๊ฐœ์˜ Anchor Box x (21๊ฐœ์˜ ํด๋ž˜์Šค ํ™•๋ฅ  + 4๊ฐœ์˜ ์œ„์น˜ ์˜คํ”„์…‹) = 100๊ฐœ์˜ ์˜ˆ์ธก ๊ฐ’์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.

SSD์˜ Multi Scale Feature Map๊ณผ Anchor Box ์ ์šฉ

๊ณ ์–‘์ด๋กœ ์˜ˆ์‹œ๋ฅผ ํ•œ๋ฒˆ ๋“ค์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • ์—ฌ๊ธฐ์„œ ๊ณ ์–‘์ด๋ฅผ ํƒ์ง€ํ•œ ๋ฐ•์Šค๋ฅผ Matching Box๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. Classification์˜ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ Ground Turth์™€ ๊ฐ€๊นŒ์›Œ ์ง€๊ธฐ ์œ„ํ•ด ๋…ธ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  8x8 Feature Map์—์„œ Ground Truth์™€ ๋งค์นญ๋œ ๊ณ ์–‘์ด์˜ Bounging Box๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค. 
  • ๊ทธํ›„, 4x4 Feature Map์„ ํ•œ๋ฒˆ ๋ด๋ณด๋ฉด, ๋” ๋‚ฎ์€ ํ•ด์ƒ๋„๋กœ ์ถ”์ถœ๋˜๋Š” ๋Œ€์‹ , ๋” ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์œ ๋ฆฌํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด size๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.
    • ๊ทธ๋ฆฌ๊ณ  4x4 Feature Map์„ ๋ณด๋ฉด Location (loc) ์™€ Confidence (conf)๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์„ค๋ช…์„ ํ•ด๋ณด๋ฉด
    • Location (loc):
      • ๊ฐ Anchor Box์˜ ์œ„์น˜๋ฅผ ์กฐ์ •ํ•˜๋Š” ์˜คํ”„์…‹ ๊ฐ’์ž…๋‹ˆ๋‹ค.
      • (cx,cy,w,h)(c_x, c_y, w, h)๋กœ ํ‘œ์‹œ๋˜๋ฉฐ, Anchor Box์˜ ์ค‘์‹ฌ ์ขŒํ‘œ์™€ ๋„ˆ๋น„ ๋ฐ ๋†’์ด๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    • Confidence (conf):
      • ๊ฐ Anchor Box๊ฐ€ ํŠน์ • ํด๋ž˜์Šค์— ์†ํ•  ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’์ž…๋‹ˆ๋‹ค.
      • (c1,c2,โ‹ฏ ,cp)(c_1, c_2, cdots, c_p)๋กœ ํ‘œ์‹œ๋˜๋ฉฐ, ๊ฐ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
      • ์—ฌ๊ธฐ์„œ pp๋Š” ํด๋ž˜์Šค์˜ ์ˆ˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

SSD Training

  • ์—ฌ๊ธฐ์„œ ์šฐ๋ฆฌ๊ฐ€ ์•Œ์•„์•ผ ํ• ๊ฑด Matching ์ „๋žต, Loss ํ•จ์ˆ˜ ๋ผ๋Š” ๊ฒƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
Matching ์ „๋žต์€ Bounding box์™€ ๊ฒน์น˜๋Š” IOU๊ฐ€ 0.5 ์ด์ƒ์ธ Default(Anchor) Box ๋“ค์˜ Classification๊ณผ Boudning box Regression์„ ์ตœ์ ํ™” ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ „๋žต์ž…๋‹ˆ๋‹ค.
  • Loss ํ•จ์ˆ˜๋Š” ์ „์ฒด ์†์‹ค ํ•จ์ˆ˜์™€ ๋ถ„๋ฅ˜ ์†์‹ค (Classification Loss), ์œ„์น˜ ์†์‹ค (Localization Loss)์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ „์ฒด Loss Function - Parameter

  • : ๋งค์น˜๋œ Default Box(Anchor Box)์˜ ๊ฐœ์ˆ˜
  • Lconf(x,c): ๋ถ„๋ฅ˜ ์†์‹ค (Classification Loss)
  • Lloc(x,l,g): ์œ„์น˜ ์†์‹ค (Localization Loss)
  • : ์œ„์น˜ ์†์‹ค์˜ ์ค‘์š”๋„๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฐ€์ค‘์น˜

๋ถ„๋ฅ˜ ์†์‹ค (Classification Loss)

  • ๋ถ„๋ฅ˜ ์†์‹ค์€ ๊ฐ Anchor Box๊ฐ€ ํŠน์ • ํด๋ž˜์Šค์— ์†ํ•  ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•˜๋Š” ์†์‹ค์ž…๋‹ˆ๋‹ค.
  • ํด๋ž˜์Šค ํ™•๋ฅ  ์˜ˆ์ธก์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์˜ค์ฐจ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • ์ผ๋ฐ˜์ ์œผ๋กœ ์†Œํ”„ํŠธ๋งฅ์Šค(Softmax) ํ•จ์ˆ˜์™€ ํฌ๋กœ์Šค ์—”ํŠธ๋กœํ”ผ(Cross-Entropy) ์†์‹ค์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์œ„์น˜ ์†์‹ค (Localization Loss)

  • ์œ„์น˜ ์†์‹ค์€ ์˜ˆ์ธก๋œ ๊ฒฝ๊ณ„ ๋ฐ•์Šค์™€ ์‹ค์ œ ๊ฒฝ๊ณ„ ๋ฐ•์Šค(Ground Truth Box) ๊ฐ„์˜ ์œ„์น˜ ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์†์‹ค์ž…๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ Pos๋Š” Positive Anchor Box(๊ฐ์ฒด๋ฅผ ํฌํ•จํ•˜๋Š” Anchor Box)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • xk^: ๋งค์น˜๋œ Anchor Box์™€ Ground Truth Box ๊ฐ„์˜ ์ธ๋ฑ์Šค ๋งค์นญ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • smoothL1: Smooth L1 ์†์‹ค ํ•จ์ˆ˜, ํšŒ๊ท€ ๋ฌธ์ œ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • l^mi: ์˜ˆ์ธก๋œ ์œ„์น˜ ๊ฐ’ (Anchor Box์˜ ์ค‘์‹ฌ ์ขŒํ‘œ์™€ ํฌ๊ธฐ)
  • g^mj: ์‹ค์ œ ์œ„์น˜ ๊ฐ’ (Ground Truth Box์˜ ์ค‘์‹ฌ ์ขŒํ‘œ์™€ ํฌ๊ธฐ)

Bounding Box Regression

Bounding Box Regression์€ ์˜ˆ์ธก๋œ Anchor Box๋ฅผ ์‹ค์ œ ๊ฐ์ฒด์˜ ์œ„์น˜์— ๋งž์ถ”๊ธฐ ์œ„ํ•ด ์œ„์น˜ ์˜คํ”„์…‹์„ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆผ์„ ๋ณด๋ฉด, ๊ทธ๋ฆผ์—์„œ ํŒŒ๋ž€์ƒ‰ ์ƒ์ž๋Š” Default Box(Anchor Box), ๋นจ๊ฐ„์ƒ‰ ์ƒ์ž๋Š” Ground Truth Box๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • Δ(cx,cy,w,h)๋Š” Default Box์—์„œ Ground Truth Box๋กœ์˜ ์œ„์น˜ ๋ฐ ํฌ๊ธฐ ์˜คํ”„์…‹์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ์œ„์น˜ ์†์‹ค์€ ์ด ์˜คํ”„์…‹์„ ์ตœ์†Œํ™”ํ•˜์—ฌ ์˜ˆ์ธก๋œ Anchor Box๊ฐ€ ์‹ค์ œ ๊ฐ์ฒด ์œ„์น˜์™€ ์ผ์น˜ํ•˜๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

Design choice๋ณ„ Performance

SSD300์€ SSD Network๊ฐ€ 300x300์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. 

 

  • Data Augmentation์„ ์ง„ํ–‰ํ• ๋•Œ ์•„๋ž˜์˜ ์ ˆ์ฐจ์— ๋”ฐ๋ผ์„œ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • GT Object์™€ IOU๊ฐ€ 0.1, 0.3, 0.5, 0.7, 0.9๊ฐ€ ๋  ์ˆ˜ ์žˆ๋„๋ก ํŠน์ • Object๋“ค์˜ Image๋ฅผ ์ž˜๋ผ๋ƒ…๋‹ˆ๋‹ค.
  • ์ž˜๋ผ๋‚ธ ์ด๋ฏธ์ง€๋ฅผ random ํ•˜๊ฒŒ sampling ํ•ฉ๋‹ˆ๋‹ค.
  • ์ž˜๋ผ๋‚ธ sample ์ด๋ฏธ์ง€๋Š” 0.1 ~ 1์‚ฌ์ด๋กœ, aspect ratio๋Š” 1/2 ~ 2 ์‚ฌ์ด๋กœ ํฌ๊ธฐ๋ฅผ ๋งž์ถฅ๋‹ˆ๋‹ค.
  • ๊ฐœ๋ณ„ sample ์ด๋ฏธ์ง€๋ฅผ ๋‹ค์‹œ 300x300 ์œผ๋กœ ๊ณ ์ •. ๊ทธ๋ฆฌ๊ณ  ๊ทธ์ค‘ 50%๋Š” horizontal flip ์ž…๋‹ˆ๋‹ค.
    • horizontal flip์€ Data์— ์ธ์œ„์ ์ธ ๋ณ€ํ™”๋ฅผ ์ค€๋‹ค -> Data Augmentation์ž…๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์„œ ์งš๊ณ  ๋„˜์–ด๊ฐ€์•ผ ํ•˜๋Š”๊ฑด, ์ž‘์€ Object Detect์‹œ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ์ด์œ ํ•ฉ๋‹ˆ๋‹ค.
์ด์œ ๋Š”, Feature Map์— Anchor Baseํ•˜๋Š” Object Detection Model์ด ์‚ฌ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
๋˜ํ•œ One-Stage Detector ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ด์„œ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ํ•ด๊ฒฐ์ฑ…์€ Feature Pyramid ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ž‘์€ Object์˜ Detection ์„ฑ๋Šฅ & Data Augmentation ํ›„ ์ž‘์€ Object์˜ Detection ์„ฑ๋Šฅ

์ž‘์€ Object์˜ Detection ์„ฑ๋Šฅ
Data Augmentation ํ›„ ์ž‘์€ Object์˜ Detection ์„ฑ๋Šฅ ํ–ฅ์ƒ

 

  • ์š”์•ฝ์„ ํ•ด๋ณด์ž๋ฉด, Data Augmentation (๋ฐ์ดํ„ฐ ์ฆ๊ฐ•) ์ „
    • ์ž‘์€ ๊ฐ์ฒด(XSS, S)์™€ ๊ทน๋‹จ์ ์ธ ๋น„์œจ์˜ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์ด ๋‚ฎ์•˜์Šต๋‹ˆ๋‹ค.
    • SSD512 ๋ชจ๋ธ์ด SSD300 ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€์—๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
  • Data Augmentation (๋ฐ์ดํ„ฐ ์ฆ๊ฐ•) ํ›„
    • ์ž‘์€ ๊ฐ์ฒด์™€ ๊ทน๋‹จ์ ์ธ ๋น„์œจ์˜ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
    • SSD300๊ณผ SSD512 ๋ชจ๋‘ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์„ ํ†ตํ•ด ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

SSD Detection ์„ฑ๋Šฅ ๋ฐ ์ˆ˜ํ–‰ ์‹œ๊ฐ„ ๋น„๊ต

  • ์ •๋ฆฌ๋ฅผ ํ•ด๋ณด๋ฉด, SSD ๋ชจ๋ธ์€ ์ž‘์€ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜๋ฉฐ, ํŠนํžˆ SSD512 ๋ชจ๋ธ์ด ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ๋ฐ์ดํ„ฐ์…‹์ด ๋‹ค์–‘ํ•ด์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ , Faster R-CNN์€ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ด์ง€๋งŒ, ์†๋„๊ฐ€ ๋Š๋ฆฝ๋‹ˆ๋‹ค.
  • YOLO๋Š” ๋งค์šฐ ๋น ๋ฅธ ์†๋„๋ฅผ ๋ณด์ด์ง€๋งŒ, ์ •ํ™•๋„๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์Šต๋‹ˆ๋‹ค.
  • SSD ๋ชจ๋ธ์€ ๋น ๋ฅธ ์†๋„์™€ ๋†’์€ ์ •ํ™•๋„์˜ ๊ท ํ˜•์„ ์ž˜ ๋งž์ถ˜ ๋ชจ๋ธ๋กœ, ํŠนํžˆ SSD300์€ ์†๋„์™€ ์ •ํ™•๋„ ๋ชจ๋‘ ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.