A A
[CV] SPPNet - Spatial Pyramid Pooling Net

SPPNet - Spatial Pyramid Pooling Net

์ด๋ฒˆ์—๋Š” SPPNet - Spatial Pyramid Pooling Net์— ๋ฐํ•˜์—ฌ ์•Œ์•„๋ณด๊ณ  ์™œ SPPNet์ด ๋“ฑ์žฅํ–ˆ๋Š”์ง€ ํ•œ๋ฒˆ ์•Œ์•„ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.


 

RCNN ์ฃผ์š” ๋ฌธ์ œ์ 

  • ๊ทธ์ „์— RCNN์˜ ์ฃผ์š” ๋ฌธ์ œ์ ์— ๋ฐํ•˜์—ฌ ํ•œ๋ฒˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ์ผ๋‹จ, CNN์€ 2,000๊ฐœ์˜ Region ์˜์—ญ ์ด๋ฏธ์ง€๊ฐ€ CNN์œผ๋กœ ์ž…๋ ฅ ๋˜๋ฉด์„œ Object Detection ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค.
  • ์ด์œ ๋Š” ์œ„์˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด 2,000๊ฐœ์˜ Region ์˜์—ญ์ด Proposal ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ทธ๋Ÿฌ๋ฉด Feature Map์ด 2,000๊ฐœ๊ฐ€ ๋งŒ๋“ค์–ด ์ €์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด์„œ ๋จพ์€ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜๋ฉฐ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์ด ๊ธธ์–ด์ง‘๋‹ˆ๋‹ค.
  • ๋˜ํ•œ Region ์˜์—ญ ์ด๋ฏธ์ง€๊ฐ€ ๊ณ ์ •๋œ ํฌ๊ธฐ๋กœ Crop / Warp ํ•˜์—ฌ CNN์— ์ž…๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๊ณผ์ •๋„ ์ถ”๊ฐ€์ ์ธ ์—ฐ์‚ฐ ๋ฐ Memory ์‚ฌ์šฉ๋Ÿ‰์„ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค.

 

RCNN ๊ฐœ์„  ๋ฐฉ์•ˆ

๊ทธ๋Ÿฌ๋ฉด RCNN์„ ๊ฐœ์„  ํ• ์ˆ˜ ์žˆ๋Š” ๋ฐฉ์•ˆ์ด ์žˆ์„๊นŒ์š”?

  • 2000๊ฐœ์˜ Region Proposal ์ด๋ฏธ์ง€๋ฅผ CNN ์œผ๋กœ Feature Extraction ํ•˜์ง€ ์•Š๊ณ  ์›๋ณธ ์ด๋ฏธ์ง€ ์œผ๋กœ๋งŒ CNN์œผ๋กœ Feature Map์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ ๋’ค์— ์›๋ณธ ์ด๋ฏธ์ง€์˜ Selective search ๋กœ ์ถ”์ฒœ๋œ ์˜์—ญ์˜ ์ด๋ฏธ์ง€๋งŒ Feature Map์œผ๋กœ ๋งคํ•‘ํ•˜์—ฌ ๋ณ„๋„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ ‡์ง€๋งŒ Selective Search ์—์„œ๋Š” ์ด๋ฏธ์ง€ ํฌ๊ธฐ๊ฐ€ ๋‹ฌ๋ผ์„œ ์ด๋ฏธ์ง€๊ฐ€ ์ฐŒ๊ทธ๋Ÿฌ์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ Feature Map์„ 1์ฐจ์›์œผ๋กœ Flatten(ํ‰ํƒ„ํ™”)ํ•˜๋Š” ๊ฒฝ์šฐ, ์ด๋ฏธ์ง€๊ฐ€ ์ฐŒ๊ทธ๋Ÿฌ์ง€๋Š” ๋ฌธ์ œ์˜ ์›์ธ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

  • ๋˜ํ•œ CNN ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์‚ฌ์ด์ฆˆ์˜ Image๋ฅผ ์ˆ˜์šฉํ•˜์ง€ ์•Š๋Š”๋ฐ, ๊ฐ€์žฅ ํฐ ์ด์œ ๋Š” Flatten Fully Connection Input์˜ ํฌ๊ธฐ๊ฐ€ ๊ณ ์ •์ด ๋˜์–ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ์š”?

 

์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ์˜ Region Proposal ์ด๋ฏธ์ง€ ๊ฐœ์„  ๋ฐฉ์•ˆ

RCNN์˜ ๋ฌธ์ œ๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ SPPNet์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • Feature ๋งต์œผ๋กœ ํˆฌ์˜๋œ ์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„ Region Proposal ์ด๋ฏธ์ง€๋ฅผ SPPNet์˜ ๊ณ ์ •๋œ ํฌ๊ธฐ Vector๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ FC์— 1D Flattened ๋œ input์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
์—ฌ๊ธฐ์„œ SPPNet์˜ ์—ญํ• ์€ Region Proposal ์ด๋ฏธ์ง€๋ฅผ ๊ณ ์ •๋œ Vector size๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๊ณ ์ • ๋ ˆ์ด์–ด์— ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

 

SPP (Spatial Pyramid Pooling)

SPP๋Š” CNN์ƒ์—์„œ Image classification์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๋ฅผ ๊ณ ์ •๋œ ํฌ๊ธฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • SPP๋Š” ์˜ค๋ž˜์ „๋ถ€ํ„ฐ ์ปดํ“จํ„ฐ ๋น„์ „ ์˜์—ญ์—์„œ ํ™œ์šฉ๋œ Spatial Pyramid Matching ๊ธฐ๋ฒ•์— ๊ทผ๊ฐ„์„ ๋‘์—ˆ์Šต๋‹ˆ๋‹ค.
  • Spatial Pyramid Matching์€ ์„œ๋กœ ๋‹ค๋ฅธ Feature Map์— ํˆฌ์˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ Selective Search๋œ Feature Map์€ ๊ณ ์ •๋œ ์‚ฌ์ด์ฆˆ์˜ Vector Value(๋ฒกํ„ฐ๊ฐ’)์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
Bag of Visual Words -> Spatial Pyramid Matching

 

BoVW (Bag of Visual Words)

Bag of Visual Words (BoVW)๋Š” ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์™€ ๊ฐ™์€ ์ปดํ“จํ„ฐ ๋น„์ „ ์ž‘์—…์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ํŠน์ง• ์ถ”์ถœ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • BoVW๋Š” ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ์—์„œ ์‚ฌ์šฉํ•˜๋Š” Bag of Words (BoW) ๋ชจ๋ธ์„ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ด ๋ฐฉ๋ฒ•์€ ์ด๋ฏธ์ง€๋ฅผ ์‹œ๊ฐ์ ์ธ ๋‹จ์–ด(visual words)์˜ ์ง‘ํ•ฉ์œผ๋กœ ํ‘œํ˜„ํ•˜์—ฌ, ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ๊ฒ€์ƒ‰, ์ธ์‹ ๋“ฑ์˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ์ฃผ์š” ๊ฐœ๋…์€ ํฌ๊ฒŒ ์•„๋ž˜ 4๊ฐœ๋กœ ์ด๋ฃจ์–ด ์ง‘๋‹ˆ๋‹ค.
  • ์‹œ๊ฐ์  ๋‹จ์–ด(Visual Words)
    • ์ด๋ฏธ์ง€์˜ Local feature๋ฅผ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ ํด๋Ÿฌ์Šคํ„ฐ๋งํ•˜์—ฌ ์‹œ๊ฐ์  ๋‹จ์–ด๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
    • ์‹œ๊ฐ์  ๋‹จ์–ด๋Š” ์ด๋ฏธ์ง€์˜ ํŒจ์น˜(patch)๋‚˜ ํ‚คํฌ์ธํŠธ(keypoint)์˜ ํŠน์ง•์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ํŠน์ง• ์ถ”์ถœ (Feature Extraction)
    • ์ผ๋ฐ˜์ ์œผ๋กœ SIFT, SURF, ORB์™€ ๊ฐ™์€ ํŠน์ง• ์ถ”์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€์—์„œ ํ‚คํฌ์ธํŠธ๋ฅผ ๊ฒ€์ถœํ•˜๊ณ , ๊ฐ ํ‚คํฌ์ธํŠธ์˜ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • ํด๋Ÿฌ์Šคํ„ฐ๋ง (Clustering)
    • ์ถ”์ถœ๋œ ํŠน์ง• ๋ฒกํ„ฐ๋“ค์„ ํด๋Ÿฌ์Šคํ„ฐ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜ (์˜ˆ: K-means)์„ ์‚ฌ์šฉํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ๋งํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ค‘์‹ฌ์ด ์‹œ๊ฐ์  ๋‹จ์–ด๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  • ํžˆ์Šคํ† ๊ทธ๋žจ ์ƒ์„ฑ (Histogram Generation)
    • ๊ฐ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด, ์ด๋ฏธ์ง€ ๋‚ด์˜ ํŠน์ง• ๋ฒกํ„ฐ๋“ค์ด ์‹œ๊ฐ์  ๋‹จ์–ด๋กœ ๋งคํ•‘๋ฉ๋‹ˆ๋‹ค.
    • ๋งคํ•‘๋œ ์‹œ๊ฐ์  ๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ํžˆ์Šคํ† ๊ทธ๋žจ์€ ํ•ด๋‹น ์ด๋ฏธ์ง€์˜ ํŠน์ง• ๋ฒกํ„ฐ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

BoVW (Bag of Visual Words)์˜ ๋ฌธ์ œ์ 

BoVW(Bag of Visual Words)์˜ ๋ฌธ์ œ์ ์— ๋ฐํ•˜์—ฌ ๋งํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

  • BoVW(Bag of Visual Word)๋Š” ๋‹ค์–‘ํ•œ Feature๋“ค์ด ์„ž์—ฌ๋ฒ„๋ฆฌ๋ฉด ์ •ํ™•ํ•œ Image Classification, Detection์— ๋ฌธ์ œ๊ฐ€ ๋ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ Local Feature๋ฅผ ์‹œ๊ฐ์  ๋‹จ์–ด๋กœ ๋งคํ•‘ํ•˜๋Š” ๊ณผ์ •์—์„œ ์ •๋ณด ์†์‹ค์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ํŠน์ง• ์ถ”์ถœ ๋ฐ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๊ณผ์ •์—์„œ ๋†’์€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ๋“ค์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ”๋กœ SPM(Spatial Pyramid Matching)์ž…๋‹ˆ๋‹ค.

 

SPM (Spatial Pyramid Matching)

Spatial Pyramid Matching (SPM)์€ ์ด๋ฏธ์ง€์˜ ๊ณต๊ฐ„์  ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
SPM์€ ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ๋ถ„ํ• ํ•˜๊ณ , ๊ฐ ๋‹จ๊ณ„์—์„œ ์ง€์—ญ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ณ„์‚ฐํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

  • ์ฃผ์š”ํ•œ ๊ฐœ๋…์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.
  • ์ด๋ฏธ์ง€ ๋ถ„ํ• 
    • ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ ๋ ˆ๋ฒจ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ ˆ๋ฒจ์—์„œ ์ด๋ฏธ์ง€๋Š” ๋” ์ž‘์€ ๊ทธ๋ฆฌ๋“œ๋กœ ๋‚˜๋ˆ„์–ด์ง‘๋‹ˆ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ ˆ๋ฒจ 0์—์„œ๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ „์ฒด๋กœ ์‚ฌ์šฉํ•˜๊ณ , ๋ ˆ๋ฒจ 1์—์„œ๋Š” ์ด๋ฏธ์ง€๋ฅผ 2x2 ๊ทธ๋ฆฌ๋“œ๋กœ ๋‚˜๋ˆ„๊ณ , ๋ ˆ๋ฒจ 2์—์„œ๋Š” 4x4 ๊ทธ๋ฆฌ๋“œ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
  • ์ง€์—ญ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ณ„์‚ฐ
    • ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์—์„œ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ณ , Visual Words(์‹œ๊ฐ์  ๋‹จ์–ด)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ ํžˆ์Šคํ† ๊ทธ๋žจ์€ ํ•ด๋‹น ๊ทธ๋ฆฌ๋“œ ์…€ ๋‚ด์˜ ์‹œ๊ฐ์  ๋‹จ์–ด์˜ ๋นˆ๋„๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ํžˆ์Šคํ† ๊ทธ๋žจ ๊ฒฐํ•ฉ
    • ๋ชจ๋“  ๋ ˆ๋ฒจ์—์„œ ๊ณ„์‚ฐ๋œ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ตœ์ข… ์ด๋ฏธ์ง€ ํ‘œํ˜„์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
    • ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ ๋ ˆ๋ฒจ์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์„ Weight(๊ฐ€์ค‘์น˜)๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ ˆ๋ฒจ 0์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์— ๋†’์€ Weight(๊ฐ€์ค‘์น˜)๋ฅผ, ๋ ˆ๋ฒจ 2์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์— ๋‚ฎ์€ Weight(๊ฐ€์ค‘์น˜)๋ฅผ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

SPM(Spatial Pyramid Matching) ๋‹จ๊ณ„๋ณ„ ์„ค๋ช…

  • Level 0
    • ์ด๋ฏธ์ง€๋ฅผ ์ „์ฒด๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐ์  ๋‹จ์–ด์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • ํ•˜๋‚˜์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.
  • Level 1
    • ์ด๋ฏธ์ง€๋ฅผ 2x2 ๊ทธ๋ฆฌ๋“œ(4๋ถ„๋ฉด)๋กœ ๋‚˜๋ˆ„๊ณ , ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์—์„œ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • ๋„ค ๊ฐœ์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.
  • Level 2
    • ์ด๋ฏธ์ง€๋ฅผ 4x4 ๊ทธ๋ฆฌ๋“œ(4 by 4)๋กœ ๋‚˜๋ˆ„๊ณ , ๊ฐ ๊ทธ๋ฆฌ๋“œ ์…€์—์„œ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
    • ์—ด์—ฌ์„ฏ ๊ฐœ์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

 

SPM์„ ์ด์šฉํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ์˜ Feature Map์„ ๊ท ์ผํ•œ Vector ํฌ๊ธฐ๋กœ ํ‘œํ˜„

  • SPM(Spatial Pyramid Matching)์˜ ์ˆœ์„œ๋Œ€๋กœ (level 0,1,2)๋ฅผ ์ฐจ๋ ˆ๋Œ€๋กœ ์ ์šฉ์„ ์‹œํ‚ค๋ฉด, 3 + 12 + 48 = 63๊ฐœ ์›์†Œ์˜ vector ๊ฐ’์œผ๋กœ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

 

SPP (Spatial Pyramid Pooling)

Spatial Pyramid Pooling (SPP)๋Š” CNN์—์„œ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ์ถœ๋ ฅ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ํ”ผ์ฒ˜ ๋งต์„ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค.

  • ํ”ผ์ฒ˜ ๋งต(Future Map)
    • CNN์„ ํ†ตํ•ด ์–ป์–ด์ง„ ์ด๋ฏธ์ง€์˜ Feature Map ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ Feature Map ์˜ ํฌ๊ธฐ๋Š” 8x8๊ณผ 8x12๋กœ ์ฃผ์–ด์ง‘๋‹ˆ๋‹ค.
  • ํ”ผ๋ผ๋ฏธ๋“œ ๋ ˆ๋ฒจ(Pyramid Levels)
    • ํ”ผ๋ผ๋ฏธ๋“œ์˜ ๊ฐ ๋ ˆ๋ฒจ์—์„œ ํ”ผ์ฒ˜ ๋งต์„ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ทธ๋ฆฌ๋“œ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
    • ๋ ˆ๋ฒจ 0: ํ”ผ์ฒ˜ ๋งต์„ ํ•˜๋‚˜์˜ ๊ทธ๋ฆฌ๋“œ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ๋ ˆ๋ฒจ 1: ํ”ผ์ฒ˜ ๋งต์„ 2x2 ๊ทธ๋ฆฌ๋“œ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
    • ๋ ˆ๋ฒจ 2: ํ”ผ์ฒ˜ ๋งต์„ 4x4 ๊ทธ๋ฆฌ๋“œ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  • ๋งฅ์Šค ํ’€๋ง(Max Pooling)
    • ๊ฐ Grid Cell ์—์„œ ์ตœ๋Œ€ ๊ฐ’์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ณต๊ฐ„์  ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅธ ์ž…๋ ฅ ์ด๋ฏธ์ง€์—์„œ๋„ ๋™์ผํ•œ ํฌ๊ธฐ์˜ Feature Vector ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋ฆผ์„ ๋ณด๋ฉด Pyramid Level์—์„œ Grid ๋ถ„ํ• ์„ ํ•˜๊ณ , Max Pooling์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ ํŠน์ง•์€ ์—ฌ๊ธฐ์„œ ์ œ์ผ ํฐ Feature๋งŒ ๊ฐ€์ง€๊ณ  ์˜จ๋‹ค๋Š” ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.
๊ทธ๋Ÿฌ๋ฉด 1 + 4 + 16 = 21๊ฐœ ๋ฒกํ„ฐ ๊ฐ’์œผ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.

 

 

SPP-Net Image Classification

Spatial Pyramid Pooling Network (SPP-Net)๋Š” ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ์ž…๋ ฅ ์ด๋ฏธ์ง€์—์„œ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ Feature Vector๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์™€ ๊ฐ™์€ CV(Computer Vision)์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • SPP-Net์€ ๊ธฐ์กด์˜ CNN ๊ตฌ์กฐ์— SPP Layer๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ Input Image๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ Input Image(์ž…๋ ฅ ์ด๋ฏธ์ง€)๋Š” Convolution Layer์„ ํ†ตํ•ด ์ฒ˜๋ฆฌ๋˜๊ณ , ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ Feature Map์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.
  • SPP Layer๋Š” ๋‹ค์–‘ํ•œ ํ”ผ๋ผ๋ฏธ๋“œ ๋ ˆ๋ฒจ์—์„œ Feature Map์„ ๋‚˜๋ˆ„๊ณ , ๊ฐ Grid Cell์—์„œ Max Pooling์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ณ ์ •๋œ ๊ธธ์ด์˜ Feature Vector๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • Fully-Connected Layer(FC)์€ SPP Layer์˜ ์ถœ๋ ฅ์„ ๋ฐ›์•„ ์ตœ์ข… Classification(๋ถ„๋ฅ˜)๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ๋˜ํ•œ SPP-Net์€ 2014 ILSVRC(ImageNet Large Scale Visual Recognition Challenge)์—์„œ GoogLeNet๊ณผ VGG์— ์ด์–ด ์„ธ ๋ฒˆ์งธ๋กœ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

 

SPP-Net Object Detection

  • SPM ๊ธฐ๋ฒ•์ด ์–ด๋–ป๊ฒŒ ์ง„ํ–‰๋˜๋Š”์ง€๋Š” ์•ž์—์„œ ์„ค๋ช…ํ–ˆ์œผ๋‹ˆ๊นŒ ์ƒ๋žตํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ๋‹ค๋งŒ, Convolution Layer์˜ Window์—์„œ ์ด๋ฏธ์ง€๋ฅผ ํƒ์ƒ‰ํ•˜๋Š” ๋ฐฉ์‹์€ Region Proposal ๋ฐฉ์‹์„ ์ด์šฉํ•˜๋Š”๊ฒƒ๋งŒ ์–˜๊ธฐํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ฐธ๊ณ ๊ธ€๋กœ CNN์— ๊ด€ํ•œ ๊ธ€ ๋งํฌ ๋‹ฌ์•„๋†“์„๊ป˜์š”!
 

[CV] Object Detection ๋ฐฉ์‹ & ์„ฑ๋Šฅ ํ‰๊ฐ€

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” ํ•œ๋ฒˆ Object Detection์˜ ๋ฐฉ์‹๋“ค์ด ์–ด๋–ค๊ฒƒ์ด ์žˆ๋Š”์ง€ ํ•œ๋ฒˆ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.Sliding Window ๋ฐฉ์‹Sliding Window ๋ฐฉ์‹์€ Window๋ฅผ ์™ผ์ชฝ ์ƒ๋‹จ๋ถ€ํ„ฐ ์˜ค๋ฅธ์ชฝ ํ•˜๋‹จ์œผ๋กœ ์ด๋™์‹œํ‚ค๋ฉด์„œ Object๋ฅผ Detection ํ•˜๋Š”

daehyun-bigbread.tistory.com

  • ๊ทธ๋ฆฌ๊ณ  Feature Map์—์„œ Selective Search ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•ด์„œ Image๋ฅผ Detect ํ•˜๋ฉด ๊ฐ๊ฐ Feature Map์— ํˆฌ์˜๋œ ์‚ฌ์ด์ฆˆ๊ฐ€ ๋‹ค๋ฅด๋‹ค๋Š”๊ฑธ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ ์ด์ œ SPP Layer๋ฅผ ํ†ตํ•ด Vectorํ™”๋ฅผ ํ•ด์ค€ ํ›„, ๊ณ ์ •๋œ Size๋กœ 2,000๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ด๊ฟ”์ค๋‹ˆ๋‹ค.

 

SPP-Net ๊ตฌ์กฐ

ํ•œ๋ฒˆ SPP-Net ๊ตฌ์กฐ์™€ ๊ฐ ๊ตฌ์„ฑ์š”์†Œ๊ฐ€ ํ•˜๋Š” ์—ญํ• ์— ๋ฐํ•˜์—ฌ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
SPP Layer๋ฅผ ์ด์šฉํ•ด์„œ 2,000๋ฒˆ ๋ถ€ํ•˜๋ฅผ ์ค„์ธ๊ฒŒ SPP-Net ์ž…๋‹ˆ๋‹ค.

1. ์ž…๋ ฅ ์ด๋ฏธ์ง€ (Input Image)

  • ๊ฐ์ฒด ํƒ์ง€๋ฅผ ์œ„ํ•ด ์›๋ณธ ์ด๋ฏธ์ง€๊ฐ€ ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.

2. ์˜์—ญ ์ œ์•ˆ (Region Proposal)

  • Selective Search: ์ž…๋ ฅ ์ด๋ฏธ์ง€์—์„œ ์ž ์žฌ์ ์ธ ๊ฐ์ฒด๊ฐ€ ์žˆ์„ ๋งŒํ•œ ์˜์—ญ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ์ˆ˜์ฒœ ๊ฐœ์˜ Region Proposal์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.

3. ํŠน์ง• ์ถ”์ถœ๊ธฐ (Feature Extractor)

  • VGG ๋„คํŠธ์›Œํฌ: VGG์™€ ๊ฐ™์€ ์‚ฌ์ „ ํ•™์Šต๋œ CNN์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ Region Proposal์—์„œ ํŠน์ง•์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์ด ๋„คํŠธ์›Œํฌ๋Š” ImageNet ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

4. ํ”ผ์ฒ˜ ๋งต (Feature Map)

  • CNN์„ ํ†ต๊ณผํ•œ ๊ฐ Region Proposal์€ ํ”ผ์ฒ˜ ๋งต์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.

5. SPP ๋ ˆ์ด์–ด (SPP Layer)

  • Spatial Pyramid Pooling (SPP): SPP ๋ ˆ์ด์–ด๋Š” ํ”ผ๋ผ๋ฏธ๋“œ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ”ผ์ฒ˜ ๋งต์„ ๊ณ ์ •๋œ ๊ธธ์ด์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๋Ÿฌ ๋ ˆ๋ฒจ๋กœ ๋‚˜๋‰˜์–ด ํ”ผ์ฒ˜ ๋งต์„ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ๊ทธ๋ฆฌ๋“œ๋กœ ๋ถ„ํ• ํ•˜๊ณ , ๊ฐ ๊ทธ๋ฆฌ๋“œ์—์„œ ๋งฅ์Šค ํ’€๋ง์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ณ ์ •๋œ ๊ธธ์ด์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

6. ์™„์ „ ์—ฐ๊ฒฐ ์ธต (Fully-Connected Layer)

  • SPP ๋ ˆ์ด์–ด์—์„œ ์ƒ์„ฑ๋œ ๊ณ ์ •๋œ ๊ธธ์ด์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐ๋Š” ์™„์ „ ์—ฐ๊ฒฐ ์ธต์— ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.

7. SVM ๋ถ„๋ฅ˜๊ธฐ (SVM Classifier)

  • ์™„์ „ ์—ฐ๊ฒฐ ์ธต์˜ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ Region Proposal์˜ ๊ฐ์ฒด ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์ธก๋œ ํด๋ž˜์Šค ํ™•๋ฅ  ์˜ˆ์‹œ:
    • Obj1: 0.8
    • Obj2: 0.1
    • Obj3: 0.1

8. ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ํšŒ๊ท€ (Bounding Box Regression)

  • ๊ฐ Region Proposal์˜ ์ •ํ™•ํ•œ ์œ„์น˜(๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ขŒํ‘œ)๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ํšŒ๊ท€๋Š” (x1, y1, x2, y2) ์ขŒํ‘œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ฒด์˜ ์œ„์น˜๋ฅผ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

 

R-CNN๊ณผ SPP-Net ๋น„๊ต

  • R-CNN๊ณผ SPP-Net์˜ ์ฐจ์ด์ ์— ๋ฐํ•˜์—ฌ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
R-CNN์€ ์ด๋ฏธ์ง€ ํ•œ ๊ฐœ์— 2000๋ฒˆ CNN์„ ํ†ต๊ณผํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ๋งค์šฐ ๋Š˜์–ด๋‚ฉ๋‹ˆ๋‹ค.
๋‹ค๋งŒ SPP-Net์€ ์ด๋ฏธ์ง€ ํ•œ ๊ฐœ๋‹น ํ•œ๋ฒˆ๋งŒ CNN์„ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ๋งค์šฐ ๋‹จ์ถ•๋˜๋Š” ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • SPP-Net์€ R-CNN์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜์—ฌ ํšจ์œจ์„ฑ์„ ๋†’์ธ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
  • R-CNN์€ ๊ฐ Region Proposal์— ๋Œ€ํ•ด CNN์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ์ ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ๊ณ  ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค.
  • SPP-Net์€ ์ „์ฒด ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ์— CNN์„ ์ ์šฉํ•˜๊ณ , SPP ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ๊ณ ์ •๋œ ๊ธธ์ด์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•˜๊ณ  ์—ฐ์‚ฐ ํšจ์œจ์„ฑ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค.
  • ์•„๋ž˜์˜ ํ‘œ๋Š” R-CNN๊ณผ SPP-Net๋ฅผ ๋น„๊ตํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.
Region Proposal ๋ฐฉ์‹ Selective Search Selective Search
CNN ์ ์šฉ ๋ฐฉ์‹ ๊ฐ Region Proposal์— ๊ฐœ๋ณ„ ์ ์šฉ ์ „์ฒด ์ด๋ฏธ์ง€์— ํ•œ ๋ฒˆ ์ ์šฉ
ํ”ผ์ฒ˜ ๋งต ์ƒ์„ฑ ๊ฐ Region Proposal์— ๋Œ€ํ•ด ๋ณ„๋„ ํ”ผ์ฒ˜ ๋งต ์ƒ์„ฑ ๊ณตํ†ต ํ”ผ์ฒ˜ ๋งต ์ƒ์„ฑ ํ›„ SPP ์ ์šฉ
Pooling ๋ฐฉ์‹ ๊ณ ์ • ํฌ๊ธฐ๋กœ ์กฐ์ •(Crop/Warp) ํ›„ ํ’€๋ง SPP ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ ํ”ผ๋ผ๋ฏธ๋“œ ํ’€๋ง
๋ถ„๋ฅ˜๊ธฐ SVM SVM
Bounding Box Regression ์žˆ์Œ ์žˆ์Œ
์—ฐ์‚ฐ ํšจ์œจ์„ฑ ๋‚ฎ์Œ ๋†’์Œ
์ฒ˜๋ฆฌ ์‹œ๊ฐ„ ์˜ค๋ž˜ ๊ฑธ๋ฆผ ๋น ๋ฆ„
์ž…๋ ฅ ํฌ๊ธฐ ์œ ์—ฐ์„ฑ ์ œํ•œ์  ๋†’์Œ