A A
[Paper Review] VGGnet Review
๋…ผ๋ฌธ์„ ๊ณ„์† ์ฝ์–ด์•ผ์ง€ ์ฝ์–ด์•ผ์ง€ ์ƒ๊ฐํ•˜๋‹ค๊ฐ€.. ์šฉ๊ธฐ๋ฅผ ๋‚ด์–ด์„œ ํ•œ๋ฒˆ ์ฝ์–ด๋ณธ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 

VGGNet Paper (2014)

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION.
๋…ผ๋ฌธ ์‚ฌ์ดํŠธ ๋งํฌ๋Š” ์•„๋ž˜์— ๋‚จ๊ฒจ๋†“๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ํ•œ๋ฒˆ ์ฐจ๊ทผ์ฐจ๊ทผ ๋ฆฌ๋ทฐํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.
 

Very Deep Convolutional Networks for Large-Scale Image Recognition

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x

arxiv.org


Abstract

VGGNet์€ ILSVRC 2014 ๋Œ€ํšŒ์—์„œ 2๋“ฑ์„ ์ฐจ์ง€ํ•œ CNN ๋ชจ๋ธ๋กœ Network์˜ ๊นŠ์ด์— ๋”ฐ๋ผ ๋ชจ๋ธ์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š” ์ ์„ ๋ณด์—ฌ์คฌ์Šต๋‹ˆ๋‹ค. VGGNet์€ 3x3 convolution filter๋กœ 16-19๊ฐœ์˜ Weight Layer๋กœ ์ฆ๊ฐ€ ํ•จ์œผ๋กœ์„œ ์ƒ๋‹นํ•œ ๊ฐœ์„ ์„ ์ด๋ฃฐ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ๋ณด์—ฌ์คฌ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ 2014 ImageNet Challenge์—์„œ ๋กœ์ปฌ๋ผ์ด์ œ์ด์…˜ ๋ฐ ๋ถ„๋ฅ˜ ํŠธ๋ž™์—์„œ ๊ฐ๊ฐ 1์œ„์™€ 2์œ„๋ฅผ ์ฐจ์ง€ํ–ˆ์œผ๋ฉฐ, ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์—๋„ ์ผ๋ฐ˜ํ™”๋œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ฐ„๊ฒฐํ•˜๊ณ  depth(๊นŠ์ด)๋ฅผ ๊นŠ๊ฒŒ ์Œ“์œผ๋ฏ€๋กœ์„œ, Computer Vision ์—ฐ๊ตฌ ๋ถ„์•ผ์ชฝ์—์„œ๋„ ์–ด๋Š์ •๋„ ์˜๋ฏธ๊ฐ€ ์žˆ๋Š” ๋…ผ๋ฌธ์ž…๋‹ˆ๋‹ค.

 


Introduction

CNN, Convolution Network๋Š” ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€, ์˜์ƒ ๋ฐ์ดํ„ฐ์…‹์— ๋ฐํ•˜์—ฌ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌํ•œ ์ด์œ ๋Š” ๋Œ€๊ทœ๋ชจ์˜ public image repo, GPU, ๋Œ€๊ทœ๋ชจ์˜ ๋ถ„์‚ฐ๋œ ํด๋Ÿฌ์Šคํ„ฐ, ๋†’์€ ์„ฑ๋Šฅ์˜ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ ๋•๋ถ„์— ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ, ์ค‘์š”ํ•œ์ ์€ Deep visual architecture์˜ ๋ฐœ์ „์—์„œ ImageNet Large-Scale Visual Recognition Challenge(ILSVRC)(Russakovsky et al., 2014)์€ ์–‡์€ Feature Encoding์—์„œ ๊นŠ์€ ConvNet๊นŒ์ง€ ๋Œ€๊ทœ๋ชจ์˜ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์‹œ์Šคํ…œ์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ–ˆ๋˜ ์ ์ž…๋‹ˆ๋‹ค.

 

๋˜ํ•œ ConvNet์ด Computer Vision ๋ถ„์•ผ์—์„œ ๋”์šฑ ๋ณดํŽธํ™”๋จ์— ๋”ฐ๋ผ ๊ธฐ์กด์˜ Model์˜ ๊ธฐ์กด์˜ Architecture๋ฅผ ๊ฐœ์„ ํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ์˜ฌ๋ฆฌ๊ธฐ์œ„ํ•œ ์—ฌ๋Ÿฌ ์‹œ๋„๊ฐ€ ์ด๋ฃจ์–ด ์กŒ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ILSVRC-2013์—์„œ ์ œ์ผ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ ๋ชจ๋ธ์—์„  (Zeiler & Fergus, 2013; Sermanet et al., 2014) ์ฒซ๋ฒˆ์งธ Convolution Layer์—์„œ window size, stride๋ฅผ ์ž‘๊ฒŒ ์‚ฌ์šฉํ•œ๊ฒƒ์ฒ˜๋Ÿผ ๋ง์ž…๋‹ˆ๋‹ค.

 

๊ทผ๋ฐ, ์•Œ์•„์•ผ ํ•˜๋Š”๊ฑด ๋˜๋‹ค๋ฅธ ๊ฐœ์„ ์ ์€, ์ „์ฒด ์ด๋ฏธ์ง€ & ๋‹ค๋ฅธ ์Šค์ผ€์ผ์— ๋ฐํ•˜์—ฌ ๋„คํŠธ์›Œํฌ๋ฅผ ๋ฐ€์ง‘ํ•˜๊ฒŒ ํ›ˆ๋ จ & ํ…Œ์ŠคํŠธ๋ฅผ ํ–ˆ๋‹ค๋Š”๊ฒƒ ์ด๋ž‘, ConvNet Architecture์˜ ๊นŠ์ด์— ๋‹ค๋ฃฌ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ Architecture์˜ ๋‹ค๋ฅธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •์‹œํ‚ค๊ณ , ์ ์ง„์ ์œผ๋กœ ๋„คํŠธ์›Œํฌ์˜ ๊นŠ์ด๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ณ , ๋ชจ๋“  Convolution Layer์— ์ž‘์€ 3x3 Convolution filter๋ฅผ ์ ์šฉํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

 

๊ฒฐ๊ณผ์ ์œผ๋กœ, ๋ฐœ์ „๋œ ConvNet Architecture๋ฅผ ๋งŒ๋“ค์—ˆ๊ณ , ILSVRC Classifcation & Localization task ์—์„œ, ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ƒ๋Œ€์ ์œผ๋กœ ๊ฐ„๋‹จํ•œ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ๋„ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ์ข‹์•˜์Šต๋‹ˆ๋‹ค.

 


ConvNet Configurations (ConvNet ๊ตฌ์„ฑ)

Architecture

 

Input๋œ Image๋ฅผ ๊ฐ€์ง€๊ณ  Model์„ Training์„ ํ• ๋•Œ ConvNet์˜ ์ž…๋ ฅ์€ ๊ณ ์ • ํฌ๊ธฐ 224 × 224 RGB ์ด๋ฏธ์ง€์ž…๋‹ˆ๋‹ค. ์‚ฌ์ „ ์ฒ˜๋ฆฌ๋Š” training set์—์„œ ๊ณ„์‚ฐ๋œ ํ‰๊ท  RGB ๊ฐ’์„ ๊ฐ pixel์—์„œ ๋นผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€๋Š” ์ž‘์€ receptive field(3×3)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” convolutional layer stack์„ ํ†ตํ•ด ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค. ์ผ๋ถ€ configuration์—์„œ๋Š” 1×1 convolution filter๋„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Convolution stride๋Š” 1 pixel๋กœ ๊ณ ์ •๋˜๋ฉฐ, spatial padding์€ convolution ํ›„ spatial resolution์„ ์œ ์ง€ํ•˜๋„๋ก ์„ค์ •๋ฉ๋‹ˆ๋‹ค. Spatial pooling์€ 2×2 pixel ์ฐฝ์—์„œ max-pooling์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

Convolutional layer stack(architecture์— ๋”ฐ๋ผ ๊นŠ์ด๊ฐ€ ๋‹ค๋ฆ„) ๋’ค์—๋Š” ์„ธ ๊ฐœ์˜ Fully-Connected (FC) layer๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋‘ layer๋Š” ๊ฐ๊ฐ 4096 channel, ๋งˆ์ง€๋ง‰ layer๋Š” 1000๊ฐœ์˜ class(ILSVRC classification)๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฏ€๋กœ 1000 channel์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ layer๋Š” softmax layer์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  network์—์„œ fully-connected layer configuration์€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

 

๋ชจ๋“  hidden layer์—๋Š” ReLU (Krizhevsky et al., 2012) non-linearity๊ฐ€ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. Network์—๋Š” Local Response Normalisation (LRN) layer๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” LRN layer๊ฐ€ ILSVRC dataset์—์„œ performance(์„ฑ๋Šฅ)์„ ํ–ฅ์ƒ์‹œํ‚ค์ง€ ๋ชปํ•˜๊ณ  memory consumption(๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„)๊ณผ computation time(๊ณ„์‚ฐ ์‹œ๊ฐ„)์„ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ฐœ๊ฒฌํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

Configurations

์ด paper์—์„œ ํ‰๊ฐ€๋œ ConvNet configuration์€ Table 1์— ์š”์•ฝ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๊ฐ๊ฐ ํ•˜๋‚˜์˜ column์œผ๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

Network์˜ depth๋Š” ์™ผ์ชฝ(A)์—์„œ ์˜ค๋ฅธ์ชฝ(E)์œผ๋กœ ์ด๋™ํ•˜๋ฉด์„œ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ configuration์€ 2.1์ ˆ์—์„œ ์„ค๋ช…ํ•œ generic design์„ ๋”ฐ๋ฅด๋ฉฐ, depth๋งŒ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

 

network A๋Š” 11๊ฐœ์˜ weight layer(8๊ฐœ์˜ convolutional layer์™€ 3๊ฐœ์˜ FC layer)๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ , network E๋Š” 19๊ฐœ์˜ weight layer(16๊ฐœ์˜ convolutional layer์™€ 3๊ฐœ์˜ FC layer)๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Convolutional layer์˜ width(channel ์ˆ˜)๋Š” ๋น„๊ต์  ์ž‘์œผ๋ฉฐ, ์ฒซ ๋ฒˆ์งธ layer์—์„œ 64๋กœ ์‹œ์ž‘ํ•˜์—ฌ ๊ฐ max-pooling layer ํ›„ ๋‘ ๋ฐฐ๋กœ ์ฆ๊ฐ€ํ•˜์—ฌ 512์— ๋„๋‹ฌํ•ฉ๋‹ˆ๋‹ค.

 

Table 2๋Š” ๊ฐ configuration์˜ parameter ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Depth๊ฐ€ ๊นŠ์ง€๋งŒ, ๋” shallowํ•˜๊ณ  ๋” ํฐ convolutional layer width์™€ receptive field๋ฅผ ๊ฐ€์ง„ network์˜ parameter ์ˆ˜๋ณด๋‹ค ๋งŽ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

Discussion

์ด ๋…ผ๋ฌธ์—์„œ์˜ ConvNet ๊ตฌ์„ฑ์€ ILSVRC-2012 (Krizhevsky et al., 2012)์™€ ILSVRC-2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014)์˜ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋˜ ์—”ํŠธ๋ฆฌ์™€๋Š” ์ƒ๋‹นํžˆ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

 

receptive field์˜ ์žฅ์ 

Krizhevsky et al. (2012)์—์„œ ์‚ฌ์šฉ๋œ ์ฒซ ๋ฒˆ์งธ ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด์˜ ์ˆ˜์šฉ ํ•„๋“œ๋Š” 11×11 ํฌ๊ธฐ์˜€๊ณ , Zeiler & Fergus (2013)์™€ Sermanet et al. (2014)์—์„œ๋Š” 7×7 ํฌ๊ธฐ์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ด ๋…ผ๋ฌธ์—์„  ๋„คํŠธ์›Œํฌ ์ „์ฒด์— ๊ฑธ์ณ ๋งค์šฐ ์ž‘์€ 3×3 receptive field๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

๋” ๋งŽ์€ ๋น„์„ ํ˜•์„ฑ(ReLU) ์ถ”๊ฐ€: 7×7 ํฌ๊ธฐ์˜ ๋‹จ์ผ Convolution Layer ๋Œ€์‹  3×3 ํฌ๊ธฐ์˜ ์„ธ ๊ฐœ์˜ Convolution Layer๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, ์„ธ ๊ฐœ์˜ ๋น„์„ ํ˜•์„ฑ ๋ ˆ์ด์–ด(ReLU)๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋„คํŠธ์›Œํฌ์˜ ํ‘œํ˜„๋ ฅ์„ ์ฆ๊ฐ€์‹œํ‚ค๊ณ , ๋” ๋ณต์žกํ•œ ํ•จ์ˆ˜ ๊ทผ์‚ฌ์— ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ์ • ํ•จ์ˆ˜์˜ ๋น„์„ ํ˜•์„ฑ์ด ๋†’์•„์ง€๋ฉด, Classification(๋ถ„๋ฅ˜) ์ž‘์—…์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ๊ฐ์†Œ: 3×3 Convolution Layer ์„ธ ๊ฐœ๋กœ ๊ตฌ์„ฑ๋œ ์Šคํƒ์˜ ๊ฒฝ์šฐ, ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์ด C ์ฑ„๋„์ผ ๋•Œ, ์ด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” 3 * (3 * 3 * C^2) = 27C^2์ž…๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, ๋‹จ์ผ 7×7 Convolution Layer๋Š” 7 * 7 * C^2 = 49C^2์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๋ฏ€๋กœ, ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ 81% ๋” ๋งŽ์Šต๋‹ˆ๋‹ค.

์ด๋Š” 7×7 Convolution Filter ์— ๋น„ํ•ด 3×3 Filter๊ฐ€ ๋” ์ ์€ ์ˆ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์ด๋Š” ๋ชจ๋ธ์˜ ๊ทœ์ œ(regularization) ํšจ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ž‘์€ ํฌ๊ธฐ์˜ Filter๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋ชจ๋ธ์˜ ๋ณต์žก์„ฑ์„ ์ค„์ด๋ฉด์„œ๋„ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋” ๊นŠ์€ ๋„คํŠธ์›Œํฌ ๊ตฌ์„ฑ ๊ฐ€๋Šฅ: ์ž‘์€ Filter๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋„คํŠธ์›Œํฌ์˜ depth(๊นŠ์ด)๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค.

์ด๋Š” ๊นŠ์€ Network๊ฐ€ ๋” ๋ณต์žกํ•œ ํŠน์ง•์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฏธ์ง€ ์ธ์‹ ์ž‘์—…์—์„œ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

1x1 Convolution Layer์˜ ์—ญํ• 

1×1 Convolution Layer์˜ ๋„์ž…์€ ๊ฒฐ์ • ํ•จ์ˆ˜์˜ ๋น„์„ ํ˜•์„ฑ์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

1×1 Convolution์€ ์ž…๋ ฅ Channel์˜ ์„ ํ˜• ๋ณ€ํ™˜์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋น„์„ ํ˜• ํ™œ์„ฑํ™” ํ•จ์ˆ˜(ReLU)์™€ ๊ฒฐํ•ฉ๋˜์–ด ์ถ”๊ฐ€์ ์ธ ๋น„์„ ํ˜•์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Lin et al. (2014)์˜ "Network in Network" ์•„ํ‚คํ…์ฒ˜์—์„œ๋„ 1×1 Convolution Layer๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ ๋ฐ” ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

์ž‘์€ Filter์˜ ์‚ฌ์šฉ ์‚ฌ๋ก€

์ž‘์€ ํฌ๊ธฐ์˜ Convolution Filter ์‚ฌ์šฉ์€ ์ด์ „์—๋„ ์‹œ๋„๋œ ๋ฐ” ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, Ciresan et al. (2011)์€ ์ž‘์€ Filter๋ฅผ ์‚ฌ์šฉํ•œ ๋„คํŠธ์›Œํฌ๋ฅผ ์ œ์•ˆํ–ˆ์œผ๋‚˜, ์ด ๋…ผ๋ฌธ์— ์ œ์‹œ๋œ Network ๋งŒํผ ๊นŠ์ง€๋Š” ์•Š์•˜๊ณ , ๋Œ€๊ทœ๋ชจ ILSVRC ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ‰๊ฐ€๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. Goodfellow et al. (2014)๋Š” ๊นŠ์€ ConvNet(11๊ฐœ ๊ฐ€์ค‘์น˜ ๋ ˆ์ด์–ด)์„ ์ŠคํŠธ๋ฆฌํŠธ ๋ฒˆํ˜ธ ์ธ์‹ ์ž‘์—…์— ์ ์šฉํ•˜์—ฌ, depth(๊นŠ์ด)๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

 

๋˜ํ•œ, ILSVRC-2014 ๋ถ„๋ฅ˜ ์ž‘์—…์—์„œ ์ƒ์œ„ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•œ GoogLeNet (Szegedy et al., 2014)์€ ๋…๋ฆฝ์ ์œผ๋กœ ๊ฐœ๋ฐœ๋˜์—ˆ์ง€๋งŒ, ๋งค์šฐ ๊นŠ์€ ConvNet(22๊ฐœ Weight Layer)๊ณผ ์ž‘์€ Convolution Filter๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์—์„œ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. GoogLeNet์€ 3×3 ํ•„ํ„ฐ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ 1×1 ๋ฐ 5×5 Convolution๋„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ฒซ ๋ฒˆ์งธ Layer์—์„œ Feature Map์˜ ๊ณต๊ฐ„ ํ•ด์ƒ๋„๋ฅผ ๋” ๊ณต๊ฒฉ์ ์œผ๋กœ ์ค„์—ฌ ๊ณ„์‚ฐ๋Ÿ‰์„ ๊ฐ์†Œ์‹œํ‚ต๋‹ˆ๋‹ค. VGGNet์€ ๋‹จ์ผ ๋„คํŠธ์›Œํฌ ๋ถ„๋ฅ˜ ์ •ํ™•๋„ ๋ฉด์—์„œ GoogLeNet (Szegedy et al., 2014)๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.


Classification Framework (๋ถ„๋ฅ˜ ํ”„๋ ˆ์ž„์›Œํฌ)

Training

์ด ๋…ผ๋ฌธ์—์„œ์˜ ConvNet ํ›ˆ๋ จ ๋ฐฉ๋ฒ•๊ณผ ์„ธ๋ถ€์‚ฌํ•ญ์— ๋ฐํ•˜์—ฌ ์„ค๋ช…ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

ํ›ˆ๋ จ์€ ๋ฏธ๋‹ˆ ๋ฐฐ์น˜ ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•(Gradient Descent)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹คํ•ญ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€(Multinomial Logistic Regression) ๋ชฉํ‘œ ํ•จ์ˆ˜๋ฅผ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์—ญ์ „ํŒŒ(Backpropagation) ์•Œ๊ณ ๋ฆฌ์ฆ˜(LeCun et al., 1989)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • ๋ฐฐ์น˜ ํฌ๊ธฐ(Batch Size): 256, ๋ชจ๋ฉ˜ํ…€(Momentum): 0.9

 

์ •๊ทœํ™” (Regularization)

Model Training์€ 2๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ Regularization(์ •๊ทœํ™”) ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  1. ๊ฐ€์ค‘์น˜ ๊ฐ์†Œ(Weight Decay): ์ด๋Š” L2 ํŒจ๋„ํ‹ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Weight ํฌ๊ธฐ๋ฅผ ์ œ์–ดํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. L2 ํŒจ๋„ํ‹ฐ ์Šน์ˆ˜๋Š” 5×10^-4๋กœ ์„ค์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  2. ๋“œ๋กญ์•„์›ƒ ์ •๊ทœํ™”(Dropout Regularization): ์ฒซ Fully-Connected Layer ์— ๋Œ€ํ•ด ๋“œ๋กญ์•„์›ƒ์„ ์ ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. Dropout ๋น„์œจ์€ 0.5๋กœ ์„ค์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

ํ•™์Šต๋ฅ  (Learning Rate)

  • ์ดˆ๊ธฐ Learning Rate ์€ 10^-2๋กœ ์„ค์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • Validation Set (๊ฒ€์ฆ ์„ธํŠธ)์—์„œ Accurary(์ •ํ™•๋„)๊ฐ€ ๋” ์ด์ƒ ํ–ฅ์ƒ๋˜์ง€ ์•Š์„ ๋•Œ, Learning Rate 10๋ฐฐ์”ฉ ๊ฐ์†Œ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
  • ์ด 3๋ฒˆ Learning Rate(ํ•™์Šต๋ฅ )์„ ๊ฐ์†Œ์‹œํ‚ค๊ณ , 370K ๋ฐ˜๋ณต(74 Epoch) ํ›„์— ํ•™์Šต์„ ์ค‘๋‹จํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๊นŠ์€ ๋„คํŠธ์›Œํฌ์˜ ์ˆ˜๋ ด (Convergence)

์ด ๋…ผ๋ฌธ์—์„œ์˜ Neural Network (์‹ ๊ฒฝ๋ง ๋„คํŠธ์›Œํฌ)๋Š” Krizhevsky et al. (2012)๋ณด๋‹ค ๋” ๋งŽ์€ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ๋” ๊นŠ์€ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, ๋‹ค์Œ์˜ ์ด์œ ๋กœ ์ธํ•ด ๋” ์ ์€ Epoch ๋‚ด์— ์ˆ˜๋ ดํ•ฉ๋‹ˆ๋‹ค.

  1. ๋‚ด์žฌ์  ์ •๊ทœํ™”(Implicit Regularization): ๋” ๊นŠ์€ ๋„คํŠธ์›Œํฌ์™€ ์ž‘์€ ํ•ฉ์„ฑ๊ณฑ ํ•„ํ„ฐ ํฌ๊ธฐ๊ฐ€ ๋‚ด์žฌ์  ์ •๊ทœํ™” ํšจ๊ณผ๋ฅผ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.
  2. ๋ ˆ์ด์–ด ์‚ฌ์ „ ์ดˆ๊ธฐํ™”(Pre-initialization of Layers): ํŠน์ • Layer(๋ ˆ์ด์–ด)๋ฅผ ๋ฏธ๋ฆฌ ์ดˆ๊ธฐํ™”ํ•จ์œผ๋กœ์จ ํ•™์Šต์ด ์ด‰์ง„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

๋„คํŠธ์›Œํฌ ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™” (Network Weights Initialization)

  • ๋„คํŠธ์›Œํฌ Weight(๊ฐ€์ค‘์น˜)์˜ ์ดˆ๊ธฐํ™”๋Š” ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐํ™”๊ฐ€ ์ž˜๋ชป๋˜๋ฉด, ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ Gradient(๊ธฐ์šธ๊ธฐ)์˜ ๋ถˆ์•ˆ์ •์„ฑ์œผ๋กœ ์ธํ•ด ํ•™์Šต์ด ์ง€์—ฐ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„: ์ดˆ๊ธฐ์—๋Š” ๋น„๊ต์  ์–•์€ ๊ตฌ์„ฑ์ธ ๋„คํŠธ์›Œํฌ A(Table 1)๋ฅผ ๋ฌด์ž‘์œ„ ์ดˆ๊ธฐํ™”๋กœ ํ›ˆ๋ จํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„: ๋” ๊นŠ์€ Architecture๋ฅผ ํ›ˆ๋ จํ•  ๋•Œ, ๋„คํŠธ์›Œํฌ A์˜ ์ฒซ ๋„ค Convolution Layer์™€ ๋งˆ์ง€๋ง‰ 3๊ฐœ์˜ Fully-Connected Layer๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ค‘๊ฐ„ Layer๋Š” ๋ฌด์ž‘์œ„๋กœ ์ดˆ๊ธฐํ™”๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฏธ๋ฆฌ ์ดˆ๊ธฐํ™”๋œ Layer์˜ Learning Rate์€ ๊ฐ์†Œ์‹œํ‚ค์ง€ ์•Š๊ณ  ํ•™์Šต ๋„์ค‘ ๋ณ€ํ™”ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๋ฌด์ž‘์œ„ ์ดˆ๊ธฐํ™” (Random Initialization)

  • ๋ฌด์ž‘์œ„ ์ดˆ๊ธฐํ™”์—์„œ๋Š” Mean(ํ‰๊ท )์ด 0์ด๊ณ  Variance(๋ถ„์‚ฐ)์ด 10^-2์ธ ์ •๊ทœ ๋ถ„ํฌ์—์„œ Weight(๊ฐ€์ค‘์น˜)๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์˜€์Šต๋‹ˆ๋‹ค. Bias(ํŽธํ–ฅ)์€ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • Glorot & Bengio (2010)์˜ ๋ฌด์ž‘์œ„ ์ดˆ๊ธฐํ™” ์ ˆ์ฐจ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Pretrained(์‚ฌ์ „ ํ›ˆ๋ จ) ์—†์ด๋„ Weight(๊ฐ€์ค‘์น˜)๋ฅผ ์ดˆ๊ธฐํ™”ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๊ณ ์ • ํฌ๊ธฐ 224×224 ConvNet ์ž…๋ ฅ ์ด๋ฏธ์ง€

  • ์ด๋ฏธ์ง€๋Š” ์žฌ์กฐ์ •๋œ ํ›ˆ๋ จ ์ด๋ฏธ์ง€์—์„œ ๋ฌด์ž‘์œ„๋กœ ์ž˜๋ผ๋ƒ…๋‹ˆ๋‹ค(ํ•œ ์ด๋ฏธ์ง€๋‹น ํ•œ ๋ฒˆ ์ž˜๋ผ๋‚ด๊ธฐ).
  • Training Set(ํ›ˆ๋ จ ์„ธํŠธ)๋ฅผ ๋” ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•ด, ์ž˜๋ผ๋‚ธ ์ด๋ฏธ์ง€๋ฅผ ๋ฌด์ž‘์œ„๋กœ ๊ฐ€๋กœ๋กœ ๋’ค์ง‘๊ณ , RGB ์ƒ‰์ƒ์„ ๋ฌด์ž‘์œ„๋กœ ๋ณ€๊ฒฝํ–ˆ์Šต๋‹ˆ๋‹ค.

 

ํ›ˆ๋ จ ์ด๋ฏธ์ง€ ํฌ๊ธฐ (Training Image Size)

ํ›ˆ๋ จ ์ด๋ฏธ์ง€์˜ ๊ฐ€์žฅ ์ž‘์€ ๋ณ€์˜ ํฌ๊ธฐ๋ฅผ S๋ผ๊ณ  ํ•  ๋•Œ, ConvNet ์ž…๋ ฅ์€ 224×224 ํฌ๊ธฐ๋กœ ๊ณ ์ •๋ฉ๋‹ˆ๋‹ค. S๋Š” ์ตœ์†Œ 224 ์ด์ƒ์ด์–ด์•ผ ํ•˜๋ฉฐ, ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ถœ์ฒ˜: https://medium.com/@msmapark2

  1. ๋‹จ์ผ ์Šค์ผ€์ผ ํ›ˆ๋ จ(Single-Scale Training): S๋ฅผ ๊ณ ์ •ํ•ฉ๋‹ˆ๋‹ค. S=256๊ณผ S=384์—์„œ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. S=256์œผ๋กœ ํ›ˆ๋ จ๋œ ๋„คํŠธ์›Œํฌ์˜ Weight๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ S=384 Network๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , ์ž‘์€ ์ดˆ๊ธฐ Learning Rate๋ฅผ 10^-3์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
  2. ๋ฉ€ํ‹ฐ ์Šค์ผ€์ผ ํ›ˆ๋ จ(Multi-Scale Training): ๊ฐ ํ›ˆ๋ จ ์ด๋ฏธ์ง€๋ฅผ [Smin, Smax] ๋ฒ”์œ„์—์„œ ๋ฌด์ž‘์œ„๋กœ ์ƒ˜ํ”Œ๋ง๋œ S๋กœ ์žฌ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค. Smin=256, Smax=512๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” Scale Jittering ์„ ํ†ตํ•ด ํ›ˆ๋ จ ์„ธํŠธ๋ฅผ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ฐ์ฒด ํฌ๊ธฐ๊ฐ€ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€๋ฅผ ์ธ์‹ํ•˜๋Š” ๋ฐ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

 

Testing

ํ…Œ์ŠคํŠธ ์‹œ ConvNet๊ณผ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ ˆ์ฐจ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€ ์žฌ์กฐ์ •

ํ…Œ์ŠคํŠธํ•  ์ด๋ฏธ์ง€๋Š” ๋จผ์ € ๊ณ ์ •๋œ ์ž‘์€ ์ธก๋ฉด ํฌ๊ธฐ Q๋กœ ๋“ฑ๋ฐฉ์„ฑ(iso-tropically)์œผ๋กœ ์žฌ์กฐ์ •๋ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ Q๋Š” ํ›ˆ๋ จ ์‹œ์˜ ํฌ๊ธฐ S์™€ ์ผ์น˜ํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค. (์—ฌ๋Ÿฌ Q ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.)

 

 

๋ฐ€์ง‘ ํ‰๊ฐ€ (Dense Evaluation)

์žฌ์กฐ์ •๋œ ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€๋ฅผ ์ „์ฒด์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋„คํŠธ์›Œํฌ๋ฅผ ๋ฐ€์ง‘ํ•˜๊ฒŒ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ฐ€์ง‘ ํ‰๊ฐ€์˜ ์ฃผ์š” ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. Fully-Connected Layer(์™„์ „ ์—ฐ๊ฒฐ ๋ ˆ์ด์–ด)๋ฅผ Convoultion Layer(ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด)๋กœ ๋ณ€ํ™˜:
    • ์ฒซ ๋ฒˆ์งธ Fully-Connected Layer(์™„์ „ ์—ฐ๊ฒฐ ๋ ˆ์ด์–ด)๋Š” 7×7 Convoultion Layer(ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด)๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.
    • ๋งˆ์ง€๋ง‰ ๋‘ ๊ฐœ์˜ Fully-Connected Layer(์™„์ „ ์—ฐ๊ฒฐ ๋ ˆ์ด์–ด) 1×1 Convoultion Layer(ํ•ฉ์„ฑ๊ณฑ ๋ ˆ์ด์–ด)๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.
  2. ๋ฐ€์ง‘ ๋„คํŠธ์›Œํฌ ์ ์šฉ:
    • ๋ณ€ํ™˜๋œ ์™„์ „ Fully-Connected Layer(์™„์ „ ์—ฐ๊ฒฐ ๋ ˆ์ด์–ด)๋ฅผ ์ „์ฒด(์ž˜๋ฆฌ์ง€ ์•Š์€) ์ด๋ฏธ์ง€์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ์ด ๊ฒฐ๊ณผ๋Š” ํด๋ž˜์Šค ์ ์ˆ˜ ๋งต(class score map)์„ ์ƒ์„ฑํ•˜๋ฉฐ, Class(ํด๋ž˜์Šค) ์ˆ˜์™€ ๋™์ผํ•œ ์ˆ˜์˜ Channel(์ฑ„๋„)๊ณผ Input Image (์ž…๋ ฅ ์ด๋ฏธ์ง€) ํฌ๊ธฐ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ๊ณต๊ฐ„ ํ•ด์ƒ๋„๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.
  3. ํด๋ž˜์Šค ์ ์ˆ˜ ๋งต์˜ ๊ณต๊ฐ„ ํ‰๊ท ํ™”:
    • ํด๋ž˜์Šค ์ ์ˆ˜ ๋งต์„ ๊ณต๊ฐ„์ ์œผ๋กœ ํ‰๊ท ํ™”(ํ•ฉ-Pooling)ํ•˜์—ฌ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ๊ณ ์ • ํฌ๊ธฐ ๋ฒกํ„ฐ์˜ ํด๋ž˜์Šค ์ ์ˆ˜๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.
    • ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ๊ฐ€๋กœ๋กœ ๋’ค์ง‘์–ด ์›๋ณธ ์ด๋ฏธ์ง€์™€ ๋’ค์ง‘ํžŒ ์ด๋ฏธ์ง€์˜ Softmax Class ํฌ์ŠคํŠธ๋ฆฌ์–ด๋ฅผ ํ‰๊ท ํ™”ํ•˜์—ฌ ์ตœ์ข… ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

 

๋‹ค์ค‘ ํฌ๋กญ ํ‰๊ฐ€ (Multi-Crop Evaluation)

๋ฐ€์ง‘ ํ‰๊ฐ€์™€ ๋‹ฌ๋ฆฌ ๋‹ค์ค‘ ํฌ๋กญ ํ‰๊ฐ€์—์„œ๋Š” ์ด๋ฏธ์ง€์˜ ์—ฌ๋Ÿฌ ๋ถ€๋ถ„์„ ์ž˜๋ผ์„œ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ์„ธ๋ฐ€ํ•œ ์ž…๋ ฅ ์ด๋ฏธ์ง€ ์ƒ˜ํ”Œ๋ง:
    • ๋‹ค์ค‘ ํฌ๋กญ ํ‰๊ฐ€๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ์„ธ๋ฐ€ํ•œ ์ƒ˜ํ”Œ๋ง์„ ์ œ๊ณตํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฐ ํฌ๋กญ์— ๋Œ€ํ•ด ์žฌ๊ณ„์‚ฐํ•ด์•ผ ํ•˜๋ฏ€๋กœ ํšจ์œจ์„ฑ์ด ๋–จ์–ด์ง€์ง€๋งŒ, ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ๋‹ค์–‘ํ•œ ๊ฒฝ๊ณ„ ์กฐ๊ฑด:
    • ๋ฐ€์ง‘ ํ‰๊ฐ€์™€ ๋‹ค์ค‘ ํฌ๋กญ ํ‰๊ฐ€๋Š” ๋‹ค๋ฅธ ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ๋ฐ€์ง‘ ํ‰๊ฐ€์—์„œ๋Š” Convolution Feature Map ์ด ์ด๋ฏธ์ง€์˜ ์ด์›ƒ ๋ถ€๋ถ„์œผ๋กœ Padding๋˜์ง€๋งŒ, ๋‹ค์ค‘ ํฌ๋กญ ํ‰๊ฐ€๋Š” ์˜(0)์œผ๋กœ Padding ๋ฉ๋‹ˆ๋‹ค.
    • ์ด๋Ÿฌํ•œ ์ฐจ์ด๋กœ ์ธํ•ด Network์˜ ์ˆ˜์šฉ ํ•„๋“œ๊ฐ€ ์ฆ๊ฐ€ํ•˜์—ฌ ๋” ๋งŽ์€ Context๋ฅผ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํ…Œ์ŠคํŠธ ์‹œ๊ฐ„ ์ฆ๊ฐ• (Test Time Augmentation)

ํ…Œ์ŠคํŠธ ์‹œ๊ฐ„ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ๋‹จ์ผ ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ ํฌ๊ธฐ๋กœ ์กฐ์ •ํ•˜๊ณ , ๋‹ค์–‘ํ•œ ํฌ๋กญ์„ ํ‰๊ฐ€ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๋‹จ์ผ Scale Test ๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํšจ์œจ์„ฑ ๊ณ ๋ ค

๋‹ค์ค‘ ํฌ๋กญ ํ‰๊ฐ€๊ฐ€ ๋ฐ€์ง‘ ํ‰๊ฐ€๋ณด๋‹ค ์‹œ๊ฐ„์ด ๋” ๋งŽ์ด ์†Œ์š”๋  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ •ํ™•์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

๋‘ ๋ฐฉ๋ฒ•์„ ๊ฒฐํ•ฉํ•˜์—ฌ Softmax Output(์ถœ๋ ฅ)์„ ํ‰๊ท ํ™”ํ•˜๋ฉด ์„ฑ๋Šฅ์ด ๋”์šฑ ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


Implementation Details

VGGNet์˜ ๊ตฌํ˜„์€ ๊ณต๊ฐœ๋œ C++ Caffe ํˆด๋ฐ•์Šค(Jia, 2013)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ง€๋งŒ, ์—ฌ๋Ÿฌ GPU์—์„œ ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ช‡ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ์ˆ˜์ •์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ์š” ์ˆ˜์ • ์‚ฌํ•ญ๊ณผ ๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

๋‹ค์ค‘ GPU ํ›ˆ๋ จ (Multi-GPU Training)

๋‹ค์ค‘ GPU ํ›ˆ๋ จ์€ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ์„ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ๋‹ค์ค‘ GPU ํ›ˆ๋ จ ๊ณผ์ •์˜ ์ฃผ์š” ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค:

  1. ๋ฐ์ดํ„ฐ Batch ๋ถ„ํ• :
    • ๊ฐ Batch์˜ ํ›ˆ๋ จ ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ GPU ๋ฐฐ์น˜๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ GPU์—์„œ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  2. Gradient ๊ณ„์‚ฐ:
    • ๊ฐ GPU์—์„œ ๋ณ‘๋ ฌ๋กœ Batch๋งˆ๋‹ค Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  3. Gradient ํ‰๊ท ํ™”:
    • ๊ฐ GPU Batch์˜ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ํ‰๊ท ํ™”ํ•˜์—ฌ ์ „์ฒด Batch์˜ Gradient(๊ธฐ์šธ๊ธฐ)๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.
  4. ๋™๊ธฐํ™”๋œ Gradient ๊ณ„์‚ฐ:
    • GPU ๊ฐ„ Gradient ๊ณ„์‚ฐ์€ ๋™๊ธฐํ™”๋˜๋ฏ€๋กœ, ๋‹จ์ผ GPU์—์„œ ํ›ˆ๋ จํ•œ ๊ฒƒ๊ณผ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

GPU ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ

  • ์„ฑ๋Šฅ ํ–ฅ์ƒ:
    • ์šฐ๋ฆฌ๋Š” ๋„คํŠธ์›Œํฌ ํ›ˆ๋ จ ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋‹จ์ผ ์‹œ์Šคํ…œ์— ์„ค์น˜๋œ ์—ฌ๋Ÿฌ GPU๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ๋” ๋ณต์žกํ•œ ๋ณ‘๋ ฌํ™” ๋ฐฉ๋ฒ•(Krizhevsky, 2014)์ด ์ตœ๊ทผ ์ œ์•ˆ๋˜์—ˆ์ง€๋งŒ, ๋‹จ์ˆœํ•œ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ๋„ 4๊ฐœ์˜ GPU ์‹œ์Šคํ…œ์—์„œ ๋‹จ์ผ GPU ์‚ฌ์šฉ์— ๋น„ํ•ด 3.75๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํ•˜๋“œ์›จ์–ด ๊ตฌ์„ฑ:
    • Network ํ›ˆ๋ จ์€ ๋„ค ๊ฐœ์˜ NVIDIA Titan Black GPU๊ฐ€ ์žฅ์ฐฉ๋œ ์‹œ์Šคํ…œ์—์„œ ์ˆ˜ํ–‰๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
    • ๋‹จ์ผ Network ํ›ˆ๋ จ์—๋Š” ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ 2-3์ฃผ๊ฐ€ ์†Œ์š”๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

ํ•ฉ์„ฑ๊ณฑ ์‹ ๊ฒฝ๋ง(CNN) ํˆด๋ฐ•์Šค Caffe์˜ ์ˆ˜์ •

Caffe๋Š” ConvNet ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ๊ณต๊ฐœ๋œ C++ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ฌ๋Ÿฌ ์ˆ˜์ • ์‚ฌํ•ญ์„ ํ†ตํ•ด Caffe๋ฅผ ๊ฐœ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค:

  1. ๋‹ค์ค‘ GPU Training(ํ›ˆ๋ จ) ๋ฐ Evaluation(ํ‰๊ฐ€) ์ง€์›:
    • ์—ฌ๋Ÿฌ GPU์—์„œ Training ๋ฐ Evaluation(ํ‰๊ฐ€)๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก Caffe ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
  2. ์ „์ฒด ์ด๋ฏธ์ง€์—์„œ Training(ํ›ˆ๋ จ) ๋ฐ Evaluation(ํ‰๊ฐ€):
    • ์ „์ฒด ํฌ๊ธฐ์˜ ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ Scale(ํฌ๊ธฐ)์—์„œ Training(ํ›ˆ๋ จ) ๋ฐ Evaluation(ํ‰๊ฐ€)ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ˆ˜์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  3. ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ ํ™œ์šฉ:
    • ๋‹ค์ค‘ GPU ํ›ˆ๋ จ์—์„œ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ Batch Gradient(๋ฐฐ์น˜ ๊ทธ๋ž˜๋””์–ธํŠธ)๋ฅผ ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐํ•˜๊ณ  ํ‰๊ท ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

Implementation(๊ตฌํ˜„)์˜ ์ฃผ์š” ํŠน์ง•

  • ๋‹ค์ค‘ ์Šค์ผ€์ผ ์ง€์›:
    • ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธ ์‹œ ์—ฌ๋Ÿฌ ์Šค์ผ€์ผ์„ ์ง€์›ํ•˜์—ฌ ๋„คํŠธ์›Œํฌ์˜ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ํšจ์œจ์„ฑ:
    • ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ํ›ˆ๋ จ ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

 


Classification Experiments (๋ถ„๋ฅ˜ ์‹คํ—˜)

ILSVRC-2012 ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ œ์•ˆ๋œ ConvNet ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

 

ILSVRC-2012 ๋ฐ์ดํ„ฐ์…‹์—๋Š” 1000๊ฐœ์˜ ํด๋ž˜์Šค ์ด๋ฏธ์ง€๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ, ํ›ˆ๋ จ ์„ธํŠธ(130๋งŒ ์žฅ), ๊ฒ€์ฆ ์„ธํŠธ(5๋งŒ ์žฅ), ํ…Œ์ŠคํŠธ ์„ธํŠธ(10๋งŒ ์žฅ)๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค.

๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์€ ๋‘ ๊ฐ€์ง€ ์ธก์ • ๋ฐฉ๋ฒ•์œผ๋กœ ํ‰๊ฐ€๋ฉ๋‹ˆ๋‹ค. top-1 ์˜ค๋ฅ˜์™€ top-5 ์˜ค๋ฅ˜. top-1 ์˜ค๋ฅ˜๋Š” ์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ์ด๋ฏธ์ง€์˜ ๋น„์œจ์ด๋ฉฐ, top-5 ์˜ค๋ฅ˜๋Š” ์ƒ์œ„ 5๊ฐœ์˜ ์˜ˆ์ธก ํด๋ž˜์Šค์— ์ •๋‹ต ํด๋ž˜์Šค๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€ ๋น„์œจ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

 

Single Scale Evaluation (๋‹จ์ผ ์Šค์ผ€์ผ ํ‰๊ฐ€)

์šฐ๋ฆฌ๋Š” ๋จผ์ € ๋‹จ์ผ ์Šค์ผ€์ผ์—์„œ ๊ฐ ConvNet ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ณ ์ •๋œ S์— ๋Œ€ํ•ด Q = S, ๊ทธ๋ฆฌ๊ณ  Jittering๋œ S ∈ [Smin, Smax]์— ๋Œ€ํ•ด Q = 0.5(Smin + Smax). ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ๋กœ์ปฌ ์‘๋‹ต ์ •๊ทœํ™”(LRN) ์‚ฌ์šฉ ์—ฌ๋ถ€:
    • A-LRN ๋„คํŠธ์›Œํฌ๋Š” LRN์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ A ๋„คํŠธ์›Œํฌ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
    • ๋”ฐ๋ผ์„œ, ๋” ๊นŠ์€ ์•„ํ‚คํ…์ฒ˜(B–E)์—์„œ๋Š” ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.
  • ๋„คํŠธ์›Œํฌ ๊นŠ์ด ์ฆ๊ฐ€์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋ณ€ํ™”:
    • Network์˜ depth(๊นŠ์ด)๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ Classification Error(๋ถ„๋ฅ˜ ์˜ค๋ฅ˜)๊ฐ€ ๊ฐ์†Œํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, 11๊ฐœ Layer(๋ ˆ์ด์–ด)๋ฅผ ๊ฐ€์ง„ A Network์—์„œ 19๊ฐœ Layer(๋ ˆ์ด์–ด)๋ฅผ ๊ฐ€์ง„ E Network๋กœ ์ด๋™ํ• ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
    • ๊ตฌ์„ฑ C๋Š” 1×1 Convolution Layer 3๊ฐœ๋ฅผ ํฌํ•จํ•˜์ง€๋งŒ, 3×3 Convolution Layer๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ตฌ์„ฑ D๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์กŒ์Šต๋‹ˆ๋‹ค.
    • ์ด๋Š” ์ถ”๊ฐ€์ ์ธ ๋น„์„ ํ˜•์„ฑ์ด ๋„์›€์ด ๋˜์ง€๋งŒ(C๊ฐ€ B๋ณด๋‹ค ๋‚˜์Œ), ๊ณต๊ฐ„์  ๋ฌธ๋งฅ์„ ํฌ์ฐฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค(D๊ฐ€ C๋ณด๋‹ค ๋‚˜์Œ).
  • ์Šค์ผ€์ผ ์ง€ํ„ฐ๋ง์˜ ํšจ๊ณผ:
    • Training(ํ›ˆ๋ จ) ์‹œ Scale Jittering (S ∈ [256; 512])์„ ์‚ฌ์šฉํ•˜๋ฉด ๋‹จ์ผ ์Šค์ผ€์ผ(S = 256 ๋˜๋Š” S = 384)๋กœ ํ›ˆ๋ จํ•œ ๊ฒƒ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋” ์ข‹์•„์ง‘๋‹ˆ๋‹ค. ์ด๋Š” Scale Jittering์„ ํ†ตํ•œ ํ›ˆ๋ จ ์„ธํŠธ ํ™•์žฅ์ด ๋ฉ€ํ‹ฐ ์Šค์ผ€์ผ ์ด๋ฏธ์ง€ ํ†ต๊ณ„๋ฅผ ์บก์ฒ˜ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ๋‹จ์ผ ์Šค์ผ€์ผ ํ‰๊ฐ€ ๊ฒฐ๊ณผ (Table 3)
ConvNet ๊ตฌ์„ฑ (ํ‘œ 1 ์ฐธ์กฐ) ํ›ˆ๋ จ ์Šค์ผ€์ผ (S) ํ…Œ์ŠคํŠธ ์Šค์ผ€์ผ (Q) top-1 ๊ฒ€์ฆ ์˜ค๋ฅ˜ (%) top-5 ๊ฒ€์ฆ ์˜ค๋ฅ˜ (%)
A 256 256 29.6 10.4
A-LRN 256 256 29.7 10.5
B 256 256 28.7 9.9
C 256 256 28.1 9.4
C 384 384 28.1 9.3
C [256;512] 384 27.3 8.8
D 256 256 27.0 8.8
D 384 384 26.8 8.7
D [256;512] 384 25.6 8.1
E 256 256 27.3 9.0
E 384 384 26.9 8.7
E [256;512] 384 25.5 8.0

 

Multi Scale Evaluation (๋ฉ€ํ‹ฐ ์Šค์ผ€์ผ ํ‰๊ฐ€)

๋‹ค์–‘ํ•œ ํฌ๊ธฐ๋กœ ์žฌ์กฐ์ •๋œ ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€(Q ๊ฐ’์„ ๋‹ค๋ฅด๊ฒŒ ์„ค์ •)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•œ ํ›„,
๊ฒฐ๊ณผ ํด๋ž˜์Šค ํฌ์ŠคํŠธ๋ฆฌ์–ด๋ฅผ ํ‰๊ท ํ™”ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
  • ํ›ˆ๋ จ ์Šค์ผ€์ผ์— ๋”ฐ๋ฅธ ํ…Œ์ŠคํŠธ ์Šค์ผ€์ผ ์„ค์ •:
    • ๊ณ ์ •๋œ S๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์€ ํ›ˆ๋ จ ์Šค์ผ€์ผ์— ๊ฐ€๊นŒ์šด ์„ธ ๊ฐ€์ง€ ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€ ํฌ๊ธฐ(Q = {S - 32, S, S + 32})์—์„œ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
    • Scale Jittering (S ∈ [256; 512])์œผ๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์€ ๋” ๋„“์€ ๋ฒ”์œ„์˜ ํ…Œ์ŠคํŠธ ํฌ๊ธฐ(Q = {Smin, 0.5(Smin + Smax), Smax})์—์„œ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฉ€ํ‹ฐ ์Šค์ผ€์ผ ํ‰๊ฐ€ ๊ฒฐ๊ณผ (Table 4)
ConvNet ๊ตฌ์„ฑ (ํ‘œ 1 ์ฐธ์กฐ) ํ›ˆ๋ จ ์Šค์ผ€์ผ (S) ํ…Œ์ŠคํŠธ ์Šค์ผ€์ผ (Q) top-1 ๊ฒ€์ฆ ์˜ค๋ฅ˜ (%) top-5 ๊ฒ€์ฆ ์˜ค๋ฅ˜ (%)
B 256 224,256,288 28.2 9.6
C 256 224,256,288 27.7 9.2
C 384 352,384,416 27.8 9.2
C [256; 512] 256,384,512 26.3 8.2
D 256 224,256,288 26.6 8.6
D 384 352,384,416 26.5 8.6
D [256; 512] 256,384,512 24.8 7.5
E 256 224,256,288 26.9 8.7
E 384 352,384,416 26.7 8.6
E [256; 512] 256,384,512 24.8 7.5

 

Multi Crop Evaluation (๋ฉ€ํ‹ฐ ํฌ๋กญ ํ‰๊ฐ€)

๋ฐ€์ง‘ ConvNet ํ‰๊ฐ€์™€ ๋ฉ€ํ‹ฐ ํฌ๋กญ ํ‰๊ฐ€๋ฅผ ๋น„๊ตํ•˜๊ณ , ๋‘ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์„ ๊ฒฐํ•ฉํ•œ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ๋ฉ€ํ‹ฐ ํฌ๋กญ ํ‰๊ฐ€:
    • ๋‹ค์ค‘ ํฌ๋กญ ํ‰๊ฐ€๋Š” ์ด๋ฏธ์ง€์˜ ์—ฌ๋Ÿฌ ๋ถ€๋ถ„์„ ์ž˜๋ผ์„œ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
    • ์ด๋Š” ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ํšจ์œจ์„ฑ์€ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.
  • ํ‰๊ฐ€ ๋ฐฉ๋ฒ• ๋น„๊ต:
    • ๋ฉ€ํ‹ฐ ํฌ๋กญ ํ‰๊ฐ€๋Š” ๋ฐ€์ง‘ ํ‰๊ฐ€๋ณด๋‹ค ์•ฝ๊ฐ„ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ๋‘ ๋ฐฉ๋ฒ•์„ ๊ฒฐํ•ฉํ•˜๋ฉด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  • ๋ฉ€ํ‹ฐ ํฌ๋กญ ํ‰๊ฐ€ ๊ฒฐ๊ณผ (Table 5)
ConvNet ๊ตฌ์„ฑ (ํ‘œ 1 ์ฐธ์กฐ) ํ‰๊ฐ€ ๋ฐฉ๋ฒ• top-1 ๊ฒ€์ฆ ์˜ค๋ฅ˜ (%) top-5 ๊ฒ€์ฆ ์˜ค๋ฅ˜ (%)
D ๋ฐ€์ง‘ ํ‰๊ฐ€(dense) 24.8 7.5
D ๋ฉ€ํ‹ฐ ํฌ๋กญ ํ‰๊ฐ€(multi-crop) 24.6 7.5
D ๋ฉ€ํ‹ฐ ํฌ๋กญ & ๋ฐ€์ง‘ ํ‰๊ฐ€ 24.4 7.2
E ๋ฐ€์ง‘ ํ‰๊ฐ€(dense) 24.8 7.5
E ๋ฉ€ํ‹ฐ ํฌ๋กญ ํ‰๊ฐ€(multi-crop) 24.6 7.4
E ๋ฉ€ํ‹ฐ ํฌ๋กญ & ๋ฐ€์ง‘ ํ‰๊ฐ€ 24.4 7.1

 

ConvNet Fusion (ConvNet๊ณผ ์œตํ•ฉ)

์—ฌ๋Ÿฌ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ConvNet ์œตํ•ฉ์€ ์—ฌ๋Ÿฌ ๋„คํŠธ์›Œํฌ์˜ ์†Œํ”„ํŠธ๋งฅ์Šค ํด๋ž˜์Šค ํฌ์ŠคํŠธ๋ฆฌ์–ด(Softmax class posterior)๋ฅผ ํ‰๊ท ํ™”ํ•˜์—ฌ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์€ ๊ฐœ๋ณ„ ๋ชจ๋ธ์˜ ์ƒ๋ณด์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ILSVRC ์ œ์ถœ ์‹œ์ ์˜ ๋„คํŠธ์›Œํฌ ์œตํ•ฉ

ILSVRC ์ œ์ถœ ๋‹น์‹œ์—๋Š” ๋‹จ์ผ ์Šค์ผ€์ผ ๋„คํŠธ์›Œํฌ์™€ ๋ฉ€ํ‹ฐ ์Šค์ผ€์ผ ๋ชจ๋ธ D๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด 7๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์œตํ•ฉํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ๋‹จ์ผ ์Šค์ผ€์ผ ๋„คํŠธ์›Œํฌ: ๊ณ ์ •๋œ ์Šค์ผ€์ผ๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ
  • ๋ฉ€ํ‹ฐ ์Šค์ผ€์ผ ๋ชจ๋ธ D: ์—ฌ๋Ÿฌ ์Šค์ผ€์ผ์—์„œ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ

์ด ์œตํ•ฉ ๋„คํŠธ์›Œํฌ๋Š” ILSVRC ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ 7.3%์˜ top-5 ์˜ค๋ฅ˜์œจ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

 

ILSVRC ์ œ์ถœ ์ดํ›„์˜ ๋„คํŠธ์›Œํฌ ์œตํ•ฉ

์ œ์ถœ ํ›„, ๋‘ ๊ฐ€์ง€ ๋ฉ€ํ‹ฐ ์Šค์ผ€์ผ ๋ชจ๋ธ์˜ ๊ฒฐํ•ฉ์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์Œ ๋‘ ๋ชจ๋ธ์„ ์œตํ•ฉํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  1. ๋ชจ๋ธ D: Scale Jittering(S ∈ [256; 512])์œผ๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ
  2. ๋ชจ๋ธ E: Scale Jittering(S ∈ [256; 512])์œผ๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ

์ด ๋‘ ๋ชจ๋ธ์„ ์œตํ•ฉํ•œ ๊ฒฐ๊ณผ, ํ…Œ์ŠคํŠธ ์„ธํŠธ์—์„œ top-5 ์˜ค๋ฅ˜์œจ์„ 7.0%๋กœ ๊ฐ์†Œ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ, ๋ฐ€์ง‘ ํ‰๊ฐ€์™€ ๋ฉ€ํ‹ฐ ํฌ๋กญ ํ‰๊ฐ€๋ฅผ ๊ฒฐํ•ฉํ•œ ๊ฒฐ๊ณผ, ์„ฑ๋Šฅ์ด ๋”์šฑ ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • ๋ฐ€์ง‘ ํ‰๊ฐ€์™€ ๋ฉ€ํ‹ฐ ํฌ๋กญ ํ‰๊ฐ€ ๊ฒฐํ•ฉ: ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ๋‹จ์ผ ๋ชจ๋ธ์ด ๊ฐ๊ฐ์˜ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์—์„œ ์–ป์€ Softmax ์ถœ๋ ฅ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ฒฐํ•ฉ๋œ ConvNet ๋ชจ๋ธ top-1 ๊ฒ€์ฆ ์˜ค๋ฅ˜ (%) top-5 ๊ฒ€์ฆ ์˜ค๋ฅ˜ (%) top-5 ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜ (%)
ILSVRC ์ œ์ถœ (D/256/224,256,288), (D/384/352,384,416), (D/[256;512]/256,384,512), (C/256/224,256,288), (C/384/352,384,416), (E/256/224,256,288), (E/384/352,384,416) 24.7 7.5 7.3
์ œ์ถœ ํ›„ (D/[256;512]/256,384,512), (E/[256;512]/256,384,512), ๋ฐ€์ง‘ ํ‰๊ฐ€ 24.0 7.1 7.0
์ œ์ถœ ํ›„ (D/[256;512]/256,384,512), (E/[256;512]/256,384,512), ๋ฉ€ํ‹ฐ ํฌ๋กญ 23.9 7.2 -
์ œ์ถœ ํ›„ (D/[256;512]/256,384,512), (E/[256;512]/256,384,512), ๋ฉ€ํ‹ฐ ํฌ๋กญ & ๋ฐ€์ง‘ ํ‰๊ฐ€ 23.7 6.8 6.8

 

Comparision with the state of the art (์ตœ์ฒจ๋‹จ๊ณผ ๋น„๊ต)

ILSVRC-2014 ์ฑŒ๋ฆฐ์ง€ ๊ฒฐ๊ณผ

ILSVRC-2014 ์ฑŒ๋ฆฐ์ง€์˜ ๋ถ„๋ฅ˜ ๊ณผ์ œ์—์„œ, "VGG"๋Š” 2์œ„๋ฅผ ์ฐจ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ตœ์ข… ์ œ์ถœ ๋ชจ๋ธ์€ 7๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์•™์ƒ๋ธ”ํ•˜์—ฌ 7.3%์˜ top-5 ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.

์ œ์ถœ ํ›„, ๋ชจ๋ธ์˜ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ๋ฉ€ํ‹ฐ ์Šค์ผ€์ผ ๋ชจ๋ธ์˜ ์•™์ƒ๋ธ”์„ ์‹œ๋„ํ–ˆ๊ณ , ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค:

  • ๋‹จ์ผ ๋„คํŠธ์›Œํฌ ํ‰๊ฐ€: ์šฐ๋ฆฌ์˜ ๊ฐ€์žฅ ์ข‹์€ ๋‹จ์ผ Network๋Š” 7.0%์˜ ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋„คํŠธ์›Œํฌ ์•™์ƒ๋ธ” ํ‰๊ฐ€: ๋‘ ๊ฐœ์˜ ๋ฉ€ํ‹ฐ ์Šค์ผ€์ผ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ 6.8%์˜ ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜์œจ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

 

GoogLeNet๊ณผ์˜ ๋น„๊ต

GoogLeNet(Szegedy et al., 2014)์€ ILSVRC-2014 ๋ถ„๋ฅ˜ ๊ณผ์ œ์—์„œ 1์œ„๋ฅผ ์ฐจ์ง€ํ•œ ๋ชจ๋ธ๋กœ, 6.7%์˜ top-5 ์˜ค๋ฅ˜์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค. GoogLeNet์€ ๋งค์šฐ ๊นŠ์€ ConvNet(22๊ฐœ์˜ Weight Layer)๊ณผ ์ž‘์€ Convoultion Filter๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. GoogLeNet์˜ ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€๋Š” ์šฐ๋ฆฌ์˜ ๊ฒƒ๋ณด๋‹ค ๋” ๋ณต์žกํ•˜๋ฉฐ, ์ฒซ ๋ฒˆ์งธ Layer์—์„œ Feature Map์˜ ๊ณต๊ฐ„ ํ•ด์ƒ๋„๋ฅผ ๋” ๊ณต๊ฒฉ์ ์œผ๋กœ ์ค„์—ฌ ๊ณ„์‚ฐ๋Ÿ‰์„ ๊ฐ์†Œ์‹œํ‚ต๋‹ˆ๋‹ค.

  • ๋‹จ์ผ ๋„คํŠธ์›Œํฌ ์„ฑ๋Šฅ: ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์ด ๋‹จ์ผ GoogLeNet ๋ชจ๋ธ๋ณด๋‹ค 0.9% ๋” ๋‚ฎ์€ ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.
  • ๋„คํŠธ์›Œํฌ ์•™์ƒ๋ธ” ์„ฑ๋Šฅ: GoogLeNet์˜ ์•™์ƒ๋ธ” ๊ฒฐ๊ณผ๋Š” 6.7%๋กœ, ์šฐ๋ฆฌ์˜ ๊ฒฐ๊ณผ์™€ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

 

MSRA์™€์˜ ๋น„๊ต

MSRA(He et al., 2014) ํŒ€์€ 11๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ 8.1%์˜ ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.

MSRA์˜ ๋‹จ์ผ ๋„คํŠธ์›Œํฌ ์„ฑ๋Šฅ์€ 9.1%์˜ ์˜ค๋ฅ˜์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.

 

Clarifai์™€์˜ ๋น„๊ต

Clarifai(Russakovsky et al., 2014)๋Š” ILSVRC-2013 ์ฑŒ๋ฆฐ์ง€์˜ ์Šน์ž๋กœ, ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ 11.7%, ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ 11.2%์˜ top-5 ์˜ค๋ฅ˜์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.

์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์€ ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ ๋„ Clarifai ๋ชจ๋ธ๋ณด๋‹ค ํ›จ์”ฌ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

 

Zeiler & Fergus์™€์˜ ๋น„๊ต

Zeiler & Fergus (Zeiler & Fergus, 2013)๋Š” 6๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ 14.8%์˜ ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‹จ์ผ ๋„คํŠธ์›Œํฌ ์„ฑ๋Šฅ์€ 16.1%์˜ ์˜ค๋ฅ˜์œจ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์€ ์ด ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํ›จ์”ฌ ๋” ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

 

OverFeat์™€์˜ ๋น„๊ต

OverFeat(Sermanet et al., 2014)์€ 7๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ 13.6%์˜ ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‹จ์ผ ๋„คํŠธ์›Œํฌ ์„ฑ๋Šฅ์€ 14.2%์˜ ์˜ค๋ฅ˜์œจ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์€ OverFeat ๋ชจ๋ธ๋ณด๋‹ค ํ›จ์”ฌ ๋” ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

 

Krizhevsky et al.๊ณผ์˜ ๋น„๊ต

Krizhevsky et al. (Krizhevsky et al., 2012)๋Š” 5๊ฐœ์˜ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ 16.4%์˜ ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜์œจ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค

๋‹จ์ผ ๋„คํŠธ์›Œํฌ ์„ฑ๋Šฅ์€ 18.2%์˜ ์˜ค๋ฅ˜์œจ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ๋ชจ๋ธ์€ ์ด ๋ชจ๋ธ๋“ค๋ณด๋‹ค ํ›จ์”ฌ ๋” ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

 

์ข…ํ•ฉ ํ‰๊ฐ€

์šฐ๋ฆฌ์˜ ๋งค์šฐ ๊นŠ์€ ConvNet ๋ชจ๋ธ์€ ๊ธฐ์กด ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ ์ƒ๋‹นํžˆ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ๋‹จ์ผ ๋„คํŠธ์›Œํฌ ์„ฑ๋Šฅ์—์„œ๋Š” ๋ชจ๋“  ๊ธฐ์กด ๋ชจ๋ธ์„ ๋Šฅ๊ฐ€ํ–ˆ์œผ๋ฉฐ, ๋„คํŠธ์›Œํฌ ์•™์ƒ๋ธ”์„ ํ†ตํ•ด ๋”์šฑ ํ–ฅ์ƒ๋œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ์—ฐ๊ตฌ๋Š” ๊นŠ์€ ๋„คํŠธ์›Œํฌ๊ฐ€ ๋” ๋‚˜์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  • ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ต ๊ฒฐ๊ณผ (Table 7)
๋ฐฉ๋ฒ• top-1 ๊ฒ€์ฆ ์˜ค๋ฅ˜ (%) top-5 ๊ฒ€์ฆ ์˜ค๋ฅ˜ (%) top-5 ํ…Œ์ŠคํŠธ ์˜ค๋ฅ˜ (%)
VGG (2๊ฐœ ๋„คํŠธ, ๋ฉ€ํ‹ฐ ํฌ๋กญ & ๋ฐ€์ง‘ ํ‰๊ฐ€) 23.7 6.8 6.8
VGG (1๊ฐœ ๋„คํŠธ, ๋ฉ€ํ‹ฐ ํฌ๋กญ & ๋ฐ€์ง‘ ํ‰๊ฐ€) 24.4 7.1 7.0
VGG (ILSVRC ์ œ์ถœ, 7๊ฐœ ๋„คํŠธ, ๋ฐ€์ง‘ ํ‰๊ฐ€) 24.7 7.5 7.3
GoogLeNet (1๊ฐœ ๋„คํŠธ) - 7.9 -
GoogLeNet (7๊ฐœ ๋„คํŠธ) - 6.7 -
MSRA (11๊ฐœ ๋„คํŠธ) - - 8.1
MSRA (1๊ฐœ ๋„คํŠธ) 27.9 9.1 9.1
Clarifai (์™ธ๋ถ€ ๋ฐ์ดํ„ฐ ์—†์ด, ์—ฌ๋Ÿฌ ๋„คํŠธ) - - 11.7
Clarifai (์™ธ๋ถ€ ๋ฐ์ดํ„ฐ ์—†์ด, 1๊ฐœ ๋„คํŠธ) - - 12.5
Zeiler & Fergus (6๊ฐœ ๋„คํŠธ) 36.0 14.7 14.8
Zeiler & Fergus (1๊ฐœ ๋„คํŠธ) 37.5 16.0 16.1
OverFeat (7๊ฐœ ๋„คํŠธ) 34.0 13.2 13.6
OverFeat (1๊ฐœ ๋„คํŠธ) 35.7 14.2 -
Krizhevsky et al. (5๊ฐœ ๋„คํŠธ) 38.1 16.4 16.4
Krizhevsky et al. (1๊ฐœ ๋„คํŠธ) 40.7 18.2 -

Conclusion (๊ฒฐ๋ก )

์—ฌ๊ธฐ์„œ ์•Œ์ˆ˜ ์žˆ๋Š”๊ฑด ๊นŠ์€ ConvNet (19๊ฐœ์˜ Weight Layer)๋ฅผ ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ํ‰๊ฐ€๋ฅผ ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ๊ฒฐ๊ณผ, ๊นŠ์ด ์žˆ๋Š” ํ‘œํ˜„์ด ๋ถ„๋ฅ˜ ์ •ํ™•๋„์— ์œ ๋ฆฌํ•˜๋‹ค๋Š”๊ฒƒ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ๊ธฐ์กด์˜ ConvNet Architecture๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด์„œ ๊นŠ์ด๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ๋„, ImageNet Challenge Dataset์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ• ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ VGGnet ๋ชจ๋ธ์ด ๋‹ค๋ฅธ ์ž‘์—…๊ณผ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•ด ์ž˜ ์ผ๋ฐ˜ํ™”๋˜์–ด ๋œ ๊นŠ์€ ์ด๋ฏธ์ง€ ํ‘œํ˜„๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•จ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

์ด ๊ฒฐ๊ณผ๋Š” ์‹œ๊ฐ์  ํ‘œํ˜„์—์„œ ๊นŠ์ด์˜ ์ค‘์š”์„ฑ์„ ๋‹ค์‹œ ํ•œ ๋ฒˆ ํ™•์ธ์‹œ์ผœ ์ค€๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.


๊ทธ๋Ÿฌ๋ฉด ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋Š” ์—ฌ๊ธฐ์„œ ๊ทธ๋งŒ ํ•ด๋ณด๊ณ , ๋‹ค์Œ ๊ธ€์€ VGG16 ์„ PyTorch๋กœ ๊ตฌํ˜„ํ•œ ์ฝ”๋“œ๋ฅผ ๋“ค๊ณ  ์˜ค๊ฒ ์Šต๋‹ˆ๋‹ค~