๊ด€๋ฆฌ ๋ฉ”๋‰ด

Doby's Lab

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift ๋ณธ๋ฌธ

AI/Concepts

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

๋„๋น„(Doby) 2023. 1. 22. 13:18

โœ… Contents

  • Intro
  • Abstract
  • 1. Introduction
  • 2. Towards Reducing Internal Covariate Shift
  • 3. Normalization via Mini-Batch Statistics
  • 4. Experiments & 5. Conclusion (skip)
  • Outro
  • Reference

โœ… Intro

๐Ÿ“„ Motivation

Batch Normalization์— ๋Œ€ํ•ด ์•Œ๊ฒŒ ๋œ ํ›„, ๋‚จ์•„์žˆ๋Š” ๊ถ๊ธˆ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ง์ ‘ ์›๋ฌธ์„ ๋ณด์•„์•ผ๊ฒ ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

๊ถ๊ธˆํ–ˆ๋˜ ๊ฒƒ์œผ๋กœ๋Š” BN(X)์™€ Activation์˜ ์„ ํ›„ ๊ด€๊ณ„๋‚˜ ๋„คํŠธ์›Œํฌ๋งˆ๋‹ค Batch Normalization์˜ ๋ฐฉ๋ฒ•์ด ๋‹ค๋ฅธ ๊ฒƒ๋“ค์ด ๊ถ๊ธˆํ•˜๊ธฐ๋„ ํ–ˆ๊ณ ,

'Internal Covariate Shift๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ๋‹ค๋ฅธ ๊ฒƒ์„ ํ•ด๊ฒฐํ•œ๋‹ค'๋Š” ์ถ”ํ›„์— ๋‚˜์˜จ ๋‹ค๋ฅธ ๋…ผ๋ฌธ์—์„œ ํ–ˆ๋˜ ์–˜๊ธฐ์˜€์ง€๋งŒ ์กฐ๊ธˆ ๊ทผ ๋ณธ์— ๋Œ€ํ•ด ๋‹ค๋ฃจ์–ด์„œ ๊ถ๊ธˆ์ฆ์„ ํ•ด๊ฒฐํ•ด๋ณด๊ณ  ์‹ถ์—ˆ์Šต๋‹ˆ๋‹ค.

 

Batch Normalization์— ๋Œ€ํ•œ ๊ฐœ๋…์„ ๋‹ค๋ฃฌ ํฌ์ŠคํŒ… (๊ธฐ์ดˆ์ ์ธ ๋‚ด์šฉ)

https://draw-code-boy.tistory.com/504

 

Batch Normalization์ด๋ž€? (Basic)

Batch Normalization Batch Normalization๋Š” ๋ชจ๋ธ์ด ๋ณต์žกํ•  ๋•Œ, ์ •ํ™•ํžˆ๋Š” Layer๊ฐ€ ๊นŠ์–ด์ง€๋ฉด์„œ ๊ฐ€์ง€๋Š” ๋ณต์žก๋„๊ฐ€ ๋†’์„ ๋•Œ ์ผ์–ด๋‚˜๋Š” Overfitting์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. Batch Normalization์ด ์–ด๋–ป๊ฒŒ ์ž‘๋™๋˜๋Š”์ง€

draw-code-boy.tistory.com

๐Ÿ“„ Paper Link

https://arxiv.org/abs/1502.03167

 

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful param

arxiv.org


โœ… Abstract

DNN ๊ตฌ์กฐ์˜ Network๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์€ ์ด์ „ layer์—์„œ Parameter๊ฐ€ ๋‹ฌ๋ผ์ง์— ๋”ฐ๋ผ ๊ฐ layer์˜ input์˜ Distribution์ด ๋‹ฌ๋ผ์ ธ์„œ ํ•™์Šต์‹œํ‚ค๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ํ˜„์ƒ์€ ๋‚ฎ์€ Learning Rate, ์กฐ์‹ฌ์Šค๋Ÿฌ์šด Parameter Initialization์„ ์š”๊ตฌํ•˜๊ณ , Distribution์ด ๋‹ฌ๋ผ์ง์— ๋”ฐ๋ผ์„œ non-linearity(= Activation)์—์„œ์˜ saturation(Activation์˜ ํŽธ๋ฏธ๋ถ„์ด 0์ด ๋˜์–ด ์—…๋ฐ์ดํŠธ๊ฐ€ ์—†์–ด์ง€๋Š” ํ˜„์ƒ)์— ๋”ฐ๋ผ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๊ฐ€ ์ค‘๋‹จ๋˜๋Š” ์‚ฌํƒœ๊นŒ์ง€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ํ˜„์ƒ์„ 'Internal Covariate Shift'๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ mini-batch์— ๋”ฐ๋ฅธ Normalization์„ ์ง„ํ–‰ํ•˜๊ณ , ์ด๋ฅผ ๋ชจ๋ธ์˜ ํ•œ ๋ถ€๋ถ„์œผ๋กœ ์ ์šฉ์„ ์‹œ์ผœ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

 

์ œ์‹œ๋œ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๋” ๋†’์€ Learning Rate๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ณ , Initialization์— ๋Œ€ํ•ด ์กฐ๊ธˆ ๋œ ์ฃผ์˜ํ•ด๋„ ๋˜๊ณ , Regularization์˜ ์—ญํ• ๊นŒ์ง€ ํ•˜์—ฌ Dropout์˜ ์˜์กด์„ฑ์„ ์ค„์ด๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.


โœ… 1. Introduction

๐Ÿ“„ Domain Adaptation

Introduction์—์„œ๋Š” SGD์— ๋Œ€ํ•œ ์„ค๋ช…๊ณผ mini-batch์˜ ์ด์ ์— ๋Œ€ํ•ด ๋งํ•˜๋ฉฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , ๊ฐ layers์˜ input์—์„œ ๋ถ„ํฌ์˜ ๋ณ€ํ™”๋Š” ๊ฐ layer๋กœ ํ•˜์—ฌ๊ธˆ ์ƒˆ๋กœ์šด ๋ถ„ํฌ์— ์ง€์†์ ์œผ๋กœ ์ ์‘ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ์‚ผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. (Domain Adaptation)

 

๊ทธ๋Ÿฌ๋‚˜, Covariate Shift๋Š” ํ•˜์œ„์˜ sub-network๋‚˜ layer์—๋„ ์ ์‘๋˜๊ธฐ ์œ„ํ•ด์„œ Learning System ์ „์ฒด๋ฅผ ๋„˜์–ด ํ™•์žฅ๋  ์ˆ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์™œ ๊ทธ๋Ÿฐ์ง€์— ๋Œ€ํ•ด์„œ ํ•œ network๋ฅผ ์˜ˆ๋กœ ๋“ค์–ด ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

$$ l = F_2(F_1(u, \theta_{1}), \theta_{2}) $$

์ˆ˜์‹์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์€ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

$$ F(input, parameter) $$

๋˜ํ•œ, \(F_2(F_1(u, \theta_{1}), \theta_{2})\)๋Š”\(x=F_1(u, \theta_{1})\)๋กœ ํ‘œํ˜„ํ•œ๋‹ค๋ฉด, \(F_2(x, \theta_{2})\)๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์ด์— ๋”ฐ๋ฅธ \(\theta_{2}\)์˜ mini-batch Gradient Descent์˜ ๊ณผ์ •์€ ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

$$ \theta_2 = \theta_2 - \alpha\frac{1}{m}\sum_{i=1}^{m}
\frac{\partial F_2(x_i, \theta_2)}{\partial \theta_2} $$

 

์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋Š” \(F_2\)๋ผ๋Š” ๋‹จ์ผ ๋„คํŠธ์›Œํฌ์— \(x\)๋ผ๋Š” input์ด ๋“ค์–ด๊ฐ€๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋ฏ€๋กœ, training data๋‚˜ test data๊ฐ€ ๋ชจ๋‘ ๊ฐ™์€ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด ํšจ์œจ์ ์ด๋ผ๋Š” ๋ง์„ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

input data์˜ ๋ถ„ํฌ๋ฅผ ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ๊ณ ์ •ํ•จ์œผ๋กœ์จ \(\theta_2\)๊ฐ€ ๊ณ„์† ๋ณ€ํ™”๋˜๋Š” ๋ถ„ํฌ์— ๋”ฐ๋ผ ์ƒˆ๋กญ๊ฒŒ ์ ์‘ํ•  ํ•„์š”๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.


๐Ÿ“„ Gradient Vanishing & Saturation

๋˜ํ•œ, sub-network์— ๋Œ€ํ•œ input์˜ ๊ณ ์ •๋œ ๋ถ„ํฌ๋Š” sub-network์˜ ์™ธ๋ถ€ layer๋“ค์—๋„ ๊ธ์ •์ ์ธ ํšจ๊ณผ๋ฅผ ์ดˆ๋ž˜ํ•  ๊ฒƒ์ด๋ผ๋Š” ๋ง์„ ํ•ฉ๋‹ˆ๋‹ค.

 

layer๋ฅผ ํ†ตํ•ด ์ถœ๋ ฅํ•œ output(\(Wu + b\))์„ ์‹œ๊ทธ๋ชจ์ด๋“œ์˜ ์ž…๋ ฅ ๊ฐ’์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜์˜€์„ ๋•Œ, ์ด ๊ฐ’์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ง์— ๋”ฐ๋ผ Gradient Vanishing์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด์— ๋”ฐ๋ผ, ํ•™์Šต์˜ ์†๋„๊ฐ€ ๋Š๋ ค์ง€๋Š” ๊ฒƒ์€ ๋ฌผ๋ก ์ž…๋‹ˆ๋‹ค.

 

๋˜ํ•œ, ์‹œ๊ทธ๋ชจ์ด๋“œ์— ๋Œ€ํ•œ ์ž…๋ ฅ๊ฐ’์€ W, b์˜ ์˜ํ–ฅ๊ณผ ๊ทธ ์•„๋ž˜์˜ layer์˜ ๋ชจ๋“  parameter๋“ค์˜ ์˜ํ–ฅ์„ ๋ฐ›๊ธฐ ๋•Œ๋ฌธ์— Back-Propagation์˜ ๊ณผ์ •์—์„œ Vanishing ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ Saturation ํ˜„์ƒ๊นŒ์ง€ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

(์ด์— ๋Œ€ํ•ด ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด sigmoid๋ณด๋‹ค๋Š” ReLU activation์„ ์ถ”์ฒœํ•ฉ๋‹ˆ๋‹ค.)

 

๊ทธ๋ž˜์„œ, nonlinearity(= activation)์— ๋Œ€ํ•œ input์˜ distribution์ด ์•ˆ์ •์„ฑ์„ ๊ฐ–์ถ˜๋‹ค๋ฉด, Vanishing์ด๋‚˜ Saturation์˜ ๋ฌธ์ œ์— Gradient Descent๊ฐ€ ๋œ ๋ฐฉ์น˜๋  ์ˆ˜ ์žˆ๊ณ , ํ•™์Šต์„ ๋” ๊ฐ€์†์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ“„ Suggest 'Batch Normalization'

layer๋ฅผ ๊ฑฐ์นจ์— ๋”ฐ๋ผ ๋ถ„ํฌ๊ฐ€ ๋ณ€ํ•˜๋Š” ํ˜„์ƒ์„ 'Internal Covariate Shift'๋ผ๊ณ  ์–ธ๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.

์ด์— ๋”ฐ๋ผ ์ œ์‹œํ•œ ๋ฐฉ์•ˆ์ด 'Batch Normalization'์ž…๋‹ˆ๋‹ค.

 

๊ฐ input(mini-batch)์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ํ†ตํ•ด Normalization์„ ํ•จ์œผ๋กœ์จ ์–ป๋Š” ์ด์ ๋“ค์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Accelerated Training using much higher learning rates (= ๋†’์€ ํ•™์Šต๋ฅ ์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ์œผ๋กœ์จ ๊ฐ€์†ํ™”๋œ ํ•™์Šต)
  • Gradient flow through the network, by reducing the dependence of gradients on the scale of parameters or of their initial values (= ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ์ดˆ๊นƒ๊ฐ’์˜ ์Šค์ผ€์ผ์— ๋”ฐ๋ฅธ gradient์˜ ์˜์กด์„ฑ์„ ์ค„์ž„)
  • Regularizes the model and reduces the need for Dropout (= ๊ทœ์ œ๋ฅผ ํ•จ์œผ๋กœ์จ Dropout์˜ ํ•„์š”์„ฑ์„ ์ค„์ž„)

โœ… 2. Towards Reducing Internal Covariate Shift

๐Ÿ“„ Whitening

๊ทธ๋ž˜์„œ Fixed Distribution์„ ์œ„ํ•ด Whitening ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ด ๋ด…๋‹ˆ๋‹ค.

 

Whitening์ด๋ž€ ํ‰๊ท ์„ 0, ๋ถ„์‚ฐ์„ 1๋กœ ๋งŒ๋“ค๊ณ , ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ decorrelated ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

- ์ด ๊ณผ์ •์—์„œ Covariance Matrix์™€ Inverse๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์‚ฐ๋Ÿ‰์ด ์—„์ฒญ ๋งŽ์•„์ง‘๋‹ˆ๋‹ค.

- ๋˜ํ•œ, ์ด ๊ณผ์ •์—์„œ ์„ ํ˜•๋Œ€์ˆ˜ํ•™์ด ์“ฐ์—ฌ ์„ ํ˜•๋Œ€์ˆ˜ํ•™์„ ๊ณต๋ถ€ํ•ด์•ผ๊ฒ ๋‹ค๋Š” ๊ณ„๊ธฐ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

ํ•˜์ง€๋งŒ, Whitening์€ Optimization(= Gradient Descent)์˜ ๊ณผ์ •๊ณผ ๋ณ„๊ฐœ๋กœ ์ด๋ฃจ์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— Gradient Descent์˜ ๊ณผ์ •์—์„œ Whitening์— ๋Œ€ํ•œ ์—…๋ฐ์ดํŠธ๋„ ๋˜์–ด์•ผ ํ•˜๋Š”๋ฐ ๊ทธ๋Ÿฌ์งˆ ๋ชป ํ•ด์„œ Gradient Descent์˜ ํšจ๊ณผ๋ฅผ ๊ฐ์†Œ์‹œํ‚ต๋‹ˆ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด, ํ•œ layer์— ๋Œ€ํ•œ input์„ \(u\)๋ผ๊ณ  ๋‘๊ณ , ํ•™์Šต์„ ํ†ตํ•ด ๋งŒ๋“ค์–ด์ง„ bias๋ฅผ \(b\)๋ผ๊ณ  ํ•˜๊ณ ,

์ด๋กœ ์ธํ•ด \(x=u+b\)๋ผ๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋งŒ๋“ค์–ด์ง„๋‹ค๊ณ  ๊ฐ€์ •ํ•ด ๋ด…์‹œ๋‹ค.

์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ํ‰๊ท ์„ ๋บŒ์œผ๋กœ์จ normalize๋ฅผ ํ•ฉ๋‹ˆ๋‹ค.

$$ \hat{x} = x - E[x] $$

(\(x=u+b, \chi = \{x_{1...N}\}\) set of values of x over the training set, \(E[x] = \frac{1}{N}\sum_{i=1}^{N}x_i\))

 

Gradient Descent์— ๋”ฐ๋ผ \(b = b + \Delta b\)๋กœ ์—…๋ฐ์ดํŠธ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

(\(\Delta b = -\alpha\frac{\partial l}{\partial \hat{x}}\))

 

๊ทธ๋Ÿผ ์—…๋ฐ์ดํŠธ ๊ณผ์ •์—์„œ ์•„๋ž˜์™€ ๊ฐ™์€ ์ผ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

$$ u + (b + \Delta b) - E[u + (b + \Delta b)] = u + b - E[u + b] $$

 

์ด ๊ณผ์ •์— ๋”ฐ๋ผ b์˜ ์˜ํ–ฅ์€ ์‚ฌ๋ผ์ง€๊ณ , Loss์—์„œ๋Š” ๋ณ€ํ•จ์ด ์—†๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ, Back-Propagation์€ ๊ณ„์† ์ง„ํ–‰๋˜๋‹ˆ \(b\)๋งŒ ๋ฌด๊ธฐํ•œ์œผ๋กœ ์ปค์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋Š” ๋‹จ์ˆœํžˆ E[x]๋งŒ ๋นผ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜๋ˆ„์–ด์ฃผ๊ฑฐ๋‚˜ scaling์˜ ๊ณผ์ •๊นŒ์ง€ ํฌํ•จ๋  ๊ฒฝ์šฐ ๋”์šฑ ๋ชจ๋ธ์ด ์•…ํ™”๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

๊ฒฐ๋ก ์ ์œผ๋กœ, Normalization์ด Gradient descent Optimization์˜ ๊ณผ์ •์— ํ•„์ˆ˜์ ์œผ๋กœ ํฌํ•จ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 

๊ทธ๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ, ๋ชจ๋ธ์˜ Parameter๋“ค์ด Normalization์„ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ“„ Whitening Back-Propagation

๊ทธ๋ž˜์„œ, \(x\)๋ฅผ input vector๋ผ๊ณ  ํ•ด๋ด…์‹œ๋‹ค. ๊ทธ๋ฆฌ๊ณ , \(\chi\)๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค.

Normalize ํ•œ ๊ฐ’์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

$$ \hat{x} = Norm(x,\chi)$$

 

์ด๋ ‡๊ฒŒ ๋˜์—ˆ์„ ๋•Œ, Back-Propagation์˜ ๊ด€์ ์—์„œ ๋ณด๋ฉด

$$ \frac{\partial Norm(x,\chi)}{\partial x}, \frac{\partial Norm(x,\chi)}{\partial \chi} $$

 

์œ„ derivative๋“ค์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ Jacobian, Covariance Matrix, inverse ๋“ฑ์˜ ๊ฐœ๋…์ด ๋‚˜์˜ค๋Š”๋ฐ ์•„์ง ์„ ํ˜•๋Œ€์ˆ˜ํ•™์— ๊ด€ํ•ด ์ž˜ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ๋Ÿ‰์ด ์—„์ฒญ ๋งŽ๋‹ค๋Š” ๊ฒƒ๋งŒ ์•Œ๊ณ  ๋„˜์–ด๊ฐ€๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์ฆ‰, ๊ณ„์‚ฐ๋Ÿ‰์ด ์—„์ฒญ ๋งŽ๋‹ค๋Š” ๊ฒƒ ๋•Œ๋ฌธ์— ๋Œ€์ฒด์ ์ธ ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

 

๋Œ€์ฒด์ ์ธ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์š”๊ตฌ ์‚ฌํ•ญ์œผ๋กœ๋Š” ๋ฏธ๋ถ„์ด ๊ฐ€๋Šฅํ•œ Normalization ๋ฐฉ๋ฒ•์ด์–ด์•ผ ํ•˜๊ณ ,

parameter๋ฅผ ์—…๋ฐ์ดํŠธํ•  ๋•Œ, ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๋ถ„์„์ด ํ•„์š”๋กœ ์—†์—ˆ์œผ๋ฉด ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ด์ „์˜ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” ๋‹จ์ผ sample์— ๋Œ€ํ•œ ๊ณ„์‚ฐ๋œ ํ†ต๊ณ„๋ฅผ ์ด์šฉํ•˜๊ฑฐ๋‚˜, Image Network์˜ ๊ฒฝ์šฐ์—๋Š” ํŠน์ •ํ•œ ์ฃผ์–ด์ง„ ์œ„์น˜์˜ feature map์„ ์ด์šฉํ•ด ๊ณ„์‚ฐํ•˜๋„๋ก ํ–ˆ๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ๊ฒƒ์€ ๋„คํŠธ์›Œํฌ์˜ ์ •๋ณด์— ๋Œ€ํ•ด ์†์‹ค์ด ์šฐ๋ ค๋˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ, ์ „์ฒด์ ์ธ ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•˜์ง€๋งŒ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— whitening์„ ๋Œ€์ฒดํ•  ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.


โœ… 3. Normalization via Mini-Batch Statistics

whitening์ด ๊ณ„์‚ฐ๋Ÿ‰๋„ ๋งŽ๊ณ , ๋ชจ๋“  ๊ณณ์—์„œ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ์ด์ง€๋Š” ์•Š๊ธฐ ๋•Œ๋ฌธ์— 2๊ฐ€์ง€๋ฅผ Simplification(๊ฐ„๋‹จํ™”) ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“„ Simplification (1): Normalize each scalar feature independently

๊ฐ dimension ๋ณ„๋กœ scalar ๊ฐ’๋“ค์˜ Normalization์„ ์ง„ํ–‰ํ•˜์ž๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ Normalization์˜ ๊ธฐ๋ฒ•์€ decorrelated ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์—๋„ ์ˆ˜๋ ด์„ ๊ฐ€์†ํ™”์‹œํ‚ต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, \(x\)๋ผ๋Š” input vector๊ฐ€ d-dimension์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๋ฉด, \(x = (x^{(1)}...x^{(d)})\)๋กœ ๋ณด์•„์„œ ์•„๋ž˜์™€ ๊ฐ™์ด Normalize ํ•œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค.

$$ \hat{x}^{(k)} = \frac{x^{(k)} - E[x^{(k)}]}{\sqrt{Var[x^{(k)}]}} $$

 

๊ทธ๋ฆฌ๊ณ , ๊ฐ„๋‹จํ•˜๊ฒŒ Normalize๋งŒ ํ•ด์„œ๋Š” ์•ˆ ๋ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, sigmoid ํ•จ์ˆ˜๋กœ Normalized ๋œ ๊ฐ’๋“ค์„ ์ด์šฉํ•˜๋ฉด sigmoid์˜ Non-linearity๊ฐ€ ๋ณด์žฅ๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋ฌด์Šจ ์†Œ๋ฆฌ๋ƒ๋ฉด Gaussian Normal Distribution์— ๋”ฐ๋ฅด๋ฉด ์ •๊ทœํ™”๋œ ๊ฐ’๋“ค์˜ 99.7%๋Š” \([m-3\sigma, m+3\sigma]\)์— ์†ํ•œ๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค. ์ •๊ทœํ™”๋œ ๊ฐ’๋“ค์„ ๋ชจ๋‘ sigmoid๋ฅผ ํ†ตํ•ด ํ†ต๊ณผ์‹œํ‚ค๋ฉด sigmoid์˜ Non-linearity๋ฅผ ์žƒ๊ฒŒ ๋œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋ฉด์„œ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ๋ง์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

 

\(the\;transformation\;inserted\;in\;the\;network\;can\;represent\;the\;identity\;transform.\)

 

transformation์„ Network์— ์ ์šฉํ•  ๋•Œ, ์˜๋„์ ์ธ ๋ณ€ํ™˜์„ ์ค„ ๋ฟ, ๋‹ค๋ฅธ ๋ณ€ํ™˜์„ ์ฃผ์ง€ ๋ง๋ผ๋Š” ๋œป์œผ๋กœ ๋‹ค๊ฐ€์™€์„œ ์ ์–ด๋‘์—ˆ์Šต๋‹ˆ๋‹ค.

๋ณ€ํ™˜์ด identity ํ•˜๊ฒŒ(=ํ•ญ๋“ฑ ํ•˜๊ฒŒ) ์ด๋ฃจ์–ด์ ธ์•ผ ํ•œ๋‹ค๋Š” ๋ง์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ, ์–ด๋–ป๊ฒŒ Non-linearity๋ฅผ ๋ณด์žฅ์‹œํ‚ฌ ๊ฑด์ง€์— ๋Œ€ํ•œ ๊ฒƒ์€ Normalize ๋œ ๊ฐ’์— parameter๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

$$ y^{(k)} = \gamma^{(k)}\hat{x}^{(k)} + \beta^{(k)} $$

 

\(\gamma^{(k)}\)๋Š” ์ดˆ๊ธฐ ๊ฐ’์ด \(\sqrt{Var[x^{(k)}]}\)๋กœ ์„ธํŒ…๋˜๊ณ , \(\beta^{(k)}\)๋Š” \(E[x^{(k)}]\)๋กœ ์„ธํŒ…์ด ๋˜์–ด ํ•™์Šต์„ ํ†ตํ•ด ์ตœ์ ์˜ Parameter๋ฅผ ์ฐพ์•„๊ฐ‘๋‹ˆ๋‹ค.


๐Ÿ“„ Simplification (2): mini-batches in stochastic gradient training

๊ฐ mini-batch์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ตฌํ•˜์—ฌ Gradient descent๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์ž๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

mini-batch \(B\)์˜ ์‚ฌ์ด์ฆˆ๋ฅผ \(m\)์ด๋ผ๊ณ  ๊ณ ๋ คํ•˜๊ณ , \(x^{(k)}\)๋Š” ํŽธ์˜์„ฑ์„ ์œ„ํ•ด \(k\)๋ฅผ ์ƒ๋žตํ•˜๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์€ mini-batch๊ฐ€ ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.

$$ B = \{x_{1...m}\} $$

 

๊ทธ๋ฆฌ๊ณ , normalized ๋œ ๊ฐ’๋“ค \(\hat{x}_{1...m}\)์„ \(\gamma\)์™€ \(\beta\)์— ์˜ํ•œ ์„ ํ˜•์ ์ธ ๋ณ€ํ™˜์„ ๊ฑฐ์น˜๋ฉด Batch Normalization์ด ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

$$ BN_{\gamma, \beta} : x_{1...m} \rightarrow y_{1...m} $$

 

๊ฒฐ๊ณผ์ ์œผ๋กœ Batch Normalizaing Transform์— ๋Œ€ํ•ด ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

\(\epsilon\)์ด๋ผ๋Š” ์ƒ์ˆ˜ ๊ฐ’์„ ๋”ํ•ด์ฃผ๋Š” ์ด์œ ๋Š” ๋ถ„์‚ฐ์ด 0์ด ๋˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์ˆ˜์น˜์  ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด ๋„ฃ์–ด์ค€๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Batch Normalizing Transform

\(\gamma\)์™€ \(\beta\)๊ฐ€ ํ•™์Šต์„ ํ†ตํ•ด optimal ํ•œ ๊ฐ’์„ ์ฐพ์•„๊ฐˆ ๋•Œ, ์•„๋ฌด๋Ÿฐ ๊ณณ์—๋„ ์˜์กด์ ์ด์ง€ ์•Š์„ ๊ฑฐ๋ž€ ์ƒ๊ฐ์„ ํ•˜๋ฉด ์•ˆ ๋ฉ๋‹ˆ๋‹ค.

์˜คํžˆ๋ ค \(BN_{\gamma, \beta}(x)\)๋Š” ๋งŽ์€ ์ƒ˜ํ”Œ๋“ค์—๊ฒŒ ๋งŽ์ด ์˜์กดํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , Batch Normalization์˜ parameter๋“ค์€ Back-Propagation์˜ ๊ณผ์ •์—์„œ Gradient Descent๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ฏธ๋ถ„์˜ ์—ฐ์‡„ ๋ฒ•์น™์— ์˜ํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ์ •๋ฆฌ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋ฏธ๋ถ„์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

์ด์— ๋”ฐ๋ผ Batch Normalization layer๋ฅผ ํ†ต๊ณผํ•จ์œผ๋กœ์จ, Internal Covariate Shift๋ฅผ ์ค„์ด๋Š” ๋ถ„ํฌ๋ฅผ ๊ฐ–๋Š” ๊ฐ’๋“ค์„ ๋‹ค์Œ sub-network๋กœ ๋„˜๊ธธ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์–ด ํ•™์Šต์„ accelerate ์‹œํ‚ฌ ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋” ๋‚˜์•„๊ฐ€, Non-Linearity๋ฅผ ๋ณด์žฅํ•˜์—ฌ Identity Transform(ํ•ญ๋“ฑ ๋ณ€ํ™˜, ๋ณ„๋‹ค๋ฅธ ์†์‹ค์ด ์—†๋„๋ก)์ด ๋˜๋„๋ก ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋„คํŠธ์›Œํฌ์˜ ๋Šฅ๋ ฅ์„ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ“„ 3.1 Training and Inference with Batch-Normalized Networks

Training์ด ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ์•Œ์•˜์œผ๋‚˜ Inference๊ฐ€ ์ง„ํ–‰๋  ๋•Œ๋Š” ์–ด๋”˜๊ฐ€ ๋ถ€์กฑํ•œ ์ ์ด ๋ณด์ž…๋‹ˆ๋‹ค.

Inference ๋‹จ๊ณ„์—์„œ 'ํ‰๊ท ', '๋ถ„์‚ฐ'์€ ์–ด๋–ค ๊ฒƒ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”๊ฐ€?

๋…ผ๋ฌธ์—์„œ๋Š” ํฌ๊ฒŒ 2๊ฐ€์ง€ ๋ฐฉ์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. 2๊ฐ€์ง€ ๋ฐฉ์‹ ๋‹ค ํ›ˆ๋ จ ์„ธํŠธ์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Population statistics

$$ \begin{align}
E[x^{(k)}] &= E_{B}[\mu_{B}^{(k)}] \\ \\

Var[x] &= \frac{m}{m-1}\cdot E_{B}[\sigma_{B}^{2}]
\end{align} $$

Moving Average

$$ \frac{P_M+P_{M-1}+\cdot\cdot\cdot+P_{M-(n-1)}}{n} = \frac{1}{n}\sum_{i=0}^{n-1}P_{M-i} $$

 

Population statistics๋Š” ๋น„ํšจ์œจ์ ์ด๋ผ ๋ณดํ†ต Moving Average์˜ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

Inference๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ \(\gamma\)์™€ \(\beta\)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Scaling๊ณผ Shift๋ฅผ ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ“„ 3.2 Batch-Normalized Convolutional Networks

Covolution layer์—์„œ \(Wx+b\) ํ˜•ํƒœ๋กœ Batch Normalization์„ ํ•˜๋Š”๋ฐ, ์ด ๋•Œ๋Š” \(\beta\)๊ฐ€ shift์˜ ์—ญํ• ์„ ํ•˜๋ฉด์„œ \(b\)์˜ ์—ญํ• ์„ ๋Œ€์ฒดํ•˜๊ธฐ ๋•Œ๋ฌธ์— \(b\)๋ฅผ ์—†์•ฑ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , CNN์˜ ์„ฑ์งˆ์„ ์œ ์ง€์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ฐ Channel์„ ๊ธฐ์ค€์œผ๋กœ Batch Normalization์— ๋Œ€ํ•œ ๋ณ€์ˆ˜๊ฐ€ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

Convolution์„ ์ ์šฉํ•œ Feature Map์˜ ํฌ๊ธฐ๊ฐ€ p x q๋ผ๊ณ  ํ•˜๊ณ , Mini-batch์˜ ์‚ฌ์ด์ฆˆ๊ฐ€ m์ด๋ผ๊ณ  ํ•œ๋‹ค๋ฉด,

Channel ๋ณ„๋กœ m x p x q์— ๋Œ€ํ•œ Mean๊ณผ Variance๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ Channel์ด n๊ฐœ ๋ผ๋ฉด, ์ด n๊ฐœ์˜ Batch Normalization Parameter(\(\gamma, \beta\))๊ฐ€ ์ƒ๊ธฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ฒฐ๊ณผ์ ์œผ๋กœ, \(\gamma\)์™€ \(\beta\)๋Š” Channel๋ณ„๋กœ ์กด์žฌํ•˜์—ฌ CNN์˜ ์„ฑ์งˆ์„ ์‚ด๋ฆฝ๋‹ˆ๋‹ค.


๐Ÿ“„ 3.3 Batch Normalization enables higher learning rates

Batch Normalization์„ ํ•จ์œผ๋กœ์จ parameter์˜ ์ž‘์€ ๋ณ€ํ™”๊ฐ€ ํฌ๊ฒŒ ๋ณ€๋™ ์‚ฌํ•ญ์„ ๊ฐ€์ ธ์˜ค์ง€ ์•Š๊ฒŒ ๋จ์œผ๋กœ์จ, ๋” ๋†’์€ Learning rate๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

 

๋˜ํ•œ, Non-linearity์˜ ์ƒ์‹ค์— ๋Œ€ํ•œ ๋ฌธ์ œ๋ฅผ ์˜ˆ๋ฐฉํ•จ์œผ๋กœ์จ Saturated Regime ๊ฐ™์€ ์œ„ํ—˜์„ฑ๋„ ์ค„์˜€์Šต๋‹ˆ๋‹ค.


๐Ÿ“„ 3.4 Batch Normalization regularizes the model

Batch Normalization์ด Regularization์˜ ์—ญํ• ์„ ํ•˜๋ฉด์„œ Overfitting์„ ์ค„์ด๊ธฐ ์œ„ํ•œ Dropout์„ ์—†์• ๊ฑฐ๋‚˜ ์ค„์—ฌ๋„ ๋œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.


โœ… 4. Experiments & 5. Conclusion (skip)

๊ฒฐ๊ตญ Batch Normalization์ด ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ณ๋Š”๋‹ค๋Š” ์–˜๊ธฐ๋กœ ์ƒ๋žตํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.


โœ… Outro

๊ถ๊ธˆ์ฆ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋…ผ๋ฌธ์„ ์ฝ๋‹ค๊ฐ€ Batch Normalization์„ ์ ์šฉํ•จ์— ๋”ฐ๋ผ ์–ป๋Š” ์ด์ ๋“ค๋„ ๊ณต๋ถ€ํ•˜๊ฒŒ ๋˜๋ฉด์„œ ๊ธฐ์กด ํ”„๋กœ์ ํŠธ(Cat & Dog Classification Version 2)์—์„œ ๋””๋ฒจ๋กญ์‹œํ‚ฌ ๋ถ€๋ถ„๋“ค์ด ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

์•„๋ž˜์™€ ๊ฐ™์€ ๋ถ€๋ถ„๋“ค์„ ์ ์šฉํ•˜์—ฌ Version 2.1์„ ๋งŒ๋“ค์–ด๋ณผ ์ƒ๊ฐ์ž…๋‹ˆ๋‹ค.

  • Regularization์˜ ์—ญํ• ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— Dropout์˜ ํ•„์š”์„ฑ์„ ์ค„์ž„.
  • 3. Normalization via Mini-Batch Statistics์—์„œ Non-linearity์— ๋Œ€ํ•œ ๋ฐฉ์•ˆ์„ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์„ ๋ณด์•„ BN(X) -> Activation์˜ ๊ตฌ์กฐ๋กœ ์“ฐ์ด๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„.

๊ทธ๋ฆฌ๊ณ , Batch Normalization์— ๋Œ€ํ•œ ํ›„์† ๋…ผ๋ฌธ 'How Does Batch Normalization Help Optimization'๋„ ์–ธ์  ๊ฐ€ ๊ธฐํšŒ๊ฐ€ ๋œ๋‹ค๋ฉด ์ฝ์–ด ๋ด์•ผ ํ•  ๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹ค. Intenal Covariate Shift๊ฐ€ ์–ด๋–ป๊ฒŒ ์ œ๊ฑฐ๋˜์—ˆ๋Š”์ง€ ์„ค๋ช…์ด ๋ถ€์กฑํ•˜์—ฌ ์‹คํ—˜์„ ํ†ตํ•ด BN์ด ์„ฑ๋Šฅ์ด ์ข‹์€ ์ด์œ ์— ๋Œ€ํ•ด ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

 

๋˜ํ•œ, 2023๋…„์˜ ๊ณ„ํš์ด์—ˆ๋˜ ์„ ํ˜•๋Œ€์ˆ˜ํ•™, ํ†ต๊ณ„ํ•™, ๋ฏธ์ ๋ถ„ํ•™์— ๋Œ€ํ•ด ๊ณ„ํš์„ ์„ธ์šฐ๋ฉด์„œ๋„ '์•„์ง ๋‚ด๊ฐ€ ํ•„์š”ํ• ๊นŒ'๋ผ๋Š” ๊ฑฐ๋ฆฌ๋‚Œ์— ๋Œ€ํ•œ ์ƒ๊ฐ์ด ์žˆ์—ˆ์ง€๋งŒ, ์ฒ˜์Œ์œผ๋กœ ์›๋ก ์ ์ธ ์ด์•ผ๊ธฐ๋ฅผ ๋‹ค๋ฃจ์–ด๋ณด๋ฉด์„œ '๋‹น์žฅ ํ•„์š”ํ•˜๊ตฌ๋‚˜'๋ฅผ ๊นจ๋‹ฌ์€ ๊ณต๋ถ€์˜€์Šต๋‹ˆ๋‹ค.

์˜ฌํ•ด ๊ณ„ํš์— ํ™•์‹คํžˆ ๊ฑฐ๋ฆฌ๋‚Œ ์—†์ด ์ง„ํ–‰ํ•ด์•ผ ํ•  ๋ถ€๋ถ„๋“ค์ด๋ผ๋Š” ํ™•์‹ ์ด ๋“ค์—ˆ์—ˆ์Šต๋‹ˆ๋‹ค.

 

์ถ”๊ฐ€์ ์œผ๋กœ, ์˜์–ด์— ๋Œ€ํ•œ ๊ณต๋ถ€๋„ ์ •๋ง ๋งŽ์ด ํ•„์š”ํ•˜๊ตฌ๋‚˜๋ฅผ ๋Š๊ผˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ฅด๋Š” ๋‹จ์–ด๋งŒ ์•Œ์•„๋‚ด๋ฉด ํ•ด์„์ด ๋งค๋„๋Ÿฌ์šธ ๊ฑฐ๋ž€ ์ƒ๊ฐ์„ ํ–ˆ์—ˆ๋Š”๋ฐ ๋ฌธ๋งฅ์ƒ ๋ฌด์Šจ ์–˜๊ธฐ๋ฅผ ํ•˜๋Š” ๊ฑด๊ฐ€ ์‹ถ์–ด์„œ ํ•ด์„๋ณธ์„ ๋งŽ์ด ์ฐพ์•„๋ณด์•„์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฌผ๋ก , ์ˆ˜ํ•™์  ์ง€์‹์ด ์—†๋‹ค๋Š” ๊ฒƒ๋„ ํ•ด์„์— ๋Œ€ํ•ด ํ•œ๋ชซํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๋งˆ์ง€๋ง‰์œผ๋กœ, ๋‹ค์Œ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ถ€ํ„ฐ๋Š” ์กฐ๊ธˆ ๋” ๊ฐ„๋žตํ™”ํ•˜์—ฌ ์ž‘์„ฑํ•ด ๋ณผ ์ƒ๊ฐ์ž…๋‹ˆ๋‹ค. ์กฐ๊ธˆ ๋” ํ•„์š”ํ•œ ํ•„์ˆ˜์ ์ธ ์ •๋ณด๋“ค์„ ์ œ๊ณตํ•˜์—ฌ ๋ถˆํ•„์š”ํ•œ ์‹œ๊ฐ„๊ณผ ๊ธ€์„ ์ค„์—ฌ์„œ ๋” ํšจ์œจ์ ์œผ๋กœ ๊ธ€์„ ์ž‘์„ฑํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค :)


โœ… Reference

https://dive-into-ds.tistory.com/19

 

Whitening transformation

Whitening transformation(ํ˜น์€ sphering transformation)์€ random variable์˜ ๋ฒกํ„ฐ(covariance matrix๋ฅผ ์•Œ๊ณ  ์žˆ๋Š”)๋ฅผ covariance matric๊ฐ€ identity matrix์ธ variable๋“ค๋กœ ๋ณ€ํ˜•ํ•˜๋Š” linear transformation์ด๋‹ค. ์ฆ‰, ๋ชจ๋“  ๋ณ€์ˆ˜๊ฐ€ uncorrelated

dive-into-ds.tistory.com

https://eehoeskrap.tistory.com/430

 

[Deep Learning] Batch Normalization (๋ฐฐ์น˜ ์ •๊ทœํ™”)

์‚ฌ๋žŒ์€ ์—ญ์‹œ ๊ธฐ๋ณธ์— ์ถฉ์‹คํ•ด์•ผ ํ•˜๋ฏ€๋กœ ... ๋”ฅ๋Ÿฌ๋‹์˜ ๊ธฐ๋ณธ์ค‘ ๊ธฐ๋ณธ์ธ ๋ฐฐ์น˜ ์ •๊ทœํ™”(Batch Normalization)์— ๋Œ€ํ•ด์„œ ์ •๋ฆฌํ•˜๊ณ ์ž ํ•œ๋‹ค. ๋ฐฐ์น˜ ์ •๊ทœํ™” (Batch Normalization) ๋ž€? ๋ฐฐ์น˜ ์ •๊ทœํ™”๋Š” 2015๋…„ arXiv์— ๋ฐœํ‘œ๋œ ํ›„

eehoeskrap.tistory.com

https://gaussian37.github.io/dl-concept-batchnorm/

 

๋ฐฐ์น˜ ์ •๊ทœํ™”(Batch Normalization)

gaussian37's blog

gaussian37.github.io

https://shuuki4.wordpress.com/2016/01/13/batch-normalization-%EC%84%A4%EB%AA%85-%EB%B0%8F-%EA%B5%AC%ED%98%84/

 

Batch Normalization ์„ค๋ช… ๋ฐ ๊ตฌํ˜„

NIPS (Neural Information Processing Systems) ๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๊ด€๋ จ ํ•™ํšŒ์—์„œ ๊ฐ€์žฅ ๊ถŒ์œ„์žˆ๋Š” ํ•™ํšŒ๋“ค ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด ํ•™ํšŒ์—์„œ๋Š” ๋งค๋…„ ์ปจํผ๋Ÿฐ์Šค๋ฅผ ๊ฐœ์ตœํ•˜๊ณ , ์ž‘๋…„ 12์›”์—๋„ NIPS 2015๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ ๋Ÿฌ์‹œ์•„์—์„œ ์ปจ

shuuki4.wordpress.com

https://lifeignite.tistory.com/47

 

Batch Normalization์„ ์ œ๋Œ€๋กœ ์ดํ•ดํ•ด๋ณด์ž

notion์œผ๋กœ ๋ณด๋ฉด ๋” ํŽธํ•ฉ๋‹ˆ๋‹ค. www.notion.so/Batch-Normalization-0649da054353471397e97296d6564298 Batch Normalization Summary www.notion.so ๋ชฉ์ฐจ Summary Introduction Background Normalization Covariate Shift Batch Normalization Algorithm Learnable

lifeignite.tistory.com

 

728x90