๊ด€๋ฆฌ ๋ฉ”๋‰ด

Doby's Lab

backward()๋Š” ์—ญ์ „ํŒŒ๋ฅผ ์–ด๋–ป๊ฒŒ ํ• ๊นŒ? (Autograd์˜ Node) ๋ณธ๋ฌธ

Code about AI/PyTorch

backward()๋Š” ์—ญ์ „ํŒŒ๋ฅผ ์–ด๋–ป๊ฒŒ ํ• ๊นŒ? (Autograd์˜ Node)

๋„๋น„(Doby) 2023. 11. 19. 15:13

๐Ÿค” Problem

PyTorch์˜ Tensor๋Š” requires_grad๊ฐ€ True๋กœ ๋˜์–ด์žˆ์„ ๋•Œ, ๋ณ€ํ™”๋„์— ๋Œ€ํ•œ ์—ฐ์‚ฐ์˜ ์ถ”์ ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์ฆ‰, ๊ฐ Tensor์— ๋Œ€ํ•ด์„œ .grad ์†์„ฑ๊ณผ .grad_fn ์†์„ฑ์ด ์ƒ๊น๋‹ˆ๋‹ค.

 

.grad๋Š” ํ˜„์žฌ Tensor์— ๋Œ€ํ•ด ๋ชฉ์ ํ•จ์ˆ˜๊ฐ€ ์–ผ๋งˆํผ ๋ณ€ํ–ˆ๋‚˜์— ๋Œ€ํ•œ ๋ณ€ํ™”๋„์˜ ๊ฐ’, ์ฆ‰ ๋ฏธ๋ถ„ ๊ฐ’์„ ๋‹ด๊ณ  ์žˆ์œผ๋ฉฐ,

.grad_fn์€ ์ด์ „ Tensor์— ๋Œ€ํ•ด์„œ ํ˜„์žฌ Tensor๋ฅผ ๋ฏธ๋ถ„ํ•ด ์ค„ ๋•Œ, ์–ด๋– ํ•œ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ๋ฏธ๋ถ„์„ ํ•ด์ฃผ์–ด์•ผ ํ•˜๋Š”์ง€ ํŠน์ • ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ๋ฏธ๋ถ„ ํ•จ์ˆ˜ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. (๊ฐ’์ด ์•„๋‹Œ ํ•จ์ˆ˜ ์ •๋ณด์ž„์„ ์œ ์˜) ์˜ˆ๋ฅผ ๋“ค์–ด, b = a + 2๋ผ๋ฉด, b์—๋Š” a์— ๋Œ€ํ•ด ๋ฏธ๋ถ„์„ ํ•  ๋•Œ, ๋”ํ•˜๊ธฐ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ์ƒ์„ฑ์ด ๋˜์—ˆ์œผ๋‹ˆ ๋”ํ•˜๊ธฐ ์—ฐ์‚ฐ์œผ๋กœ ๋งŒ๋“ค์–ด์กŒ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค.

 

์—ฌ๊ธฐ์„œ ์ œ๊ฐ€ ๊ถ๊ธˆํ•œ ์ ์ด์ž ์ด๋ฒˆ ํฌ์ŠคํŒ…์˜ ์ฃผ์ œ๋Š” 'backward()๋ฅผ ํ˜ธ์ถœํ–ˆ์„ ๋•Œ, ๋„๋Œ€์ฒด ์–ด๋–ป๊ฒŒ .grad๋ฅผ ๊ฐฑ์‹ ํ•˜๋Š”๊ฐ€?'์ž…๋‹ˆ๋‹ค. ๋„ˆ๋ฌด ๋‹น์—ฐํ•˜๊ฒŒ๋„ ์ด๋ก ์ ์œผ๋กœ๋Š” ์—ญ์ „ํŒŒ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•˜๋ฉด, .grad์—๋Š” ๋ณ€ํ™”๋„์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋“ค์–ด๊ฐ€๋Š” ๊ฒƒ์ด ๋งž์ง€๋งŒ, ํ˜„์žฌ ์ œ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ •๋ณด๋กœ๋Š” ๊ทธ๊ฑด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ํŒ๋‹จํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ ํŒ๋‹จํ•œ ๊ทผ๊ฑฐ์—๋Š” Tensor์—์„œ .grad์™€ .grad_fn๊ณผ ๊ฐ™์€ ์†์„ฑ, ํ˜น์€ ๋‹ค๋ฅธ ์†์„ฑ๋“ค์—์„œ๋„ ํ˜„์žฌ Tensor์—์„œ๋Š” ์ด์ „ Tensor์— ๋Œ€ํ•œ ์œ„์น˜ ์ •๋ณด(์†์„ฑ)๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์—ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

 

์™œ๋ƒํ•˜๋ฉด, ํ˜„์žฌ Tensor๋Š” ์ด ์œ„์น˜ ์ •๋ณด๋ฅผ ์•Œ๊ณ  ์žˆ์–ด์•ผ ์ด ์œ„์น˜์—๊ฒŒ ์–ด๋–ค ์ •๋ณด๋“  ๋„˜๊ฒจ์ค„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์—ญ์ „ํŒŒ๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” 'ํ˜„์žฌ Tensor๋Š” ์–ด๋– ํ•œ Tensor๋กœ๋ถ€ํ„ฐ ๋‚˜์™”๋Š”์ง€' ๋ฌด์กฐ๊ฑด ์•Œ๊ณ  ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋ฒˆ ํฌ์ŠคํŒ…์˜ ๋‚ด์šฉ์€ ๋‹ค์†Œ ํ˜ผ์žก์Šค๋Ÿฌ์šธ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋งจ ํ•˜๋‹จ์˜ Summary๋ฅผ ๊ผญ ์ฝ์–ด์ฃผ์‹œ๊ฑฐ๋‚˜ ๋Œ“๊ธ€์„ ๋‚จ๊ฒจ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค :)


๐Ÿง Computational Graph (์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”„)

์ด ๋ฌธ์ œ์™€ ๊ด€๋ จํ•ด์„œ ์ œ์ผ ๋ฐ€์ ‘ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ํ‚ค์›Œ๋“œ๋Š” Computational Graph์ž…๋‹ˆ๋‹ค. ์ด์— ๋Œ€ํ•œ ์ •๋ณด๋Š” PyTorch ํ•œ๊ตญ ์‚ฌ์šฉ์ž ๋ชจ์ž„์—์„œ๋„ ์ด๋ฏธ ์ œ๊ณต์„ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Forward Propagation์„ ํ†ตํ•ด์„œ ๊ตฌ์ถ•์ด ๋˜๋Š” Computational Graph๋Š” ๋™์ ์œผ๋กœ ๊ตฌ์ถ•์ด ๋ฉ๋‹ˆ๋‹ค. ๋™์ ์œผ๋กœ ๊ตฌ์ถ•์ด ๋˜๋Š” ์ด์œ ๋Š” ์ด๋‹ค์Œ์— ์–ด๋– ํ•œ Forward๊ฐ€ ๋ฐœ์ƒํ• ์ง€ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์ •์ ์œผ๋กœ ๊ด€๋ฆฌ๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ Computational Graph์—์„œ๋Š” ํ˜„์žฌ Tensor์˜ ์ด์ „ Tensor๋Š” ๋ฌด์—‡์ธ์ง€์™€ ๊ฐ™์€ ์ •๋ณด๋ฅผ ๊ด€๋ฆฌํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.


๐Ÿ˜€ Solution: Autograd์˜ Node

ํ•˜์ง€๋งŒ, ์œ„์—์„œ ๋งํ•œ Computational Graph๋Š” PyTorch์— ๋Œ€ํ•œ ์„ค๋ช…์ด์—ˆ์„ ๋ฟ, ์• ์ดˆ์— ๋ฏธ๋ถ„์„ ์ž‘๋™์‹œํ‚ค๊ณ  ์žˆ๋Š” Autograd์—์„œ๋Š” Computational Graph์— ๋Œ€ํ•ด ๋” ๊นŠ์ด ๋‹ค๋ฃจ๊ณ  ์žˆ์„ ๊ฑฐ๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ๊ณ , ์ด์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ Autograd์˜ ๊ณต์‹ ๋ฌธ์„œ๋ฅผ ์ฐพ์•„๋ณด์•˜์Šต๋‹ˆ๋‹ค.

https://autograd.readthedocs.io/en/latest/implementation.html

 

Implementation — autograd 1.0.0 documentation

The second core data structure is the Block. It is an atomic operation performed on Variable. For instance, sin, exp, addition or multiplication. for flexibility of the code, we implemented a generic Block type as well as a more specific one : the SimpleBl

autograd.readthedocs.io

 

ํ•ด๋‹น ๊ณต์‹ ๋ฌธ์„œ์—์„œ๋Š” Autograd๊ฐ€ ์–ด๋–ป๊ฒŒ ๊ตฌ์„ฑ์ด ๋˜์–ด์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ํฌ๊ฒŒ Variable, Block, Node๋กœ ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ์ด 3๊ฐ€์ง€์—์„œ ๊ด€๋ฆฌํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ๋“ค์ด ์กฐ๊ธˆ์”ฉ ๋‹ค๋ฅด๊ณ , ํ—ท๊ฐˆ๋ฆด ์ˆ˜ ์žˆ๋Š” ๋ชจํ˜ธ์„ฑ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์›ํ•˜๋Š” ์ •๋ณด๊ฐ€ ์žˆ๋Š” Node์— ๋Œ€ํ•ด์„œ๋งŒ ์ด์•ผ๊ธฐํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

Node๋Š” Autograd์—์„œ ๊ด€๋ฆฌํ•˜๋Š” Data Structure๋กœ y = f(x)์™€ ๊ฐ™์€ ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ •์— ๋Œ€ํ•ด ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”„๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ํ•จ์ˆ˜ ๋‚ด๋ถ€์˜ ๊ฐ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ๊ด€๊ณ„์™€ Gradient๋ฅผ Node๋กœ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค. y = g(f(x))๋„ ๊ฒฐ๊ตญ ํŽผ์ณ๋ณด์•˜์„ ๋•Œ, ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์—ฐ์‚ฐ์œผ๋กœ ์—ฎ์ธ ๊ฒƒ๊ณผ ๊ฐ™๊ธฐ ๋•Œ๋ฌธ์— Node๋ฅผ ํ•˜๋‚˜์˜ Tensor๋ผ ๋ณด๊ณ  ์ดํ•ดํ•ด๋„ ๋ฌด๋ฐฉํ•ฉ๋‹ˆ๋‹ค.

 

Forward๊ฐ€ ์ผ์–ด๋‚˜๋Š” ๊ฒฝ์šฐ, PyTorch๊ฐ€ ์•„๋‹Œ PyTorch ๋‚ด๋ถ€์˜ Autograd์—์„œ๋Š” Computational Graph๋ฅผ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ทธ๋ž˜ํ”„์˜ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ๊ฒƒ์ด Autograd์˜ Node์ž…๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ Node์ธ Loss๋ฅผ Root๋กœ ์žก๊ณ , ์—ฐ์‚ฐ์˜ ์ˆœ๋ฐฉํ–ฅ๊ณผ ๋ฐ˜๋Œ€์ธ ์—ญ๋ฐฉํ–ฅ์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ํ˜•์„ฑํ•ฉ๋‹ˆ๋‹ค. ์œ„ ๊ทธ๋ฆผ์—์„œ ํ™”์‚ดํ‘œ๋ฅผ ๋ฐ˜๋Œ€๋กœ ๋ฐ”๊พผ ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๐Ÿง Node

Node์—๋Š” ํฌ๊ฒŒ 2๊ฐ€์ง€ ์†์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

1. children

2. gradient

 

์‚ฌ์‹ค, ์ด๋ฏธ childrens๋ผ๋Š” ์†์„ฑ์„ ๊ฐ€์ง„ ๊ฒƒ๋ถ€ํ„ฐ '์•„ ์ด์ „ Tensor์— ๋Œ€ํ•œ ์œ„์น˜ ์ •๋ณด๋Š” Autograd์˜ Node์—์„œ ๊ด€๋ฆฌ๊ฐ€ ๋˜๊ฒ ๊ตฌ๋‚˜'๋ผ๋Š” ๊ฑธ ์ง์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. children์—์„œ ๊ด€๋ฆฌํ•˜๋Š” ๊ฒƒ์€ ํ˜„์žฌ Node์˜ Child Node๊ฐ€ ๋ฌด์—‡์ธ์ง€, ๊ทธ๋ฆฌ๊ณ  Child Node์™€ ์–ด๋– ํ•œ ๋ฏธ๋ถ„ ๊ฐ’(Gradient)์„ ๊ฐ–๋Š”์ง€์— ๋Œ€ํ•ด Jacobian Matrix ํ˜•ํƒœ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

x=Variable(2) 
y=sin(x)

y.node.childrens=[{'node':x.node, 'jacobian':cos(x.data)}]

๊ทธ๋ฆฌ๊ณ , gradient์—์„œ๋Š” ํ˜„์žฌ Node์— ๋Œ€ํ•œ root(๋ชฉ์ ํ•จ์ˆ˜)์˜ ๋ฏธ๋ถ„ ๊ฐ’์„ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

 

๊ณต์‹ ๋ฌธ์„œ์—์„œ ์ œ๊ณตํ•˜๋Š” child์˜ gradient(์ด์ „ Tensor์˜ .grad)๊ฐ€ ๊ฐฑ์‹ ์ด ๋˜์–ด๊ฐ€๋Š” ๊ณผ์ •์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ฝ”๋“œ๋ฅผ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

for child in self.childrens:
  node,jacobian=child['node'], child['jacobian']
  new_grad = np.dot(self.gradient, jacobian)
  node.update_gradient(new_grad)

์ด ์ฝ”๋“œ์—์„œ ์•Œ ์ˆ˜ ์žˆ๋Š” ๊ฑด self(= ํ˜„์žฌ Node)์—์„œ child(= ์ด์ „ Tensor)์— ์ ‘๊ทผ์„ ํ•˜์—ฌ ํ˜„์žฌ Node์— ๋Œ€ํ•œ ๋ชฉ์  ํ•จ์ˆ˜์˜ Gradient์™€ child Node์— ๋Œ€ํ•ด ํ˜„์žฌ Node๋ฅผ ๋ฏธ๋ถ„ํ•œ ๊ฐ’์„ ๊ณฑํ•˜์—ฌ child Node์˜ gradient(์ด์ „ Tensor์˜ .grad)๊ฐ€ ๊ฐฑ์‹ ์ด ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ ๊ณฑํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ฏธ๋ถ„ ๊ฐ’์ด ๊ตฌํ•ด์ง€๋Š” ๊ฒƒ์€ ๋ฏธ๋ถ„์˜ ์—ฐ์‡„ ๋ฒ•์น™์— ์˜ํ•ด์„œ ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด๋ฉฐ ์ด ๋ง์„ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋ƒˆ์„ ๋•Œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

https://draw-code-boy.tistory.com/517

 

๋ฏธ๋ถ„์˜ ์—ฐ์‡„ ๋ฒ•์น™(Chain Rule)์— ๋Œ€ํ•˜์—ฌ

Gradient Vanishing ํ˜„์ƒ์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•˜๋˜ ์ค‘์— Back Propagation์˜ ์ž‘๋™ ์›๋ฆฌ์— ๋Œ€ํ•ด ์•Œ์•„์•ผ ํ–ˆ๊ณ , ์ด ๊ณผ์ •์—์„œ ๋ฏธ๋ถ„์˜ ์—ฐ์‡„ ๋ฒ•์น™์ด ์“ฐ์—ฌ ์ •๋ฆฌํ•ด ๋ด…๋‹ˆ๋‹ค. ๋ฏธ๋ถ„์˜ ์—ฐ์‡„ ๋ฒ•์น™(Chain Rule) ๋ฏธ๋ถ„์˜ ์—ฐ์‡„ ๋ฒ•์น™์ด๋ž€

draw-code-boy.tistory.com

$$ \begin{align} 
\text{self.gradient} = \frac{\delta Loss}{\delta Now} \\ \\
\text{child[`jacobian`]} = \frac{\delta Now}{\delta child} \\ \\
\frac{\delta Loss}{\delta child} = \frac{\delta Loss}{\delta Now}\cdot\frac{\delta Now}{\delta child}
\end{align} $$

 

๋ฌผ๋ก , Autograd์—์„œ๋„ Computational Graph๋Š” ์–ด๋– ํ•œ ๋‹ค๋ฅธ Forward๊ฐ€ ๋ฐœ์ƒํ• ์ง€ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๋™์ ์œผ๋กœ ๊ด€๋ฆฌ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.


โœ… Summary

์ด๋ฒˆ ๋‚ด์šฉ์ด ๋ณต์žกํ–ˆ๋˜ ๋งŒํผ ์ œ๊ฐ€ ์ฐพ์•„๋ณด๊ณ ์ž ํ–ˆ๋˜ ์ •๋ณด์— ๋Œ€ํ•ด์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ •๋ฆฌ๋ฅผ ํ•˜์ž๋ฉด, PyTorch์—์„œ Back Propagation์ด ์ผ์–ด๋‚  ๋•Œ ํ˜„์žฌ Tensor์—์„œ ์ด์ „ Tensor์—๊ฒŒ Gradient์™€ ๊ฐ™์€ ์ •๋ณด๋“ค์„ ๋„˜๊ธธ ๋•Œ, ์ด์ „ Tensor์˜ ์œ„์น˜๋ฅผ  ์–ด๋–ป๊ฒŒ ์•Œ๊ณ ์žˆ๋Š”๊ฐ€์— ๋Œ€ํ•œ ์˜๋ฌธ์„ ํ’ˆ์—ˆ์—ˆ์Šต๋‹ˆ๋‹ค.

 

์ด์— ๋Œ€ํ•ด์„œ๋Š” Autograd์˜ Computational Graph๋ฅผ ํ†ตํ•ด์„œ ํ˜„์žฌ Node์˜ Child Node ์ •๋ณด๋กœ ๊ด€๋ฆฌ๊ฐ€ ๋˜๊ณ  ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๊ณต์‹ ๋ฌธ์„œ๋ฅผ ํ†ตํ•ด ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

 

 

728x90