๊ด€๋ฆฌ ๋ฉ”๋‰ด

Doby's Lab

nn.Parameter(), ์ด๊ฑธ ์จ์•ผ ํ•˜๋Š” ์ด์œ ๊ฐ€ ๋ญ˜๊นŒ? (tensor์™€ ๋ช…๋ฐฑํ•˜๊ฒŒ ๋‹ค๋ฅธ ์ ) ๋ณธ๋ฌธ

Code about AI/PyTorch

nn.Parameter(), ์ด๊ฑธ ์จ์•ผ ํ•˜๋Š” ์ด์œ ๊ฐ€ ๋ญ˜๊นŒ? (tensor์™€ ๋ช…๋ฐฑํ•˜๊ฒŒ ๋‹ค๋ฅธ ์ )

๋„๋น„(Doby) 2024. 4. 29. 00:28

๐Ÿค” Problem

๋ฌธ๋“ ์˜ˆ์ „์— ViT๋ฅผ ๊ตฌํ˜„ํ•ด ๋†“์€ ์ฝ”๋“œ๋ฅผ ๋ณด๋‹ค๊ฐ€ ๊ทธ๋Ÿฐ ์ƒ๊ฐ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. '๋‚ด๊ฐ€ ์ €๊ธฐ์„œ nn.Parameter()๋ฅผ ์™œ ์ผ๋”๋ผ?' ์ง€๊ธˆ ์ƒ๊ฐํ•ด ๋ณด๋ฉด, ๊ทธ๋ƒฅ tensor๋ฅผ ์จ๋„ ๋˜‘๊ฐ™์€ ์ฝ”๋“œ์ผ ํ…๋ฐ ๋ง์ž…๋‹ˆ๋‹ค.

 

์ด๋•Œ ๋‹น์‹œ์— Attention์„ ๊ตฌํ˜„ํ•˜๋ฉด์„œ Query, Key, Value๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๊ธฐ ์œ„ํ•œ ๋ชฉ์ ์œผ๋กœ Weight Matrix๊ฐ€ ํ•„์š”ํ–ˆ์—ˆ๊ณ , ์—ฌ๋Ÿฌ ์˜คํ”ˆ ์†Œ์Šค๋ฅผ ์ฐธ๊ณ ํ•˜๋ฉด์„œ ๊ตฌํ˜„ํ•˜๋‹ค๊ฐ€ ๋ฌด์‹ฌํ•˜๊ฒŒ ์ผ์—ˆ๋˜ ๊ธฐ์–ต์ด ๋‚ฉ๋‹ˆ๋‹ค.

class ScaledDotProductAttention(nn.Module):
    def __init__(self, embedding_length, qkv_vec_length):
        '''
        embedding_length : embedding ํ•˜๋‚˜์˜ ๊ธธ์ด -> W_(qkv)์˜ row ๊ฐ’์ด ๋œ๋‹ค.
        qkv_vec_length :  W_(qkv) ํ–‰๋ ฌ์˜ col ๊ฐ’
        '''
        super().__init__()

        # Query Matrix (Weight)
        self.W_q = nn.Parameter(torch.randn(embedding_length, qkv_vec_length, requires_grad=True))
        
        # Key Matrix (Weight)
        self.W_k = nn.Parameter(torch.randn(embedding_length, qkv_vec_length, requires_grad=True))
        
        # Value Matrix (Weight)
        self.W_v = nn.Parameter(torch.randn(embedding_length, qkv_vec_length, requires_grad=True))
        
        self.softmax = nn.Softmax(dim=-1)

 

๋ญ ์–ด์จŒ๋“  ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ์•Œ๊ณ  ์‹ถ์€ ๋ฌธ์ œ, ํ•ด๊ฒฐํ•˜๊ณ  ์‹ถ์€ ๋ฌธ์ œ๋Š” 'nn.Parameter()๋ฅผ ์™œ ์จ์•ผ ํ•˜๋Š”๊ฐ€?'๊ฐ€ ์งˆ๋ฌธ์ด์ž ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.


๐Ÿค” Documentation

์ด๋Ÿฐ ๊ฑด ์•„๋งˆ ๊ณต์‹ ๋ฌธ์„œ์— ๋น ๋ฅด๊ฒŒ ๋‹ต์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์„ ๊ฑฐ๋ผ ํŒ๋‹จํ•ด์„œ Documentation์„ ํ™•์ธํ•ด ๋ดค์Šต๋‹ˆ๋‹ค.

https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html

 

Parameter — PyTorch 2.3 documentation

Shortcuts

pytorch.org

 

์–ป์„ ์ˆ˜ ์žˆ๋Š” ์ •๋ณด๋Š” ํฌ๊ฒŒ 2๊ฐ€์ง€์˜€์Šต๋‹ˆ๋‹ค.

(1) model.parameters() or model.named_parameters()๋ฅผ ํ†ตํ•ด ํ•ด๋‹น ๋ชจ๋“ˆ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ผ๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

(2) requires_grad = True์ธ ๊ฒฝ์šฐ, torch.no_grad()์ด๋”๋ผ๋„ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๊ณ , ๊ณ„์† gradient ๊ฐ’์„ ๊ณ„์‚ฐํ•œ๋‹ค.

 

์‚ฌ์‹ค 2๋ฒˆ์€ ์•„์˜ˆ ํ•ด๋‹น ํฌ์ŠคํŒ…๊ณผ ๊ด€๋ จ ์—†๋Š” ๊ฑฐ ๊ฐ™์œผ๋‹ˆ ์ œ์™ธ๋ฅผ ํ•˜๊ณ , 1๋ฒˆ์„ ๋ด์•ผ ํ•˜๋Š”๋ฐ ์ด์— ๋Œ€ํ•ด์„œ๋Š” ์ฒ˜์Œ์— ์ด๋ ‡๊ฒŒ ํ•ด์„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ € ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” model.parameters(), ํ˜น์€ model.named_parameters()๋ฅผ ๋ชจ๋ธ ํ”„๋ฆฌ์ง•์ด๋‚˜ ๋ชจ๋ธ์˜ ๊ตฌ์„ฑ์š”์†Œ๋ฅผ ํ™•์ธํ•˜๋ ค๊ณ  ํ•  ๋•Œ, ๋งŽ์ด ์‚ฌ์šฉํ•ด ์™”์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ torchsummary๋‚˜ print(model)์— ๋Œ€ํ•ด์„œ ์ฐจ์ด์ ์ด ์žˆ๋Š” ๊ฑด๊ฐ€?๋ฅผ ์ƒ๊ฐํ•ด์„œ ๊ฐ„๋‹จํ•œ ์ฝ”๋“œ๋ฅผ ์ž‘์—…ํ•ด์„œ ๋Œ๋ ค๋ดค์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์˜ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด์ฃผ๊ณ , 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ชจ๋ธ์˜ ๊ตฌ์„ฑ ์š”์†Œ์— ๋Œ€ํ•ด์„œ ์ถœ๋ ฅํ•ด ๋ณธ ๊ฒฐ๊ณผ, ์ฐจ์ด์ ์€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.param1 = nn.Parameter(
            torch.tensor([1, 2, 3], dtype=torch.float32))
        self.param2 = torch.tensor([4, 5, 6], dtype=torch.float32)
        self.li = nn.Linear(3, 1)

    def forward(self, x):
        x = x * self.param1
        x = x * self.param2
        x = self.li(x)

        return x

 

1๋ฒˆ์งธ ๋ฐฉ๋ฒ•: model.named_parameters()

for name, param in model.named_parameters():
	print(name, param)

# OUTPUT
param1 Parameter containing:
tensor([1., 2., 3.], requires_grad=True)
li.weight Parameter containing:
tensor([[-0.2087,  0.4349,  0.3196]], requires_grad=True)
li.bias Parameter containing:
tensor([-0.5716], requires_grad=True)

2๋ฒˆ์งธ ๋ฐฉ๋ฒ•: summary(model, (BATCH_SIZE, 3))

BATCH_SIZE = 16
summary(model, (BATCH_SIZE, 3))

# OUTPUT
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Linear-1                [-1, 16, 1]               4
================================================================
Total params: 4
Trainable params: 4
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
----------------------------------------------------------------

3๋ฒˆ์งธ ๋ฐฉ๋ฒ•: print(model)

print(model)

# OUTPUT
Model(
  (li): Linear(in_features=3, out_features=1, bias=True)
)

 

์ฐจ์ด์ ์€ 1๋ฒˆ์งธ ๋ฐฉ๋ฒ•์—์„œ๋งŒ nn.Parameter()์— ๋Œ€ํ•ด์„œ ์ถœ๋ ฅ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. 2, 3๋ฒˆ์งธ ๋ฐฉ๋ฒ•์ด ์ถœ๋ ฅ์„ ์•ˆ ํ•œ๋‹ค๋Š” ๊ฒŒ ์กฐ๊ธˆ ์˜์•„ํ•˜๊ธด ํ–ˆ์Šต๋‹ˆ๋‹ค. 1, 2, 3๋ฒˆ ๋ฐฉ๋ฒ• ๋ชจ๋‘ tensor๋กœ ์„ ์–ธ๋œ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ์ถœ๋ ฅํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

 

๋ฌผ๋ก , 2, 3๋ฒˆ์งธ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์ถœ๋ ฅํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” nn.Parameter() ์ž์ฒด๋ฅผ ํ•˜๋‚˜์˜ Layer์ฒ˜๋Ÿผ ์ทจ๊ธ‰ํ•˜์—ฌ ๊ณ„์ธต์ ์ธ ๊ตฌ์กฐ๋กœ ์ถœ๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด, ํ•ด๋‹น ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•ด์„œ ์–ด๋– ํ•œ ์—ฐ์‚ฐ์ด ์ด๋ฃจ์–ด์งˆ์ง€ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ์–ด์ฐŒ ๋ณด๋ฉด ๋‹น์—ฐํ•œ ๊ฒฐ๊ณผ์ธ ๊ฑฐ ๊ฐ™์•„ ๋ณด์ด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ, ๊ณ ์ž‘ ์ด๋Ÿฐ ์ฐจ์ด์ ์„ ์ฃผ์ž๊ณ  nn.Parameter()๋ฅผ ๋งŒ๋“ค์—ˆ์„๊นŒ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด๋ฉด ์ด ์ฐจ์ด์ ์€ ํฌ๊ฒŒ ์„ค๋“๋ ฅ์ด ์—†์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฌด์–ธ๊ฐ€ ๋” ํ•ต์‹ฌ์ ์ธ ์ด์œ ๊ฐ€ ์žˆ์„ ๊ฑฐ๋ผ ํŒ๋‹จํ–ˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ˜€ Solution: optimizer

๊ทธ์— ๋Œ€ํ•œ ๋‹ต์€ Stackoverflow์—์„œ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. 

https://stackoverflow.com/questions/51373919/the-purpose-of-introducing-nn-parameter-in-pytorch

 

The purpose of introducing nn.Parameter in pytorch

I am new to Pytorch and I am confused about the difference between nn.Parameter and autograd.Variable. I know that the former one is the subclass of Variable and has the gradient. But I really don't

stackoverflow.com

 

์—ฌ๊ธฐ์„œ ๋‹ต์„ ์•Œ์•„๋ƒˆ์„ ๋•Œ๋Š” ์ •๋ง ๊ฐ€๊นŒ์šด ๊ณณ์— ๊ทธ ๋‹ต์ด ์žˆ์—ˆ๊ณ , '์™€ ์ด๊ฑด ๋„ˆ๋ฌด ๋ฉ์ฒญํ–ˆ๋‹ค'๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ค ์ •๋„์˜€์Šต๋‹ˆ๋‹ค.

nn.Parameter()๋ฅผ ์“ฐ๋ฉด, model.parameters(), model.named_parameters()์—์„œ ๋‚˜ํƒ€๋‚œ๋‹ค๋ผ๋Š” ๊ฒƒ์€ ์œ„์—์„œ๋„ ์—ฌ๋Ÿฌ ๋ฒˆ ์–˜๊ธฐํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด ์šฉ๋„๋ฅผ ์œ„์—์„œ๋Š” ๋‹จ์ˆœํžˆ ๋ชจ๋ธ์˜ ๊ตฌ์„ฑ ์š”์†Œ ์ถœ๋ ฅ์„ ์œ„ํ•ด์„œ๋ผ๊ณ  ํŒ๋‹จํ–ˆ์—ˆ๊ณ ์š”.

 

ํ•˜์ง€๋งŒ, ์šฐ๋ฆฌ๊ฐ€ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ๋Š” ์–ด๋–ป๊ฒŒ ํ•™์Šต์„ ์‹œํ‚ต๋‹ˆ๊นŒ?

์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์„ ์–ธํ•ด์„œ ํ•ด๋‹น ์˜ตํ‹ฐ๋งˆ์ด์ €์—๊ฒŒ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋„˜๊ฒจ์ฃผ์ง€ ์•Š์•˜๋‚˜์š”?

 

๋„ค! ๋งž์Šต๋‹ˆ๋‹ค. ๊ทธ๋•Œ ์šฐ๋ฆฌ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋„˜๊ฒจ์ค„ ๋•Œ ์ด๋Ÿฐ ์ฝ”๋“œ๋ฅผ ์ผ์Šต๋‹ˆ๋‹ค.

optimizer = optim.SGD(model.parameters(), lr=0.001)

model.parameters()๋ฅผ ์˜ตํ‹ฐ๋งˆ์ด์ €์— ๋„˜๊ฒจ์„œ ์˜ตํ‹ฐ๋งˆ์ด์ €์—๊ฒŒ ํ•™์Šต์„ ์‹œ์ผœ์•ผ ํ•˜๋Š” ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์•Œ๋ ค์ฃผ๊ณ  ์žˆ์—ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์•„, ๊ทธ๋Ÿฌ๋ฉด tensor๋กœ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์€ ์šฐ๋ฆฌ์˜ ์˜๋„์ ์ธ ๋…ผ๋ฆฌ๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต์„ ๋ชป ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ ์‚ฌ์†Œํ•œ ์ฐจ์ด์ ์ด ์•„๋‹ˆ๋ผ, ์น˜๋ช…์ ์ธ ์ฐจ์ด์ ์ด ์žˆ์—ˆ๋‹ค๋Š” ๊ฑฐ์˜ˆ์š”! 

 

์ด๊ฒŒ ์‚ฌ์‹ค์ธ์ง€ ํ™•์ธํ•ด ๋ณด๊ธฐ ์œ„ํ•ด์„œ ์•„๋ž˜์™€ ๊ฐ™์€ ํ›ˆ๋ จ์„ ์‹œํ‚ค๋Š” ํ•จ์ˆ˜๋ฅผ ๋งŒ๋“ค์–ด์„œ ์‹คํ–‰ํ•ด ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

def train():
    model = Model()
    optimizer = optim.SGD(model.parameters(), lr=0.001)
    loss_fn = nn.MSELoss()

    input_data = torch.tensor([1, 2, 3], dtype=torch.float32)
    output_data = model(input_data)
    target_data = torch.tensor([1], dtype=torch.float32)

    loss = loss_fn(output_data, target_data)

    _ANNOTATION_MARK = 10
    print('#' * _ANNOTATION_MARK +
          '[Before Back Propagation]' + '#' * _ANNOTATION_MARK)
    print('param1: ', model.param1.data)
    print('param2: ', model.param2.data)
    print('li: ', model.li.weight.data)

    # Back Propagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print('#' * _ANNOTATION_MARK +
          '[Before Back Propagation]' + '#' * _ANNOTATION_MARK)
    print('param1: ', model.param1.data)
    print('param2: ', model.param2.data)
    print('li: ', model.li.weight.data)

 

์ด์— ๋Œ€ํ•œ ์‹คํ–‰ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

##########[Before Back Propagation]##########
param1:  tensor([1., 2., 3.])
param2:  tensor([4., 5., 6.])
li:  tensor([[-0.3608, -0.2601, -0.4624]])
##########[Before Back Propagation]##########
param1:  tensor([0.9055, 1.8296, 2.4548])
param2:  tensor([4., 5., 6.])
li:  tensor([[-0.0987,  1.0501,  3.0752]])

nn.Parameter()๋ฅผ ํ†ตํ•ด์„œ ์„ ์–ธ๋œ param1์€ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ฐ’์ด ๋ณ€๊ฒฝ ๋˜์—ˆ์ง€๋งŒ, tensor๋กœ ์„ ์–ธ๋œ param2๋Š” ํ•™์Šต์„ ํ•˜์ง€ ๋ชปํ•ด์„œ ๊ฐ’์ด ๊ทธ๋Œ€๋กœ์ธ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


โœ… Result

์ด์ œ๋Š” ๋‹ค์Œ ์งˆ๋ฌธ์— ๋Œ€๋‹ต์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 'nn.Parameter()์™€ tensor๊ฐ€ ๋ฌด์Šจ ์ฐจ์ด๊ฐ€ ์žˆ๊ธธ๋ž˜ nn.Parameter()๋ฅผ ์“ฐ๋Š”๊ฐ€?'๋ผ๋Š” ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ 'nn.Parameter()๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ชจ๋ธ(๋ชจ๋“ˆ) ๋‚ด์— ์„ ์–ธ์„ ํ•ด์•ผ ์˜ตํ‹ฐ๋งˆ์ด์ €์— model.parameters()๋ฅผ ํ†ตํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋„˜๊ฒจ์„œ ํ•ด๋‹น ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.'๋ผ๊ณ  ๋‹ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋„ˆ๋ฌด ๊ธด ๊ฑฐ ๊ฐ™์œผ๋‹ˆ ์กฐ๊ธˆ ๋” ์š”์•ฝ์„ ํ•˜๋ฉด, 'tensor๊ฐ€ ์•„๋‹Œ nn.Parameter()๋กœ ์„ ์–ธ์„ ํ•ด์•ผ ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.'๋กœ ์š”์•ฝํ•  ์ˆ˜ ์žˆ์„ ๊ฑฐ ๊ฐ™์Šต๋‹ˆ๋‹ค.


๐Ÿ“‚ Reference

https://stackoverflow.com/questions/51373919/the-purpose-of-introducing-nn-parameter-in-pytorch

 

The purpose of introducing nn.Parameter in pytorch

I am new to Pytorch and I am confused about the difference between nn.Parameter and autograd.Variable. I know that the former one is the subclass of Variable and has the gradient. But I really don't

stackoverflow.com