nn.Parameter(), 이걸 써야 하는 이유가 뭘까? (tensor와 명백하게 다른 점)

Notice

Recent Posts

Recent Comments

Link

깃허브

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Tags more

Archives

Today

Total

관리 메뉴

Doby's Lab

nn.Parameter(), 이걸 써야 하는 이유가 뭘까? (tensor와 명백하게 다른 점) 본문

Code about AI/PyTorch

nn.Parameter(), 이걸 써야 하는 이유가 뭘까? (tensor와 명백하게 다른 점)

도비(Doby) 2024. 4. 29. 00:28

🤔 Problem

문득 예전에 ViT를 구현해 놓은 코드를 보다가 그런 생각을 했습니다. '내가 저기서 nn.Parameter()를 왜 썼더라?' 지금 생각해 보면, 그냥 tensor를 써도 똑같은 코드일 텐데 말입니다.

이때 당시에 Attention을 구현하면서 Query, Key, Value를 만들어내기 위한 목적으로 Weight Matrix가 필요했었고, 여러 오픈 소스를 참고하면서 구현하다가 무심하게 썼었던 기억이 납니다.

class ScaledDotProductAttention(nn.Module):
    def __init__(self, embedding_length, qkv_vec_length):
        '''
        embedding_length : embedding 하나의 길이 -> W_(qkv)의 row 값이 된다.
        qkv_vec_length :  W_(qkv) 행렬의 col 값
        '''
        super().__init__()

        # Query Matrix (Weight)
        self.W_q = nn.Parameter(torch.randn(embedding_length, qkv_vec_length, requires_grad=True))
        
        # Key Matrix (Weight)
        self.W_k = nn.Parameter(torch.randn(embedding_length, qkv_vec_length, requires_grad=True))
        
        # Value Matrix (Weight)
        self.W_v = nn.Parameter(torch.randn(embedding_length, qkv_vec_length, requires_grad=True))
        
        self.softmax = nn.Softmax(dim=-1)

뭐 어쨌든 이번 포스팅에서 알고 싶은 문제, 해결하고 싶은 문제는 'nn.Parameter()를 왜 써야 하는가?'가 질문이자 문제입니다.

🤔 Documentation

이런 건 아마 공식 문서에 빠르게 답을 확인할 수 있을 거라 판단해서 Documentation을 확인해 봤습니다.

https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html

Parameter — PyTorch 2.3 documentation

Shortcuts

pytorch.org

얻을 수 있는 정보는 크게 2가지였습니다.

(1) model.parameters() or model.named_parameters()를 통해 해당 모듈의 파라미터라는 것을 나타낼 수 있다.

(2) requires_grad = True인 경우, torch.no_grad()이더라도 영향을 받지 않고, 계속 gradient 값을 계산한다.

사실 2번은 아예 해당 포스팅과 관련 없는 거 같으니 제외를 하고, 1번을 봐야 하는데 이에 대해서는 처음에 이렇게 해석했습니다. 저 같은 경우에는 model.parameters(), 혹은 model.named_parameters()를 모델 프리징이나 모델의 구성요소를 확인하려고 할 때, 많이 사용해 왔습니다.

그래서 torchsummary나 print(model)에 대해서 차이점이 있는 건가?를 생각해서 간단한 코드를 작업해서 돌려봤습니다. 아래의 모델을 만들어주고, 3가지 방법으로 모델의 구성 요소에 대해서 출력해 본 결과, 차이점은 있었습니다.

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.param1 = nn.Parameter(
            torch.tensor([1, 2, 3], dtype=torch.float32))
        self.param2 = torch.tensor([4, 5, 6], dtype=torch.float32)
        self.li = nn.Linear(3, 1)

    def forward(self, x):
        x = x * self.param1
        x = x * self.param2
        x = self.li(x)

        return x

1번째 방법: `model.named_parameters()`

for name, param in model.named_parameters():
	print(name, param)

# OUTPUT
param1 Parameter containing:
tensor([1., 2., 3.], requires_grad=True)
li.weight Parameter containing:
tensor([[-0.2087,  0.4349,  0.3196]], requires_grad=True)
li.bias Parameter containing:
tensor([-0.5716], requires_grad=True)

2번째 방법: `summary(model, (BATCH_SIZE, 3))`

BATCH_SIZE = 16
summary(model, (BATCH_SIZE, 3))

# OUTPUT
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Linear-1                [-1, 16, 1]               4
================================================================
Total params: 4
Trainable params: 4
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
----------------------------------------------------------------

3번째 방법: `print(model)`

print(model)

# OUTPUT
Model(
  (li): Linear(in_features=3, out_features=1, bias=True)
)

차이점은 1번째 방법에서만 nn.Parameter()에 대해서 출력을 한다는 것입니다. 2, 3번째 방법이 출력을 안 한다는 게 조금 의아하긴 했습니다. 1, 2, 3번 방법 모두 tensor로 선언된 모델의 파라미터에 대해서는 출력하지 않았습니다.

물론, 2, 3번째 방법을 통해 출력하기 위해서는 nn.Parameter() 자체를 하나의 Layer처럼 취급하여 계층적인 구조로 출력해야 합니다. 반면, 해당 파라미터에 대해서 어떠한 연산이 이루어질지 모르기 때문에 어찌 보면 당연한 결과인 거 같아 보이기도 합니다.

그런데, 고작 이런 차이점을 주자고 nn.Parameter()를 만들었을까를 생각해 보면 이 차이점은 크게 설득력이 없었습니다. 무언가 더 핵심적인 이유가 있을 거라 판단했습니다.

😀 Solution: optimizer

그에 대한 답은 Stackoverflow에서 찾을 수 있었습니다.

https://stackoverflow.com/questions/51373919/the-purpose-of-introducing-nn-parameter-in-pytorch

The purpose of introducing nn.Parameter in pytorch

I am new to Pytorch and I am confused about the difference between nn.Parameter and autograd.Variable. I know that the former one is the subclass of Variable and has the gradient. But I really don't

stackoverflow.com

여기서 답을 알아냈을 때는 정말 가까운 곳에 그 답이 있었고, '와 이건 너무 멍청했다'라는 생각이 들 정도였습니다.

nn.Parameter()를 쓰면, model.parameters(), model.named_parameters()에서 나타난다라는 것은 위에서도 여러 번 얘기했습니다. 하지만, 이 용도를 위에서는 단순히 모델의 구성 요소 출력을 위해서라고 판단했었고요.

하지만, 우리가 모델을 학습시킬 때는 어떻게 학습을 시킵니까?

옵티마이저를 선언해서 해당 옵티마이저에게 모델의 파라미터를 넘겨주지 않았나요?

네! 맞습니다. 그때 우리는 파라미터를 넘겨줄 때 이런 코드를 썼습니다.

optimizer = optim.SGD(model.parameters(), lr=0.001)

model.parameters()를 옵티마이저에 넘겨서 옵티마이저에게 학습을 시켜야 하는 모델의 파라미터가 무엇인지 알려주고 있었다는 것입니다.

아, 그러면 tensor로 모델의 파라미터를 구성하는 것은 의도에 적합하지 않다고 볼 수 있습니다. 해당 tensor는 학습을 못 하기 때문입니다.

즉, 코드로 표현되는 사소한 차이점이 아니라, 치명적인 차이점이 있었다는 거예요!

이게 사실인지 확인해 보기 위해서 아래와 같은 훈련을 시키는 함수를 만들어서 실행해 보았습니다.

def train():
    model = Model()
    optimizer = optim.SGD(model.parameters(), lr=0.001)
    loss_fn = nn.MSELoss()

    input_data = torch.tensor([1, 2, 3], dtype=torch.float32)
    output_data = model(input_data)
    target_data = torch.tensor([1], dtype=torch.float32)

    loss = loss_fn(output_data, target_data)

    _ANNOTATION_MARK = 10
    print('#' * _ANNOTATION_MARK +
          '[Before Back Propagation]' + '#' * _ANNOTATION_MARK)
    print('param1: ', model.param1.data)
    print('param2: ', model.param2.data)
    print('li: ', model.li.weight.data)

    # Back Propagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    print('#' * _ANNOTATION_MARK +
          '[Before Back Propagation]' + '#' * _ANNOTATION_MARK)
    print('param1: ', model.param1.data)
    print('param2: ', model.param2.data)
    print('li: ', model.li.weight.data)

이에 대한 실행 결과는 아래와 같습니다.

##########[Before Back Propagation]##########
param1:  tensor([1., 2., 3.])
param2:  tensor([4., 5., 6.])
li:  tensor([[-0.3608, -0.2601, -0.4624]])
##########[Before Back Propagation]##########
param1:  tensor([0.9055, 1.8296, 2.4548])
param2:  tensor([4., 5., 6.])
li:  tensor([[-0.0987,  1.0501,  3.0752]])

nn.Parameter()를 통해서 선언된 param1은 학습을 수행하여 값이 변경 되었지만, tensor로 선언된 param2는 학습을 하지 못해서 값이 그대로인 것을 확인할 수 있습니다.

✅ Result

이제는 다음 질문에 대답을 할 수 있습니다. 'nn.Parameter()와 tensor가 무슨 차이가 있길래 nn.Parameter()를 쓰는가?'라는 질문에 대해서 'nn.Parameter()로 파라미터를 모델(모듈) 내에 선언을 해야 옵티마이저에 model.parameters()를 통해 파라미터로 넘겨서 해당 파라미터를 학습시킬 수 있기 때문이다.'라고 답할 수 있습니다.

너무 긴 거 같으니 조금 더 요약을 하면, 'tensor가 아닌 nn.Parameter()로 선언을 해야 옵티마이저를 통해 학습을 할 수 있다.'로 요약할 수 있을 거 같습니다.

그렇다면, torch.tensor()는 안전하다고 볼 수 있을까요? 의도적으로 학습을 시키지 않는 파라미터가 있습니다만, 마냥 torch.tensor()를 사용하는 것은 적합하지 않습니다.

이에 대해 알아보기 위해 다음 글, "self.register_buffer(), 학습하지 않을 파라미터라면? (tensor와 명백하게 다른 점 2)을 소개합니다.

📂 Reference

https://stackoverflow.com/questions/51373919/the-purpose-of-introducing-nn-parameter-in-pytorch

The purpose of introducing nn.Parameter in pytorch

I am new to Pytorch and I am confused about the difference between nn.Parameter and autograd.Variable. I know that the former one is the subclass of Variable and has the gradient. But I really don't

stackoverflow.com

'Code about AI > PyTorch' 카테고리의 다른 글

DropPath란 무엇이며, Dropout과 무슨 차이가 있을까? (timm 활용 및 오픈소스 분석) (0)	2024.05.12
Tensor는 서로 다른 ndim에 대해서 어떻게 연산할까? (Broadcasting Semantics) (0)	2024.05.04
x.clone()은 정말 Residual Connection을 할까? (Memory 공유, Immutability) (0)	2024.04.27
backward()는 역전파를 어떻게 할까? (Autograd의 Node) (4)	2023.11.19
optimizer.step()은 정말 가중치를 건들까? (Call-by-Assignment) (0)	2023.11.18

'Code about AI/PyTorch' Related Articles

Doby's Lab

nn.Parameter(), 이걸 써야 하는 이유가 뭘까? (tensor와 명백하게 다른 점) 본문

nn.Parameter(), 이걸 써야 하는 이유가 뭘까? (tensor와 명백하게 다른 점)

🤔 Problem

🤔 Documentation

1번째 방법: model.named_parameters()

2번째 방법: summary(model, (BATCH_SIZE, 3))

3번째 방법: print(model)

😀 Solution: optimizer

✅ Result

📂 Reference

'Code about AI > PyTorch' 카테고리의 다른 글

티스토리툴바

1번째 방법: `model.named_parameters()`

2번째 방법: `summary(model, (BATCH_SIZE, 3))`

3번째 방법: `print(model)`