x.clone()은 정말 Residual Connection을 할까? (Memory 공유, Immutability)

Notice

Recent Posts

Recent Comments

Link

깃허브

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tags more

Archives

Today

Total

관리 메뉴

Doby's Lab

x.clone()은 정말 Residual Connection을 할까? (Memory 공유, Immutability) 본문

Code about AI/PyTorch

x.clone()은 정말 Residual Connection을 할까? (Memory 공유, Immutability)

도비(Doby) 2024. 4. 27. 19:04

🤔 Problem

이번에 ResNet을 PyTorch로 직접 구현해 보면서 약간의 의구심(?)이 들었던 부분이 있습니다. Residual Connection을 구현할 때 크게 2가지 방법으로 구현을 하는데, '두 코드 모두 Residual Connection을 수행하는가?'가 의문이자 이번 포스팅에서 알아내고 싶은 문제점입니다.

+ 코드에 대해서만 다룰 것이니 Residual Connection에 대한 개념의 언급은 따로 없습니다.

첫 번째 코드는 torchvision 라이브러리 내에 ResNet을 구현해 둔 소스코드입니다.

해당 코드에서는 identity = x와 같은 방법으로 복사를 합니다.

( https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py 143 line)

def forward(self, x):
	... # 일련의 과정(1)
	identity = x
	... # 일련의 과정(2)
	x += identity
	return x

두 번째 코드는 익명의 누군가(?)가 ResNet을 구현한 소스코드입니다.

해당 코드에서는 identity = x.clone()과 같은 방법으로 복사를 합니다.

( https://github.com/JayPatwardhan/ResNet-PyTorch/blob/master/ResNet/ResNet.py 57 line)

def forward(self, x):
	... # 일련의 과정(1)
	identity = x.clone()
	... # 일련의 과정(2)
	x += identity
	return x

첫 번째 코드와 같은 방식으로 Residual Connection을 구현하는 것이 일반적이지만, 두 번째 코드와 같은 방식으로 구현된 것도 종종 찾아볼 수 있었습니다. 하지만, 조금 의구심을 품었던 계기가 있었습니다. 첫 번째 코드는 두 번째 코드와 다른 점이 원본 변수(텐서)에 대해서 메모리를 공유한다는 것입니다. '메모리를 공유하고 있다면, x에 대한 연산들이 일어날 때도 공유를 하는 것은 문제가 되지 않는가? 그러면 Residual Conneciton이 아니라 똑같은 연산을 한 결과 2개를 합치는 꼴이 아닌가?'라는 문제점을 제안하며 이번 포스팅을 작성합니다.

😀 What is `x.clone()`?

그러면 두 코드에 대한 차이점을 보기 위해서 x.clone()이 무엇을 수행하는 코드인지 알아야 합니다.

https://pytorch.org/docs/stable/generated/torch.clone.html

torch.clone — PyTorch 2.3 documentation

Shortcuts

pytorch.org

x.clone()이란 x라는 텐서에 대해서 복사가 이루어지는데 이때, 메모리는 새로운 메모리를 할당하며, 기존의 .grad 속성은 가져가지 않고, .grad_fn은 CloneBackward라는 연산 함수를 등록해 줍니다.

즉, x.clone()은 기존 텐서의 값들(데이터)에 대해서 공유가 아닌 복사(새 메모리 할당)를 바로 해버리는 것입니다. (.grad는 제외를 하고, 데이터에 대해서만 말입니다.)

이 차이점을 직접 보기 위해서 아래와 같은 코드를 만들어줬습니다.

def clone_test():
    warnings.filterwarnings(action='ignore')
    a = torch.tensor([2, 3, 4], dtype=torch.float32, requires_grad=True)
    a.grad = torch.tensor([1, 2, 3], dtype=torch.float32)
    b = a.clone()
    c = a

    li = ['a', 'b', 'c']
    for tensor_ in li:
        print('=' * 60)
        print(f'Tensor: {tensor_}')
        print(f'Tensor Memory Address: {str(hex(eval(tensor_).data_ptr())).upper()}')
        print(f'Tensor Value: {eval(tensor_).data}')
        print(f'Tensor Requires Grad: {eval(tensor_).requires_grad}')
        print(f'Tensor Gradient: {eval(tensor_).grad}')
        print(f'Tensor Backward Function: {eval(tensor_).grad_fn}')

    warnings.filterwarnings(action='default')

그에 대한 결과는 아래와 같습니다.

============================================================
Tensor: a
Tensor Memory Address: 0X143E7370D40
Tensor Value: tensor([2., 3., 4.])
Tensor Requires Grad: True
Tensor Gradient: tensor([1., 2., 3.])
Tensor Backward Function: None
============================================================
Tensor: b
Tensor Memory Address: 0X143E7370E00
Tensor Value: tensor([2., 3., 4.])
Tensor Requires Grad: True
Tensor Gradient: None
Tensor Backward Function: <CloneBackward0 object at 0x00000143E9DFDC30>
============================================================
Tensor: c
Tensor Memory Address: 0X143E7370D40
Tensor Value: tensor([2., 3., 4.])
Tensor Requires Grad: True
Tensor Gradient: tensor([1., 2., 3.])
Tensor Backward Function: None

x.clone()이 어떤 것을 수행하는지 이를 통해서 살펴볼 수 있었고, identity = x라는 방식과 핵심적인 차이점은 데이터를 복사함에 있어서 메모리를 공유하느냐 안 하느냐(새로운 메모리 할당)가 가장 큰 차이점입니다.

🤔 `identity = x`에 대한 의심

그러면 저의 의구심의 방향은 첫 번째 코드로 향합니다. x.clone()은 메모리 공유를 하고 있지 않기 때문에 원본 x에 대해 어떠한 연산이 이루어지더라도 관계가 없다는 것은 확실해졌으니 말입니다.

하지만, 여기서 살짝 이미 정답은 나와있습니다. 왜냐하면, identity = x를 쓰는 방식의 코드는 torchvision 라이브러리의 코드입니다. 너무 공식적인 코드라 제가 생각한 문제를 야기하지는 않겠지만, 그래도 궁금하니 알아봅시다.

두 코드의 차이점을 직접 보기 위해서 모델을 하나 만들었습니다. ~~(모델이라 하기엔 그렇지만)~~

# (1) MAKE RESIDUAL MODEL
input_data = torch.tensor([1.], dtype=torch.float32)
weight1 = torch.tensor([2.], dtype=torch.float32, requires_grad=True)
weight2 = torch.tensor([3.], dtype=torch.float32, requires_grad=True)
target_data = torch.tensor([8.5], dtype=torch.float32)

feature = weight1 * input_data  # 2.0

# make identity for residual connection
identity = None  # 2.0
print('HOW TO RESIDUAL CONNECTION: ', end='')
if use_clone:
    print('identity = feature.clone()')
    identity = feature.clone()
else:
    print('identity = feature')
    identity = feature

feature2 = weight2 * feature  # 6.0

output_data = identity + feature2  # 8.0 = 2.0 + 6.0

loss = (output_data - target_data) ** 2  # 0.25 = (8.0 - 8.5) ** 2

코드가 다소 복잡해 보이지만, 그림으로 나타내면 아래와 같습니다.

Residual Connection을 위한 너무나 의도적인 모델입니다. 위 코드에 따라 입력 값이 1이면, 메모리를 계속 공유하지 않는다고 했을 때, 출력 값이 8이 나와야 하는 모델입니다. 이걸 두 가지 방식에 따라 수행했을 때 차이가 있었을까요? 출력을 해보았습니다. (출력에 관한 코드는 맨 아래에 한꺼번에 첨부하였습니다.)

(1) `identity = x`를 사용했을 때

HOW TO RESIDUAL CONNECTION: identity = feature

=========================[Memory Address]=========================
Identity's Address(data_ptr): 0X24E65892840, Address(id): 0X24E683404A0
Feature's  Address(data_ptr): 0X24E65892840, Address(id): 0X24E683404A0
Feature2's Address(data_ptr): 0X24E65891F40, Address(id): 0X24E68358C20

=========================[Gradient Function]=========================
input_data's grad_fn:     None
weight1's grad_fn:        None
feature's grad_fn:        <MulBackward0 object at 0x0000024E6831DCF0>
identity's grad_fn:       <MulBackward0 object at 0x0000024E6831DF00>
weight2's grad_fn:        None
feature2's grad_fn:       <MulBackward0 object at 0x0000024E6831DCF0>
output_data's grad_fn:    <AddBackward0 object at 0x0000024E6831DF00>
loss's grad_fn:           <PowBackward0 object at 0x0000024E6831F5B0>

=========================[Value Information]=========================
Output: 8.00
Target: 8.50
Loss: 0.25
Weight1's Gradient to update -4.00

(2) `identity = x.clone()`를 사용했을 때

HOW TO RESIDUAL CONNECTION: identity = feature.clone()

=========================[Memory Address]=========================
Identity's Address(data_ptr): 0X23F5C9B5880, Address(id): 0X23F5F420D60
Feature's  Address(data_ptr): 0X23F5C9B6000, Address(id): 0X23F5F4085E0
Feature2's Address(data_ptr): 0X23F5C9B5580, Address(id): 0X23F5F420DB0

=========================[Gradient Function]=========================
input_data's grad_fn:     None
weight1's grad_fn:        None
feature's grad_fn:        <MulBackward0 object at 0x0000023F5F3EDCF0>
identity's grad_fn:       <CloneBackward0 object at 0x0000023F5F3EDF00>
weight2's grad_fn:        None
feature2's grad_fn:       <MulBackward0 object at 0x0000023F5F3EDCF0>
output_data's grad_fn:    <AddBackward0 object at 0x0000023F5F3EDF00>
loss's grad_fn:           <PowBackward0 object at 0x0000023F5F3EF5B0>

=========================[Value Information]=========================
Output: 8.00
Target: 8.50
Loss: 0.25
Weight1's Gradient to update -4.00

결과가 같습니다! 똑같은 Output을 가지고 온다는 겁니다. 하지만, 이건 Forward Propagation을 만족한다는 소리일 뿐, Back Propagation은 확실하지 않습니다.

그래서 저는 weight1이 Back Propagation을 했을 때, 업데이트를 해야 하는 값을 출력하도록 해주었습니다. 그 결과도 같습니다! 즉, Forward Propagation, Backward Propagation의 결과가 같기 때문에 첫 번째 코드의 방식이든, 두 번째 코드의 방식이든 문제가 없다는 것입니다.

그러면 도대체 identity = x의 방식은 메모리를 공유한다고 했는데 어떻게 Residual Conneciton을 가능하게 했을까요? 다시 말해 메모리를 공유하고 있는데 같은 메모리의 두 변수에 대해서 어떻게 독립적인 값을 갖도록 했을까요.

😀 Tensor의 Mutable, or Immutable

이것이 가능한 이유를 찾기 위해서는 돌고 돌아야 했습니다. Garbage Collection이라는 개념까지 볼 뻔하다가 겨우 그까지는 가지 않았습니다. 어쨌든 찾은 근거는 결국에 또 기초에 모든 것이 있었습니다.

예전 파이썬 관련, 파이토치 관련 포스팅에서 파이썬의 모든 것은 객체이며, 이는 Mutable한가, Immutable한가를 다룬 적이 있었습니다.

간단하게 다시 언급하면 Mutable은 수정사항에 대해서 새로운 메모리를 할당할 필요 없이 그대로 수정할 수 있는 객체이고, Immutable은 수정을 하는데 기존 메모리에서 수정할 수 없으니 새로운 메모리를 할당해야 하는 객체입니다. (항상 이 개념에 대해서 블로그에서는 부수적으로만 다루어 왔어서 나중에 한 번 제대로 다루어 보겠습니다.)

어쨌든, 여기까지 설명했다면 감은 오셨을 겁니다. '아, Tensor가 Immutable 하기 때문에 굳이 x.clone()을 해주지 않아도 새로운 메모리를 할당하는 것이라서 Residual Connection을 할 때, 문제가 되지 않는 거구나!'

네, 정말 훌륭한 생각입니다만 사실 이건 틀렸습니다. 왜냐하면, Tensor는 Mutable 한 객체이거든요.

단, 특정 조건에 대해서만 말이에요. 리스트에서 원소 값 수정 같은 In-place Operation인 경우에는 새로운 메모리를 할당할 필요가 없습니다. Mutable하다는 의미입니다.

하지만, 다른 텐서와의 덧셈, 뺄셈, 곱셈 등과 같은 Arithmetic Operation에 대해서는 Immutability가 사용되어 새로운 메모리를 할당해야만 합니다.

그래서, Residual Connection에서는 Artihmetic Operation이 99.999999%로 쓰이기 때문에 우리가 identity = x와 같은 코드를 사용하더라도 x가 연산을 수행함에 따라 메모리 주소가 달라지기 때문에 문제가 되지 않는 것이었습니다.

✅ Result

자, 여기까지가 제가 생각한 문제점(궁금점)에 대한 해결책(해소책)이었습니다.

다시 한번 요약을 해보자면, Residual Connection을 구현하는 방법은 2가지가 있었습니다.

1. identity = x.clone()은 별도의 메모리를 가지기 때문에 문제가 되지 않는다고 판단하여, identity = x에 대해서만 알아보기 시작

2. Tensor가 연산을 수행함에 따라 별도의 메모리를 가진다. 하지만, Tensor는 Mutable 한 객체이다.

3. 이 점에서 Tensor가 In-place Operation에 대해서는 Mutability가 보장되지만, Arithmetic Operation에 대해서는 Immutability가 보장된다. 즉, 새로운 메모리가 할당된다.

4. 그래서 identity = x or identity = x.clone()에 대한 선택에 있어서 결과의 차이는 없다.

하지만, x.clone()은 .grad 속성에 대해서는 복사를 하지 않는다고 했으므로 x.clone()이 더 효율적일 수도 있겠다는 추측이 됩니다.

import torch
from memory_profiler import profile


@profile
def copy_tensor():
    a = torch.tensor([1], dtype=torch.float32)
    a.grad = torch.tensor([2], dtype=torch.float32)
    b = a
    # b = a.clone()


if __name__ == '__main__':
    copy_tensor()

그래서 memory_profiler라이브러리를 통해서 프로파일링을 한 결과, 텐서의 데이터에 대해서 새로운 메모리가 할당되기 때문에 x.clone()을 사용하지 않는 것이 더 효율적입니다. 어차피, x를 연산함에 따라 새로운 메모리를 할당해야 하기 때문입니다.

또한, identity나 x는 결국 input tensor로 weight가 아니기 때문에 gradient 값은 필요가 없어서 x.clone()의 필요성을 더 못 느끼게 되는 것이 아닌가 싶습니다.

결론적으로, 'Residual Connection을 할 때는 x.clone()을 쓸 필요가 없다. identity = x를 쓰면 된다.'로 글을 마무리할 수 있을 거 같습니다.

📄 Code for this post

import torch
import numpy as np
import argparse
import warnings


def residual_test(use_clone: bool = True):
    # (1) MAKE RESIDUAL MODEL
    input_data = torch.tensor([1.], dtype=torch.float32)
    weight1 = torch.tensor([2.], dtype=torch.float32, requires_grad=True)
    weight2 = torch.tensor([3.], dtype=torch.float32, requires_grad=True)
    target_data = torch.tensor([8.5], dtype=torch.float32)

    feature = weight1 * input_data  # 2.0

    # make identity for residual connection
    identity = None  # 2.0
    print('HOW TO RESIDUAL CONNECTION: ', end='')
    if use_clone:
        print('identity = feature.clone()')
        identity = feature.clone()
    else:
        print('identity = feature')
        identity = feature

    feature2 = weight2 * feature  # 6.0

    output_data = identity + feature2  # 8.0 = 2.0 + 6.0

    loss = (output_data - target_data) ** 2  # 0.25 = (8.0 - 8.5) ** 2

    loss.backward()

    # (2) PRINT INFORMATION
    _ANNOTATION_MARK = 25

    print('\n' + '=' * _ANNOTATION_MARK +
          '[Memory Address]' + '=' * _ANNOTATION_MARK)

    print(
        f'Identity\'s Address(data_ptr): {str(hex(identity.data_ptr())).upper()}, Address(id): {str(hex(id(identity))).upper()}')
    print(
        f'Feature\'s  Address(data_ptr): {str(hex(feature.data_ptr())).upper()}, Address(id): {str(hex(id(feature))).upper()}')
    print(
        f'Feature2\'s Address(data_ptr): {str(hex(feature2.data_ptr())).upper()}, Address(id): {str(hex(id(feature2))).upper()}')

    print('\n' + '=' * _ANNOTATION_MARK +
          '[Gradient Function]' + '=' * _ANNOTATION_MARK)
    components_li = ['input_data',
                     'weight1',
                     'feature',
                     'identity',
                     'weight2',
                     'feature2',
                     'output_data',
                     'loss']
    for component in components_li:
        temp_str = f'{component}\'s grad_fn:'
        print(f'{temp_str:25s} {eval(component).grad_fn}')
    # print(hex(weight1.data_ptr()))

    print('\n' + '=' * _ANNOTATION_MARK +
          '[Value Information]' + '=' * _ANNOTATION_MARK)
    print(f'Output: {output_data.item():.2f}')
    print(f'Target: {target_data.item():.2f}')
    print(f'Loss: {loss.item():.2f}')
    print(f'Weight1\'s Gradient to update {weight1.grad.item():.2f}')


def memory_test():
    a = np.array([1, 2, 3])
    print(f'a\'s address: {hex(id(a))}, a: {a}')
    a[2] = 2
    print(f'a\'s address: {hex(id(a))}, a: {a}')
    b = np.array([2, 4, 6])
    a = a + b
    print(f'a\'s address: {hex(id(a))}, a: {a}')


def clone_test():
    warnings.filterwarnings(action='ignore')
    a = torch.tensor([2, 3, 4], dtype=torch.float32, requires_grad=True)
    a.grad = torch.tensor([1, 2, 3], dtype=torch.float32)
    b = a.clone()
    c = a

    li = ['a', 'b', 'c']
    for tensor_ in li:
        print('=' * 60)
        print(f'Tensor: {tensor_}')
        print(
            f'Tensor Memory Address: {str(hex(eval(tensor_).data_ptr())).upper()}')
        print(f'Tensor Value: {eval(tensor_).data}')
        print(f'Tensor Requires Grad: {eval(tensor_).requires_grad}')
        print(f'Tensor Gradient: {eval(tensor_).grad}')
        print(f'Tensor Backward Function: {eval(tensor_).grad_fn}')

    warnings.filterwarnings(action='default')


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-c', '--use-clone', action='store_true')
    args = parser.parse_args()

    residual_test(use_clone=args.use_clone)
    # memory_test()
    # clone_test()

'Code about AI > PyTorch' 카테고리의 다른 글

DropPath란 무엇이며, Dropout과 무슨 차이가 있을까? (timm 활용 및 오픈소스 분석) (0)	2024.05.12
Tensor는 서로 다른 ndim에 대해서 어떻게 연산할까? (Broadcasting Semantics) (0)	2024.05.04
nn.Parameter(), 이걸 써야 하는 이유가 뭘까? (tensor와 명백하게 다른 점) (3)	2024.04.29
backward()는 역전파를 어떻게 할까? (Autograd의 Node) (3)	2023.11.19
optimizer.step()은 정말 가중치를 건들까? (Call-by-Assignment) (0)	2023.11.18

'Code about AI/PyTorch' Related Articles

Doby's Lab

x.clone()은 정말 Residual Connection을 할까? (Memory 공유, Immutability) 본문

x.clone()은 정말 Residual Connection을 할까? (Memory 공유, Immutability)

🤔 Problem

😀 What is x.clone()?

🤔 identity = x에 대한 의심

(1) identity = x를 사용했을 때

(2) identity = x.clone()를 사용했을 때

😀 Tensor의 Mutable, or Immutable

✅ Result

📄 Code for this post

'Code about AI > PyTorch' 카테고리의 다른 글

티스토리툴바

😀 What is `x.clone()`?

🤔 `identity = x`에 대한 의심

(1) `identity = x`를 사용했을 때

(2) `identity = x.clone()`를 사용했을 때