[Multi GPU] MultiGPU를 통한 학습

1. 핵심 코드

import torch.nn.parallel


# deivce_ids : 학습에 사용할 GPU
# oudput_device : 출력이 모이는 GPU, 즉 loss 계산과정을 output_device에서 하겠다는 의미.
resnet_model = nn.DataParallel(resnet_model, device_ids=[0,1,2],output_device=2)

- 작동원리.

출처 : https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

1. 매 iteration마다 Batch를 GPU의 개수만큼 나눈다. (scatter)

2. 모델을 각 GPU에 복사함. (replicate)

3. 각 GPU에서 forward 진행.

4. 각 GPU에서의 출력을 하나의 GPU로 모음 (gather)

1. 하나의 GPU에서 output과 label을 비교하여 loss계산

2. 계산된 각각의 loss를 각각의 GPU에 나눔 (scatter)

3. 전달받은 loss를 이용해서 각device에서 backward를 수행.

4. 모든 gradient를 하나의 GPU로 모아서 모델 파라미터를 업데이트함.

2. 주의 사항

1. device_ids 의 첫번째 GPU 번호와 model 이 올라가는 GPU번호는 동일시 해야함.

→ 그렇지 않다면 다음과 같은 오류가 발생함.

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

→ 아래와 같이 작성하지 말라는 의미임,

device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
resnet_model.to(device)
resnet_model = nn.DataParallel(resnet_model, device_ids=[0,1,2],output_device=2)

→ Solution (device_ids 첫번째 GPU번호와 resnet_model의 GPU 번호를 동일시 하면 됨.)

device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
resnet_model.to(0)
resnet_model = nn.DataParallel(resnet_model, device_ids=[0,1,2],output_device=2)

2. output_device를 'n번' 으로 설정하였다면 actual 값도 n번 GPU에 올려야함. (앞에서 loss계산은 output_device에서 한다하였으니깐)

→ 아래의 코드에서 아래서 2번째 줄을 보면 labels.to(2)라고 되어있다. 이렇게 nn.DataParallel의 output_device 번호와 labels.to()의 번호 (actual값)를 맞춰야한다.

# GPU 설정
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
resnet_model.to(device)
resnet_model = nn.DataParallel(resnet_model, device_ids=[0,1,2],output_device=2)





# Training the model
for epoch in range(10):  
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # Get the inputs and labels
        inputs, labels = data
        inputs, labels = inputs , labels.to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

 

        # Forward pass
        outputs = resnet_model(inputs)

        # print(f'Outputs are on device: {outputs.device}')
        # print(f'Labels are on device: {labels.device}')

        labels=labels.to(2)
        loss = criterion(outputs, labels)

3. batch size는 사용하려는 GPU의 개수보다 커야한다.(동일해도 됨) 즉, batch_size >= 사용하려는 GPU개수

출처 : https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html

→ 아래와 같이 하면 안된다는것.

# Batch size = 2로 설정함.
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=2, shuffle=True, num_workers=2)

'''
	중간생략
'''

# GPU개수 : 3개로 설정함.
resnet_model = nn.DataParallel(resnet_model, device_ids=[0,1,2],output_device=2)

3. Multi GPU 전체 예시 코드

import os
import torch
import torch.nn as nn
import torchvision.models as models
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision import models
import torch.nn.parallel




if torch.cuda.is_available():
    device_count = torch.cuda.device_count()
    print(f"현재 사용 가능한 GPU의 수: {device_count}")
else:
    print("GPU를 사용할 수 없습니다.")
    
    
    
    



# Define transformations for the training set and the test set
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))])

# Load the CIFAR-10 dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=256, shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=100, shuffle=False, num_workers=2)

# Define the classes
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# Load the pre-trained ResNet-18 model
resnet_model = models.resnet18(weights='DEFAULT')




# Modify the last layer of ResNet-18 for CIFAR-10 (10 classes)
num_ftrs = resnet_model.fc.in_features
resnet_model.fc = nn.Linear(num_ftrs, 10)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(resnet_model.parameters(), lr=0.001, momentum=0.9)

# Check if GPU is available and move the model to GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
resnet_model.to(device)


'''
    ####### 멀티 GPU 사용하기 ########
'''



resnet_model = nn.DataParallel(resnet_model, device_ids=[0,1,2],output_device=2)



'''
    ####################################
'''



# Training the model
for epoch in range(10):  
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # Get the inputs and labels
        inputs, labels = data
        inputs, labels = inputs , labels.to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

 

        # Forward pass
        outputs = resnet_model(inputs)

        # print(f'Outputs are on device: {outputs.device}')
        # print(f'Labels are on device: {labels.device}')

        labels=labels.to(2)
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        # Print statistics
        running_loss += loss.item()
        if i % 100 == 99:    # print every 100 mini-batches
            print(f'[Epoch {epoch + 1}, Batch {i + 1}] loss: {running_loss / 100:.3f}')
            running_loss = 0.0

    # 모델 저장
    os.makedirs("./pt_1/", exist_ok=True)
    torch.save(resnet_model.state_dict(), f"./pt_1/Multi_GPU_state_dict_{epoch}.pt")
    torch.save(resnet_model, f"./pt_1/Multi_GPU_{epoch}.pt")
    print('save model')

print('Finished Training')

# Save the trained model
PATH = './cifar_resnet18.pth'
torch.save(resnet_model.state_dict(), PATH)

# Testing the model
resnet_model.eval()  # Set the model to evaluation mode
correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        images, labels = images.to(device), labels.to(device)
        outputs = resnet_model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct / total:.2f}%')

참고사이트 :

https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

https://medium.com/daangn/pytorch-multi-gpu-%ED%95%99%EC%8A%B5-%EC%A0%9C%EB%8C%80%EB%A1%9C-%ED%95%98%EA%B8%B0-27270617936b

'인공지능 (Deep Learning) > 딥러닝 및 파이토치 기타 정리' 카테고리의 다른 글

Seq2Seq vs Attention vs Self Attention (0)	2025.04.02
[LoRA] Low-Rank Adaptation of Large Language models (0)	2024.08.22
[import os] 파일 호출, 삭제, 생성 명령어. (0)	2024.05.26
[Pytorch, Huggingface] Pretrained Model 의 특정 Layer 만 Freeze 하기 (1)	2024.04.28
[Pytorch, Huggingface] Pretrained Model 의 특정 Layer 만 추출 (1)	2024.04.26

시작은 미약하였으나 , 그 끝은 창대하리라

[Multi GPU] MultiGPU를 통한 학습

1. 핵심 코드

2. 주의 사항

3. Multi GPU 전체 예시 코드

'인공지능 (Deep Learning) > 딥러닝 및 파이토치 기타 정리' 카테고리의 다른 글

티스토리툴바

[Multi GPU] MultiGPU를 통한 학습

1. 핵심 코드

2. 주의 사항

3. Multi GPU 전체 예시 코드

'인공지능 (Deep Learning) > 딥러닝 및 파이토치 기타 정리' 카테고리의 다른 글

관련글

티스토리툴바