一、环境配置
本次训练采用的是经典的LeNet网络,实现手写数字识别任务,选取的两个对照服务器均为炼丹侠A100服务器。
首先通过tabby连接炼丹侠A100云服务器,之后安装对应的环境,本次采用的环境为cuda11.7+python3.8+pytorch/torchaudio/torchvision(cuda11.7对应版本),训练代码在下方网址内:https://blog.csdn.net/eroDuanDian123456/article/details/12566...
二、训练优化
原代码只做了cpu上训练的部分,作为对照,在原代码基础上将网络结果,数据等内容全部移到了GPU上,使A100能充分应用到代码训练加速中,修改后的代码如下:
1.import torch
2.from torch.autograd import Variable
3.import torch.nn as nn
4.import torchvision
5.
6.# 检查GPU是否可用
7.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
8.
9."""MNIST数据集"""
10.train_dataset = torchvision.datasets.MNIST("dataset", train=True,
11. transform=torchvision.transforms.ToTensor(), download=True)
12.test_dataset = train_dataset
13.
14.# DataLoader
15.train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
16. batch_size=128,
17. shuffle=True)
18.
19.test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
20. batch_size=100,
21. shuffle=False)
22.
23."""LetNet-5"""
24.class LeNet5(nn.Module):
25. def __init__(self):
26. super(LeNet5, self).__init__()
27. self.conv1 = nn.Conv2d(1, 6, kernel_size=5, padding=2) # padding = 2, 28+2+2=32
28. self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
29. self.pool = nn.MaxPool2d(2)
30. self.relu = nn.ReLU()
31. self.fc1 = nn.Linear(400, 120) # 400=16*5*5
32. self.fc2 = nn.Linear(120, 84)
33. self.fc3 = nn.Linear(84, 10)
34. self.softmax = nn.Softmax()
35.
36. def forward(self, x):
37. in_size = x.size(0)
38. out = self.relu(self.pool(self.conv1(x)))
39. out = self.relu(self.pool(self.conv2(out)))
40. out = out.view(in_size, -1)
41. out = self.relu(self.fc1(out))
42. out = self.relu(self.fc2(out))
43. out = self.fc3(out)
44. return self.softmax(out,)
45.
46.model = LeNet5().to(device) # 将模型移动到GPU
47.
48.# 损失函数:交叉熵损失
49.loss_func = torch.nn.CrossEntropyLoss()
50.
51.# 定义优化器
52.opt = torch.optim.Adam(model.parameters(), lr=0.001)
53.
54.def train(epoch):
55. model.train()
56. for batch_index, (data, target) in enumerate(train_loader):
57. data, target = data.to(device), target.to(device) # 将数据移动到GPU
58. opt.zero_grad() # backward前梯度清零
59. output = model(data)
60. loss = loss_func(output, target)
61. # 误差反向传播
62. loss.backward()
63. # 参数更新
64. opt.step()
65. if batch_index % 20 == 0:
66. print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
67. epoch, batch_index * len(data), len(train_loader.dataset),
68. 100. * batch_index / len(train_loader), loss.item()))
69.
70.def test():
71. model.eval()
72. test_loss = 0
73. correct = 0
74. for data, target in test_loader:
75. data, target = data.to(device), target.to(device) # 将数据移动到GPU
76. output = model(data)
77. # 叠加loss
78. test_loss += loss_func(output, target).item()
79. # 最大概率预测结果标签
80. pred = torch.max(output.data, 1)[1]
81. correct += pred.eq(target.data.view_as(pred)).cpu().sum()
82. test_loss /= len(test_loader.dataset)
83.
84. print('Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)'.format(
85. test_loss, correct, len(test_loader.dataset),
86. 100. * correct / len(test_loader.dataset)))
87.
88.# 迭代10轮后测试
89.for epoch in range(1, 11):
90. train(epoch)
91.
92.test()
三、训练过程
测试视频如下
四、运行结果
CPU版本训练总共耗时152秒
炼丹侠A100加速版本训练总共耗时33秒
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。