PyTorch RuntimeError:DataLoader worker (pid(s) 15332) 意外退出


我是 PyTorch 的初学者,我只是尝试 此网页上 的一些示例。但由于这个错误,我似乎无法运行“super_resolution”程序:

RuntimeError: DataLoader worker (pid(s) 15332) exited unexpectedly

我在网上搜索了一下,发现有人建议设置 num_workers0 。但如果我这样做,程序会告诉我内存不足(无论是 CPU 还是 GPU):

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 9663676416 bytes. Buy new RAM!


RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 4.00 GiB total capacity; 2.03 GiB already allocated; 0 bytes free; 2.03 GiB reserved in total by PyTorch)


我在 Win10(64 位)和 pytorch 1.4.0 上使用 python 3.8。

More complete error messages ( --cuda means using GPU, --threads x means passing x to the num_worker parameter):

  1. 使用命令行参数 --upscale_factor 1 --cuda
   File "E:\Python38\lib\site-packages\torch\utils\data\", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "E:\Python38\lib\multiprocessing\", line 108, in get
    raise Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Z:\super_resolution\", line 81, in <module>
  File "Z:\super_resolution\", line 48, in train
    for iteration, batch in enumerate(training_data_loader, 1):
  File "E:\Python38\lib\site-packages\torch\utils\data\", line 345, in __next__
    data = self._next_data()
  File "E:\Python38\lib\site-packages\torch\utils\data\", line 841, in _next_data
    idx, data = self._get_data()
  File "E:\Python38\lib\site-packages\torch\utils\data\", line 808, in _get_data
    success, data = self._try_get_data()
  File "E:\Python38\lib\site-packages\torch\utils\data\", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 16596, 9376, 12756, 9844) exited unexpectedly

  1. 使用命令行参数 --upscale_factor 1 --cuda --threads 0
   File "Z:\super_resolution\", line 81, in <module>
  File "Z:\super_resolution\", line 52, in train
    loss = criterion(model(input), target)
  File "E:\Python38\lib\site-packages\torch\nn\modules\", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "Z:\super_resolution\", line 21, in forward
    x = self.relu(self.conv2(x))
  File "E:\Python38\lib\site-packages\torch\nn\modules\", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "E:\Python38\lib\site-packages\torch\nn\modules\", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "E:\Python38\lib\site-packages\torch\nn\modules\", line 341, in conv2d_forward
    return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 4.00 GiB total capacity; 2.03 GiB already allocated; 954.35 MiB free; 2.03 GiB reserved in total by PyTorch)

原文由 ihdv 发布,翻译遵循 CC BY-SA 4.0 许可协议

阅读 3.1k
2 个回答

没有针对 GPU 内存不足错误的“完整”解决方案,但是您可以采取很多措施来缓解内存需求。另外,请确保您没有同时将训练集和测试集传递给 GPU!

  1. 将批量大小减少到 1
  2. 减少全连接层的维数(它们是内存最密集的)
  3. (图像数据)应用中心裁剪
  4. (图像数据)将RGB数据转换为灰度
  5. (文本数据)在 n 个字符处截断输入(这可能不会有太大帮助)

或者,您可以尝试在 Google Colaboratory(K80 GPU 上的 12 小时使用限制)和 Next Journal 上运行,它们都提供高达 12GB 的免费使用空间。最坏的情况是,您可能必须在 CPU 上进行训练。希望这可以帮助!

原文由 ccl 发布,翻译遵循 CC BY-SA 4.0 许可协议

这是对我有用的解决方案。它可能适用于其他 Windows 用户。只需删除/注释 num workers 即可禁用并行加载

原文由 Aneesh Cherian K 发布,翻译遵循 CC BY-SA 4.0 许可协议

  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进