如何将 Pytorch ( cuda) 与 A100 GPU 一起使用？

我试图将我当前的代码与 A100 gpu 一起使用，但出现此错误：

 ---> backend='nccl'
/home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2/lib/python3.8/site-packages/torch/cuda/__init__.py:104: UserWarning:
A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

这很令人困惑，因为它指向通常的 pytorch 安装，但没有告诉我 pytorch 版本 + cuda 版本的哪个组合用于我的特定硬件 (A100)。为 A100 安装 pytorch 的正确方法是什么？

这些是我尝试过的一些版本：

 # conda install -y pytorch==1.8.0 torchvision cudatoolkit=10.2 -c pytorch
# conda install -y pytorch torchvision cudatoolkit=10.2 -c pytorch
#conda install -y pytorch==1.7.1 torchvision torchaudio cudatoolkit=10.2 -c pytorch -c conda-forge
# conda install -y pytorch==1.6.0 torchvision cudatoolkit=10.2 -c pytorch
#conda install -y pytorch==1.7.1 torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge

# conda install pytorch torchvision torchaudio cudatoolkit=11.0 -c pytorch
# conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge
# conda install -y pytorch torchvision cudatoolkit=9.2 -c pytorch # For Nano, CC
# conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge

请注意，这可能很微妙，因为我过去在这台机器 + pytorch 版本上遇到过这个错误：

如何解决著名的 `unhandled cuda error, NCCL version 2.7.8` 错误？

奖励 1：

我仍然有错误：

 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1423, in <module>
    main()
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1365, in main
    train(args=args)
  File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1385, in train
    args.opt = move_opt_to_cherry_opt_and_sync_params(args) if is_running_parallel(args.rank) else args.opt
  File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/distributed.py", line 456, in move_opt_to_cherry_opt_and_sync_params
    args.opt = cherry.optim.Distributed(args.model.parameters(), opt=args.opt, sync=syn)
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 62, in __init__
    self.sync_parameters()
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 78, in sync_parameters
    dist.broadcast(p.data, src=root)
  File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8

建议让 nvcca 和 pytorch.version.cuda 匹配的答案之一，但它们不匹配：

 (meta_learning_a100) [miranda9@hal-dgx ~]$ python -c "import torch;print(torch.version.cuda)"

11.1
(meta_learning_a100) [miranda9@hal-dgx ~]$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

我如何匹配它们？我这是错误吗？有人可以显示他们的 pip、conda 和 nvcca 版本以查看设置的效果吗？

更多错误信息：

 hal-dgx:21797:21797 [0] NCCL INFO Bootstrap : Using [0]enp226s0:141.142.153.83<0> [1]virbr0:192.168.122.1<0>
hal-dgx:21797:21797 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hal-dgx:21797:21797 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/IB [6]mlx5_6:1/IB [7]mlx5_7:1/IB ; OOB enp226s0:141.142.153.83<0>
hal-dgx:21797:21797 [0] NCCL INFO Using network IB
NCCL version 2.7.8+cuda11.1
hal-dgx:21805:21805 [2] NCCL INFO Bootstrap : Using [0]enp226s0:141.142.153.83<0> [1]virbr0:192.168.122.1<0>
hal-dgx:21799:21799 [1] NCCL INFO Bootstrap : Using [0]enp226s0:141.142.153.83<0> [1]virbr0:192.168.122.1<0>
hal-dgx:21805:21805 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hal-dgx:21799:21799 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hal-dgx:21811:21811 [3] NCCL INFO Bootstrap : Using [0]enp226s0:141.142.153.83<0> [1]virbr0:192.168.122.1<0>
hal-dgx:21811:21811 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hal-dgx:21811:21811 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/IB [6]mlx5_6:1/IB [7]mlx5_7:1/IB ; OOB enp226s0:141.142.153.83<0>
hal-dgx:21811:21811 [3] NCCL INFO Using network IB
hal-dgx:21799:21799 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/IB [6]mlx5_6:1/IB [7]mlx5_7:1/IB ; OOB enp226s0:141.142.153.83<0>
hal-dgx:21805:21805 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB [4]mlx5_4:1/IB [5]mlx5_5:1/IB [6]mlx5_6:1/IB [7]mlx5_7:1/IB ; OOB enp226s0:141.142.153.83<0>
hal-dgx:21799:21799 [1] NCCL INFO Using network IB
hal-dgx:21805:21805 [2] NCCL INFO Using network IB

hal-dgx:21797:27906 [0] misc/ibvwrap.cc:280 NCCL WARN Call to ibv_create_qp failed
hal-dgx:21797:27906 [0] NCCL INFO transport/net_ib.cc:360 -> 2
hal-dgx:21797:27906 [0] NCCL INFO transport/net_ib.cc:437 -> 2
hal-dgx:21797:27906 [0] NCCL INFO include/net.h:21 -> 2
hal-dgx:21797:27906 [0] NCCL INFO include/net.h:51 -> 2
hal-dgx:21797:27906 [0] NCCL INFO init.cc:300 -> 2
hal-dgx:21797:27906 [0] NCCL INFO init.cc:566 -> 2
hal-dgx:21797:27906 [0] NCCL INFO init.cc:840 -> 2
hal-dgx:21797:27906 [0] NCCL INFO group.cc:73 -> 2 [Async thread]

hal-dgx:21811:27929 [3] misc/ibvwrap.cc:280 NCCL WARN Call to ibv_create_qp failed
hal-dgx:21811:27929 [3] NCCL INFO transport/net_ib.cc:360 -> 2
hal-dgx:21811:27929 [3] NCCL INFO transport/net_ib.cc:437 -> 2
hal-dgx:21811:27929 [3] NCCL INFO include/net.h:21 -> 2
hal-dgx:21811:27929 [3] NCCL INFO include/net.h:51 -> 2
hal-dgx:21811:27929 [3] NCCL INFO init.cc:300 -> 2
hal-dgx:21811:27929 [3] NCCL INFO init.cc:566 -> 2
hal-dgx:21811:27929 [3] NCCL INFO init.cc:840 -> 2
hal-dgx:21811:27929 [3] NCCL INFO group.cc:73 -> 2 [Async thread]

放后

import os
os.environ["NCCL_DEBUG"] = "INFO"

原文由 Charlie Parker 发布，翻译遵循 CC BY-SA 4.0 许可协议

如何将 Pytorch ( cuda) 与 A100 GPU 一起使用？

奖励 1：

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

如何将 Pytorch ( cuda) 与 A100 GPU 一起使用？

奖励 1：

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

分解质因素的算法很难，理解不了。 请问有哪位大佬可以进行解释一下呢？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Stack Overflow 翻译

分解质因素的算法很难，理解不了。请问有哪位大佬可以进行解释一下呢？