昆山中心安装pytorch1.10-dtk22.04
1) 本地whl所在目录(安装教程以dtk-22.04.1为例)
/public/software/apps/DeepLearning/whl/dtk-22.04.1
或
/public/software/apps/DeepLearning/whl/dtk-22.04.2
2) 添加变量
添加miopen变量
export MIOPEN_SYSTEM_DB_PATH=/temp/pytorch-miopen-2.8
export MIOPEN_DEBUG_DISABLE_FIND_DB=1
export MIOPEN_DEBUG_CONV_WINOGRAD=0
export MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0
export HSA_USERPTR_FOR_PAGED_MEM=0
3) conda创建python3.7环境(以创建python3.7环境为例)
conda create -n pytorch_1.10-dtk_22.04 python=3.7
添加conda的lib库
export LD_LIBRARY_PATH=/public/home/username/miniconda3/envs/pytorch_1.10-dtk_22.04/lib:$PATH
4)在conda环境中安装PyTorch1.10(以python3.7-pytorch1.10版本为例)
conda activate pytorch_1.10-dtk_22.04
pip install /public/software/apps/DeepLearning/whl/dtk-22.04/torch-1.10.0a0+git450cdd1.dtk22.4-cp37-cp37m-linux_x86_64.whl
5) 安装依赖包
pip install numpy -i https://pypi.tuna.tsinghua.edu.cn/simple
6) 查看安装是否成功(能否调用到dcu)
查看队列:
whichpartition
申请节点并登录计算节点,进行测试。
salloc -p 队列名 -N 1 --gres=dcu:2
登录节点(根据申请到的节点登录)
ssh 节点名称
切换rocm编译器版本
module switch compiler/rocm/dtk-22.04.1
7)在本地创建一个pytorch_env.sh的文件,添加环境变量
vi ~/pytorch_env.sh
export LD_LIBRARY_PATH=/public/software/apps/DeepLearning/PyTorch/lib:/public/software/apps/DeepLearning/PyTorch/lmdb-0.9.24-build/lib:/public/software/apps/DeepLearning/PyTorch/opencv-2.4.13.6-build/lib:/public/software/apps/DeepLearning/PyTorch/openblas-0.3.7-build/lib:$LD_LIBRARY_PATH
source ~/pytorch_env.sh
激活pytorch_1.10-dtk_22.04环境(登录到计算节点后会退出之前的环境,所以需要重新激活环境)
conda activate pytorch_1.10-dtk_22.04
进入环境中依次执行
python
import torch
torch.cuda.is_available()