系统平台 环境 系统:Ubuntu 20.04
设置 NVIDIA Docker支持 安装 nvidia-container-toolkit
1 2 3 4 5 distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update sudo apt-get install nvidia-container-toolkit -y
Docker软件 安装 Docker
1 sudo apt-get install docker.io -y
给予当前用户 docker 权限
1 2 sudo gpasswd -a $USER docker groups
参考资料 如何在Docker中搭建CUDA & CUDNN 开发环境
Docker容器如何优雅使用NVIDIA GPU
Docker容器 自定义镜像 构建 Dockerfile
1 2 3 4 5 6 7 8 9 10 11 12 13 FROM nvidia/cuda:12.0 .1 -cudnn8-devel-ubuntu20.04 LABEL maintainer="KaleoFeng" \ version="1.0-SNAPSHOT" \ description="Nvidia CUDA, cuDNN and PyTorch" USER rootRUN apt update; \ apt-get install python3-pip -y; \ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 VOLUME [ "/data" ]
环境说明:
CUDA & CUDNN v12.0
Python v3.8.10
PyTorch v2.1.0 适配 cu118
使用 拉取镜像
1 docker pull kaleofeng/cudatorch:1.0-SNAPSHOT
运行容器
1 2 3 4 5 6 7 docker run \ -it \ --name cudatorch \ --volume $PWD/data:/data \ --detach \ --gpus all \ kaleofeng/cudatorch:1.0-SNAPSHOT
赋予当前用户 data
目录权限
1 sudo chown -R $USER data
进入容器
1 docker exec -it cudatorch bash
执行命令
输出
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 root@766d2bc6e192:/# nvidia-smi Fri Jul 21 03:30:07 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:08.0 Off | 0 | | N/A 31C P8 11W / 70W | 2MiB / 15360MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
执行命令
输出
1 2 3 4 5 6 root@766d2bc6e192:/# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Fri_Jan__6_16:45:21_PST_2023 Cuda compilation tools, release 12.0, V12.0.140 Build cuda_12.0.r12.0/compiler.32267302_0
测试是否 GPU 版本
1 python3 -c "import torch; print(torch.cuda.is_available())"
输出
1 2 root@766d2bc6e192:/# python3 -c "import torch; print(torch.cuda.is_available())" True
查看环境信息
1 python3 -m torch.utils.collect_env
应用部署 准备工作 安装 Git LFS
1 2 sudo apt-get install git-lfs -y git lfs install
工程应用 准备工程 上传工程
1 2 3 4 cd data rz -bey unzip ChatGLM2-6B-modified.zip -d ChatGLM2-6B cd ChatGLM2-6B
下载模型
1 2 3 4 mkdir models cd models git clone https://huggingface.co/THUDM/chatglm2-6b cd ..
安装依赖 进入容器
1 docker exec -it cudatorch bash
进入工程
ChatGLM 基础依赖
1 2 3 pip3 install -r requirements.txt pip3 install --upgrade typing-extensions pip3 install starlette==0.27.0
P-Tuning 所需
1 2 3 4 pip3 install datasets pip3 install jieba pip3 install rouge_chinese pip3 install nltk
全参数微调所需
1 2 apt-get install libaio-dev -y pip3 install deepspeed
检查环境
输出
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 oot@766d2bc6e192:/data/ChatGLM2-6B# ds_report [2023-07-21 03:57:31,728] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) -------------------------------------------------- DeepSpeed C++/CUDA extension op report -------------------------------------------------- NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op. -------------------------------------------------- JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- async_io ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 2.0.1+cu118 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.10.0, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 12.0 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
进行训练 1 2 3 cd ptuning/ chmod +x ds_train_finetune.sh ./ds_train_finetune.sh