系统平台

环境

系统:Ubuntu 20.04

设置

NVIDIA Docker支持

安装 nvidia-container-toolkit

1
2
3
4
5
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit -y

Docker软件

安装 Docker

1
sudo apt-get install docker.io -y

给予当前用户 docker 权限

1
2
sudo gpasswd -a $USER docker
groups

参考资料

如何在Docker中搭建CUDA & CUDNN 开发环境

Docker容器如何优雅使用NVIDIA GPU

Docker容器

自定义镜像

构建

Dockerfile

1
2
3
4
5
6
7
8
9
10
11
12
13
FROM nvidia/cuda:12.0.1-cudnn8-devel-ubuntu20.04

LABEL maintainer="KaleoFeng" \
version="1.0-SNAPSHOT" \
description="Nvidia CUDA, cuDNN and PyTorch"

USER root

RUN apt update; \
apt-get install python3-pip -y; \
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

VOLUME [ "/data" ]

环境说明:

  • CUDA & CUDNN v12.0
  • Python v3.8.10
  • PyTorch v2.1.0 适配 cu118

使用

拉取镜像

1
docker pull kaleofeng/cudatorch:1.0-SNAPSHOT

运行容器

1
2
3
4
5
6
7
docker run \
-it \
--name cudatorch \
--volume $PWD/data:/data \
--detach \
--gpus all \
kaleofeng/cudatorch:1.0-SNAPSHOT

赋予当前用户 data 目录权限

1
sudo chown -R $USER data

进入容器

1
docker exec -it cudatorch bash

执行命令

1
nvidia-smi

输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@766d2bc6e192:/# nvidia-smi
Fri Jul 21 03:30:07 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:08.0 Off | 0 |
| N/A 31C P8 11W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

执行命令

1
nvcc -V

输出

1
2
3
4
5
6
root@766d2bc6e192:/# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

测试是否 GPU 版本

1
python3 -c "import torch; print(torch.cuda.is_available())"

输出

1
2
root@766d2bc6e192:/# python3 -c "import torch; print(torch.cuda.is_available())"
True

查看环境信息

1
python3 -m torch.utils.collect_env

应用部署

准备工作

安装 Git LFS

1
2
sudo apt-get install git-lfs -y
git lfs install

工程应用

准备工程

上传工程

1
2
3
4
cd data
rz -bey
unzip ChatGLM2-6B-modified.zip -d ChatGLM2-6B
cd ChatGLM2-6B

下载模型

1
2
3
4
mkdir models
cd models
git clone https://huggingface.co/THUDM/chatglm2-6b
cd ..

安装依赖

进入容器

1
docker exec -it cudatorch bash

进入工程

1
cd /data/ChatGLM2-6B

ChatGLM 基础依赖

1
2
3
pip3 install -r requirements.txt
pip3 install --upgrade typing-extensions
pip3 install starlette==0.27.0

P-Tuning 所需

1
2
3
4
pip3 install datasets
pip3 install jieba
pip3 install rouge_chinese
pip3 install nltk

全参数微调所需

1
2
apt-get install libaio-dev -y
pip3 install deepspeed

检查环境

1
ds_report

输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
oot@766d2bc6e192:/data/ChatGLM2-6B# ds_report
[2023-07-21 03:57:31,728] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 2.0.1+cu118
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 12.0
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

进行训练

1
2
3
cd ptuning/
chmod +x ds_train_finetune.sh
./ds_train_finetune.sh