目录

NVIDIA GPU 的 docker 安装和部署

背景

在人工智能领域,GPU 是必不可少的。在本文中,我们将介绍如何在服务器上安装和部署 NVIDIA GPU 的 docker

升级你的CUDA

官网链接

选择你的系统对应版本进行安装
/images/CUDA-download.png

安装

  1. 确保你的系统已经安装了NVIDIA驱动和Docker引擎。确保驱动版本与Docker引擎兼容。你还需要安装nvidia-docker2软件包,它是NVIDIA Docker的一个插件。可以在https://github.com/NVIDIA/nvidia-docker上找到安装说明
    1.1 卸载旧版本的nvidia-docker(如果已安装)
1
sudo apt-get remove nvidia-docker

1.2 添加nvidia-docker2仓库

1
2
3
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

1.3 更新软件包列表和安装nvidia-docker2

1
2
sudo apt-get update
sudo apt-get install nvidia-docker2

1.4 重启Docker守护进程

1
sudo systemctl restart docker
  1. 通过以下命令测试是否安装成功
1
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
  1. 如果安装成功,你将看到类似下面的输出
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   32C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:00:1F.0 Off |                    0 |
| N/A   32C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  On   | 00000000:00:20.0 Off |                    0 |
| N/A   32C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  On   | 00000000:00:21.0 Off |                    0 |
| N/A   32C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P100-PCIE...  On   | 00000000:00:22.0 Off |                    0 |
| N/A   32C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P100-PCIE...  On   | 00000000:00:23.0 Off |                    0 |
| N/A   32C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P100-PCIE...  On   | 00000000:00:24.0 Off |                    0 |
| N/A   32C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P100-PCIE...  On   | 00000000:00:25.0 Off |                    0 |
| N/A   32C    P0    32W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

部署

  1. 创建一个新的dockerfile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
FROM nvidia/cuda:11.7.0-cudnn8-runtime-ubuntu18.04

# 安装软件包
RUN apt-get update && apt-get install -y \
    build-essential \
    git \
    wget \
    vim \
    openssh-server 

RUN mkdir -p /run/sshd
RUN echo "root:password" | chpasswd
RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
EXPOSE 22
CMD ["/usr/sbin/sshd", "-D"]
  1. 创建一个新的docker-compose.yml
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
version: "3"
services:
  cuda-container:
    build: .
    runtime: nvidia
    container_name: cuda-container
    ports:
      - 2201:22
    volumes:
      - ./app:/app
  1. 启动docker
1
docker-compose up -d
  1. 验证
1
docker exec -ti cuda-container nvidia-smi