编写dockerfile

更新时间：

dockerfile 文件用于自定义构建镜像，文件中由多条指令构成，每一条指令都会对应于Docker镜像中的一层，其原理为镜像分层。

# 限制条件

基础镜像操作系统建议为 Ubuntu20.04。
当前 ubuntu20.04 兼容性最好，大部分软件稳定版都可在 ubuntu20.04 上稳定安装。 ubuntu22.04 其兼容性最不好，常出现无法安装或兼容性问题。
python 版本选择需依据操作系统版本，否则用该镜像初始化开发环境可能会遗漏部分功能。
- Ubuntu18.04 配套 python3.6
- Ubuntu20.04 配套 python3.8
- Ubuntu22.04 配套 python3.10

# 常用指令

在平台中写 dockerfile，通常需关注如下指令：

注意：平台中的 dockerfile 不允许用户自己输入 FROM、EXPOSE、CMD、ENTRYPOINT 指令，以下指令介绍仅供您了解 dockerfile。

# FROM

简介：用于指定基础镜像，如果本地不存在会从远程仓库下载。示例：FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04
其他说明：平台中将基础镜像指定做成了功能，您无需再在 dockerfile 中使用 FROM 指定，也不支持在平台的 dockerfile 中使用 FROM 再次指定。

# ENV

简介：设置环境变量，变量可以被后续的 RUN 指令使用。
格式：ENV [环境变量] [变量值] 示例：ENV HOSTNAME sevrer1.example.com

# RUN

简介：在容器中运行命令并创建新的镜像层，常用于安装软件包，构建时该指令会使镜像的体积增加。
示例：RUN yum install -y vim

在 RUN 指令下常用的安装方式有如下几种：

# apt

简介：apt 是 ubuntu 操作系统的包管理工具，其源是 ubuntu 仓库，安装的包是系统化的包，在系统内完全安装。
注意事项：
- apt 默认为国外镜像源，会出现不稳定的情况，可在 /etc/apt/sources.list 中添加国内镜像源后再用 apt 安装其他软件。
- 尽量使用 apt 方式安装，ubuntu 仓库中均为稳定版，与其他软件或包兼容性较好，如必须使用较新的包，可选择 conda/pip 方式安装。

相关命令

命令	描述
`apt-get update`	apt 更新。
`apt-get upgrade`	升级软件包。
`apt-get install package_name`	安装第三方工具包。
`apt-get -y install package_name`	忽略安装过程中的系统确认提示，直接安装第三方工具包。
`apt-get install <package_1> <package_2> <package_3>`	安装多个软件包。
`apt-get clean && rm -rf /var/lib/apt/lists/*`	清除系统下已下载的系统包，以减小系统大小。

# pip

简介：pip 是 python 解释器语言的包管理工具，其源为 pyPI，所含包种类更多，对于同一个包，pyPI 可以提供更多的版本以供下载。
镜像源

pip 默认资源地址是国外服务器，会出现不稳定的情况，可以使用国内镜像源代替，国内常用镜像源有：
- 阿里云： http://mirrors.aliyun.com/pypi/simple/
- 豆瓣(douban)： http://pypi.douban.com/simple/
- 清华大学： https://pypi.tuna.tsinghua.edu.cn/simple/
- 中国科技大学： https://pypi.mirrors.ustc.edu.cn/simple/
- 中国科学技术大学： http://pypi.mirrors.ustc.edu.cn/simple/
镜像源替换
- 全局永久替换，则在 dockerfile 中使用 pip 安装其他工具前增加 pip config set global.index-url https://mirror.baidu.com/pypi/simple。
- 对个别软件的安装临时替换，则 dockerfile 中 pip install 时使用 -i 参数指定镜像源，如：
```
pip install --no-cache-dir xxx -i https://mirror.baidu.com/pypi/simple --trusted-host mirror.baidu.com --extra-index-url https://pypi.tuna.tsinghua.edu.cn/simple/
```
  上述命令中这两个参数为可选，也可以不出现。
  - --trusted-host 指定信任的镜像源。
  - --extra-index-url 指定备用镜像源，即 -i 指定的镜像源有问题，还可以使用该参数指定的备用镜像源。
国外镜像源
pip 安装时如想使用国外官方源，可以用 --default-timeout 设置超时等待，如 pip install xxx --default-timeout=120。

常用命令

命令	描述
`pip install tensorflow`	安装包，以安装最新版本的 tensorflow 框架为例。
`pip install tensorflow==2.11.0`	安装指定版本的。
`pip install -U tensorflow`	升级为最新版本的。
`pip install tensorflow -i https://pypi.tuna.tsinghua.edu.cn/simple`	下载指定源的。
`pip download tensorflow`	下载包。
`pip uninstall tensorflow`	卸载包。
`pip show tensorflow`	展示包的信息。
`pip install --upgrade --no-deps --force-reinstall <packagename>`	重装某软件但不重装其依赖的包。

# 技巧

使用 &&，多条命令合并成一条执行，如：

apt-get install -y python3 python3-pip && ln -s /usr/bin/python3.8 /usr/bin/python && ln -s /usr/bin/pip3 /usr/bin/pip3.8

指令行尾使用 && \，多条命令合并成一条执行，前一条执行完成再执行下一条，如：

pip install --no-cache-dir tensorflow==2.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple && \
pip install --no-cache-dir tensorboard jupyterlab -i https://pypi.tuna.tsinghua.edu.cn/simple

# 其他指令

其他指令在平台中编辑 dockerfile 时用不到，但您可以做了解，方便您理解镜像构建。

指令	简介	格式/示例
MAINTAINER	设置镜像的作者，比如用户邮箱等。	-
COPY	将本地主机上的文件/目录复制到目标地点，源文件/目录要与Dockerfile在相同的目录中。	`COPY srcPath destPath` 或 `COPY ["src", "dest"]`
ADD	将源文件复制到目标文件，源文件要与 Dockerfile 在相同目录中或是一个 URL，若源文件是压缩包则会将其解压。	`ADD html.tar /var/www` `ADD http://ip/html.tar /var/www`
EXPOSE	如果容器中运行应用服务，可通过 EXPOSE 将服务端口暴露出去。	`EXPOSE 80`
VOLUME	在容器中创建一个挂载点，简单来说就是-v，指定镜像的目录挂载到宿主机上。平台已做了默认设置，您无需关注	`VOLUME ["/var/www/html"]`
WORKDIR	为后续的RUN、CMD、ENTRYPOINT指令指定执行目录，目录不存在会自动创建。	`workdir /opt`
CMD	指定启动容器时需要运行的命令或者脚本。指定多条则只能执行最后一条，“bin/bash” 也是一条 CMD，并且会覆盖 image 镜像里面的 cmd。	`CMD [“要运行的程序”，“参数1”、“参数2”]`

# 镜像瘦身

编写 dockerfile 时，您可以参考如下方法在达到您的构建目的的前提下让镜像体量更小。

指令合并：指令合并是最简单也是最方便的降低镜像层数的方式。该操作节省空间的原理是在同一层中清理“缓存”和工具软件。

RUN sed -i "s/archive.ubuntu.com/mirrors.aliyun.com/g; s/security.ubuntu.com/mirrors.aliyun.com/g" /etc/apt/sources.list &&\
    apt update &&\
    apt install -y curl make gcc &&\

删除RUN的缓存文件

基于debian的镜像

 # 换国内源，并更新     
 sed -i “s/deb.debian.org/mirrors.aliyun.com/g” /etc/apt/sources.list && apt update     
 # --no-install-recommends 很有用     
 apt install -y --no-install-recommends a b c && rm -rf /var/lib/apt/lists/*

alpine镜像

# 换国内源，并更新     
sed -i 's/dl-cdn.alpinelinux.org/mirrors.tuna.tsinghua.edu.cn/g' /etc/apk/repositories     
# --no-cache 表示不缓存     
apk add --no-cache a b c && rm -rf /var/cache/apk/*

centos镜像

# 换国内源并更新
curl -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo && yum makecache
yum install -y a b c  && yum clean al

pip 无缓存安装
一般情况下 pip install 都会在当前用户目录下生成一个 ~/.cache/pip 的 pip 缓存目录，建议在 pip install ... 后增加 --no-cache-dir，如：
```
pip install --no-cache-dir tensorboard jupyterlab -i https://pypi.mirrors.ustc.edu.cn/simple/
```

多阶段构建
基于不同研发阶段的目标，分阶段构建镜像。如 AI 研发过程中：

调试环境阶段，您需要安装基本工具，如 tmux、screen、vim、wget、curl 等基础工具。
研发阶段，安装项目依赖的库，如 matplotlib、pandas、seaborn、keras-tuner、pillow 等。

FROM nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04

# 安装项目所需工具
RUN gpg --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC 2>&1 > /dev/null && \
    gpg --export --armor A4B469963BF863CC | apt-key add - 2>&1 > /dev/null && apt-get update && \
    apt-get install -y tmux screen vim wget curl net-tools apt-utils unzip zip git openssl libaio1 iputils-ping openssh-server libssl-dev openssl make gcc libffi-dev zlib1g-dev libbz2-dev zlibc software-properties-common && \

# 安装项目的依赖库
RUN pip install --no-cache-dir tensorflow==2.7.0 protobuf==3.19.0 matplotlib pandas seaborn keras-tuner pillow tqdm einops scikit-learn fluidsynth pretty-midi gym tf-models-official opencv-python einops pyfluidsynth wurlitzer pyglet pyvirtualdisplay pydot  -i https://pypi.mirrors.ustc.edu.cn/simple/ && \
 pip install --no-cache-dir tensorflow==2.7.0 protobuf==3.19.0 imageio matplotlib tensorflow_text==2.7.0 -i https://pypi.mirrors.ustc.edu.cn/simple/

# 常用软件安装说明

cuda
官方镜像中已按官方兼容说明预置了合适的 CUDA 版本，您无需重复安装。如必须重新安装，需按照如下官方兼容性说明安装合适的 cuda 版本，否则环境会存在无法预期的问题甚至无法使用。

nccl
官方镜像均已自带 nccl，您无需重复安装。

numpy
构建日志中若提示版本不兼容，可按报错提示，使用 pip install numpy==xxx 安装指定的版本。

imageio
安装 imageio 时，避免当前 torch 版本升级。

pip install --no-cache-dir tensorflow==2.4.3 typing-extensions==3.7.4 async-lru==1.0.3 imageio matplotlib tensorflow_text==2.4.3

matplotlib
安装 matplotlib 时，避免当前 torch 版本升级。

pip install --no-cache-dir tensorflow==2.4.3 typing-extensions==3.7.4 async-lru==1.0.3 imageio matplotlib tensorflow_text==2.4.3

deepspeed

deepspeed 配套使用 triton==1.0.0，可参考如下命令指定 triton 版本为 1.0.0。

说明：python3.10 无法安装 triton1.0.0，请将 python 版本降为 3.9。
```
pip install --no-cache-dir triton==1.0.0 && pip install --no-cache-dir transformers[deepspeed]
```
deepspeed 安装依赖 libaio 包，否则无法安装，使用如下命令提前安装 libaio 包。
```
apt update && apt install libaio1 libaio-dev
```

apex
cuda11.8+torch2.0.1 安装 apex 软件时，提示 torch 需要 cuda11.7，此时您将官方镜像替换成 cuda11.7+torch2.0.1 的即可。

# dockerfile示例

下述示例是 dockerfile 在本平台的应用示例，因此不包含 FROM 等指令。
示例1

RUN gpg --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC 2>&1 > /dev/null && \
    gpg --export --armor A4B469963BF863CC | apt-key add - 2>&1 > /dev/null && apt-get update && \
    apt-get install -y tmux screen vim wget curl net-tools apt-utils unzip zip git openssl libaio1 iputils-ping openssh-server libssl-dev openssl make gcc libffi-dev zlib1g-dev libbz2-dev zlibc software-properties-common && \
    apt-get install -y xvfb python-opengl cmake g++ lsb-core build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev libjpeg-dev build-essential liblzma-dev libsqlite3-dev pkg-config libnuma-dev libgl1-mesa-glx xvfb python-opengl && \
    apt-get install -y python3 python3-pip && ln -s /usr/bin/python3.8 /usr/bin/python && ln -s /usr/bin/pip3 /usr/bin/pip3.8 && pip3 install --no-cache-dir --upgrade pip && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html && \
    pip install --no-cache-dir jupyterlab tensorboard -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install --no-cache-dir cmake lit utils matplotlib timm librosa resampy pandas boto3 botocore gymnasium -i https://pypi.tuna.tsinghua.edu.cn/simple && \
    pip install --no-cache-dir opencv-python tabulate lltm_cpp ray gym multiprocess datasets torchx torchviz -i https://pypi.tuna.tsinghua.edu.cn/simple && \
    pip install --no-cache-dir torchtext==0.9 -i https://pypi.tuna.tsinghua.edu.cn/simple

示例2

RUN gpg --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC 2>&1 > /dev/null && \
    gpg --export --armor A4B469963BF863CC | apt-key add - 2>&1 > /dev/null && apt-get update && \
    apt-get install -y tmux screen vim wget curl net-tools apt-utils unzip zip git openssl libaio1 iputils-ping openssh-server libssl-dev openssl make gcc libffi-dev zlib1g-dev libbz2-dev zlibc software-properties-common && \
    apt-get install -y xvfb python-opengl cmake g++ lsb-core build-essential zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev wget libbz2-dev libjpeg-dev build-essential liblzma-dev libsqlite3-dev pkg-config libnuma-dev libgl1-mesa-glx xvfb python-opengl && \
    apt-get install -y python3 python3-pip && ln -s /usr/bin/python3.8 /usr/bin/python && ln -s /usr/bin/pip3 /usr/bin/pip3.8 && pip3 install --no-cache-dir --upgrade pip && \
    apt-get clean && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir tensorflow==2.9.0 -i https://pypi.tuna.tsinghua.edu.cn/simple && \
    pip install --no-cache-dir tensorboard jupyterlab -i https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install --no-cache-dir tensorflow==2.9.0 matplotlib pandas seaborn keras-tuner pillow tqdm einops scikit-learn fluidsynth pretty-midi gym tf-models-official opencv-python einops pyfluidsynth wurlitzer pyglet pyvirtualdisplay pydot  -i https://pypi.tuna.tsinghua.edu.cn/simple && \
    pip install --no-cache-dir imageio matplotlib tensorflow_text==2.9.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
ENV LD_LIBRARY_PATH=/usr/local/openmpi/lib/:${LD_LIBRARY_PATH} PATH=$PATH:/usr/local/openmpi/bin/:/usr/local/openmpi/lib/
RUN wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.gz && tar xf *.tar.gz && cd openmpi-4.1.4 && \
    ./configure --prefix=/usr/local/openmpi && make -j all && make install && rm -rf /openmpi-4.1.4.tar.gz
RUN pip install --no-cache-dir cmake -i https://pypi.tuna.tsinghua.edu.cn/simple && \
    HOROVOD_NCCL_INCLUDE=/usr/include/ && \
    HOROVOD_NCCL_LIB=/usr/lib/x86_64-linux-gnu/ && \
    HOROVOD_NCCL_LINK=SHARED && \
    HOROVOD_GPU_OPERATIONS=NCCL && \
    pip install --no-cache-dir horovod -i https://pypi.tuna.tsinghua.edu.cn/simple

← 删除镜像官方镜像 →