nvidia-docker多用户共享GPU服务器环境搭建

https://blog.csdn.net/hangvane123/article/details/88639279

  1. nvidia驱动

  2. docker-ce

  3. nvidia-docker

  4. pull带有cuda和cudnn的ubuntu镜像

    https://hub.docker.com/r/nvidia/cuda

    docker pull nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04勿下载runtime版,而是devel,否则很多配置文件找不到。

  5. 启动容器,进入容器

  6. apt-get update

  7. apt-get install python3-pip

  8. 安装与cuda, cudnn版本匹配的tensorflow-gpu

创建容器

#-it交互式终端运行 , 参数/bin/bash启动ubuntu, --name命名容器
# docker run -it -p [host_port]:[container_port](do not use 8888) --name=[container_name] [image_name] -v [host_path]:[container_path] /bin/bash

$ nvidia-docker run -itd --name=dev --shm-size 10g -v /data/mnt:/home -p 8180:8180 -p 8280:8280 -p 8380:8380 -p 8480:8480 -p 9380:22  nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 /bin/bash

# 查看所有容器
$ docker ps -a

进入容器

$: docker exec -it cuda10_0 env LANG=C.UTF-8 /bin/bash

添加已运行容器端口

  • 查看容器号 docker ps -a

  • 停止容器 docker stop

  • 查找容器目录 docker inspect [容器号ID] | grep Id

  • 停止docker服务(systemctl stop docker)

  • 查找容器路径:find / -name containers

  • 修改这个容器的/usr/local/docker/containers/...hostconfig.json文件中的端口:
    “PortBindings”:{“3306/tcp”:[{“HostIp”:”“,”HostPort”:”3307”}]} 前者是容器端口, 后者是宿主机端口。

  • 修改该容器的config.v2.json文件中的ExposedPorts。

  • 启动docker服务(systemctl start docker)

  • 启动容器

https://blog.csdn.net/lypeng_/article/details/98176138
https://blog.csdn.net/wesleyflagon/article/details/78961990

修改容器容量

# CID ---> /dev/mapper/docker
# 100 ---> 100G
./script.sh 100`
CID="62f54c85d02ec67b64c1ea15b0c3820edeea6744f7a052f0e795ea127d3fb28e"

SIZE=$1

if [ "$CID" != "" ] && [ "$SIZE" != "" ]; then
    DEV=$(basename $(echo /dev/mapper/docker-*-$CID));
    dmsetup table $DEV | sed "s/0 [0-9]* thin/0 $(($SIZE*1024*1024*1024/512)) thin/" | dmsetup load $DEV;
    dmsetup resume $DEV;
    xfs_growfs /dev/mapper/$DEV;
    docker start container_name
    docker exec -it container_name env LANG=C.UTF-8 /bin/bash
  echo "Resize $CID completed."
else
    echo "Usage: sh resize_container 459fd505311ad364309940ac24dcdb2bdfc68e3c3b0f291c9153fb54fbd46771 100";
fi

docker授权非root用户

sudo usermod -a -G docker AD\\yckj2939

sudo groupadd docker          # 添加docker用户组
sudo gpasswd -a $USER docker  # 将登陆用户加入到docker用户组中
newgrp docker                 # 更新用户组
docker ps                     # 测试docker命令是否可以使用sudo正常使用

systemctl restart docker      # !!!一定重启

容器内中文乱码

$ vim /etc/profile
$ export LANG=C.UTF-8

服务器docker中启可远程notebook

  • 启ubuntu容器,开出端口8888

nvidia-docker run -itd --name=cuda10_0 -v /mnt/docker_share:/home/centos -p 8888:8888 nvidia/cuda

  • 进入容器安装anaconda
$: docker exec -it cuda10_0 /bin/bash

$: cd ~ | ./Anaconda.sh
  • 配置notebook在.jupyter文件夹下
$: jupyter notebook --generate-config # 会自动生成.jupyter文件夹

$: jupyter notebook password

生成密钥: jupyter_notebook_config.json

修改文件: jupyter_notebook_config.py

c.NotebookApp.ip='*'
c.NotebookApp.password = u'生成的密钥' #  jupyter_notebook_config.json文件中的password字段
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888 #可指定一个端口, 访问时使用该端口(虽然运行jupyter时可以直接指定端口)
c.NotebookApp.notebook_dir = '/home'  # 自定义启动目录

https://blog.csdn.net/qq_42001765/article/details/96144442

  • 后台运行

nohup jupyter notebook --ip=0.0.0.0 --no-browser --allow-root --port 8888 > jupyter.log 2>&1 &

  • 远程访问
# server_ip:port
10.235.3.43:8888

容器内自动化脚本:

#!/bin/sh

# enter docker
cd ~
jupyter notebook --generate-config
echo 'Please input jupyter notebook password'
jupyter notebook password
echo 'Password input success!'

chmod 777 ~/.jupyter/jupyter_notebook_config.json
PASSWORD=$(cat ~/.jupyter/jupyter_notebook_config.json | grep password | awk -F '"' '{print $4}')
echo "c.NotebookApp.ip='*'" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.password = '$PASSWORD'" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.open_browser = False" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.port = 8480" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.notebook_dir = '/home'" >> ~/.jupyter/jupyter_notebook_config.py

nohup jupyter notebook --ip=0.0.0.0 --no-browser --allow-root --port 8480 > jupyter.log 2>&1 &

PyCharm + Docker炼丹炉

PyCharm Pro + Nvidia Docker

  • 参照上文增加容器22端口,SFTP默认使用22端口。

  • 远程服务器参数查看

$ docker port <your container name> 22
# 此操作将查看docker container中端口22,在远程服务器上端口的映射
# 输出结果如下所示
0.0.0.0:9380
# 表明只要ssh链接远程服务器的9380端口,实际是链接docker container中的22端口

ssh root@<服务器的ip地址> -p 9380
# 可以进入容器,passwd命令可以修改容器root密码
  • 进入容器配置ssh服务
# 修改root密码
$ passwd

# 安装open-ssh
$ apt-get update
$ apt-get install openssh-server

$ vim  /etc/ssh/sshd_config

$ server ssh restart
/etc/ssh/sshd_config修改以下位置:
# Subsystem  sftp  /usr/libexec/openssh/sftp-server
Subsystem  sftp  internal-sftp

PubkeyAuthentication yes 

AuthorizedKeysFile .ssh/authorized_keys #公钥文件路径(和上面生成的文件同)

PermitRootLogin yes

最常见的问题就是docker容器停了以后里面的SSH服务也会相应停止,记得去docker里重启一下ssh服务:

$ service ssh restart

https://www.cnblogs.com/ruiyang-/p/10158658.html

制作镜像并上传仓库

  • dockerhub上创建仓库jerrysu666/cuda10.0

  • 终端登录:docker login

  • 制作镜像:docker commit containerId dockerUserName/repoName

  • 镜像打标签:docker tag imageName dockerUserName/repoName[:tag]

  • 推送镜像: docker push dockerUserName/repoNme[:tag]

docker tag local-image:tagname new-repo:tagname
docker push new-repo:tagname

docker push jerrysu666/cuda10.0:tagname

docker-compose && dockerfile

# docker-compose
version: "2.3"
services:
  detectron2:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        USER_ID: ${USER_ID:-1000}
    runtime: nvidia
    shm_size: "8gb"
    ulimits:
      memlock: -1
      stack: 67108864
    ports:
      - "8170:8170"
      - "8270:8270"
      - "8370:8370"
      - "8470:8470"
      - "8570:22"
    volumes:
      - /data:/home
    environment:
      - DISPLAY=$DISPLAY
      - NVIDIA_VISIBLE_DEVICES=all
# dockerfile

FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04

ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y \
        python3-opencv ca-certificates python3-dev git wget sudo && \
  rm -rf /var/lib/apt/lists/*

# create a non-root user
ARG USER_ID=1000
RUN useradd -m --no-log-init --system  --uid ${USER_ID} appuser -g sudo
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
USER appuser
WORKDIR /home/appuser

ENV PATH="/home/appuser/.local/bin:${PATH}"
RUN wget https://bootstrap.pypa.io/get-pip.py && \
        python3 get-pip.py --user && \
        rm get-pip.py

# install dependencies
# See https://pytorch.org/ for other options if you use a different version of CUDA
RUN pip install --user torch torchvision tensorboard cython -i https://pypi.doubanio.com/simple
RUN pip install --user 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

RUN pip install --user 'git+https://github.com/facebookresearch/fvcore'
# install detectron2
RUN git clone https://github.com/facebookresearch/detectron2 detectron2_repo
ENV FORCE_CUDA="1"
# This will build detectron2 for all common cuda architectures and take a lot more time,
# because inside `docker build`, there is no way to tell which architecture will be used.
ENV TORCH_CUDA_ARCH_LIST="Kepler;Kepler+Tesla;Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
RUN pip install --user -e detectron2_repo

# Set a fixed model cache directory.
ENV FVCORE_CACHE="/tmp"
WORKDIR /home/appuser/detectron2_repo

Share on: TwitterFacebookEmail

Comments


Related Posts


Published

Category

Tools

Tags

Contact