nvidia-docker多用户共享GPU服务器环境搭建
https://blog.csdn.net/hangvane123/article/details/88639279
-
nvidia驱动
-
docker-ce
-
nvidia-docker
-
pull带有cuda和cudnn的ubuntu镜像
https://hub.docker.com/r/nvidia/cuda
docker pull nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
勿下载runtime版,而是devel,否则很多配置文件找不到。 -
启动容器,进入容器
-
apt-get update
-
apt-get install python3-pip
-
安装与cuda, cudnn版本匹配的tensorflow-gpu
创建容器
#-it交互式终端运行 , 参数/bin/bash启动ubuntu, --name命名容器
# docker run -it -p [host_port]:[container_port](do not use 8888) --name=[container_name] [image_name] -v [host_path]:[container_path] /bin/bash
$ nvidia-docker run -itd --name=dev --shm-size 10g -v /data/mnt:/home -p 8180:8180 -p 8280:8280 -p 8380:8380 -p 8480:8480 -p 9380:22 nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 /bin/bash
# 查看所有容器
$ docker ps -a
进入容器
$: docker exec -it cuda10_0 env LANG=C.UTF-8 /bin/bash
添加已运行容器端口
-
查看容器号 docker ps -a
-
停止容器 docker stop
-
查找容器目录 docker inspect [容器号ID] | grep Id
-
停止docker服务(systemctl stop docker)
-
查找容器路径:
find / -name containers
-
修改这个容器的
/usr/local/docker/containers/...
hostconfig.json文件中的端口:
“PortBindings”:{“3306/tcp”:[{“HostIp”:”“,”HostPort”:”3307”}]} 前者是容器端口, 后者是宿主机端口。 -
修改该容器的config.v2.json文件中的ExposedPorts。
-
启动docker服务(systemctl start docker)
-
启动容器
https://blog.csdn.net/lypeng_/article/details/98176138
https://blog.csdn.net/wesleyflagon/article/details/78961990
修改容器容量
# CID ---> /dev/mapper/docker
# 100 ---> 100G
./script.sh 100`
CID="62f54c85d02ec67b64c1ea15b0c3820edeea6744f7a052f0e795ea127d3fb28e"
SIZE=$1
if [ "$CID" != "" ] && [ "$SIZE" != "" ]; then
DEV=$(basename $(echo /dev/mapper/docker-*-$CID));
dmsetup table $DEV | sed "s/0 [0-9]* thin/0 $(($SIZE*1024*1024*1024/512)) thin/" | dmsetup load $DEV;
dmsetup resume $DEV;
xfs_growfs /dev/mapper/$DEV;
docker start container_name
docker exec -it container_name env LANG=C.UTF-8 /bin/bash
echo "Resize $CID completed."
else
echo "Usage: sh resize_container 459fd505311ad364309940ac24dcdb2bdfc68e3c3b0f291c9153fb54fbd46771 100";
fi
docker授权非root用户
sudo usermod -a -G docker AD\\yckj2939
sudo groupadd docker # 添加docker用户组
sudo gpasswd -a $USER docker # 将登陆用户加入到docker用户组中
newgrp docker # 更新用户组
docker ps # 测试docker命令是否可以使用sudo正常使用
systemctl restart docker # !!!一定重启
容器内中文乱码
$ vim /etc/profile
$ export LANG=C.UTF-8
服务器docker中启可远程notebook
- 启ubuntu容器,开出端口8888
nvidia-docker run -itd --name=cuda10_0 -v /mnt/docker_share:/home/centos -p 8888:8888 nvidia/cuda
- 进入容器安装anaconda
$: docker exec -it cuda10_0 /bin/bash
$: cd ~ | ./Anaconda.sh
- 配置notebook在.jupyter文件夹下
$: jupyter notebook --generate-config # 会自动生成.jupyter文件夹
$: jupyter notebook password
生成密钥: jupyter_notebook_config.json
修改文件: jupyter_notebook_config.py
c.NotebookApp.ip='*'
c.NotebookApp.password = u'生成的密钥' # jupyter_notebook_config.json文件中的password字段
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888 #可指定一个端口, 访问时使用该端口(虽然运行jupyter时可以直接指定端口)
c.NotebookApp.notebook_dir = '/home' # 自定义启动目录
https://blog.csdn.net/qq_42001765/article/details/96144442
- 后台运行
nohup jupyter notebook --ip=0.0.0.0 --no-browser --allow-root --port 8888 > jupyter.log 2>&1 &
- 远程访问
# server_ip:port
10.235.3.43:8888
容器内自动化脚本:
#!/bin/sh
# enter docker
cd ~
jupyter notebook --generate-config
echo 'Please input jupyter notebook password'
jupyter notebook password
echo 'Password input success!'
chmod 777 ~/.jupyter/jupyter_notebook_config.json
PASSWORD=$(cat ~/.jupyter/jupyter_notebook_config.json | grep password | awk -F '"' '{print $4}')
echo "c.NotebookApp.ip='*'" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.password = '$PASSWORD'" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.open_browser = False" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.port = 8480" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.notebook_dir = '/home'" >> ~/.jupyter/jupyter_notebook_config.py
nohup jupyter notebook --ip=0.0.0.0 --no-browser --allow-root --port 8480 > jupyter.log 2>&1 &
PyCharm + Docker炼丹炉
PyCharm Pro + Nvidia Docker
-
参照上文增加容器22端口,SFTP默认使用22端口。
-
远程服务器参数查看
$ docker port <your container name> 22
# 此操作将查看docker container中端口22,在远程服务器上端口的映射
# 输出结果如下所示
0.0.0.0:9380
# 表明只要ssh链接远程服务器的9380端口,实际是链接docker container中的22端口
ssh root@<服务器的ip地址> -p 9380
# 可以进入容器,passwd命令可以修改容器root密码
- 进入容器配置ssh服务
# 修改root密码
$ passwd
# 安装open-ssh
$ apt-get update
$ apt-get install openssh-server
$ vim /etc/ssh/sshd_config
$ server ssh restart
/etc/ssh/sshd_config修改以下位置:
# Subsystem sftp /usr/libexec/openssh/sftp-server
Subsystem sftp internal-sftp
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys #公钥文件路径(和上面生成的文件同)
PermitRootLogin yes
最常见的问题就是docker容器停了以后里面的SSH服务也会相应停止,记得去docker里重启一下ssh服务:
$ service ssh restart
https://www.cnblogs.com/ruiyang-/p/10158658.html
制作镜像并上传仓库
-
dockerhub上创建仓库
jerrysu666/cuda10.0
-
终端登录:
docker login
-
制作镜像:
docker commit containerId dockerUserName/repoName
-
镜像打标签:
docker tag imageName dockerUserName/repoName[:tag]
-
推送镜像:
docker push dockerUserName/repoNme[:tag]
docker tag local-image:tagname new-repo:tagname
docker push new-repo:tagname
docker push jerrysu666/cuda10.0:tagname
docker-compose && dockerfile
# docker-compose
version: "2.3"
services:
detectron2:
build:
context: .
dockerfile: Dockerfile
args:
USER_ID: ${USER_ID:-1000}
runtime: nvidia
shm_size: "8gb"
ulimits:
memlock: -1
stack: 67108864
ports:
- "8170:8170"
- "8270:8270"
- "8370:8370"
- "8470:8470"
- "8570:22"
volumes:
- /data:/home
environment:
- DISPLAY=$DISPLAY
- NVIDIA_VISIBLE_DEVICES=all
# dockerfile
FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y \
python3-opencv ca-certificates python3-dev git wget sudo && \
rm -rf /var/lib/apt/lists/*
# create a non-root user
ARG USER_ID=1000
RUN useradd -m --no-log-init --system --uid ${USER_ID} appuser -g sudo
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
USER appuser
WORKDIR /home/appuser
ENV PATH="/home/appuser/.local/bin:${PATH}"
RUN wget https://bootstrap.pypa.io/get-pip.py && \
python3 get-pip.py --user && \
rm get-pip.py
# install dependencies
# See https://pytorch.org/ for other options if you use a different version of CUDA
RUN pip install --user torch torchvision tensorboard cython -i https://pypi.doubanio.com/simple
RUN pip install --user 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
RUN pip install --user 'git+https://github.com/facebookresearch/fvcore'
# install detectron2
RUN git clone https://github.com/facebookresearch/detectron2 detectron2_repo
ENV FORCE_CUDA="1"
# This will build detectron2 for all common cuda architectures and take a lot more time,
# because inside `docker build`, there is no way to tell which architecture will be used.
ENV TORCH_CUDA_ARCH_LIST="Kepler;Kepler+Tesla;Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
RUN pip install --user -e detectron2_repo
# Set a fixed model cache directory.
ENV FVCORE_CACHE="/tmp"
WORKDIR /home/appuser/detectron2_repo