矿卡P106构建深度学习环境

日期 作者
2019-11-29 [email protected]

0. 环境

硬件:i58400 | 16G ddr4-2400 | 华硕P106-6G 矿卡,约等于GTX1060~GTX1070(大约350元,性能更好的P104约800,还有P102等)

OS: xubuntu-18.04

1. P106矿卡安装

设备断电情况下,矿卡插入PC的卡槽内,开机。

在命令行执行(依赖第2步,先配置下仓库)

ubuntu-drivers devices

如果没有任何输出那么说明矿卡硬件没有插好或者硬件本身有问题

正常情况输出样例:(倒数第四行是推荐你安装的驱动)

cxu@cxu-pc:~$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C07sv00001043sd000085FCbc03sc02i00
vendor   : NVIDIA Corporation
model    : GP106 [P106-100]
driver   : nvidia-driver-430 - distro non-free
driver   : nvidia-driver-410 - third-party free
driver   : nvidia-driver-415 - third-party free
driver   : nvidia-driver-440 - third-party free recommended
driver   : nvidia-driver-390 - third-party free
driver   : nvidia-driver-435 - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

2. 矿卡驱动仓库配置

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update

配置仓库目的是从命令行安装驱动。

当然也能从https://www.nvidia.com/Download/index.aspx?lang=en-us 下载打包好的官方驱动,搜索框按照下面的填写:产品类型:GeForce|产品系列:GeForce 10 Series|产品家族:GeForce GTX1060|..

我搜出来的是:https://www.nvidia.cn/download/driverResults.aspx/155002/cn

3. 安装矿卡驱动

安装驱动

apt install nvidia-driver-440

还有一种方法是

ubuntu-drivers autoinstall

后面的nvidia-driver-xxx需要看ubuntu-drivers devices命令的输出,里面有可以安装的驱动。此处选择了recommended的,也就是nvidia-driver-440。

测试驱动

nvidia-smi

正常输出如下:

cxu@cxu-pc:~$ nvidia-smi
Fri Nov 29 23:10:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.26       Driver Version: 440.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  P106-100            Off  | 00000000:01:00.0 Off |                  N/A |
| 50%   23C    P8     8W / 120W |     57MiB /  6080MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       867      G   /usr/lib/xorg/Xorg                            56MiB |
+-----------------------------------------------------------------------------+

卸载驱动(如果要)

sudo nvidia-uninstall
sudo apt purge "*nvidia*"
# 然后要重启一下,目的是把内核已经加载的驱动模块卸载掉

# ===以下是备注====
# apt clean 删除下载的缓存软件包
# apt remove 删除安装的软件包
# apt purge 删除安装的软件包和相关的配置文件

常见错误

错误1 NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. ———————————————————————————

原因我遇到了2个:

1)硬件没插好,或者也许你买的卡本身就是坏的。

2)安装CUDA时,没看清楚选择了安装display driver, 这个driver和这一步骤安装的驱动产生冲突。判断是否冲突可以运行nvidia-smi, 如果冲突了【NVIDIA-SMI 440.26 Driver Version: 440.26 】,这个地方的版本(440.26)会不一样。

错误2: Failed to initialize NVML: Driver/library version mismatch

——————————————————————————

原因是卸载驱动之后内核里的还没有卸载掉,特别是先安装了高版驱动,然后又安装低版本驱动的情况下。最简单的办法就是重启一下。

https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch

问题3:有的时候再安装CUDA的时候不小心又安装了一次驱动,会造成很奇怪的问题,运行nvidia-smi可能会出现nvidia-smi版本和驱动版本对不上

————————

查看驱动版本 cat /proc/driver/nvidia/version, 看下和自己手工安装的能不能对上,如果对不上就出现驱动版本混乱了。需要重装。

需要用安装cuda时自带的命令 nvidia-uninstall来卸载,他会把内核里的东西全部清除。

卸载cuda运行 cuda/bin里的uninstall-xxx.pl

如果你是从网站下载的驱动包,那么卸载还需要用这个驱动包来卸载 ./xxxxxx-driver.run --uninstall

然后重新按照流程从头安装。

4. 安装CUDA(N卡的并行计算框架)

CUDA介绍

CUDA的介绍 https://blog.csdn.net/u014380165/article/details/77340765

CUDA安装

CUDA的下载地址 https://developer.nvidia.com/cuda-toolkit-archive

我选择安装CUDA 10.0

wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
# 10.0的补丁
wget http://developer.download.nvidia.com/compute/cuda/10.0/Prod/patches/1/cuda_10.0.130.1_linux.run

chmod 755 cuda_10.0.130_410.48_linux
chmod 755 cuda_10.0.130.1_linux.run

sudo service lightdm stop  #这行不能忽略
sudo ./cuda_10.0.130_410.48_linux
sudo ./cuda_10.0.130.1_linux.run
sudo service lightdm start 

在安装的时候一定不要再次安装驱动,看清楚这个选项,否则会造成执行nvidia-smi的时候出现错误。

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
(y)es/(n)o/(q)uit: n  #这个地方一定要选n

CUDA配置

CUDA安装之后终端出现提示

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-10.0
Samples:  Installed in /home/cxu, but missing recommended libraries

Please make sure that
 -   PATH includes /usr/local/cuda-10.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.

Logfile is /tmp/cuda_install_1788.log

根据上面输出的提示做一些配置, 打开/etc/profile

export PATH=${PATH}:/usr/local/cuda/bin:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64

由于我安装cuda时候选择创建了软链接cuda到cuda-10.0因此配置里可以直接用cuda目录。

测试CUDA

命令行测试
cxu@cxu-pc:~$ source /etc/profile
cxu@cxu-pc:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
运行CUDA自带的例子

安装CUDA的时候会问你是否要安装例子,我选了y, 然后就再我家目录下生成了一个目录 NVIDIA_CUDA-10.0_Samples

先编译

cd NVIDIA_CUDA-10.0_Samples
make all -j6
#编译完成生成了一个bin目录
cd bin/x86_64/linux/release/
./deviceQuery  #也可以执行其他的

输出

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "P106-100"
  CUDA Driver Version / Runtime Version          10.2 / 10.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6081 MBytes (6375997440 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

常见错误

错误1:It appears that an X server is running. Please exit X before installation.

——————————————————————

解决办法,安装前运行 sudo service lightdm stop

安装之后运行 sudo service lightdm start

https://askubuntu.com/questions/149206/how-to-install-nvidia-run

5. 安装cuDNN

cuDNN介绍

cuDNN(CUDA Deep Neural Network library):是NVIDIA打造的针对深度神经网络的加速库,是一个用于深层神经网络的GPU加速库。如果你要用GPU训练模型,cuDNN不是必须的,但是一般会采用这个加速库。

cuDNN版本选择与安装

cuDNN需要寻找和CUDA版本匹配的进行安装:https://developer.nvidia.com/rdp/cudnn-archive

下载的时候要注册并登陆,下载会有多个选择,只需要安装 cuDNN Library for Linux, 我下载到的文件名字是cudnn-10.0-linux-x64-v7.6.4.38.tgz

解压这个文件内容如下:

cxu@cxu-pc:~$ tree cuda
cuda
├── include
│   └── cudnn.h
├── lib64
│   ├── libcudnn.so -> libcudnn.so.7
│   ├── libcudnn.so.7 -> libcudnn.so.7.6.4
│   ├── libcudnn.so.7.6.4
│   └── libcudnn_static.a
└── NVIDIA_SLA_cuDNN_Support.txt

2 directories, 6 files

然后执行

sudo cp cuda/include/cudnn.h   /usr/local/cuda/include/
sudo cp -d cuda/lib64/lib*     /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h   /usr/local/cuda/lib64/libcudnn*

删除cuDNN(如果要)

如果要删除cuDNN,请执行:

sudo rm /usr/local/cuda/include/cudnn.h
sudo rm -r /usr/local/cuda/lib64/libcudnn*

6. 测试tensorFlow

至此GPU有关环境安装完毕,接下来测试一下tf能不能正常。

virtualenv -p /usr/bin/python3  venv
source venv/bin/activate
pip install tensorflow-gpu

打开ipython/python控制台

>> import tensorflow as tf
>> tf.test.gpu_device_name()

输出(看第7行):

2019-11-29 22:51:19.282039: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 22:51:19.282540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: P106-100 major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:01:00.0
2019-11-29 22:51:19.282569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-11-29 22:51:19.282578: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-11-29 22:51:19.282587: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-11-29 22:51:19.282596: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-11-29 22:51:19.282610: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-11-29 22:51:19.282619: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-11-29 22:51:19.285274: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-11-29 22:51:19.285360: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 22:51:19.285905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 22:51:19.286288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-29 22:51:19.287356: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-11-29 22:51:19.288043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-29 22:51:19.288056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2019-11-29 22:51:19.288061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2019-11-29 22:51:19.288716: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 22:51:19.289252: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 22:51:19.289638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 5658 MB memory) -> physical GPU (device: 0, name: P106-100, pci bus id: 0000:01:00.0, compute capability: 6.1)
Out[3]: '/device:GPU:0'

其他问题

  • 安装CUDA时,它自带了一个驱动,一般不用,因为我安装了这个驱动之后执行nvidia-smi没有反应,可能配套的管理命令不如自己安装那么全面。
  • 这里有个问题没有解决:我先安装了nvidia-driver-440, 然后卸载想安装nvidia-driver-410,但每次重启还是440,还没找到原因。
build:   __BUILD_VERSION__