日期 | 作者 | |
---|---|---|
2019-11-29 | [email protected] | |
0. 环境
硬件:i58400 | 16G ddr4-2400 | 华硕P106-6G 矿卡,约等于GTX1060~GTX1070(大约350元,性能更好的P104约800,还有P102等)
OS: xubuntu-18.04
1. P106矿卡安装
设备断电情况下,矿卡插入PC的卡槽内,开机。
在命令行执行(依赖第2步,先配置下仓库)
ubuntu-drivers devices
如果没有任何输出那么说明矿卡硬件没有插好或者硬件本身有问题。
正常情况输出样例:(倒数第四行是推荐你安装的驱动)
cxu@cxu-pc:~$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C07sv00001043sd000085FCbc03sc02i00
vendor : NVIDIA Corporation
model : GP106 [P106-100]
driver : nvidia-driver-430 - distro non-free
driver : nvidia-driver-410 - third-party free
driver : nvidia-driver-415 - third-party free
driver : nvidia-driver-440 - third-party free recommended
driver : nvidia-driver-390 - third-party free
driver : nvidia-driver-435 - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
2. 矿卡驱动仓库配置
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
配置仓库目的是从命令行安装驱动。
当然也能从https://www.nvidia.com/Download/index.aspx?lang=en-us 下载打包好的官方驱动,搜索框按照下面的填写:产品类型:GeForce|产品系列:GeForce 10 Series|产品家族:GeForce GTX1060|..
我搜出来的是:https://www.nvidia.cn/download/driverResults.aspx/155002/cn
3. 安装矿卡驱动
安装驱动
apt install nvidia-driver-440
还有一种方法是
ubuntu-drivers autoinstall
后面的nvidia-driver-xxx
需要看ubuntu-drivers devices
命令的输出,里面有可以安装的驱动。此处选择了recommended的,也就是nvidia-driver-440。
测试驱动
nvidia-smi
正常输出如下:
cxu@cxu-pc:~$ nvidia-smi
Fri Nov 29 23:10:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.26 Driver Version: 440.26 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 P106-100 Off | 00000000:01:00.0 Off | N/A |
| 50% 23C P8 8W / 120W | 57MiB / 6080MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 867 G /usr/lib/xorg/Xorg 56MiB |
+-----------------------------------------------------------------------------+
卸载驱动(如果要)
sudo nvidia-uninstall
sudo apt purge "*nvidia*"
# 然后要重启一下,目的是把内核已经加载的驱动模块卸载掉
# ===以下是备注====
# apt clean 删除下载的缓存软件包
# apt remove 删除安装的软件包
# apt purge 删除安装的软件包和相关的配置文件
常见错误
错误1 NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. ———————————————————————————
原因我遇到了2个:
1)硬件没插好,或者也许你买的卡本身就是坏的。
2)安装CUDA时,没看清楚选择了安装display driver, 这个driver和这一步骤安装的驱动产生冲突。判断是否冲突可以运行nvidia-smi, 如果冲突了【NVIDIA-SMI 440.26 Driver Version: 440.26 】,这个地方的版本(440.26)会不一样。
错误2: Failed to initialize NVML: Driver/library version mismatch
——————————————————————————
原因是卸载驱动之后内核里的还没有卸载掉,特别是先安装了高版驱动,然后又安装低版本驱动的情况下。最简单的办法就是重启一下。
https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch
问题3:有的时候再安装CUDA的时候不小心又安装了一次驱动,会造成很奇怪的问题,运行nvidia-smi可能会出现nvidia-smi版本和驱动版本对不上
————————
查看驱动版本 cat /proc/driver/nvidia/version, 看下和自己手工安装的能不能对上,如果对不上就出现驱动版本混乱了。需要重装。
需要用安装cuda时自带的命令 nvidia-uninstall来卸载,他会把内核里的东西全部清除。
卸载cuda运行 cuda/bin里的uninstall-xxx.pl
如果你是从网站下载的驱动包,那么卸载还需要用这个驱动包来卸载 ./xxxxxx-driver.run --uninstall
然后重新按照流程从头安装。
4. 安装CUDA(N卡的并行计算框架)
CUDA介绍
CUDA的介绍 https://blog.csdn.net/u014380165/article/details/77340765
CUDA安装
CUDA的下载地址 https://developer.nvidia.com/cuda-toolkit-archive。
我选择安装CUDA 10.0
wget https://developer.nvidia.com/compute/cuda/10.0/Prod/local_installers/cuda_10.0.130_410.48_linux
# 10.0的补丁
wget http://developer.download.nvidia.com/compute/cuda/10.0/Prod/patches/1/cuda_10.0.130.1_linux.run
chmod 755 cuda_10.0.130_410.48_linux
chmod 755 cuda_10.0.130.1_linux.run
sudo service lightdm stop #这行不能忽略
sudo ./cuda_10.0.130_410.48_linux
sudo ./cuda_10.0.130.1_linux.run
sudo service lightdm start
在安装的时候一定不要再次安装驱动,看清楚这个选项,否则会造成执行nvidia-smi
的时候出现错误。
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
(y)es/(n)o/(q)uit: n #这个地方一定要选n
CUDA配置
CUDA安装之后终端出现提示
===========
= Summary =
===========
Driver: Installed
Toolkit: Installed in /usr/local/cuda-10.0
Samples: Installed in /home/cxu, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-10.0/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.
Logfile is /tmp/cuda_install_1788.log
根据上面输出的提示做一些配置, 打开/etc/profile
export PATH=${PATH}:/usr/local/cuda/bin:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/lib64
由于我安装cuda时候选择创建了软链接cuda到cuda-10.0因此配置里可以直接用cuda目录。
测试CUDA
命令行测试
cxu@cxu-pc:~$ source /etc/profile
cxu@cxu-pc:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
运行CUDA自带的例子
安装CUDA的时候会问你是否要安装例子,我选了y, 然后就再我家目录下生成了一个目录 NVIDIA_CUDA-10.0_Samples
先编译
cd NVIDIA_CUDA-10.0_Samples
make all -j6
#编译完成生成了一个bin目录
cd bin/x86_64/linux/release/
./deviceQuery #也可以执行其他的
输出
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "P106-100"
CUDA Driver Version / Runtime Version 10.2 / 10.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 6081 MBytes (6375997440 bytes)
(10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores
GPU Max Clock rate: 1709 MHz (1.71 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS
常见错误
错误1:It appears that an X server is running. Please exit X before installation.
——————————————————————
解决办法,安装前运行 sudo service lightdm stop
安装之后运行 sudo service lightdm start
https://askubuntu.com/questions/149206/how-to-install-nvidia-run
5. 安装cuDNN
cuDNN介绍
cuDNN(CUDA Deep Neural Network library):是NVIDIA打造的针对深度神经网络的加速库,是一个用于深层神经网络的GPU加速库。如果你要用GPU训练模型,cuDNN不是必须的,但是一般会采用这个加速库。
cuDNN版本选择与安装
cuDNN需要寻找和CUDA版本匹配的进行安装:https://developer.nvidia.com/rdp/cudnn-archive
下载的时候要注册并登陆,下载会有多个选择,只需要安装 cuDNN Library for Linux
, 我下载到的文件名字是cudnn-10.0-linux-x64-v7.6.4.38.tgz
解压这个文件内容如下:
cxu@cxu-pc:~$ tree cuda
cuda
├── include
│ └── cudnn.h
├── lib64
│ ├── libcudnn.so -> libcudnn.so.7
│ ├── libcudnn.so.7 -> libcudnn.so.7.6.4
│ ├── libcudnn.so.7.6.4
│ └── libcudnn_static.a
└── NVIDIA_SLA_cuDNN_Support.txt
2 directories, 6 files
然后执行
sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp -d cuda/lib64/lib* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
删除cuDNN(如果要)
如果要删除cuDNN,请执行:
sudo rm /usr/local/cuda/include/cudnn.h
sudo rm -r /usr/local/cuda/lib64/libcudnn*
6. 测试tensorFlow
至此GPU有关环境安装完毕,接下来测试一下tf能不能正常。
virtualenv -p /usr/bin/python3 venv
source venv/bin/activate
pip install tensorflow-gpu
打开ipython/python控制台
>> import tensorflow as tf
>> tf.test.gpu_device_name()
输出(看第7行):
2019-11-29 22:51:19.282039: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 22:51:19.282540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: P106-100 major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:01:00.0
2019-11-29 22:51:19.282569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-11-29 22:51:19.282578: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-11-29 22:51:19.282587: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-11-29 22:51:19.282596: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-11-29 22:51:19.282610: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-11-29 22:51:19.282619: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-11-29 22:51:19.285274: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-11-29 22:51:19.285360: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 22:51:19.285905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 22:51:19.286288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-29 22:51:19.287356: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-11-29 22:51:19.288043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-29 22:51:19.288056: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2019-11-29 22:51:19.288061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2019-11-29 22:51:19.288716: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 22:51:19.289252: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-29 22:51:19.289638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 5658 MB memory) -> physical GPU (device: 0, name: P106-100, pci bus id: 0000:01:00.0, compute capability: 6.1)
Out[3]: '/device:GPU:0'
其他问题
- 安装CUDA时,它自带了一个驱动,一般不用,因为我安装了这个驱动之后执行nvidia-smi没有反应,可能配套的管理命令不如自己安装那么全面。
- 这里有个问题没有解决:我先安装了nvidia-driver-440, 然后卸载想安装nvidia-driver-410,但每次重启还是440,还没找到原因。