问题现象
客户端是opentack,通过nfs挂载我们存储,发现有个服务进程变D了,长时间无法恢复,如下
[root@ECM-043 ~]# ps aux|grep nova-compute
nova 21409 1.4 0.0 2117692 110916 ? Dl 15:06 1:24 /opt/server/python27/bin/python /usr/bin/nova-compute --logfile /var/log/nova/compute.log
排查步骤
客户端排查,查找该进程hang 在哪里,lsof 查找进程打开的文件
lsof -p 21409
nova-comp 21409 nova 22w REG 0,29 0 2199054015793 /var/lib/nova/instances/locks/nova-storage-registry-lock (block.beijing.wocloud.cn:/var/share/ezfs/shareroot/block-bj)
可以看到客户端的进程打开了nfs上的nova-storage-registry-lock文件,下一步在服务端看下该文件的状态
服务端查看nova-storage-registry-lock 文件的inode号
知道了inode号
在所有的nfs服务端,看下是否有其他进程 拿着 该lock 文件
root@Storage5:~# cat /proc/locks |grep inode
1: POSIX ADVISORY WRITE 28331 00:14:2199053494291 0 EOF
2: POSIX ADVISORY WRITE 28165 00:14:1099511627783 0 0
3: POSIX ADVISORY READ 360769 00:0f:21590 4 4
4: POSIX ADVISORY WRITE 28331 00:14:2199054015793 0 EOF
5: POSIX ADVISORY WRITE 1022761 00:0f:673303453 0 EOF
6: POSIX ADVISORY WRITE 98842 00:0f:47181 0 EOF
7: POSIX ADVISORY WRITE 688677 00:0f:1186403778 0 EOF
8: POSIX ADVISORY READ 38754 00:0f:21590 4 4
服务端 确认哪些 nfs客户端连接
nfs 是无状态协议,和cifs不一样,不是TCP 吗?
root@Storage5:/var/lib/nfs/sm# ll
total 48
drwxr-xr-x 2 statd root 4096 Apr 13 09:03 ./
drwxr-xr-x 5 statd root 4096 Apr 10 18:21 ../
-rw------- 1 statd root 88 Apr 11 16:08 10.55.4.1
-rw------- 1 statd root 89 Apr 10 14:47 10.55.4.15
-rw------- 1 statd root 89 Apr 10 14:41 10.55.4.16
-rw------- 1 statd root 90 Apr 7 02:10 10.55.4.199
-rw-r----- 1 statd root 1350 Apr 11 13:42 10.55.4.205
-rw------- 1 statd root 89 Apr 11 16:00 10.55.4.31
-rw------- 1 statd root 89 Apr 10 14:44 10.55.4.32
-rw------- 1 statd root 89 Apr 10 14:34 10.55.4.37
-rw------- 1 statd root 89 Apr 11 16:04 10.55.4.54
-rw-r----- 1 statd root 2288 Apr 13 09:03 10.55.4.9
root@Storage5:/var/lib/nfs/sm# pwd
/var/lib/nfs/sm
延伸
linux下lock类型
文件锁,主要分为 flock 和fcntl
2者的粒度不一样
flock
锁住整个文件,如下,
root@scal61:/usr/share/pyshared/ezs3# cat /proc/locks
1: FLOCK ADVISORY WRITE 97862 00:12:229635 0 EOF
2: FLOCK(lock类型) ADVISORY(建议锁,非强制执行) WRITE(holder能写lock文件) 4246(holder的pid) 00:12:20704(锁文件的MAJOR-DEVICE:MINOR-DEVICE:INODE-NUMBER) 0 EOF(锁文件的范围,0到EOF代表整个文件)
fcntl
lock文件的某一部分,如下 posix
root@scal61:/usr/share/pyshared/ezs3# cat /proc/locks
3: POSIX ADVISORY WRITE 97830 08:03:524404 0 EOF
4: POSIX ADVISORY WRITE 4056 00:12:41028 0 EOF
5: POSIX ADVISORY READ 2351 00:12:10095 4 4
6: POSIX ADVISORY READ 2783 00:12:10052 4 4
7: POSIX ADVISORY READ 2783 00:12:14432 4 4
8: POSIX ADVISORY WRITE 2783 00:12:10050 0 0
* centos
- FLOCK signifying the older-style UNIX file locks from a flock system
call - POSIX representing the newer POSIX locks from the lockf system call.
- ADVISORY means that the lock does not prevent other people from accessing the data; it only prevents other attempts to lock it
- MANDATORY means that no other access to the data is permitted while the lock is held
- The fourth column reveals whether the lock is allowing the holder READ or WRITE access to the file
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。