背景
某项目Oracle 11g RAC其中一个节点vip服务offline,集群从双节点变为单节点
排查
-
crsctl
命令查看集群状态
$ su - grid
$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCH.dg
ONLINE ONLINE dbprd1
ONLINE ONLINE dbprd2
ora.CRS.dg
ONLINE ONLINE dbprd1
ONLINE ONLINE dbprd2
ora.DATA.dg
ONLINE ONLINE dbprd1
ONLINE ONLINE dbprd2
ora.LISTENER.lsnr
ONLINE OFFLINE dbprd1
ONLINE ONLINE dbprd2
ora.asm
ONLINE ONLINE dbprd1 Started
ONLINE ONLINE dbprd2 Started
ora.gsd
OFFLINE OFFLINE dbprd1
OFFLINE OFFLINE dbprd2
ora.net1.network
ONLINE ONLINE dbprd1
ONLINE ONLINE dbprd2
ora.ons
ONLINE ONLINE dbprd1
ONLINE ONLINE dbprd2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE dbprd2
ora.asdfprdb.db
1 ONLINE ONLINE dbprd1 Open
2 ONLINE ONLINE dbprd2 Open
ora.cvu
1 ONLINE ONLINE dbprd2
ora.dbprd1.vip
1 ONLINE OFFLINE
ora.dbprd2.vip
1 ONLINE ONLINE dbprd2
ora.oc4j
1 ONLINE ONLINE dbprd1
ora.scan1.vip
1 ONLINE ONLINE dbprd2
可以看到,ora.dbprd1.vip1状态为OFFLINE
,并且ora.LISTENER.lsnr
也OFFLINE,监听应该是受vip影响,可以不用去管
- 检查集群健康状态
[grid@dbprd1 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
其他服务都是正常状态。
- 查看alert日志
alert日志位置可以通过以下sql查询到
[grid@dbprd1 ~]$ sqlplus / as sysdba
SQL> select * from v$diag_info where name ='Diag Alert';
INST_ID NAME
---------- ----------------------------------------------------------------
VALUE
--------------------------------------------------------------------------------
1 Diag Alert
/u01/app/grid_base/diag/asm/+asm/+ASM1/alert
alert日志并无错误信息,说明数据库实例并不存在错误
- 查看系统日志
系统日志位于/var/log/messages
文件中,需要用root权限,messages日志会定期归档,所以需要根据你系统出错的时间找到相应日期的日志
-rw------- 1 root root 188K Jul 7 00:01 messages
-rw------- 1 root root 686K Jun 15 03:07 messages-20200615
-rw------- 1 root root 525K Jun 21 03:42 messages-20200621
-rw------- 1 root root 694K Jun 29 03:30 messages-20200629
-rw------- 1 root root 552K Jul 5 03:30 messages-20200705
messages日志也并未有明显的错误
- 查看crsd进程日志
vip归crsd进程管,可以查看crsd进程的日志文件,文件位于
/u01/app/11.2.0/grid/log/{SID}/agent/crsd/orarootagent_root/orarootagent_root.log
在日志中找到如下错误
CRS-5005: IP Address: 172.16.200.191 is already in use in the network
. For details refer to "(:CLSN00107:)" in "/u01/app/11.2.0/grid/log/dbprd1/agent/crsd/orarootagent_root//orarootagent_root.log".
2020-06-29 13:02:08.366: [ora.dbprd1.vip][2503161600]{1:57860:43811} [start] (:CLSN00107:) clsn_agent::start }
2020-06-29 13:02:08.366: [ AGFW][2503161600]{1:57860:43811} Command: start for resource: ora.dbprd1.vip 1 1 completed with status: FAIL
2020-06-29 13:02:08.367: [ AGFW][2501060352]{1:57860:43811} Agent sending reply for: RESOURCE_START[ora.dbprd1.vip 1 1] ID 4098:3899996
2020-06-29 13:02:08.367: [ AGFW][2501060352]{1:57860:43811} Agent sending reply for: RESOURCE_START[ora.dbprd1.vip 1 1] ID 4098:3899996
2020-06-29 13:02:08.867: [ora.dbprd1.vip][2503161600]{1:57860:43811} [check] Failed to check 172.16.200.191 on eth0
172.16.200.191
这个ip在rac中作vip使用,从日志可以看出应该是该ip被同个网络中其他主机给用了,而且此时vip服务已经停止但主机缺能ping通,说明确实有使用该ip的主机,反馈给负责人,通过查询果然有一台windows设备使用了该ip,更换windowsip后重启vip,服务恢复正常,重启命令如下:
[grid]
$ srvctl start vip -n dbprd1
dbprd1
为节点名称
参考
- https://docs.oracle.com/database/121/RACAD/GUID-B3AF3FC7-2EC1-4A8B-A4D9-28CF0C239AF6.htm#RACAD7848
- https://support.oracle.com/knowledge/Oracle%20Database%20Products/1470361_1.html
运维攻坚系列是我精心整理的运维实战记录系列,每个案例都来自于真实的线上环境,如果你有兴趣点击以下链接查看其他文章
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。