1

背景

某项目Oracle 11g RAC其中一个节点vip服务offline,集群从双节点变为单节点

排查

  • crsctl命令查看集群状态
$ su - grid
$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ARCH.dg
               ONLINE  ONLINE       dbprd1                                       
               ONLINE  ONLINE       dbprd2                                       
ora.CRS.dg
               ONLINE  ONLINE       dbprd1                                       
               ONLINE  ONLINE       dbprd2                                       
ora.DATA.dg
               ONLINE  ONLINE       dbprd1                                       
               ONLINE  ONLINE       dbprd2                                       
ora.LISTENER.lsnr
               ONLINE  OFFLINE      dbprd1                                       
               ONLINE  ONLINE       dbprd2                                       
ora.asm
               ONLINE  ONLINE       dbprd1                   Started             
               ONLINE  ONLINE       dbprd2                   Started             
ora.gsd
               OFFLINE OFFLINE      dbprd1                                       
               OFFLINE OFFLINE      dbprd2                                       
ora.net1.network
               ONLINE  ONLINE       dbprd1                                       
               ONLINE  ONLINE       dbprd2                                       
ora.ons
               ONLINE  ONLINE       dbprd1                                       
               ONLINE  ONLINE       dbprd2                                       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       dbprd2                                       
ora.asdfprdb.db
      1        ONLINE  ONLINE       dbprd1                   Open                
      2        ONLINE  ONLINE       dbprd2                   Open                
ora.cvu
      1        ONLINE  ONLINE       dbprd2                                       
ora.dbprd1.vip
      1        ONLINE  OFFLINE                                                   
ora.dbprd2.vip
      1        ONLINE  ONLINE       dbprd2                                       
ora.oc4j
      1        ONLINE  ONLINE       dbprd1                                       
ora.scan1.vip
      1        ONLINE  ONLINE       dbprd2         

可以看到,ora.dbprd1.vip1状态为OFFLINE,并且ora.LISTENER.lsnr也OFFLINE,监听应该是受vip影响,可以不用去管

  • 检查集群健康状态
[grid@dbprd1 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

其他服务都是正常状态。

  • 查看alert日志

alert日志位置可以通过以下sql查询到

[grid@dbprd1 ~]$ sqlplus / as sysdba

SQL> select * from v$diag_info where name ='Diag Alert';

   INST_ID NAME
---------- ----------------------------------------------------------------
VALUE
--------------------------------------------------------------------------------
         1 Diag Alert
/u01/app/grid_base/diag/asm/+asm/+ASM1/alert

alert日志并无错误信息,说明数据库实例并不存在错误

  • 查看系统日志

系统日志位于/var/log/messages文件中,需要用root权限,messages日志会定期归档,所以需要根据你系统出错的时间找到相应日期的日志

-rw-------  1 root   root   188K Jul  7 00:01 messages
-rw-------  1 root   root   686K Jun 15 03:07 messages-20200615
-rw-------  1 root   root   525K Jun 21 03:42 messages-20200621
-rw-------  1 root   root   694K Jun 29 03:30 messages-20200629
-rw-------  1 root   root   552K Jul  5 03:30 messages-20200705

messages日志也并未有明显的错误

  • 查看crsd进程日志

vip归crsd进程管,可以查看crsd进程的日志文件,文件位于

/u01/app/11.2.0/grid/log/{SID}/agent/crsd/orarootagent_root/orarootagent_root.log

在日志中找到如下错误

CRS-5005: IP Address: 172.16.200.191 is already in use in the network
. For details refer to "(:CLSN00107:)" in "/u01/app/11.2.0/grid/log/dbprd1/agent/crsd/orarootagent_root//orarootagent_root.log".

2020-06-29 13:02:08.366: [ora.dbprd1.vip][2503161600]{1:57860:43811} [start] (:CLSN00107:) clsn_agent::start }
2020-06-29 13:02:08.366: [    AGFW][2503161600]{1:57860:43811} Command: start for resource: ora.dbprd1.vip 1 1 completed with status: FAIL
2020-06-29 13:02:08.367: [    AGFW][2501060352]{1:57860:43811} Agent sending reply for: RESOURCE_START[ora.dbprd1.vip 1 1] ID 4098:3899996
2020-06-29 13:02:08.367: [    AGFW][2501060352]{1:57860:43811} Agent sending reply for: RESOURCE_START[ora.dbprd1.vip 1 1] ID 4098:3899996
2020-06-29 13:02:08.867: [ora.dbprd1.vip][2503161600]{1:57860:43811} [check] Failed to check 172.16.200.191 on eth0

172.16.200.191这个ip在rac中作vip使用,从日志可以看出应该是该ip被同个网络中其他主机给用了,而且此时vip服务已经停止但主机缺能ping通,说明确实有使用该ip的主机,反馈给负责人,通过查询果然有一台windows设备使用了该ip,更换windowsip后重启vip,服务恢复正常,重启命令如下:

[grid]

$ srvctl start vip -n dbprd1

dbprd1为节点名称

参考

运维攻坚系列是我精心整理的运维实战记录系列,每个案例都来自于真实的线上环境,如果你有兴趣点击以下链接查看其他文章


DQuery
300 声望93 粉丝

幸福是奋斗出来的