HBase集群RegionServer异常关闭
线上环境,运行一段时间出现
异常log排查
RegionServer报错信息
2019-06-26 18:25:57,946 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 33506ms
No GCs detected
2019-06-26 18:25:57,945 WARN [regionserver/dn3.ambari/10.1.11.143:16020] util.Sleeper: We slept 34118ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2019-06-26 18:25:57,945 INFO [dn3.ambari,16020,1556242814495_ChoreService_1] regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker missed its start time
2019-06-26 18:25:58,241 FATAL [regionserver/dn3.ambari/10.1.11.143:16020] regionserver.HRegionServer: ABORTING region server dn3.ambari,16020,1556242814495: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing dn3.ambari,16020,1556242814495 as dead server
gc.log信息
2019-06-26T18:24:53.434+0800: 5301797.558: [GC (CMS Initial Mark) [1 CMS-initial-mark: 1179358K(1679360K)] 1212098K(2055424K), 0.3656619 secs] [Times: user=0.04 sys=0.00, real=0.37 secs]
2019-06-26T18:24:53.800+0800: 5301797.924: [CMS-concurrent-mark-start]
2019-06-26T18:25:19.302+0800: 5301823.426: [CMS-concurrent-mark: 25.435/25.502 secs] [Times: user=0.43 sys=0.27, real=25.50 secs]
2019-06-26T18:25:19.302+0800: 5301823.426: [CMS-concurrent-preclean-start]
2019-06-26T18:25:19.316+0800: 5301823.440: [CMS-concurrent-preclean: 0.014/0.014 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2019-06-26T18:25:19.316+0800: 5301823.440: [CMS-concurrent-abortable-preclean-start]
CMS: abort preclean due to time 2019-06-26T18:25:24.387+0800: 5301828.511: [CMS-concurrent-abortable-preclean: 3.768/5.071 secs] [Times: user=3.82 sys=0.00, real=5.07 secs]
2019-06-26T18:25:24.399+0800: 5301828.523: [GC (CMS Final Remark) [YG occupancy: 101036 K (376064 K)]2019-06-26T18:25:24.399+0800: 5301828.523: [Rescan (parallel) , 0.0181882 secs]2019-06-26T18:25:24.417+0800: 5301828.541: [weak refs processing, 16.3528683 secs]2019-06-26T18:25:40.770+0800: 5301844.894: [class unloading, 15.7067612 secs]2019-06-26T18:25:56.477+0800: 5301860.601: [scrub symbol table, 1.4339443 secs]2019-06-26T18:25:57.911+0800: 5301862.035: [scrub string table, 0.0010173 secs][1 CMS-remark: 1179358K(1679360K)] 1280395K(2055424K), 33.5452080 secs] [Times: user=0.83 sys=0.30, real=33.55 secs]
2019-06-26T18:25:57.944+0800: 5301862.068: [CMS-concurrent-sweep-start]
2019-06-26T18:26:47.724+0800: 5301911.848: [CMS-concurrent-sweep: 49.167/49.780 secs] [Times: user=3.82 sys=0.80, real=49.78 secs]
2019-06-26T18:26:47.724+0800: 5301911.848: [CMS-concurrent-reset-start]
2019-06-26T18:26:47.732+0800: 5301911.856: [CMS-concurrent-reset: 0.008/0.008 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
Heap
par new generation total 376064K, used 248314K [0x0000000080000000, 0x0000000099800000, 0x0000000099800000)
eden space 334336K, 65% used [0x0000000080000000, 0x000000008d522a58, 0x0000000094680000)
from space 41728K, 72% used [0x0000000094680000, 0x00000000963dc0c0, 0x0000000096f40000)
to space 41728K, 0% used [0x0000000096f40000, 0x0000000096f40000, 0x0000000099800000)
concurrent mark-sweep generation total 1679360K, used 66650K [0x0000000099800000, 0x0000000100000000, 0x0000000100000000)
Metaspace used 72518K, capacity 91224K, committed 405388K, reserved 1255424K
class space used 7491K, capacity 13244K, committed 200360K, reserved 1048576K
根据log中的地址http://hbase.apache.org/book....,查看官方文档。官网建议如下:
最终我选择先增加RegionServer的最大内存,再继续观察一段时间