RegionServer异常关闭: org.apache.hadoop.hbase.YouAreDeadException

HBase集群RegionServer异常关闭

线上环境,运行一段时间出现

异常log排查

RegionServer报错信息

2019-06-26 18:25:57,946 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 33506ms
No GCs detected
2019-06-26 18:25:57,945 WARN  [regionserver/dn3.ambari/10.1.11.143:16020] util.Sleeper: We slept 34118ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2019-06-26 18:25:57,945 INFO  [dn3.ambari,16020,1556242814495_ChoreService_1] regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker missed its start time

2019-06-26 18:25:58,241 FATAL [regionserver/dn3.ambari/10.1.11.143:16020] regionserver.HRegionServer: ABORTING region server dn3.ambari,16020,1556242814495: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing dn3.ambari,16020,1556242814495 as dead server

gc.log信息

2019-06-26T18:24:53.434+0800: 5301797.558: [GC (CMS Initial Mark) [1 CMS-initial-mark: 1179358K(1679360K)] 1212098K(2055424K), 0.3656619 secs] [Times: user=0.04 sys=0.00, real=0.37 secs] 
2019-06-26T18:24:53.800+0800: 5301797.924: [CMS-concurrent-mark-start]
2019-06-26T18:25:19.302+0800: 5301823.426: [CMS-concurrent-mark: 25.435/25.502 secs] [Times: user=0.43 sys=0.27, real=25.50 secs] 
2019-06-26T18:25:19.302+0800: 5301823.426: [CMS-concurrent-preclean-start]
2019-06-26T18:25:19.316+0800: 5301823.440: [CMS-concurrent-preclean: 0.014/0.014 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
2019-06-26T18:25:19.316+0800: 5301823.440: [CMS-concurrent-abortable-preclean-start]
 CMS: abort preclean due to time 2019-06-26T18:25:24.387+0800: 5301828.511: [CMS-concurrent-abortable-preclean: 3.768/5.071 secs] [Times: user=3.82 sys=0.00, real=5.07 secs] 
2019-06-26T18:25:24.399+0800: 5301828.523: [GC (CMS Final Remark) [YG occupancy: 101036 K (376064 K)]2019-06-26T18:25:24.399+0800: 5301828.523: [Rescan (parallel) , 0.0181882 secs]2019-06-26T18:25:24.417+0800: 5301828.541: [weak refs processing, 16.3528683 secs]2019-06-26T18:25:40.770+0800: 5301844.894: [class unloading, 15.7067612 secs]2019-06-26T18:25:56.477+0800: 5301860.601: [scrub symbol table, 1.4339443 secs]2019-06-26T18:25:57.911+0800: 5301862.035: [scrub string table, 0.0010173 secs][1 CMS-remark: 1179358K(1679360K)] 1280395K(2055424K), 33.5452080 secs] [Times: user=0.83 sys=0.30, real=33.55 secs] 
2019-06-26T18:25:57.944+0800: 5301862.068: [CMS-concurrent-sweep-start]
2019-06-26T18:26:47.724+0800: 5301911.848: [CMS-concurrent-sweep: 49.167/49.780 secs] [Times: user=3.82 sys=0.80, real=49.78 secs] 
2019-06-26T18:26:47.724+0800: 5301911.848: [CMS-concurrent-reset-start]
2019-06-26T18:26:47.732+0800: 5301911.856: [CMS-concurrent-reset: 0.008/0.008 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] 
Heap
 par new generation   total 376064K, used 248314K [0x0000000080000000, 0x0000000099800000, 0x0000000099800000)
  eden space 334336K,  65% used [0x0000000080000000, 0x000000008d522a58, 0x0000000094680000)
  from space 41728K,  72% used [0x0000000094680000, 0x00000000963dc0c0, 0x0000000096f40000)
  to   space 41728K,   0% used [0x0000000096f40000, 0x0000000096f40000, 0x0000000099800000)
 concurrent mark-sweep generation total 1679360K, used 66650K [0x0000000099800000, 0x0000000100000000, 0x0000000100000000)
 Metaspace       used 72518K, capacity 91224K, committed 405388K, reserved 1255424K
  class space    used 7491K, capacity 13244K, committed 200360K, reserved 1048576K

如何调整JVM参数,让集群不会因为Full GC 导致RegionServer与zookeeper断开连接

阅读 4.2k
1 个回答

根据log中的地址http://hbase.apache.org/book....,查看官方文档。官网建议如下:

  1. 确保你提供足够的RAM(在hbase-env.sh中),默认的1GB将无法维持长时间运行的导入。
  2. 确保不交换,JVM在交换时从不表现良好
  3. 确保您没有CPU占用RegionServer线程。例如,如果在具有4个内核的计算机上使用6个CPU密集型任务运行MapReduce作业,则可能会使RegionServer匮乏,从而导致更长时间的垃圾收集暂停。
  4. 增加ZooKeeper会话超时

最终我选择先增加RegionServer的最大内存,再继续观察一段时间

撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进