netty
引言
如标题所言,讲述了一个难以排查的 Netty ByteBuf 内存泄漏问题的排查和优化实战。这种经验之谈非常有学习和参考价值。
原文
A Netty ByteBuf Memory Leak Story and the Lessons Learned | Logz.io
By: Asaf Mesika
Just a while ago, I was chasing a memory leak we had at Logz.io while I was refactoring our log receiver. We were using Netty, and after a major refactoring, we noticed that there was a gradual decrease of free memory to the machine.
就在不久前,我在Logz.io重构我们的日志收集器时发现了一个内存泄露问题。我们当时使用的是Netty,在重构之后,我们发现机器的可用内存在逐渐减少。
Our first action was to try to run garbage collection to see if this was an on-heap or off-heap (utilizing ByteBuf) memory issue. We quickly found that it was an off-heap issue and started to read through the code to see where we forgot to call the release() method on the ByteBuf type. We could not find anything obvious — but that is usually the case when it comes to memory leaks.
我们首先尝试运行垃圾回收,看看这是堆内 (on-heap) 还是堆外(off-heap)(利用ByteBuf)内存问题。
我们很快发现这是一个堆外问题,并开始阅读代码,查看我们在哪里忘记调用ByteBuf类型的release() 方法。我们没有找到任何明显的地方--但当涉及到内存泄漏时,通常就是这种情况。
Then, I noticed that there was a message that appeared only once when we started the application:
之后,我注意到有一条消息在应用程序启动的时候只出现了一次。
ERROR i.n.u.ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call ResourceLeakDetector.setLevel()
ERROR i.n.u.ResourceLeakDetector
: 泄漏:ByteBuf.release()未在垃圾收集前调用。启用高级泄漏报告以找出泄漏发生的位置。要启用高级泄漏报告,请指定JVM选项"-Dio.netty.leakDetectionLevel=advanced "或调用ResourceLeakDetector.setLevel()
At first, I did not pay much attention to the message because it only appeared once. So, I figured that it was a single ByteBuf that I forgot to release and that I would fix it the following week. After a couple of days, we noticed that the host’s free memory was still decreasing. So, I realized that I needed to understand more about this error.
最开始我并没有对于这条信息过多关注,因为它仅仅出现了一次。因此,我认为是我忘记释放单个ByteBuf,我会在下周解决这个问题。几天后,我们发现主机的可用内存仍在减少。因此,我意识到我需要进一步了解这个错误。
In the reference counted objects section in Netty’s documentation, there was a detailed section entitled “Troubleshooting buffer leaks.” When I read that part of the documentation, I did not understand it completely until I read the following:
在Netty文档中的引用计数对象部分,有一个题为 "缓冲区泄漏的故障排除 "的详细章节。当我阅读这部分文档时,我并没有完全理解,直到我阅读了下面的内容:
Netty adds a hook to the ByteBuf code such that when a GC occurs, it checks whether this buffer was released(), if it doesn’t it prints the error message above. ONE important detail here is that it only does this check for a fraction of the byte buffers (sampling), thus when you see this error message only once, it probably means it happens a lot more than once.
Netty在ByteBuf代码中添加了一个钩子,当GC发生时,它会检查该缓冲区是否被释放(),如果没有,就会打印上面的错误信息。
这里有一个重要细节,它只对一部分字节缓冲区(采样)进行这种检查,因此当你只看到一次错误信息时,很可能意味着它发生了很多次。
Once I understood that I added the JVM option switch
理解这一点后,我添加了JVM选项:
-Dio.netty.leakDetectionLevel=advanced
as recommended. However, when the application started, I then saw two error messages instead of one as a side effect. There was one more important detail in the log message: the location in the code where I had created the specific ByteBuf that had not been released. This helped me to understand the location where I was causing the leak. The first takeaway: Do not ignore memory leak messages — immediately switch the leak detection level to advanced mode in the JVM command line argument to detect the origin of the leak.
根据建议。当应用程序启动时,我看到了两个错误信息,而不是一个错误信息。这里又一条更为重要的消息出现在日志当中: 我在代码中创建的特定ByteBuf尚未释放的位置。这帮助我理解那导致了内存泄漏的位置。
基于上面的内容。第一点启示: 不要忽略内存泄漏信息,立即将JVM命令行参数中的泄漏检测级别切换为高级模式,以检测泄漏的源头。
The second takeaway: When hunting down ByteBuf memory leaks, run a “find usage” on the class and trace your code upwards through the calling hierarchy until you get to the actual code that created it — even if it seems obvious and specifically if it is third-party code that is causing the problem.
第二点启示: 在查找ByteBuf内存泄漏时,在类上运行 "find usage",并通过调用层次结构向上跟踪代码,直到找到创建它的实际代码,即使它看起来很明显,特别是如果它是导致问题的第三方代码。
More on the subject:
更多相关信息:
- Webinar - Collect and Analyze Kafka JMX Metrics with Logz.io
- Shipping AWS Lambda Metrics to Logz.io
- What Are the Hardest Parts of Kubernetes to Learn?
The third takeaway was a side effect of changing the leak-detection level to advanced mode. When I ran my performance load test, I noticed that the receiver barely made it through 25 MB/sec, but the rate when using the same machine is usually 200 MB/sec. I had placed more code into the build that I had tested, so I was not sure of the cause of the slowdown.
上面第三条的收获是了解到内存泄漏的检测级别改为 高级模式 的副作用。当我们运行性能负载测试的时候,发现日志收集器勉强可以到25MB/s,但是使用同一台机器的时候速率通常都在 200MB /s。
我将更多的代码放入了我测试过的版本中,所以我不确定导致速度变慢的原因。
I started commenting out code until I had reached a point where my handler simply did nothing — the handler practically looked like a copy-paste of the Discard Server example from Netty’s documentation.
我开始注释代码,直到我的处理程序什么也不做,处理程序实际上就像Netty文档中Discard Server示例的复制粘贴。
When I removed the
但是当我移除
-Dio.netty.leakDetectionLevel=advanced
JVM option, the speed returned to normal. I was amazed! So, just to boil this article down to a single point to remember: The leak detection level’s advanced mode may slow down Netty by a factor of 10.
JVM选项之后,这时候速度立刻恢复到正常情况。这让我非常惊讶!所以,将本文归结为一点,请记住:
泄漏检测级别的高级模式可能会使Netty的运行速度降低10倍。
Have you had any experiences with memory leaks using Netty and had learned some lessons as a result? If so, I’d love to hear your stories in the comments below!
您在使用Netty时是否有过内存泄露的经历,并因此吸取了一些教训?如果有,我希望在下面的评论中听到您的故事!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。