【Netty】A Netty ByteBuf Memory Leak Story and the Lessons Learned

netty

引言

如标题所言，讲述了一个难以排查的 Netty ByteBuf 内存泄漏问题的排查和优化实战。这种经验之谈非常有学习和参考价值。

原文

A Netty ByteBuf Memory Leak Story and the Lessons Learned | Logz.io

By: Asaf Mesika

Just a while ago, I was chasing a memory leak we had at Logz.io while I was refactoring our log receiver. We were using Netty, and after a major refactoring, we noticed that there was a gradual decrease of free memory to the machine.

就在不久前，我在Logz.io重构我们的日志收集器时发现了一个内存泄露问题。我们当时使用的是Netty，在重构之后，我们发现机器的可用内存在逐渐减少。

Our first action was to try to run garbage collection to see if this was an on-heap or off-heap (utilizing ByteBuf) memory issue. We quickly found that it was an off-heap issue and started to read through the code to see where we forgot to call the release() method on the ByteBuf type. We could not find anything obvious — but that is usually the case when it comes to memory leaks.

我们首先尝试运行垃圾回收，看看这是堆内 （on-heap） 还是堆外（off-heap）（利用ByteBuf）内存问题。

我们很快发现这是一个堆外问题，并开始阅读代码，查看我们在哪里忘记调用ByteBuf类型的release() 方法。我们没有找到任何明显的地方--但当涉及到内存泄漏时，通常就是这种情况。

Then, I noticed that there was a message that appeared only once when we started the application:

之后，我注意到有一条消息在应用程序启动的时候只出现了一次。

ERROR i.n.u.ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call ResourceLeakDetector.setLevel()

ERROR i.n.u.ResourceLeakDetector：泄漏：ByteBuf.release()未在垃圾收集前调用。启用高级泄漏报告以找出泄漏发生的位置。要启用高级泄漏报告，请指定JVM选项"-Dio.netty.leakDetectionLevel=advanced "或调用ResourceLeakDetector.setLevel()

At first, I did not pay much attention to the message because it only appeared once. So, I figured that it was a single ByteBuf that I forgot to release and that I would fix it the following week. After a couple of days, we noticed that the host’s free memory was still decreasing. So, I realized that I needed to understand more about this error.

最开始我并没有对于这条信息过多关注，因为它仅仅出现了一次。因此，我认为是我忘记释放单个ByteBuf，我会在下周解决这个问题。几天后，我们发现主机的可用内存仍在减少。因此，我意识到我需要进一步了解这个错误。

In the reference counted objects section in Netty’s documentation, there was a detailed section entitled “Troubleshooting buffer leaks.” When I read that part of the documentation, I did not understand it completely until I read the following:

在Netty文档中的引用计数对象部分，有一个题为 "缓冲区泄漏的故障排除 "的详细章节。当我阅读这部分文档时，我并没有完全理解，直到我阅读了下面的内容：

Netty adds a hook to the ByteBuf code such that when a GC occurs, it checks whether this buffer was released(), if it doesn’t it prints the error message above. ONE important detail here is that it only does this check for a fraction of the byte buffers (sampling), thus when you see this error message only once, it probably means it happens a lot more than once.
Netty在ByteBuf代码中添加了一个钩子，当GC发生时，它会检查该缓冲区是否被释放()，如果没有，就会打印上面的错误信息。
这里有一个重要细节，它只对一部分字节缓冲区（采样）进行这种检查，因此当你只看到一次错误信息时，很可能意味着它发生了很多次。

Once I understood that I added the JVM option switch

理解这一点后，我添加了JVM选项：

-Dio.netty.leakDetectionLevel=advanced

as recommended. However, when the application started, I then saw two error messages instead of one as a side effect. There was one more important detail in the log message: the location in the code where I had created the specific ByteBuf that had not been released. This helped me to understand the location where I was causing the leak. The first takeaway: Do not ignore memory leak messages — immediately switch the leak detection level to advanced mode in the JVM command line argument to detect the origin of the leak.

根据建议。当应用程序启动时，我看到了两个错误信息，而不是一个错误信息。这里又一条更为重要的消息出现在日志当中： 我在代码中创建的特定ByteBuf尚未释放的位置。这帮助我理解那导致了内存泄漏的位置。

基于上面的内容。第一点启示：不要忽略内存泄漏信息，立即将JVM命令行参数中的泄漏检测级别切换为高级模式，以检测泄漏的源头。

The second takeaway: When hunting down ByteBuf memory leaks, run a “find usage” on the class and trace your code upwards through the calling hierarchy until you get to the actual code that created it — even if it seems obvious and specifically if it is third-party code that is causing the problem.

第二点启示：在查找ByteBuf内存泄漏时，在类上运行 "find usage"，并通过调用层次结构向上跟踪代码，直到找到创建它的实际代码，即使它看起来很明显，特别是如果它是导致问题的第三方代码。

More on the subject:

更多相关信息：

The third takeaway was a side effect of changing the leak-detection level to advanced mode. When I ran my performance load test, I noticed that the receiver barely made it through 25 MB/sec, but the rate when using the same machine is usually 200 MB/sec. I had placed more code into the build that I had tested, so I was not sure of the cause of the slowdown.

上面第三条的收获是了解到内存泄漏的检测级别改为 高级模式 的副作用。当我们运行性能负载测试的时候，发现日志收集器勉强可以到25MB/s，但是使用同一台机器的时候速率通常都在 200MB /s。

我将更多的代码放入了我测试过的版本中，所以我不确定导致速度变慢的原因。

I started commenting out code until I had reached a point where my handler simply did nothing — the handler practically looked like a copy-paste of the Discard Server example from Netty’s documentation.

我开始注释代码，直到我的处理程序什么也不做，处理程序实际上就像Netty文档中Discard Server示例的复制粘贴。

When I removed the

但是当我移除

-Dio.netty.leakDetectionLevel=advanced

JVM option, the speed returned to normal. I was amazed! So, just to boil this article down to a single point to remember: The leak detection level’s advanced mode may slow down Netty by a factor of 10.

JVM选项之后，这时候速度立刻恢复到正常情况。这让我非常惊讶！所以，将本文归结为一点，请记住：

泄漏检测级别的高级模式可能会使Netty的运行速度降低10倍。

Have you had any experiences with memory leaks using Netty and had learned some lessons as a result? If so, I’d love to hear your stories in the comments below!

您在使用Netty时是否有过内存泄露的经历，并因此吸取了一些教训？如果有，我希望在下面的评论中听到您的故事！

【Netty】A Netty ByteBuf Memory Leak Story and the Lessons Learned

netty

引言

原文

阿东

引用和评论

清华大学第五弹：DeepSeek与AI幻觉