1. Business background
At present, a large number of message pushes are used in mobile terminal usage scenarios. Push messages can help operators achieve their operational goals more efficiently (such as pushing marketing activities to users or reminding users of new APP functions).
For the push system, the following two characteristics are required:
- Messages are delivered to users in seconds, without delay, support millions of pushes per second, and single-machine millions of long connections.
- Supports display forms such as notification, text, and transparent transmission of custom messages. It is for the above reasons that bring challenges to the development and maintenance of the system. The figure below is a brief description of the push system (API->Push Module->Mobile).
2. Background of the problem
In the long-connected cluster of the push system, a process hangs randomly after a period of time in the stability test and stress test stage. The probability is small (the frequency is about once a month), which will affect the delivery of some client messages. The time limit.
The long connection node (Broker system) in the push system is developed based on Netty. This node maintains the long connection between the server and the mobile terminal. After an online problem occurs, add Netty memory leak monitoring parameters for troubleshooting. No problem was found.
Since the long connection node is developed by Netty, in order to facilitate readers' understanding, Netty is briefly introduced below.
Three, Netty introduction
Netty is a high-performance, asynchronous event-driven NIO framework, implemented based on the API provided by Java NIO. It provides support for TCP, UDP and file transmission. As the most popular NIO framework, Netty has been widely used in the Internet field, big data distributed computing field, game industry, communication industry, etc., HBase, Hadoop, Bees , Dubbo and other open source components are also built based on Netty's NIO framework.
Four, problem analysis
4.1 Conjecture
The initial guess was that it was caused by the number of long connections, but after checking the logs and analyzing the code, it was found that this was not the cause.
Number of long connections: 390,000, as shown below:
The byte size of each channel is 1456, which is calculated based on 400,000 long connections, which will not cause excessive memory.
4.2 View GC log
Checking the GC log, it is found that before the process hangs up, there is frequent full GC (every 5 minutes), but the memory has not been reduced. It is suspected that off-heap memory leaks.
4.3 Analyze the heap memory situation
The ChannelOutboundBuffer object occupies nearly 5G of memory, and the reason for the leak can basically be determined: ChannelOutboundBuffer has too many entries. Checking the source code of ChannelOutboundBuffer can analyze that it is the data in ChannelOutboundBuffer.
Not written out, resulting in a constant backlog; ChannelOutboundBuffer is a linked list structure inside.
4.4 The analysis data from the above figure has not been written out. Why does this happen?
There is actually a situation in the code to determine whether the connection is available (Channel.isActive), and the timeout connection will be closed. From historical experience, this situation occurs when the connection is half open (the client is closed abnormally) --- there is no problem if the two parties do not communicate with each other.
According to the above conjecture, the test environment is reproduced and tested.
1) Simulate the client cluster and establish a connection with the persistent connection server, set the firewall of the client node, and simulate the scenario where the server and the client network are abnormal (that is, the situation where the Channel.isActive call is successful but the data cannot be actually sent) .
2) Reduce the off-heap memory and continue to send test messages to the previous client. The size of the message (about 1K).
3) Calculated according to 128M memory, in fact, it will appear after calling 9W multiple times.
Five, problem solving
5.1 Enable autoRead mechanism
When the channel is not writable, turn off autoRead;
public void channelReadComplete(ChannelHandlerContext ctx) throws Exception {
if (!ctx.channel().isWritable()) {
Channel channel = ctx.channel();
ChannelInfo channelInfo = ChannelManager.CHANNEL_CHANNELINFO.get(channel);
String clientId = "";
if (channelInfo != null) {
clientId = channelInfo.getClientId();
}
LOGGER.info("channel is unwritable, turn off autoread, clientId:{}", clientId);
channel.config().setAutoRead(false);
}
}
Turn on autoRead when the data can be written;
@Override
public void channelWritabilityChanged(ChannelHandlerContext ctx) throws Exception
{
Channel channel = ctx.channel();
ChannelInfo channelInfo = ChannelManager.CHANNEL_CHANNELINFO.get(channel);
String clientId = "";
if (channelInfo != null) {
clientId = channelInfo.getClientId();
}
if (channel.isWritable()) {
LOGGER.info("channel is writable again, turn on autoread, clientId:{}", clientId);
channel.config().setAutoRead(true);
}
}
Description:
The role of autoRead is more precise rate control. If it is turned on, Netty will register the read event for us. When the read event is registered, if the network is readable, Netty will read the data from the channel. Then if autoread is turned off, Netty will not register read events.
In this way, even if the opposite end sends data, the read event will not be triggered, and the data will not be read from the channel. When recv_buffer is full, no more data will be received.
5.2 Set high and low water levels
serverBootstrap.option(ChannelOption.WRITE_BUFFER_WATER_MARK, new WriteBufferWaterMark(1024 * 1024, 8 * 1024 * 1024));
Note: The high and low water levels are used in conjunction with the isWritable behind
5.3 Increase the judgment of channel.isWritable()
In addition to verifying channel.isActive(), whether a channel is available or not needs to be judged by channel.isWrite(). isActive only ensures whether the connection is active, and whether it can be written is determined by isWrite.
private void writeBackMessage(ChannelHandlerContext ctx, MqttMessage message) {
Channel channel = ctx.channel();
//增加channel.isWritable()的判断
if (channel.isActive() && channel.isWritable()) {
ChannelFuture cf = channel.writeAndFlush(message);
if (cf.isDone() && cf.cause() != null) {
LOGGER.error("channelWrite error!", cf.cause());
ctx.close();
}
}
}
Note: isWritable can control the ChannelOutboundBuffer to prevent unlimited expansion. The mechanism is to use the set channel high and low water levels to make judgments.
5.4 Problem verification
Test after modification, and no error is reported after sending to 27W times;
Sixth, analysis of solutions
The general Netty data processing flow is as follows: the read data is processed by the business thread, and the processing is completed before sending it out (the whole process is asynchronous). In order to improve the throughput of the network, Netty adds a ChannelOutboundBuffer between the business layer and the socket. .
When channel.write is called, all the written data is not actually written to the socket, but is written to the ChannelOutboundBuffer first. When channel.flush is called, it is actually written to the socket. Because there is a buffer in the middle, there is rate matching, and this buffer is still unbounded (linked list), that is, if you do not control the speed of channel.write, a lot of data will accumulate in this buffer, if you encounter it again When the socket cannot write data (isActive is judged to be invalid at this time) or the writing is slow.
The likely result is that resources are exhausted, and if the ChannelOutboundBuffer stores DirectByteBuffer, this will make the problem more difficult to troubleshoot.
The process can be abstracted as follows:
From the above analysis, it can be seen that if the step 1 is written too fast (to be too fast to process) or the downstream cannot send data, it will cause problems. This is actually a rate matching problem.
Seven, Netty source code description
High water level
When the capacity of the ChannelOutboundBuffer exceeds the high watermark threshold, isWritable() returns false, the channel is set to be unwritable (setUnwritable), and fireChannelWritabilityChanged() is triggered.
private void incrementPendingOutboundBytes(long size, boolean invokeLater) {
if (size == 0) {
return;
}
long newWriteBufferSize = TOTAL_PENDING_SIZE_UPDATER.addAndGet(this, size);
if (newWriteBufferSize > channel.config().getWriteBufferHighWaterMark()) {
setUnwritable(invokeLater);
}
}
private void setUnwritable(boolean invokeLater) {
for (;;) {
final int oldValue = unwritable;
final int newValue = oldValue | 1;
if (UNWRITABLE_UPDATER.compareAndSet(this, oldValue, newValue)) {
if (oldValue == 0 && newValue != 0) {
fireChannelWritabilityChanged(invokeLater);
}
break;
}
}
}
below the low water level
When the capacity of the ChannelOutboundBuffer is lower than the low water mark threshold, isWritable() returns true, the channel is set to be writable, and fireChannelWritabilityChanged() is triggered.
private void decrementPendingOutboundBytes(long size, boolean invokeLater, boolean notifyWritability) {
if (size == 0) {
return;
}
long newWriteBufferSize = TOTAL_PENDING_SIZE_UPDATER.addAndGet(this, -size);
if (notifyWritability && newWriteBufferSize < channel.config().getWriteBufferLowWaterMark()) {
setWritable(invokeLater);
}
}
private void setWritable(boolean invokeLater) {
for (;;) {
final int oldValue = unwritable;
final int newValue = oldValue & ~1;
if (UNWRITABLE_UPDATER.compareAndSet(this, oldValue, newValue)) {
if (oldValue != 0 && newValue == 0) {
fireChannelWritabilityChanged(invokeLater);
}
break;
}
}
}
8. Summary
When the capacity of the ChannelOutboundBuffer exceeds the threshold set for the high water mark, isWritable() returns false, indicating that the message has accumulated and the writing speed needs to be reduced.
When the capacity of the ChannelOutboundBuffer is lower than the low water mark threshold, isWritable() returns true, indicating that there are too few messages and the writing speed needs to be improved. After modification through the above three steps, no problems occurred during the deployment online observation for half a year.
Author: vivo Internet server team-Zhang Lin
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。