Introduction to Exploring RocketMQ source code-Series1: Transaction news from the perspective of Producer
1 Introduction
Apache RocketMQ, as a well-known open source messaging middleware, was born in Alibaba and donated to Apache in 2016. From RocketMQ 4.0 to the latest v4.7.1, whether in Alibaba's internal or external communities, it has won widespread attention and praise.
Out of interest and work needs, I recently studied some of the code of RocketMQ 4.7.1, during which a lot of confusion was generated and more inspiration was gained.
From the perspective of the sender, this article will analyze how RocketMQ works in transaction message sending by reading the RocketMQ Producer source code. It should be noted that the code posted in this article comes from the RocketMQ source code of version 4.7.1. The sending discussed in this article only refers to the process of sending from the Producer to the Broker, and does not include the process of delivering the message to the Consumer by the Broker.
2. Macro overview
RocketMQ transaction message sending process:
figure 1
Combined with the source code, the sendMessageInTransaction method of RocketMQ's transaction message TransactionMQProducer actually calls the sendMessageInTransaction method of DefaultMQProducerImpl. We enter the sendMessageInTransaction method, and the sending process of the entire transaction message is clearly visible:
First, check before sending and fill in the necessary parameters, including the prepare transaction message.
source code list-1
public TransactionSendResult sendMessageInTransaction(final Message msg,
final LocalTransactionExecuter localTransactionExecuter, final Object arg)
throws MQClientException {
TransactionListener transactionListener = getCheckListener();
if (null == localTransactionExecuter && null == transactionListener) {
throw new MQClientException("tranExecutor is null", null);
}
// ignore DelayTimeLevel parameter
if (msg.getDelayTimeLevel() != 0) {
MessageAccessor.clearProperty(msg, MessageConst.PROPERTY_DELAY_TIME_LEVEL);
}
Validators.checkMessage(msg, this.defaultMQProducer);
SendResult sendResult = null;
MessageAccessor.putProperty(msg, MessageConst.PROPERTY_TRANSACTION_PREPARED, "true");
MessageAccessor.putProperty(msg, MessageConst.PROPERTY_PRODUCER_GROUP, this.defaultMQProducer.getProducerGroup());
Enter the sending process:
source code list-2
try {
sendResult = this.send(msg);
} catch (Exception e) {
throw new MQClientException("send message Exception", e);
}
Determine whether to execute the local transaction according to the processing result returned by the broker, and start the local transaction execution if the half-message is successfully sent:
source code list-3
LocalTransactionState localTransactionState = LocalTransactionState.UNKNOW;
Throwable localException = null;
switch (sendResult.getSendStatus()) {
case SEND_OK: {
try {
if (sendResult.getTransactionId() != null) {
msg.putUserProperty("__transactionId__", sendResult.getTransactionId());
}
String transactionId = msg.getProperty(MessageConst.PROPERTY_UNIQ_CLIENT_MESSAGE_ID_KEYIDX);
if (null != transactionId && !"".equals(transactionId)) {
msg.setTransactionId(transactionId);
}
if (null != localTransactionExecuter) {
localTransactionState = localTransactionExecuter.executeLocalTransactionBranch(msg, arg);
} else if (transactionListener != null) {
log.debug("Used new transaction API");
localTransactionState = transactionListener.executeLocalTransaction(msg, arg);
}
if (null == localTransactionState) {
localTransactionState = LocalTransactionState.UNKNOW;
}
if (localTransactionState != LocalTransactionState.COMMIT_MESSAGE) {
log.info("executeLocalTransactionBranch return {}", localTransactionState);
log.info(msg.toString());
}
} catch (Throwable e) {
log.info("executeLocalTransactionBranch exception", e);
log.info(msg.toString());
localException = e;
}
}
break;
case FLUSH_DISK_TIMEOUT:
case FLUSH_SLAVE_TIMEOUT:
case SLAVE_NOT_AVAILABLE: // 当备broker状态不可用时,半消息要回滚,不执行本地事务
localTransactionState = LocalTransactionState.ROLLBACK_MESSAGE;
break;
default:
break;
}
The execution of the local transaction ends, and the two-stage processing is carried out according to the local transaction status:
source code list-4
try {
this.endTransaction(sendResult, localTransactionState, localException);
} catch (Exception e) {
log.warn("local transaction execute " + localTransactionState + ", but end broker transaction failed", e);
}
// 组装发送结果
// ...
return transactionSendResult;
}
Next, we in-depth code analysis at each stage.
3. Deep Grilled Inside Story
3.1 One-stage transmission
Focus on the send method. After entering the send method, we found that the SYNC synchronization mode is used in the first stage of RocketMQ's transaction message:
source code list-5
public SendResult send(Message msg,
long timeout) throws MQClientException, RemotingException, MQBrokerException, InterruptedException {
return this.sendDefaultImpl(msg, CommunicationMode.SYNC, null, timeout);
}
This is easy to understand. After all, the transaction message is to decide whether to execute a local transaction based on the results of a stage, so it must be blocked waiting for the broker's ack.
Let's enter DefaultMQProducerImpl.java to see the implementation of the sendDefaultImpl method. By reading the code of this method, we will try to understand the behavior of the producer during the first stage of the transaction message sending process. It is worth noting that this method is not customized for transaction messages, or even customized for SYNC synchronization mode, so after reading this code, you can basically have a more comprehensive understanding of RocketMQ's message sending mechanism.
The logic of this code is very smooth, I can't bear to slice it. In order to save space, replace the more complicated but less informative parts of the code with comments to preserve the integrity of the process as much as possible. The parts that I personally think are more important or easily overlooked are marked with notes, and some details are explained in detail later.
source code list-6
private SendResult sendDefaultImpl(
Message msg,
final CommunicationMode communicationMode,
final SendCallback sendCallback,
final long timeout
) throws MQClientException, RemotingException, MQBrokerException, InterruptedException {
this.makeSureStateOK();
// 一、消息有效性校验。见后文
Validators.checkMessage(msg, this.defaultMQProducer);
final long invokeID = random.nextLong();
long beginTimestampFirst = System.currentTimeMillis();
long beginTimestampPrev = beginTimestampFirst;
long endTimestamp = beginTimestampFirst;
// 获取当前topic的发送路由信息,主要是要broker,如果没找到则从namesrv获取
TopicPublishInfo topicPublishInfo = this.tryToFindTopicPublishInfo(msg.getTopic());
if (topicPublishInfo != null && topicPublishInfo.ok()) {
boolean callTimeout = false;
MessageQueue mq = null;
Exception exception = null;
SendResult sendResult = null;
// 二、发送重试机制。见后文
int timesTotal = communicationMode == CommunicationMode.SYNC ? 1 + this.defaultMQProducer.getRetryTimesWhenSendFailed() : 1;
int times = 0;
String[] brokersSent = new String[timesTotal];
for (; times < timesTotal; times++) {
// 第一次发送是mq == null, 之后都是有broker信息的
String lastBrokerName = null == mq ? null : mq.getBrokerName();
// 三、rocketmq发送消息时如何选择队列?——broker异常规避机制
MessageQueue mqSelected = this.selectOneMessageQueue(topicPublishInfo, lastBrokerName);
if (mqSelected != null) {
mq = mqSelected;
brokersSent[times] = mq.getBrokerName();
try {
beginTimestampPrev = System.currentTimeMillis();
if (times > 0) {
//Reset topic with namespace during resend.
msg.setTopic(this.defaultMQProducer.withNamespace(msg.getTopic()));
}
long costTime = beginTimestampPrev - beginTimestampFirst;
if (timeout < costTime) {
callTimeout = true;
break;
}
// 发送核心代码
sendResult = this.sendKernelImpl(msg, mq, communicationMode, sendCallback, topicPublishInfo, timeout - costTime);
endTimestamp = System.currentTimeMillis();
// rocketmq 选择 broker 时的规避机制,开启 sendLatencyFaultEnable == true 才生效
this.updateFaultItem(mq.getBrokerName(), endTimestamp - beginTimestampPrev, false);
switch (communicationMode) {
// 四、RocketMQ的三种CommunicationMode。见后文
case ASYNC: // 异步模式
return null;
case ONEWAY: // 单向模式
return null;
case SYNC: // 同步模式
if (sendResult.getSendStatus() != SendStatus.SEND_OK) {
if (this.defaultMQProducer.isRetryAnotherBrokerWhenNotStoreOK()) {
continue;
}
}
return sendResult;
default:
break;
}
} catch (RemotingException e) {
// ...
// 自动重试
} catch (MQClientException e) {
// ...
// 自动重试
} catch (MQBrokerException e) {
// ...
// 仅返回码==NOT_IN_CURRENT_UNIT==205 时自动重试
// 其他情况不重试,抛异常
} catch (InterruptedException e) {
// ...
// 不重试,抛异常
}
} else {
break;
}
}
if (sendResult != null) {
return sendResult;
}
// 组装返回的info信息,最后以MQClientException抛出
// ... ...
// 超时场景抛RemotingTooMuchRequestException
if (callTimeout) {
throw new RemotingTooMuchRequestException("sendDefaultImpl call timeout");
}
// 填充MQClientException异常信息
// ...
}
validateNameServerSetting();
throw new MQClientException("No route info of this topic: " + msg.getTopic() + FAQUrl.suggestTodo(FAQUrl.NO_TOPIC_ROUTE_INFO),
null).setResponseCode(ClientErrorCode.NOT_FOUND_TOPIC_EXCEPTION);
}
3.1.1 Message validity check
source code list-7
Validators.checkMessage(msg, this.defaultMQProducer);
In this method, the validity of the message is verified, including the verification of the topic and the message body. The topic naming must conform to the specification, and avoid using the built-in system message TOPIC. Message body length> 0 && Message body length <= 1024_1024_4 = 4M.
source code list-8
public static void checkMessage(Message msg, DefaultMQProducer defaultMQProducer)
throws MQClientException {
if (null == msg) {
throw new MQClientException(ResponseCode.MESSAGE_ILLEGAL, "the message is null");
}
// topic
Validators.checkTopic(msg.getTopic());
Validators.isNotAllowedSendTopic(msg.getTopic());
// body
if (null == msg.getBody()) {
throw new MQClientException(ResponseCode.MESSAGE_ILLEGAL, "the message body is null");
}
if (0 == msg.getBody().length) {
throw new MQClientException(ResponseCode.MESSAGE_ILLEGAL, "the message body length is zero");
}
if (msg.getBody().length > defaultMQProducer.getMaxMessageSize()) {
throw new MQClientException(ResponseCode.MESSAGE_ILLEGAL,
"the message body size over max value, MAX: " + defaultMQProducer.getMaxMessageSize());
}
}
3.1.2 Send retry mechanism
Producer will automatically retry when the message is not sent successfully, the maximum number of sending times = retryTimesWhenSendFailed + 1 = 3 times.
It is worth noting that not all abnormal situations will be retried. The information that can be extracted from the above source code tells us that in the following three cases, it will automatically retry:
1) RemotingException or MQClientException occurs when one of two exceptions occurs.
2) When MQBrokerException occurs and ResponseCode is NOT\_IN\_CURRENT\_UNIT = 205.
3) In SYNC mode, no abnormality occurs and the sending result status is not SEND\_OK.
Before each message is sent, it will first check whether the previous two steps have taken too long (the timeout period is 3000ms by default). If it is, it will not continue to send and return to the timeout without retrying. Two issues are explained here:
1) The automatic retry within the producer is imperceptible to business applications, and the sending time seen by the application includes all the time consuming retries;
2) Once overtime means that this message sending has ended in failure, the reason is timeout. This message will eventually be thrown in the form of RemotingTooMuchRequestException.
What needs to be pointed out here is that the RocketMQ official document points out that the sending timeout period is 10s, that is, 10000ms. Many people on the Internet also think that the rocketMQ timeout period is 10s. However, 3000ms is clearly written in the code, and finally I confirmed after debugging that the default timeout is indeed 3000ms. It is also recommended that the RocketMQ team confirm the documentation. If there are any errors, it is better to correct them as soon as possible.
figure 2
3.1.3 Broker's unusual avoidance mechanism
source code list-8
MessageQueue mqSelected = this.selectOneMessageQueue(topicPublishInfo, lastBrokerName);
This line of code is the process of selecting the queue before sending.
This involves a core mechanism of RocketMQ message sending high availability, latencyFaultTolerance. This mechanism is part of the producer's load balancing and is controlled by the value of sendLatencyFaultEnable. The default is false and the broker failure delay mechanism is not activated. When the value is true, the broker failure delay mechanism is enabled, which can be activated by the Producer.
When selecting a queue, turn on the abnormal conventional avoidance mechanism, and avoid selecting the broker agent with a bad current state according to the working status of the broker. Unhealthy brokers will be avoided for a period of time. If the abnormal conventional avoidance mechanism is not enabled, the next queue will be selected in order. , But in the retry scenario, try to choose a queue that is different from the broker sent last time. Every time a message is sent, the status information of the broker is maintained through the updateFaultItem method.
source code list-9
public void updateFaultItem(final String brokerName, final long currentLatency, boolean isolation) {
if (this.sendLatencyFaultEnable) {
// 计算延迟多久,isolation表示是否需要隔离该broker,若是,则从30s往前找第一个比30s小的延迟值,再按下标判断规避的周期,若30s,则是10min规避;
// 否则,按上一次发送耗时来决定规避时长;
long duration = computeNotAvailableDuration(isolation ? 30000 : currentLatency);
this.latencyFaultTolerance.updateFaultItem(brokerName, currentLatency, duration);
}
}
Go deep into the selectOneMessageQueue method to find out:
source code list-10
public MessageQueue selectOneMessageQueue(final TopicPublishInfo tpInfo, final String lastBrokerName) {
if (this.sendLatencyFaultEnable) {
// 开启异常规避
try {
int index = tpInfo.getSendWhichQueue().getAndIncrement();
for (int i = 0; i < tpInfo.getMessageQueueList().size(); i++) {
int pos = Math.abs(index++) % tpInfo.getMessageQueueList().size();
if (pos < 0)
pos = 0;
// 按顺序取下一个message queue作为发送的queue
MessageQueue mq = tpInfo.getMessageQueueList().get(pos);
// 当前queue所在的broker可用,且与上一个queue的broker相同,
// 或者第一次发送,则使用这个queue
if (latencyFaultTolerance.isAvailable(mq.getBrokerName())) {
if (null == lastBrokerName || mq.getBrokerName().equals(lastBrokerName))
return mq;
}
}
final String notBestBroker = latencyFaultTolerance.pickOneAtLeast();
int writeQueueNums = tpInfo.getQueueIdByBroker(notBestBroker);
if (writeQueueNums > 0) {
final MessageQueue mq = tpInfo.selectOneMessageQueue();
if (notBestBroker != null) {
mq.setBrokerName(notBestBroker);
mq.setQueueId(tpInfo.getSendWhichQueue().getAndIncrement() % writeQueueNums);
}
return mq;
} else {
latencyFaultTolerance.remove(notBestBroker);
}
} catch (Exception e) {
log.error("Error occurred when selecting message queue", e);
}
return tpInfo.selectOneMessageQueue();
}
// 不开启异常规避,则随机自增选择Queue
return tpInfo.selectOneMessageQueue(lastBrokerName);
}
3.1.4 Three CommunicationModes of RocketMQ
source code list-11
public enum CommunicationMode {
SYNC,
ASYNC,
ONEWAY,
}
The above three modes refer to the stage when the message reaches the broker from the sender, and does not include the process of the broker delivering the message to the subscriber.
The difference between the three modes of sending:
- One-way mode : ONEWAY. The message sender just sends it and doesn't care about the result of the broker's processing. In this mode, since the processing flow is small, the sending time is very small, and the throughput is large, but the message cannot be guaranteed to be reliable and not lost. It is often used in scenarios with huge traffic but unimportant messages, such as heartbeat sending.
- asynchronous mode : ASYNC. After the message sender sends the message to the broker, there is no need to wait for the broker to process it. Instead, an asynchronous thread does the message processing. After the processing is completed, the sender is notified of the sending result in the form of a callback. If there is an exception during asynchronous processing, it will be retried internally before returning the failure result of the sender (default 3 times, the sender is not aware). In this mode, the sender's waiting time is small, the throughput is large, and the message is reliable. It is used in scenarios with large traffic but important messages.
- synchronization mode : SYNC. The message sender needs to wait for the broker to complete the processing and clearly return success or failure. Before the message sender gets the result of the message sending failure, it will also experience internal retries (default 3 times, the sender does not perceive). In this mode, the sender will block waiting for the message processing result, the waiting time is long, the message is reliable, and it is used for important message scenarios with small traffic. It should be emphasized that the processing of one-phase and half-transaction messages of transaction messages is a synchronous mode.
The specific implementation differences can also be seen in the sendKernelImpl method. The ONEWAY mode is the simplest and does not do any processing. Among the parameters of the sendMessage method responsible for sending, compared with the synchronous mode, the asynchronous mode has more callback methods, topicPublishInfo containing topic sending routing meta-information, instance containing sending broker information, producer containing sending queue information, and the number of retries. In addition, in asynchronous mode, the compressed message will be copied first.
source code list-12
switch (communicationMode) {
case ASYNC:
Message tmpMessage = msg;
boolean messageCloned = false;
if (msgBodyCompressed) {
//If msg body was compressed, msgbody should be reset using prevBody.
//Clone new message using commpressed message body and recover origin massage.
//Fix bug:https://github.com/apache/rocketmq-externals/issues/66
tmpMessage = MessageAccessor.cloneMessage(msg);
messageCloned = true;
msg.setBody(prevBody);
}
if (topicWithNamespace) {
if (!messageCloned) {
tmpMessage = MessageAccessor.cloneMessage(msg);
messageCloned = true;
}
msg.setTopic(NamespaceUtil.withoutNamespace(msg.getTopic(), this.defaultMQProducer.getNamespace()));
}
long costTimeAsync = System.currentTimeMillis() - beginStartTime;
if (timeout < costTimeAsync) {
throw new RemotingTooMuchRequestException("sendKernelImpl call timeout");
}
sendResult = this.mQClientFactory.getMQClientAPIImpl().sendMessage(
brokerAddr,
mq.getBrokerName(),
tmpMessage,
requestHeader,
timeout - costTimeAsync,
communicationMode,
sendCallback,
topicPublishInfo,
this.mQClientFactory,
this.defaultMQProducer.getRetryTimesWhenSendAsyncFailed(),
context,
this);
break;
case ONEWAY:
case SYNC:
long costTimeSync = System.currentTimeMillis() - beginStartTime;
if (timeout < costTimeSync) {
throw new RemotingTooMuchRequestException("sendKernelImpl call timeout");
}
sendResult = this.mQClientFactory.getMQClientAPIImpl().sendMessage(
brokerAddr,
mq.getBrokerName(),
msg,
requestHeader,
timeout - costTimeSync,
communicationMode,
context,
this);
break;
default:
assert false;
break;
}
There is such a picture in the official document, which clearly describes the detailed process of asynchronous communication:
image 3
3.2 Two-stage transmission
The source code listing-3 reflects the execution of the local transaction. The localTransactionState associates the execution result of the local transaction with the second-phase sending of the transaction message.
It is worth noting that if the sending result of the first stage is SLAVE\_NOT\_AVAILABLE, that is, when the standby broker is not available, the localTransactionState will be set to Rollback, and the local transaction will not be executed at this time. After that, the endTransaction method is responsible for the second-stage submission, see source code listing-4. Specific to the implementation of endTransaction:
source code list-13
public void endTransaction(
final SendResult sendResult,
final LocalTransactionState localTransactionState,
final Throwable localException) throws RemotingException, MQBrokerException, InterruptedException, UnknownHostException {
final MessageId id;
if (sendResult.getOffsetMsgId() != null) {
id = MessageDecoder.decodeMessageId(sendResult.getOffsetMsgId());
} else {
id = MessageDecoder.decodeMessageId(sendResult.getMsgId());
}
String transactionId = sendResult.getTransactionId();
final String brokerAddr = this.mQClientFactory.findBrokerAddressInPublish(sendResult.getMessageQueue().getBrokerName());
EndTransactionRequestHeader requestHeader = new EndTransactionRequestHeader();
requestHeader.setTransactionId(transactionId);
requestHeader.setCommitLogOffset(id.getOffset());
switch (localTransactionState) {
case COMMIT_MESSAGE:
requestHeader.setCommitOrRollback(MessageSysFlag.TRANSACTION_COMMIT_TYPE);
break;
case ROLLBACK_MESSAGE:
requestHeader.setCommitOrRollback(MessageSysFlag.TRANSACTION_ROLLBACK_TYPE);
break;
case UNKNOW:
requestHeader.setCommitOrRollback(MessageSysFlag.TRANSACTION_NOT_TYPE);
break;
default:
break;
}
requestHeader.setProducerGroup(this.defaultMQProducer.getProducerGroup());
requestHeader.setTranStateTableOffset(sendResult.getQueueOffset());
requestHeader.setMsgId(sendResult.getMsgId());
String remark = localException != null ? ("executeLocalTransactionBranch exception: " + localException.toString()) : null;
// 采用oneway的方式发送二阶段消息
this.mQClientFactory.getMQClientAPIImpl().endTransactionOneway(brokerAddr, requestHeader, remark,
this.defaultMQProducer.getSendMsgTimeout());
}
When sending in the second phase, the reason why it is sent in oneway way, I personally understand that this is precisely because the transaction message has a special reliable mechanism—check back.
3.3 Message reply
When the Broker has passed a specific time and found that it still does not get the exact information about whether to commit or roll back the second stage of the transaction message, the Broker does not know what happened to the Producer (maybe the producer is down, or the producer may send a commit but the network If the jitter is lost, maybe...), so I took the initiative to initiate a back-check.
The back-check mechanism of transaction messages is more reflected on the broker side. RocketMQ's broker isolates transaction messages in different sending stages with three different topics: Half message, Op message, and real message, so that Consumer can only see the message that finally confirms that the commit needs to be delivered. The detailed implementation logic will not be repeated in this article for the time being, and another post can be opened to interpret it from the perspective of Broker.
Returning to the Producer's perspective, when a Broker's review request is received, the Producer will check the local transaction status according to the message, and decide to submit or roll back based on the result. This requires the Producer to specify the implementation of the review in case of emergency.
Of course, under normal circumstances, it is not recommended to actively send the UNKNOW status. This status will undoubtedly bring extra check back overhead to the broker. It is reasonable to start the check back mechanism only when an unpredictable abnormal situation occurs. select.
In addition, the transaction review of version 4.7.1 is not an unlimited review, but a maximum of 15 times:
source code list-14
/**
* The maximum number of times the message was checked, if exceed this value, this message will be discarded.
*/
@ImportantField
private int transactionCheckMax = 15;
appendix
The official default parameters of the Producer are as follows (the timeout duration parameter has also been mentioned in the previous article, and the debug result is the default 3000ms, not 10000ms):
Figure 4
RocketMQ is an excellent open source message middleware, and many developers have done secondary development based on it. For example, the commercial product SOFAStack MQ message queue of Ant Group is a financial-grade message middleware re-developed based on the RocketMQ kernel. A lot of excellent work has been done in information management and control, transparent operation and maintenance.
I hope that RocketMQ will continue to grow and develop with greater vitality under the co-creation and construction of the vast number of developers in the community.
We are the Alibaba Cloud Intelligent Global Technical Service-SRE team. We are committed to becoming a technology-based, service-oriented, and high-availability engineer team of business systems; providing professional and systematic SRE services to help customers make better use of the cloud , Building a more stable and reliable business system based on the cloud to improve business stability. We hope to share more technologies that help enterprise customers go to the cloud, make good use of the cloud, and make their business operations on the cloud more stable and reliable. You can scan the QR code below to join the Alibaba Cloud SRE Technical Institute Dingding circle, and more The multi-cloud master communicates about those things about the cloud platform.
Copyright statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。