1

background

I haven't shared Java-related troubleshooting for a long time. Recently, I helped my colleagues to investigate a problem together:

When using Pulsar consume, the same message is consumed repeatedly.

Check

When he told me about this phenomenon, I was skeptical. Based on previous experience, Pulsar explained it in the official documentation and API:



Only when the consumption ackTimeout is set and the consumption is overtime will the message be repeatedly delivered. It is turned off by default, and the viewing code is indeed not turned on.

Could it be that the negativeAcknowledge() method was called (calling this method will also trigger redelivery), because we used a third-party library https://github.com/majusko/pulsar-java-spring-boot-starter This method will only be called when an exception is thrown.

After reviewing the code, there is no place to throw an exception, and I don't even see an exception occur during the entire process; this is a bit weird.

recurrent

In order to understand the ins and outs of the whole thing, I learned about his use process in detail;

In fact, bug appeared in the business. He debug and then single-step debugging when the message was consumed. After one debugging, he received the same message again shortly after.

But the strange thing is that it is not possible to repeat consumption every time after debug . We all say that if a bug can be completely reproduced 100%, it will basically solve more than half of it.

So the first step in our investigation is to fully reproduce the problem.


In order to rule out the problem of IDEA (although the probability is unlikely), since it is a problem caused by sleep , it is actually debug when converted to the code, so we plan to directly sleep in the consumption logic to see if it can be recovered. now.

After testing, sleep could not be reproduced for a few seconds to tens of seconds, and finally sleep one minute, a magical thing happened, and it was successfully reproduced every time!

Since it can be successfully reproduced, it is easy to say, because my own business code also uses Pulsar , so I plan to reproduce it in my own project for the convenience of debugging.

As a result, the weird thing happened again, and I can't reproduce it here.

Although this is expected, it cannot be adjusted.

Based on the premise of believing in modern science, the only difference between the two of us is that the projects are different, so I compared the codes on both sides.

    @PulsarConsumer(
            topic = xx,
            clazz = Xx.class,
            subscriptionType = SubscriptionType.Shared
    )
    public void consume(Data msg) {
        log.info("consume msg:{}", msg.getOrderId());
        Lock lock = redisLockRegistry.obtain(msg.getOrderId());
        if (lock.tryLock()) {
            try {
                orderService.do(msg.getOrderId());
            } catch (Exception e) {
                log.error("consumer msg:{} err:", msg.toString(), e);
            } finally {
                lock.unlock();
            }
        }

    }

As expected, the code on the colleague's side is locked; it is a distributed lock based on Redis. At this time, when I slap my thigh, it will not be unlocked and the timeout will cause an exception to be thrown.

In order to verify this problem, I made a breakpoint at the consumption of Pulsar of the framework on the basis of reproducibility:


Sure enough, the case was solved, and the abnormal prompt was very clear: the timeout period for locking has passed.

After entering the exception, the message was directly negative , and the exception was also eaten, so it was not found before.


After checking the source code of RedisLockRegistry , the default timeout is exactly one minute, so we could not reproduce this problem for tens of seconds before sleep .

Summarize

Afterwards, I asked my colleague why the lock was added here, because I saw that there was no need for a lock at all; it turned out that he added it because of the code copied from others, and he didn't think much of it at all.

So there are some lessons to be learned from this:

  • Although ctrl C/V is convenient, you have to fully consider your own business scenarios.
  • When using some third-party APIs, you need to fully understand their functions and parameters.

Your likes and shares are the greatest support for me


crossoverJie
5.4k 声望4k 粉丝