37
头图

foreword

After so many years of development, I have encountered countless bugs. However, no matter how complicated the bug is, as long as you study the code carefully and debug it, you can always find the reason.

However, this bug recently encountered in the company. This bug is very simple at first glance, but it is very evil. For a period of time, I even couldn't understand it. Over a period of several days, the probability of recurrence is very low. It's almost difficult to catch any trail.

So this article is very realistic to record the whole process of finding and troubleshooting this bug.

cause

Colleagues made a request before and submitted the test. Test colleagues halfway through the test. Found an error in the background

 com.alibaba.fastjson.JSONException: can not cast to String, value : {"code":"00","msg":"成功","data":{这里是正确的数据}}
    at com.alibaba.fastjson.util.TypeUtils.castToInt(TypeUtils.java:564) ~[fastjson-1.2.29.jar:na]
    at com.alibaba.fastjson.serializer.IntegerCodec.deserialze(IntegerCodec.java:89) ~[fastjson-1.2.29.jar:na]

At a glance, you can see that this error is very simple, it is a fastjson conversion type error. Caused by the mismatch between the accepted json and the type to be converted. Basically check the code, it can be solved at a glance.

As a result, my colleagues looked at it for a while and found no problem.

Then run the debug remotely in the test environment, hit a breakpoint to prepare for debugging, and the test colleague operates it, but it is fine.

Then release the breakpoint and run normally. After a while, the same error occurred again.

Continue to hit the breakpoint, or normal.

This phenomenon made my colleagues a little confused. I saw this in the group and thought it was very interesting, but the result actually depends on whether or not to observe, what the hell is this? .

I took over and looked at the code, and at first glance, it was completely fine.

The code I've stripped down and some pseudo-code processing looks like this:

 public <T> T executeLua(String luaName, Class<T> c, Object... args){
    String json = 执行器.执行(luaName,args);
  log.info("执行结果为:{}",json);
  T resut = JSON.parseObject(json, c);
  return result;
}

When calling, call it like this:

 LuaResult result = executeLua("xxxx", LuaResult.class, args);

The generic T of the returned object is a LuaResult object, and its structure is also very simple:

 public class LuaResult<T> {

    protected String code;

    protected String msg;

    protected T data;
  
  ...
}

According to the printed json result, the returned data is the correct result and completely matches the structure of the LuaResult object. Although no generic type is added when calling, the data in LuaResult is not used in the previous code, so the type of JsonObject is still retained after parsing. From an operational point of view, it's perfectly fine. I also used the local main method to run with the same data, and it can be parsed normally. Even if I tried loop parsing several times locally and multi-threaded parsing, there is no problem.

So why do I get an occasional error when I arrive at the server environment?

I also decided to debug remotely, I don't believe there will be a Schrodinger bug.

I hit a breakpoint remotely, my test colleague started doing business, and everything worked perfectly.

So I removed the breakpoint, and after a while, my test colleague sent me a screenshot of the error. Everything is really the same as the previous colleagues. . .

This time I am also messy, such a simple error report, I can't see the problem, I still can't debug it? ? And the json string returned by the log is completely normal business return data.

Too evil.

Investigation (1)

First of all, it is certain that the returned data is completely correct, which can be seen when the log is printed. There is no low-level error, json does not match the return type.

The problem is obvious, it is the parsing problem of fastjson, and the version of fastjson used by the project is relatively old, version 1.2.29.

But what is the problem? It was a simple mistake, but it is very strange to report it occasionally when the data and structure are completely correct.

And I simply don't believe in the fact that when it's back to normal during debug, it's going to be a problem during normal operation.

And in the next time, the bug seems to disappear, whether it is debug startup or normal startup, it no longer appears.

Seeing this, some people will definitely say that this is not easy. Change the json parsing framework, or upgrade fastjson.

For changing the framework, I have privately chatted with my colleagues to change to jackson. First of all, there are many places in the system to be changed, and secondly, even if you change it, the problem will not recur. But for the problem, it is equivalent to bypassing, and not really solving it head-on. And I myself am very interested in the source of this bug. So I don't plan to change it, it's just positive.

For upgrading fastjson, there is actually speculation that it is caused by some bugs in fastjson. But you have to come up with evidence, otherwise even if there is no recurrence after the upgrade, it cannot be proved that it was fixed because of the upgrade of fastjson, because this bug is extremely occasional. In other words, it is still necessary to find the root cause of the problem. You can't just rely on long-term observations to prove whether it's repaired.

Finally, by chance, I finally caught a breakpoint in debug.

This also shows that there is no Schrödinger bug, at least I can catch it when I observe it.

The breakpoint is at the following location (some business data has been changed)

 //json为:{"code":"00","msg":"成功","data":{"xxx":21,"yyy":5}}
//c为LuaResult.class
T resut = JSON.parseObject(json, c);

Use IDEA's Evaluate tool to view the execution results

Then continue to execute repeatedly, probably after pressing Enter dozens of times, an error finally appeared:

This is very strange. Is there, the same data, the same code. Do it dozens of times and get an error.

But the problem must be on the fastjson side. The next step is to get to the root of the problem. In order to seize this reason, I really reluctant to upgrade immediately.

Investigation (2)

I mentioned before that I didn't add generics to the call, so I added a generic to this call with the idea of giving it a try, and replaced the parameter with TypeReference

 LuaResult<Map<String,Object>> result = executeLua("xxxx", new TypeReference<LuaResult<Map<String,Object>>>(){}, args);

Go to the breakpoint again, and use the Evaluate tool to diagnose. I have clicked it hundreds of times, but the error has not occurred.

Then it should be determined that it should be caused by generics. By searching for related keywords, I found an article about fastjson about generic parsing. The general meaning is: fastjson will use the last generic type by default for those same objects that do not define a clear generic type. That is, fastjson caches the previous generic definition of the same object.

In this way, I can roughly understand the accidental cause of this monster bug.

To this end, I specially wrote a piece of code locally to simulate the occurrence of this bug.

You can also copy it locally to run it. The fastjson version should be 1.2.29.

 public static void main(String... args) throws Exception {
    try {
        String json = "{\"code\":\"00\",\"msg\":\"成功\",\"data\":{\"xxx\":21,\"yyy\":5}}";
        LuaResult result = JSON.parseObject(json, LuaResult.class);
        System.out.println(result);
    }catch (Exception e){
        log.error("出错了", e);
    }

    try {
        String json1 = "{\"msg\":\"成功\",\"data\":\"31\",\"code\":\"00\"}";
        LuaResult<Integer> result = JSON.parseObject(json1, new TypeReference<LuaResult<Integer>>(){});
        System.out.println(result);
    }catch (Exception e){
        log.error("出错了", e);
    }

         //上面2个没问题,到执行第三段的时候,就会必现
    try {
        String json = "{\"code\":\"00\",\"msg\":\"成功\",\"data\":{\"xxx\":21,\"yyy\":5}}";
        LuaResult result = JSON.parseObject(json, LuaResult.class);
        System.out.println(result);
    }catch (Exception e){
        log.error("出错了", e);
    }
}

An error occurs after execution:

The interesting thing is that the error occurs only when the three pieces of code are in this order, and the order of the three pieces of code is changed, and it is normal again.

This is a must-have bug. True to form!

Since the fastjson version used by the project is version 1.2.29, I upgraded version by version, wondering which version will fix this bug. When I tried version 1.2.33, this bug finally disappeared. Go to github to find the release information of version 1.2.33, and found that the author has fixed the issue of parsing errors without parameter generics.

in conclusion

So the conclusion is: this Schrödinger bug is caused by the lax bug of generic parsing in fastjson in earlier versions.

Now that we have reached this level, we will continue to study further and find that fastjson is under version 1.2.33, and there is a cache phenomenon for the generic type of the same object. It is no problem if the LuaResult with generic definition has been executed all the time before. If the first few are with generics and the latter are executed without generics, there will be a problem. It will be parsed according to the last cached generic type. Since the previous generic type is Integer, no matter what the json data is, it will be converted according to the Integer type, which leads to the error of can not cast to int.

As for why in the beginning, there was an illusion similar to Schrödinger, and now there is a conclusion, this is because of the order in which the students were tested. When it happens to be debugged, it is not executed in this order, and when it is running normally, it is executed in this order. I guess it might just be a coincidence.

If you know the root cause, you just need to upgrade to the latest version to fix it.

Some people will think that I have gone around in a circle, just upgrade and it's over. Why bother to find out why.

On the one hand, I think it is necessary to be technically rigorous, to know why, but also to know why. I am very interested (obsessed) in how this came about. So even if it can be solved by upgrading a version number, it is necessary to figure out the ins and outs.

If you like articles on the similar bug finding process, please let me know in the comments, and I will arrange more in the future.

My name is Bo Saidong, an interesting and in-depth developer. I pay attention to [Yuanren Tribe], and I will share technology and open source related content. Will also share some views on life.


铂赛东
1.2k 声望10.4k 粉丝

开源作者&内容创作者,专注于架构,开源,微服务,分布式等领域的技术研究和原创分享