How to troubleshoot problems without source code? Arthas brute force search memory objects to solve the problem of occasional time-consuming online

Like it first, then watch it, develop a good habit

background

A certain rule engine system of the company is mainly used for the functions of cost calculation and business logic verification. However, due to some historical reasons indescribable , causing the system no documentation and no source , even the configuration is by word of mouth.

Although this system is relatively old, it is a commercial product after all, and its functions are relatively complete and easy to use. After the system was connected to the business system for calculating fees and verifying the rules, it went online soon.

However, after going online, the service time-consuming problem will occasionally appear. Under normal circumstances, the service response time is 20ms or less. But when there is a problem, the service time will to tens of seconds , and the time point for each occurrence of a long time is not fixed. Moreover, during the long time-consuming period, all requests to the service will be time-consuming, not just individual requests will be affected.

Theoretically, there should be such problems as testing environments; however, because the testing environment restarts frequently, the occasional long time-consuming may be considered to be restarting, and no one has paid attention.

Although there is no information, the problem still needs to be solved. Who can withstand the time-consuming tens of seconds from time to time. This rule engine is still the core system of the business, so you can earn less insurance premiums for one minute overtime...

With the mentality of giving it a try, study it first.

After investigation

This kind of occasional and time-consuming problem is troublesome to troubleshoot. It cannot be reproduced stably, and there is no regularity. It is not easy to do other environment simulations. In addition, this system has no documentation and no source code, and no one understands its structure and process. It is more laborious to check.

Watch monitoring

After reviewing the monitoring history, I found that every time there is a long time-consuming, the utilization of CPU & disk IO will increase, but it is not too high. When the CPU is high, it is only 50 or 60, and the disk is even lower, with only a slight fluctuation. This fluctuation will bring tens of seconds of time-consuming, which is a bit unreasonable.

See log

The CPU block is only slightly higher, but it doesn't explain what the problem is. The root cause must be found. So I checked the log of that system again to see if I could find any clues.

I stuck time, looked at the log, tossed for half an hour, and finally saw a key clue on the log:

2021-05-08 10:12:35.897 INFO  [com.马赛克.rules.res.execution] (default-threads - 65) 规则集 /ansNcreckonRuleApp/1.0/ansncreckonrule/61.0 已在 58 秒后解析完毕。

58 seconds...The time point of this log line can match the actual time-consuming time point of our occurrence. In this 58-second range, there are indeed a large number of request time-consuming, and they are all less than 58 seconds or slightly larger 58 seconds.

In addition to this time-consuming printing, the GC log is also a bit suspicious. Each time this time-consuming log appears, there will be a GC activity log , but the pause time is not long:

2021-06-07T17:36:44.755+0800: [GC2021-06-07T17:36:44.756+0800: [DefNew: 879616K->17034K(3531079K), 0.0472190 secs] 3549012K->2686430K(10467840K), 0.0473160 secs] [Times: user=0.05 sys=0.00, real=0.2 secs]

even more strange...not every time after GC, there will be a time-consuming log

GC issues

After carefully checking the gc log, I found that shortly before each slow response, there will be a GC, but it is not a FULL GC, and the pause time is not long, far less than the time-consuming request. Moreover, the CPU utilization rate in the GC process is not high, which is quite normal from the data point of view.

Now there are (a little bit) clue:

During the long time-consuming period, the CPU utilization has increased
There is log printing during the long time-consuming period, and the time length can be matched when loading something
Before each long time consuming, there is GC activity
Guess the possible cause
After all, there is no source code, and no one understands it. I don’t have a chance to tear the code, I can only rely on guessing...

From the above clues, the CPU utilization increased after the GC activity, and then a line of loading log was printed. The time can also correspond to the long time-consuming request of the system.

And every time there is a GC activity shortly before the long time-consuming log, then the timing of this resource loading is related to GC, and GC will affect resource loading...

Thinking of this, I probably guessed the reason. Weak references are probably used to maintain this resource cache. After GC, the resources of weak references are reclaimed, so they need to be reloaded (for detailed explanation of weak references and test results, please refer to "Strong references in Java/ Soft references/weak references/phantom references and GC strategy " ).

Since this may be the reason, if I find a place where this weak reference maintains the resource cache, changing it to a strong reference will solve the problem! Although there is no source code, it can still be decompiled. After decompilation, change the code in the cache. There is no big problem.

Locate the resource loading point

Finally found the possible reason, but there is a problem... How do I know where this time-consuming log line is printed? In which category? I don't even know what web container this service uses.

No way, go to Arthas , but it certainly can't be tested directly in the production environment. So I made a new set of temporary test environment to troubleshoot this problem.

Fortunately, there is a loggerName prefix log. Through the trace Arthas , all classes under the package name trace can be used in the form of wildcards:

# trace 该包名下的所有类，所有方法，只显示耗时大于 1000ms 的方法
# 由于这个输出结果可能会比较多，所以 > 输出到文件，并且后台运行

trace com.马赛克.rules.res.* * '#cost > 1000' > /app/slow_trace.log &

trace command took more than ten seconds to return. It affected 169 classes and 1617 methods. It can be seen that wildcard matching is dangerous... If I play like this, I might be pulled out to sacrifice to the heavens.

Combined with the situation described above, there will be this time-consuming problem after GC. Now let's manually trigger a GC.

Use Arthas's vmtool to directly forceGC (using jmap or other means can also be used):

vmtool --action forceGc

After forceGC, retest the system interface

As in the above situation, it turned out that there was a long time-consuming. After the request was returned, the same time-consuming log was printed, but it took longer than the production environment (because I used arthas trace enhancement)

2020-04-08 12:30:35.897 INFO  [com.马赛克.rules.res.execution] (default-threads - 65) 规则集 /ansNcreckonRuleApp/1.0/ansncreckonrule/61.0 已在 70 秒后解析完毕。

At the same time Arthas trace the log file is written in, but also with content (link is too long, can not paste the code, only the screenshot): This trace log has more than a thousand lines, and part of the content has been deleted here.

Arthas' trace command will not perform full link tracking, so the trace result is only the current path. But it's already good, and the scope of the problem has narrowed again.

From the picture above, we can see IlrRulesetProvider:getRuleset is the main time-consuming point, so let’s take a look at the gameplay in this method first.

For the sake of simplicity, here temporarily use Arthas's jad command to directly decompile this class and look at the logic inside:

jad ilog.rules.res.xu.ruleset.internal.IlrRulesetProvider

The code is a bit too much, here are some cuts, and only the key parts are kept:

public final XURulesetImpl getRuleset(IlrXURulesetArchiveInformation archive, IlrXUContext xuCtx, boolean waitParsing, Listener listener) throws IlrRulesetArchiveInformationNotFoundException, IlrRulesetCreationException, IlrRulesetAlreadyParsingException, IlrRulesetCacheException, XUException {
        String canonicalPath = archive.getCanonicalPath().toString();
        this.logger.finest("IlrRulesetProvider.getRuleset " + canonicalPath + " " + waitParsing);
        if (!archive.getProperties().isShareable()) {
            return this.factory.createRuleset(archive, xuCtx);
        } else {
            ClassLoader cl = archive.getXOMClassLoader();

            XURulesetImpl ruleset;
            while(true) {
                synchronized(this.parsingRulesets) {
                    
                    // 字面意思是，从缓存中获取规则集，有的话直接 return 了
                    ruleset = (XURulesetImpl)this.getCache().getRuleset(canonicalPath, cl);
                    if (ruleset != null) {
                        return ruleset;
                    }
                    
                    // 第一个加载的线程，将当前资源添加到 parsingRulesets 同时跳出 while
                    if (!this.parsingRulesets.contains(archive)) {
                        this.parsingRulesets.add(archive);
                        break;
                    }

                    if (!waitParsing) {
                        throw new IlrRulesetAlreadyParsingException("XU.ERROR.10406", (String[])null);
                    }
                    
                    // 这里的 wait……应该是防止并发访问，当其他线程也进入该代码块时 wait 等待第一个线程加载完成唤醒
                    try {
                        this.parsingRulesets.wait();
                    } catch (InterruptedException var20) {
                        throw new IlrRulesetCreationException("XU.ERROR.10009", (String[])null, var20);
                    }
                }
            }

            if (!this.useWorkManager(archive)) {
                this.logger.finest("IlrRulesetProvider.getRuleset doesn't use the workmanager " + this.workManager, (Object[])null, xuCtx);

                XURulesetImpl var9;
                try {
                    // 创建资源
                    ruleset = this.factory.createRuleset(archive, xuCtx);
                    
                    // 创建完成，添加到缓存
                    this.getCache().addRuleset(ruleset);
                    var9 = ruleset;
                } finally {
                    this.parsingStopped(archive);
                }

                return var9;
            } 
        }
    }

Although a lot of deletions have been made, the code is still a bit unclear. After all, the decompiled code reading interference is a bit big. Here is a brief explanation of the above code logic:

Get resources from the cache container first
If you can't get it, execute the creation logic
The creation is complete, add to the cache again
If other threads are already loading during loading, they will actively wait for the first created thread to wake up

Combined with our above guess:

It is possible that the system maintains the resource cache with a weak reference (Weak Reference). When the resource is recycled after GC, it needs to be reloaded.

The problem lies with this cache container! Just look at this cache to know that there must be weakly referenced code!

`Decompile and find code`

060c6b82f6619a Decompilation is still too much trouble to see the code in the terminal. It is better to be intuitive in the IDE, so you still have to pull the code down and analyze it, otherwise it will be jad

There are many Jar packages under this system. You have to find out which Jar these related classes are in first, using Arthas's sc command, which is also very simple:

sc view JVM loaded class information

sc -d ilog.rules.res.xu.ruleset.internal.IlrRulesetProvider

 class-info        ilog.rules.res.xu.ruleset.internal.IlrRulesetProvider
 
 # 这里是我们关心的信息，该 class 所在的 jar 
 code-source       /content/jrules-res-xu-JBOSS61EAP.rar/ra.jar
 name              ilog.rules.res.xu.ruleset.internal.IlrRulesetProvider
 isInterface       false
 isEnum            false
 isAnonymousClass  false
 isArray           false
 isLocalClass      false
 isMemberClass     false
 isPrimitive       false
 isSynthetic       false
 simple-name       IlrRulesetProvider
 modifier          final,public
 annotation
 interfaces        com.ibm.rules.res.xu.ruleset.internal.RulesetParsingWork$Listener
 super-class       +-java.lang.Object
 class-loader      +-ModuleClassLoader for Module "deployment.jrules-res-xu-JBOSS61EAP.rar:main" from Service Module Loader
                     +-sun.misc.Launcher$AppClassLoader@2d74e4b3
                       +-sun.misc.Launcher$ExtClassLoader@5552bb15
 classLoaderHash   2d022d73                             

Affect(row-cnt:1) cost in 79 ms.

After finding this ra.jar, drag this jar into the IDE to decompile, but there are some related classes for this class that are not in this ra.jar. Still the same way, find the associated class, then sc -d find the jar location, copy it to the local ide decompile

After repeated tossing a few times, I finally got all the 4 related jar packages back to the local, and now I can happily watch the code in the IDE.

`Analyze the mechanism of the cache container`

The first is the createCache method above. After analysis, it is found that the implementation class of cache is IlrRulesetCacheImpl . There are only two methods that this class needs to care about, getRuleset and addRuleset:

public void addRuleset(IlrXURuleset executableRuleset) {
    synchronized(this.syncObject) {
        //...
        this.entries.add(new IlrRulesetCacheEntry(executableRuleset, this.maxIdleTimeOutSupport));
        //...

    }
}

public IlrXURuleset getRuleset(String canonicalRulesetPath, ClassLoader xomClassLoader) {
    // ...
    
    List<IlrRulesetCacheEntry> cache = this.entries;
    synchronized(this.syncObject) {
        Iterator iterator = cache.iterator();

        while(iterator.hasNext()) {
            IlrRulesetCacheEntry entry = (IlrRulesetCacheEntry)iterator.next();
            IlrXURuleset ruleset = (IlrXURuleset)entry.rulesetReference.get();
            if (ruleset == null) {
                iterator.remove();
            } else if (entry.canonicalRulesetPath.equals(canonicalRulesetPath) && (entry.xomClassLoader == xomClassLoader || entry.xomClassLoader != null && entry.xomClassLoader.getParent() == xomClassLoader)) {
                return ruleset;
            }
        }
    }
    // ...
    return null;
}

After reading these two methods, it is obvious that entries is the key data storage set. Let’s see how it is played:

protected transient List<IlrRulesetCacheEntry> entries = new ArrayList();

It turned out to be just an ArrayList, continue to look at the class IlrRulesetCacheEntry

public class IlrRulesetCacheEntry {
    protected String canonicalRulesetPath = null;
    protected ClassLoader xomClassLoader = null;
    protected IlrReference<IlrXURuleset> rulesetReference = null;
    
    public IlrRulesetCacheEntry(IlrXURuleset executableRuleset, boolean maxIdleTimeOutSupport) {
        this.canonicalRulesetPath = executableRuleset.getCanonicalRulesetPath();
        this.xomClassLoader = executableRuleset.getXOMClassLoader();
        long maxIdleTime = executableRuleset.getRulesetArchiveProperties().getMaxIdleTime();
        
        // 注意这里是关键，根据 maxIdleTime 的值选择强引用和弱引用
        if (maxIdleTime != 0L && (!maxIdleTimeOutSupport || maxIdleTime == -1L)) {
            this.rulesetReference = new IlrWeakReference(executableRuleset);
        } else {
            this.rulesetReference = new IlrStrongReference(executableRuleset);
        }

    }
}

The code is very straightforward. Different reference strategies are used according to the difference of maxIdleTime. If it is not equal to 0, it is weak reference, and if it is equal to 0, it is strong reference; but you still have to look at the codes of these two Reference classes:

// 弱引用，继承 WeakReference
public class IlrWeakReference<T> extends WeakReference<T> implements IlrReference<T> {
    public IlrWeakReference(T t) {
        super(t);
    }
}

// 强引用
public class IlrStrongReference<T> implements IlrReference<T> {
    private T target;

    IlrStrongReference(T target) {
        this.target = target;
    }

    public T get() {
        return this.target;
    }
}

There is nothing special about these two classes, and they have the same meaning as the class name; IlrWeakReference inherits from WeakReference, which is a weak reference. When a GC occurs, the referenced object will be deleted.

Although I found this weak reference, I still need to verify whether this weak reference is really used

`Verify that weak references are used`

Here, use Arthas's vmtool command to look at the real-time objects in the cache:

vmtool --action getInstances --className ilog.rules.res.xu.ruleset.cache.internal.IlrRulesetCacheImpl --express 'instances[0].entries.get(0)'
@IlrRulesetCacheEntry[
    canonicalRulesetPath=@String[/ansNcreckonRuleApp/1.0/ansncreckonrule/1.0],
    xomClassLoader=@XOMClassLoader[com.ibm.rules.res.persistence.internal.XOMClassLoader@18794875],
   
 # 这里可以看到，rulesetRef 的实例是 IlrWeakReference
rulesetReference=@IlrWeakReference[ilog.rules.res.xu.ruleset.cache.internal.IlrWeakReference@dbd2972],
]

Judging from the result, the stone hammer is a weak citation.

But it is maxIdleTime after all that causes the weak reference, and we still need to find the source of maxIdleTime...

`Looking for maxIdleTime`

In IlrRulesetCacheEntry , you can see that maxIdleTime is obtained from IlrRulesetArchiveProperties :

long maxIdleTime = executableRuleset.getRulesetArchiveProperties().getMaxIdleTime();

Then continue to look at the class IlrRulesetArchiveProperties

public long getMaxIdleTime() {
    
    // 从 properties 里获取 key 为 ruleset.maxIdleTime 的 value
    String result = this.get("ruleset.maxIdleTime");
    return result == null ? -1L : Long.valueOf(result);
}

public String get(Object key) {
    String result = (String)this.properties.get(key);
    return result == null && this.defaults != null ? (String)this.defaults.get(key) : result;
}

The default value returned by getMaxIdleTime is -1, which means that if this maxIdleTime value is not configured, the weak reference strategy will also be used by default.

So far, the problem has been accurately located. The weakly referenced caching strategy caused the resource cache to be emptied when GC, and reloading the resource caused a long time.

But this system has no source code and no documentation. Where can I change this maxIdleTime...

But here are all here, I have already seen the IlrRulesetArchiveProperties class, it is better to see what values are configured in this class, and whether there is maxIdleTime

vmtool --action getInstances --className com.ibm.rules.res.xu.ruleset.internal.CRERulesetImpl --express 'instances[0].getRulesetArchiveProperties()'

@IlrRulesetArchivePropertiesImpl[
    @String[ruleset.engine]:@String[cre],
    @String[ruleset.status]:@String[enabled],
    @String[ruleset.bom.enabled]:@String[true],
    @String[ruleset.managedxom.uris]:@String[resuri://ans-nc-xom.zip/54.0,resuri://ruleapp-core-model-1.5.2.jar/2.0],
]

From the results, we can see that there is no maxIdleTime property in properties, which matches our conclusion above. The maxIdleTime property is not configured, the default is -1, so a weak reference is used

`Find the configuration method of maxIdleTime`

Although you can't see the comments in the decompiled code, you can still guess from the class name. The class name IlrRulesetArchiveProperties should mean "rule set archive properties".

Although I don't understand the business rules of this system, someone understands it! So I found the old man who was in charge of the configuration of this system, and asked him to explain to me the various concepts of this system. My brother is also very friendly, so he drew a picture for me directly This rule engine system has the concept of an App, which is the meaning of the application often referred to in the saas platform; each App can create multiple rule sets, which is the RuleSet in the figure, and each RuleSet is our business rule, such as Cost calculation formulas or logical verification rules, and each RuleSet will also record multiple historical versions of each change.

Seeing this picture, I also understand that the IlrRulesetArchiveProperties located above are not the properties of the rule set? This product has a console, there should be a place for this attribute configuration, otherwise, why do you make such a class? The rule sets are created in the console, so the rule set properties should also be configurable!

As I expected, there is a concept of attributes in the rule set of the console, and attributes can also be added.

The most important thing is that ruleset.bom.enabled the console is also available when I use arthas vmtool above. It can prove that the configuration here should be the configuration of IlrRulesetArchiveProperties; as long as maxIdleTime is added to this rule set, is it right? Can strong references be used?

Then I carefully clicked the button to add attributes...

Sure enough, there is an option of maxIdleTime. After selecting maxIdleTime, set it to 0.

`Validation results`

After the configuration is completed, restart to verify that it takes effect. Who knows if this system supports hot refresh? It is safer to restart to verify after the change.

First verify the properties of the rule set:

vmtool --action getInstances --className com.ibm.rules.res.xu.ruleset.internal.CRERulesetImpl --express 'instances[0].getRulesetArchiveProperties()'
@IlrRulesetArchivePropertiesImpl[
# 新增配置
    @String[ruleset.maxIdleTime]:@String[0],
    @String[ruleset.engine]:@String[cre],
    @String[ruleset.status]:@String[enabled],
    @String[ruleset.bom.enabled]:@String[true],
    @String[ruleset.managedxom.uris]:@String[resuri://ans-nc-xom.zip/54.0,resuri://ruleapp-core-model-1.5.2.jar/2.0],
]

As you can see from the above, our new configuration has taken effect, and there is already this maxIdleTime in the rule set.

Let's see if the reference in the cache has become a strong reference:

vmtool --action getInstances --className ilog.rules.res.xu.ruleset.cache.internal.IlrRulesetCacheImpl --express 'instances[0].entries.get(0).rulesetReference'

# 这里是强引用
@IlrStrongReference[
    target=@CRERulesetImpl[com.ibm.rules.res.xu.ruleset.internal.CRERulesetImpl@28160472],
]

After the maxIdleTime is increased, the cache of the rule set becomes a strong reference, and there will be no more cases of being reclaimed due to GC under the strong reference!

Use vmtool to run fullgc again, and try to see if it will reload:

vmtool --action forceGc

After a dozen times of execution, there was no long time-consuming problem. The problem should be solved.

Then I synchronized this configuration to the test environment. After running for three days, I went up to manually forceGc from time to time, and this long time-consuming problem did not recur.

`Why is it not time-consuming after every GC?`

As mentioned at the beginning of this article, there will be a GC activity shortly before each long time consuming, but not every time after GC there will be a long time consuming.

Objects maintained by weak references do not mean that they will be emptied during GC; only during GC, if the weakly referenced object has no other references, as in the following example:

Map<String,Object> dataMap = new HashMap<>();
WeakReference ref = new WeakReference(dataMap);
System.gc();
System.out.println(ref.get());
dataMap = null;
System.gc();
System.out.println(ref.get());

//output
{}
null

In the first gc, the data in ref will not be cleared, and the weakly referenced data before the second gc, which has no other references, will be cleared at this time.

Considering the problems of this system, although the weak reference cache is used in the rule set, if the caller still holds the rule set object during GC and is not released, the weak reference rule set cache will not be emptied either; therefore There will be this not every GC will cause a reload, but every reload is due to GC problems

But I don’t think there is any need to continue. Where is the place where the rule set object is held? This time-consuming problem is not very much related; after changing to a strong reference, there will be no more GC reclaiming the object, then Do you care about who holds it? (Actually, I am lazy. I don’t have the documentation and the source code. I’m bald if I look for problems.)

`Write at the back`

I thought it was necessary to decompile first, and then modify the code to repackage the problem. In the end, it was solved without changing a single line of code. I was a little surprised...

However, the design idea of this rule engine is quite good. In order to save memory, a configurable caching strategy is used, but the default use of weak references to maintain the rule set is too conservative. For most server-side scenarios, the amount of memory released is not bad, and you can use it as much as you want. If it is released, it will cause greater problems. The time-consuming reloading is the most unacceptable.

Originality is not easy, unauthorized reprinting is prohibited. If my article is helpful to you, please like/favorite/follow to encourage and support it ❤❤❤❤❤❤

How to troubleshoot problems without source code? Arthas brute force search memory objects to solve the problem of occasional time-consuming online

background

After investigation

Watch monitoring

See log

GC issues

Guess the possible cause

Locate the resource loading point

`Decompile and find code`

`Analyze the mechanism of the cache container`

`Verify that weak references are used`

`Looking for maxIdleTime`

`Find the configuration method of maxIdleTime`

`Validation results`

`Why is it not time-consuming after every GC?`

`Write at the back`

空无

`引用和评论`

PDF 那些事

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性