Author: Lu Yanbo
From the Java Agent error reporting, to the JVM principle, to the glibc thread safety, and then to pthread tls, we will gradually explore the strange error reporting of the Java Agent.
background
Since multiple products of Alibaba Cloud provide Java Agents for users to use, in the scenario where multiple Java Agents are used together, the overall Java Agent time-consuming increases, and each Agent is stored separately, resulting in increased memory usage and resource consumption.
MSE launched the one-java-agent project, which can cooperate with each Java Agent; at the same time, it also supports more efficient and convenient bytecode injection.
Among them, each Java Agent, as the plugin of one-java-agent, is loaded by multi-thread startup in the premain stage, thereby reducing the startup speed from O(n) to O(1), reducing the overall Java Agent overall performance. load time.
question
But recently, during the verification process of the new version of Agent, in the premain stage of one-java-agent, the following errors were found:
2022-06-15 06:22:47 [oneagent plugin arms-agent start] ERROR c.a.o.plugin.PluginManagerImpl -start plugin error, name: arms-agent
com.alibaba.oneagent.plugin.PluginException: start error, agent jar::/home/admin/.opt/ArmsAgent/plugins/ArmsAgent/arms-bootstrap-1.7.0-SNAPSHOT.jar
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:113)
at com.alibaba.oneagent.plugin.PluginManagerImpl.startOnePlugin(PluginManagerImpl.java:294)
at com.alibaba.oneagent.plugin.PluginManagerImpl.access$200(PluginManagerImpl.java:22)
at com.alibaba.oneagent.plugin.PluginManagerImpl$2.run(PluginManagerImpl.java:325)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.InternalError: null
at sun.instrument.InstrumentationImpl.appendToClassLoaderSearch0(Native Method)
at sun.instrument.InstrumentationImpl.appendToSystemClassLoaderSearch(InstrumentationImpl.java:200)
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:100)
... 4 common frames omitted
2022-06-16 09:51:09 [oneagent plugin ahas-java-agent start] ERROR c.a.o.plugin.PluginManagerImpl -start plugin error, name: ahas-java-agent
com.alibaba.oneagent.plugin.PluginException: start error, agent jar::/home/admin/.opt/ArmsAgent/plugins/ahas-java-agent/ahas-java-agent.jar
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:113)
at com.alibaba.oneagent.plugin.PluginManagerImpl.startOnePlugin(PluginManagerImpl.java:294)
at com.alibaba.oneagent.plugin.PluginManagerImpl.access$200(PluginManagerImpl.java:22)
at com.alibaba.oneagent.plugin.PluginManagerImpl$2.run(PluginManagerImpl.java:325)
at java.lang.Thread.run(Thread.java:855)
Caused by: java.lang.IllegalArgumentException: null
at sun.instrument.InstrumentationImpl.appendToClassLoaderSearch0(Native Method)
at sun.instrument.InstrumentationImpl.appendToSystemClassLoaderSearch(InstrumentationImpl.java:200)
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:100)
... 4 common frames omitted
Students familiar with Java Agent may notice that this is an error when calling Instrumentation.appendToSystemClassLoaderSearch.
But first, the path of appendToSystemClassLoaderSearch exists; secondly, the real reason of this error is in the C++ part, which is more difficult to troubleshoot.
But anyway, let's get to the bottom of why this error occurs.
First, let's sort out the specific calling process. The following analysis is based on this analysis:
- Instrumentation.appendToSystemClassLoaderSearch (java)
- appendToClassLoaderSearch0 (JNI)
`- appendToClassLoaderSearch
|- AddToSystemClassLoaderSearch
| `-create_class_path_zip_entry
| `-stat
`-convertUft8ToPlatformString
`- iconv
log, confirm the scene
Because this problem has a 10% probability of appearing in the container environment, it is relatively easy to reproduce, so I use the latest code of dragonwell8, add logs, and confirm the scene.
First, add a log at the actual entry of JNI, that is, the method entry of appendToClassLoaderSearch:
After adding the above log, the problem is even more bald:
- When no error is reported, appendToClassLoaderSearch entry will be output.
- When there is an error, the appendToClassLoaderSearch entry has no output, and it is not executed here?
This doesn't match the error log. Could it be that the stacktrace information deceived us?
After a hard night, I asked Dragonwell's classmates for advice the next day. The boss's posture when writing the log is like this:
- tty->print_cr("internal error");
- If the above doesn't work, use printf("xxx\n");fflush(stdout);
After adding the log in this way, our log can be typed out.
This is the first pit to be stepped on. Printf must be added with fflush to ensure the output is successful.
Analyze the code
After that, the log was added continuously, and finally it was found that create_class_path_zip_entry returned NULL.
Can't find the corresponding jar file?
Continue to investigate and find that stat reports an error and returns No such file or directory. But as mentioned earlier, the path to jarFile exists, isn't stat thread-safe?
Checked the document [ 1] and found that stat is thread-safe.
So I went back and looked again and noticed that the path of stat is abnormal: sometimes the path is empty, sometimes the path is /home/admin/.opt/ArmsAgent/plugins/ahas-java-agent/ ahas-java-agent.jarSHOT.jar, as you can see from the end of the characters, is basically because two characters are written to the same piece of memory; and the length of the corresponding string has also become an irregular number.
Then the problem is clear, start looking for the generation of this string. This character is generated by convertUft8ToPlatformString.
Problem with character encoding conversion?
So I started to debug the logic of utf8ToPlatform. At this time, in order to avoid adding logs and restarting the container frequently, I ran gdb directly on ECS to debug jvm.
It turns out that under Linux, utf8ToPlatform is directly memcpy, and the target address of memcpy is on the stack.
This is not likely to have thread safety issues.
Later, I checked carefully and found that it was related to the environment variable. The environment variable related to the encoding on ECS is LANG=en_US.UTF-8. On the container, centos:7 does not have this environment variable by default. In this case, what the jvm reads is ANSI_X3.4-1968.
Here is the second pit, environment variables will affect the local encoding conversion.
Combining the above phenomenon and code, it is found that in the container environment, it is still necessary to go through iconv to convert from UTF-8 to ANSI_X3.4-1968 encoding.
In fact, it can also be speculated here that if LANG=en_US.UTF-8 is manually set in the container, this problem will not occur again. Additional verification also confirmed this.
Then add the log, and when it is finally confirmed that it is iconv, the target string is written and hung up.
Is iconv thread-unsafe?
iconv is not thread safe!
Checking the documentation for iconv, it's not completely thread-safe:
In layman's terms, before iconv, you need to open an iconv_t with iconv_open, and this iconv_t does not support simultaneous use of multiple threads.
So far, the problem has been almost clearly located, because jvm writes iconv_t as a global variable, so when multiple threads append, it is possible to call iconv at the same time, resulting in a race problem.
Here is the third pit, iconv is not thread-safe.
how to fix
Fix one-java-agent first
For Java code, it is very easy to modify, just add a lock:
But there is a design problem here. The instrument objects have been scattered everywhere in the code, and now a lock needs to be added all of a sudden. Almost all the places used need to be changed, and the cost of code modification is relatively high.
So in the end, it is solved through the proxy class:
In this way, other places only need to use InstrumentationWrapper, and it will not trigger this problem.
Should jvm be repaired?
Then we analyzed the code on the jvm side and found that because iconv_t is not thread-safe, the appendToClassLoaderSearch0 method is not thread-safe. Can it be solved elegantly?
If it is a Java program, it can be solved by directly using ThreadLoal to store iconv_t.
But on the cpp side, although C++11 supports thread_local, first of all, jdk8 has not used C++11 (this can refer to JEP); secondly, C++11 only supports thread_local set and get, thread_local initialization, destruction and other life cycle management also Not supported, for example, there is no way to automatically recycle iconv_t resources when the thread ends.
Then we fallback to pthread? Because pthread provides thread-specific data, it is possible to do something similar.
- pthread_key_create creates a thread-local storage area
- pthread_setspecific for putting values into thread-local storage
- pthread_getspecific is used to fetch values from thread-local storage
- Most importantly, pthread_once satisfies the requirement that pthread_key_t can only be initialized once.
- It should also be mentioned that the second parameter of pthread_once is the callback when the thread ends, we can use it to close iconv_t to avoid resource leakage.
In short pthread provides full life cycle management of thread_local. So, the final code is as follows, initializing thread-local storage with make_key:
So after compiling JDK, mirroring and restarting pods several times in batches, the problem mentioned at the beginning of the article no longer occurs.
Summarize
In the whole process, from Java to JNI/JVMTi, to glibc, to pthread, I stepped on a lot of pits:
- Printf needs to add fflush to ensure the output is successful
- Environment variables affect local character encoding conversion
- iconv is not thread safe
- Use pthread thread-local storage to achieve full life cycle management of thread local variables
From this case, along the call stack, code, gradually restore the problem, and propose solutions, I hope you can understand a little more about Java/JVM.
Reference link:
[1] Documentation:
https://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_09.html
[2] Link to one-java-agent fix:
https://github.com/alibaba/one-java-agent/issues/31
[3] Link to dragonwell fix:
https://github.com/alibaba/dragonwell8/pull/346
[4] one-java-agent brings you a more convenient and non-intrusive microservice governance method:
https://www.aliyun.com/product/aliware/mse
10% discount for the first purchase of MSE Registration and Configuration Center Professional Edition, 15% discount for MSE Cloud Native Gateway Prepaid Full Specifications. Click " here " to enjoy the discount now!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。