- Porting EdgeDB Network I/O Code: In the process of porting a significant portion of EdgeDB's network I/O code from Python to Rust, a new HTTP fetch feature using
reqwest
was being worked on. Tests initially passed locally and on x86_64 CI runners but started failing intermittently on ARM64 CI runners. - CI Output and Initial Observations: The CI output showed a hung test process with no errors in the logs. It initially seemed like a deadlock, but later it was discovered that the process had crashed.
- Connecting to ARM64 Runner: To figure out what was happening, Sully and Matt connected directly to the ARM64 runner. They SSH'd into the CI machine and tried to find the hung process but couldn't as it was running in a Docker container with its own process namespace.
- Finding the Core Dump: It was determined that the process had crashed and a core dump was found. Loading the core dump into
gdb
initially faced issues due to missing files. By copying the relevant libraries out of the container and tellinggdb
where to find them, they were able to get more useful information. - Backtrace and Disassembly: The backtrace revealed that the crash was not in the new HTTP code but in
getenv
. Disassembling thegetenv
function showed that it was crashing while loading a byte. - Inspecting Environment Block: Inspecting the environment block using
gdb
showed that it seemed valid and consistent, but there was a load from an invalid memory location. - The Real Culprit: setenv and getenv:
setenv
is not safe in a multithreaded environment and was suspected to be the cause. Reading the disassembly and cross-referencing with the C code showed that a race condition was likely occurring between threads callingsetenv
andgetenv
. - Offending Code in openssl-probe:
openssl-probe
was found to be setting theSSL_CERT_FILE
andSSL_CERT_DIR
environment variables, which was likely triggering the crash. - Assembly Skills and a Curious Operator: The assembly skills of the developers were a bit rusty, and they noticed a curious exclamation mark in the assembly code indicating the “pre-index” address mode.
- Preconditions for the Crash: The crash was caused by a memory-moving
realloc
triggered bysetenv
at the same time another thread was callinggetenv
. The number of environment variables and other factors needed to be just right for the crash to occur. - Migration Plan: In the end, it was decided to migrate away from
reqwest
'srust-native-tls
/openssl
backend torustls
on Linux to avoid similar issues. Another option was to hold the Python Global Interpreter Lock to prevent races with Python threads. The Rust project has identified this as an issue and plans to make the environment-setter functions unsafe in the 2024 edition, and the glibc project has added more thread-safety togetenv
.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。