C 标准库不是线程安全的,甚至安全的 Rust 也没有拯救我们 | Gel 博客

  • Porting EdgeDB Network I/O Code: In the process of porting a significant portion of EdgeDB's network I/O code from Python to Rust, a new HTTP fetch feature using reqwest was being worked on. Tests initially passed locally and on x86_64 CI runners but started failing intermittently on ARM64 CI runners.
  • CI Output and Initial Observations: The CI output showed a hung test process with no errors in the logs. It initially seemed like a deadlock, but later it was discovered that the process had crashed.
  • Connecting to ARM64 Runner: To figure out what was happening, Sully and Matt connected directly to the ARM64 runner. They SSH'd into the CI machine and tried to find the hung process but couldn't as it was running in a Docker container with its own process namespace.
  • Finding the Core Dump: It was determined that the process had crashed and a core dump was found. Loading the core dump into gdb initially faced issues due to missing files. By copying the relevant libraries out of the container and telling gdb where to find them, they were able to get more useful information.
  • Backtrace and Disassembly: The backtrace revealed that the crash was not in the new HTTP code but in getenv. Disassembling the getenv function showed that it was crashing while loading a byte.
  • Inspecting Environment Block: Inspecting the environment block using gdb showed that it seemed valid and consistent, but there was a load from an invalid memory location.
  • The Real Culprit: setenv and getenv: setenv is not safe in a multithreaded environment and was suspected to be the cause. Reading the disassembly and cross-referencing with the C code showed that a race condition was likely occurring between threads calling setenv and getenv.
  • Offending Code in openssl-probe: openssl-probe was found to be setting the SSL_CERT_FILE and SSL_CERT_DIR environment variables, which was likely triggering the crash.
  • Assembly Skills and a Curious Operator: The assembly skills of the developers were a bit rusty, and they noticed a curious exclamation mark in the assembly code indicating the “pre-index” address mode.
  • Preconditions for the Crash: The crash was caused by a memory-moving realloc triggered by setenv at the same time another thread was calling getenv. The number of environment variables and other factors needed to be just right for the crash to occur.
  • Migration Plan: In the end, it was decided to migrate away from reqwest's rust-native-tls/openssl backend to rustls on Linux to avoid similar issues. Another option was to hold the Python Global Interpreter Lock to prevent races with Python threads. The Rust project has identified this as an issue and plans to make the environment-setter functions unsafe in the 2024 edition, and the glibc project has added more thread-safety to getenv.
阅读 13
0 条评论