Tokio + prctl = 讨厌的 bug

  • Bug encounter: The author found a bug in HyperQueue, a Rust distributed task scheduler. After releasing version 0.21.0, several issue reports came in about tasks getting terminated after a few seconds with no obvious reason. One reported issue was even weirder as the very last executed task always failed.
  • Bug origin: A user provided a reproducer. The bug was introduced by a commit in 2024 that slightly changed the way tasks were spawned in HyperQueue. The change was to offload the blocking process spawning to a different thread using tokio::task::spawn_blocking. This was done to improve performance as spawning processes can be a bottleneck on some systems.
  • Figuring out the cause: The author found that the tasks were being killed with SIGTERM. The cause was that by moving the spawning to a different thread, the PR_SET_PDEATHSIG configuration to send SIGTERM when the parent process dies was now sending it when that worker thread dies. The worker thread was being killed after approximately ten seconds of inactivity, which caused the kernel to terminate the spawned process.
  • Fixing the bug: The bug was fixed by reverting the commit with the task spawning optimization. The author also implemented a test to prevent similar bugs in the future. Although the test is not perfect, it is better than nothing.
  • Conclusion: The bug was found, diagnosed and fixed in less than an hour. The author shared the bug as it was an interesting case and hoped others would find it useful. The author also noted that the test suite should be improved to catch such bugs in the future.
阅读 15
0 条评论