测量和提升 rustls 的多线程性能

  • System configuration: Benchmarks run on a bare-metal server with Debian 12 (Bookworm), GCC 12.2.0 and Clang 14.0.6, Ampere Altra Q80-30 (80 cores), 128GB memory, and all cores set to performance CPU frequency governor. It's the Hetzner RX170.
  • How benchmarks work: Can perform same benchmarks in many threads simultaneously. Each thread runs the same benchmarking operation and doesn't contend with others except via TLS library internals. Benchmarking measures TLS client "connecting" over a memory buffer without network latency, system calls, etc.
  • Versions:

    • BoringSSL and OpenSSL: Same as previous report. Test OpenSSL 3.4.0 (latest at writing) and OpenSSL 3.0.14 (packaged by Debian).
    • Rustls: Tested version is 0.23.16 with aws-lc-rs 1.10.0 / aws-lc-sys 0.22.0. Also included three commits: a5d510ea, 44522ad0, d1c33f86.
  • Measurements:

    • BoringSSL: ~/bench/openssl-bench $ BENCH_MULTIPLIER=2 setarch -R make threads BORINGSSL=1
    • OpenSSL 3.4.0: ~/bench/openssl-bench $ BENCH_MULTIPLIER=2 setarch -R make threads
    • OpenSSL 3.0.14: ~/bench/openssl-bench $ BENCH_MULTIPLIER=2 setarch -R make threads HOST_OPENSSL=1
    • rustls: ~/bench/rustls $ BENCH_MULTIPLIER=2 setarch -R make -f admin/bench-measure.mk threads
  • Initial results: Focus on server performance. Graphs show handshakes per second per thread. OpenSSL 3 has scalability problems. BoringSSL and rustls have no such problems. rustls's ticket resumption performance is suspect in TLS1.2 and TLS1.3.
  • Improving rustls's ticket resumption performance: In rustls, a TicketSwitcher was improved to use a rwlock to avoid contention. Released in rustls 0.23.17.
  • Measuring worst-case latency at high concurrency: Tested with fc6b4a19. 80 threads working at once shows satisfying htop output. Graphs show latency distributions with rustls having the tightest, followed by BoringSSL, then OpenSSL 3.4 and 3.0 having the widest.
阅读 8
0 条评论