- May 2025: This is the publication date.
- The challenge: Broadcasting at line rate: Bluesky is built on ATproto with a core part being the "firehose", a stream of events representing network state changes. It comes in full-fat and jetstream versions. Jetstream's average message size is around half a kilobyte and the event rate is variable around 300 - 500 events per second. A relay follows an upstream feed and rebroadcasts to clients. The goal is to build a simple jetstream relay called "jetrelay" demonstrating transferable techniques.
- Multicast and backfill: The relay's key is to send the same data to all clients including websocket framing. This is called multicast and is usually done with UDP multicast on local networks. But jetstream is based on TCP, so multicast-for-TCP needs to be implemented. Clients don't see events at the same time and backfill allows clients to reconnect and get missed events. This means the relay needs to save event data to disk.
- Trick #1: Bypassing userspace with
sendfile()
: As new events arrive, they are appended to a file ready for sending. Thesendfile()
syscall is used to copy data from the file to the client's socket. It's easy to use and cheap as the data is in the kernel's page cache. This design naturally performs write-batching, sending data efficiently even when clients are falling behind. - Trick #2: Handling many clients in parallel with io_uring:
sendfile()
is synchronous and blocking, so each client needs a dedicated thread. But io_uring allows preparing and submitting multiplesendfile()
s in a single syscall and getting completion events. The main runloop submitssendfile()
s for non-up-to-date clients and updates their cursors on completion. The number of syscalls doesn't depend on the number of clients. Io_uring emulatessendfile()
with twosplice()
s. - Not a trick: Getting new clients connected: Jetrelay has an event writer thread, an I/O orchestration thread, and a client handshake thread. The event writer thread writes events to the file and updates an index. The handshake thread listens for new clients, performs the handshake, and initializes their file offsets using the index. A mutex is used to protect the index when it's read and written by different threads.
- Trick #3: Discarding old data with
FALLOC_FL_PUNCH_HOLE
: The file grows without bound, so old data needs to be discarded.fallocate()
is used to deallocate regions of the file, taking up no disk space. An index is used to map timestamps to byte-offsets, and entries are removed from the index and the file is deallocated when data is older than a certain time. Clients whose cursors are in the deallocated region are kicked off. - Testing it out: Jetrelay was tested locally with loopback and a real network. With loopback, the fd limit needed to be increased to handle 20k clients. On a real network with a 10 Gbps connection, 6000 clients could keep up easily and 9000 clients could for a while until the data-rate exceeded 10 Gbps. Testing with different CPU quotas showed that the throughput was bottlenecked by the CPU with 8 cores or fewer. Comparing with the official server, it was able to handle more clients and saturation but was optimized for different use cases.
- Wrap-up: The code of jetrelay shows how events are written to the file, old data is dropped,
sendfile()
requests are created, and client cursors are updated. Jetrelay is a tech demo and more work is needed for real deployment including operational and security features and missing features like filtering. In a real setup, jetrelay could be run behind nginx. - Appendix: The other 90%: Jetrelay needs various improvements for real use such as backfill on startup, better websocket handling, systemd notifications, logging, Prometheus metrics, and a fancy dashboard. Security features like per-IP rate limiting, preventing DoS, timing out clients, and preventing large data send are also needed. Missing features include filtering by collection and DID, and in-band stream control. More testing including fuzzing is required.
- Appendix: Thoughts on ATproto and the “push-based internet”: The OG internet is pull-based while push-based systems need a middleman. ATproto is similar to RSS but with improvements like a map-based data structure and signed records. Removing trust from the relay allows using the closest relay. HTTP is for the pull-based internet and RSS has been used for push, and ATproto could be a new standard.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。