Hacker News comment by GuB-42 mentioned encoding unlimited data in a single emoji using ZWJ sequences. The main points are:
- Unicode represents text as a sequence of codepoints. Simple latin-alphabet text has a one-to-one mapping.
- Unicode designates 256 codepoints as variation selectors (VS-1 to VS-256) to modify the presentation of the preceding character.
- We can concatenate variation selectors to represent arbitrary byte strings. For example, to encode "hello" as
[0x68, 0x65, 0x6c, 0x6c, 0x6f]
, we convert each byte to a variation selector and concatenate them after a base character. - Decoding is straightforward by converting variation selectors back to bytes.
- This can be abused, such as sneaking data past human content filters or watermarking text.
- Regarding LLMs, tokenizers seem to preserve variation selectors as tokens, but models generally don't try to decode them internally. However, with a code interpreter, some models like Gemini 2 Flash and Claude can solve it.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。