通过表情符号走私任意数据

发布于 2 月 13 日

Hacker News comment by GuB-42 mentioned encoding unlimited data in a single emoji using ZWJ sequences. The main points are:

Unicode represents text as a sequence of codepoints. Simple latin-alphabet text has a one-to-one mapping.
Unicode designates 256 codepoints as variation selectors (VS-1 to VS-256) to modify the presentation of the preceding character.
We can concatenate variation selectors to represent arbitrary byte strings. For example, to encode "hello" as [0x68, 0x65, 0x6c, 0x6c, 0x6f], we convert each byte to a variation selector and concatenate them after a base character.
Decoding is straightforward by converting variation selectors back to bytes.
This can be abused, such as sneaking data past human content filters or watermarking text.
Regarding LLMs, tokenizers seem to preserve variation selectors as tokens, but models generally don't try to decode them internally. However, with a code interpreter, some models like Gemini 2 Flash and Claude can solve it.

阅读 10