通过表情符号走私任意数据

Hacker News comment by GuB-42 mentioned encoding unlimited data in a single emoji using ZWJ sequences. The main points are:

  • Unicode represents text as a sequence of codepoints. Simple latin-alphabet text has a one-to-one mapping.
  • Unicode designates 256 codepoints as variation selectors (VS-1 to VS-256) to modify the presentation of the preceding character.
  • We can concatenate variation selectors to represent arbitrary byte strings. For example, to encode "hello" as [0x68, 0x65, 0x6c, 0x6c, 0x6f], we convert each byte to a variation selector and concatenate them after a base character.
  • Decoding is straightforward by converting variation selectors back to bytes.
  • This can be abused, such as sneaking data past human content filters or watermarking text.
  • Regarding LLMs, tokenizers seem to preserve variation selectors as tokens, but models generally don't try to decode them internally. However, with a code interpreter, some models like Gemini 2 Flash and Claude can solve it.
阅读 10
0 条评论