fleetwood.dev - SegmentFault 思否

Gall's Law: A complex system that works evolved from a simple working system.
Problem Statement: Self-attention in transformers needs positional information to determine relationships between tokens. Without it, the same token in different positions has identical outputs.
Motivating Example: In the sentence "The dog chased another dog", the same word "dog" represents different entities. Without positional encoding, the model treats them the same.
Desirable Properties:
- Unique encoding for each position across sequences.
- Linear relation between encoded positions.
- Generalizes to longer sequences.
- Generated by a deterministic process the model can learn.
- Extensible to multiple dimensions.
Integer Position Encoding: Add the integer position value to each component of the token embedding. But it causes problems due to the large magnitude and value range.
Binary Position Encoding: Convert the position to binary and add each bit to the corresponding component. Solves the value range problem but the result is "jumpy".
Sinusoidal positional encoding: Use sin and cos functions with gradually increasing wavelengths. Each component is alternatively drawn from sin and cos. The base wavelength is 10,000. It is similar to binary encoding and can encode relative positions.
Absolute vs Relative Position Encoding: Sinusoidal Encoding generates a separate vector for absolute position and encodes relative positions using trigonometric tricks. In self-attention, relative position matters more than absolute position.
Positional encoding in context: In self-attention, add positional information directly to the token embedding and it pollutes the semantic information. Shifting to multiplicative is the key.
Rotary Positional Embedding (RoPE): Decompose vectors into 2D pairs and multiply each pair with the rotation matrix. It gains a performance boost by applying rotations to q and k prior to their dot product.
Extending RoPE to nn-Dimensions: For 2D data, encode horizontal and vertical relative positions independently by pairing components within the same dimension.
The future of positional encoding: There may be future breakthroughs inspired by signal processing or for low-precision arithmetic.
Conclusion: Positional encoding is important in transformers and can be discovered by iterative improvement. There is room for future innovation.
References: Links to relevant papers and blogs.
Footnotes:
- Binary and Sinusoidal animations are from a specific video.
- Using θ = 10000 gives a theoretical upper bound on context length.
- Parts based on a fantastic post by Amirhossein Kazemnejad.
- Empirical evidence from a post by EleutherAI.