vu128:高效的可变长度整数

  • Why variable-length integers: Basic goal is to encode small numbers in fewer bytes. VLQ and LEB128 are popular for TLV formats like Protocol Buffers. The author initially planned to use LEB128 but found implementations too complex. A quick hack using a length-prefix byte was made and later benchmarked, showing it to be much faster than the optimized LEB128 code in rustc_serialize/leb128.rs. The conclusion is that LEB128 is a poor fit for today's deeply pipelined processor architectures and a new encoding named vu128 is introduced.
  • The vu128 format: Composed of up to N+1 octets for an N-octet unsigned integer. Supports integers up to 128 bits. Three layouts based on value range: single-byte for [0, 27), unary length prefix for [27, 228), and binary length prefix for [228, 2128). Signed integers use "zigzag" encoding. IEEE-754 floating-point values are byte-swapped to big-endian before encoding.
  • Rust implementation: Written in Rust with basic bit operations. Faster than LEB128 code in the Rust compiler on AMD Ryzen 7 3700X. Github: https://github.com/jmillikin/..., crates.io: https://crates.io/crates/vu128. The decoder accepts over-long encoding and truncates large values.
  • Python implementation: Intended for non-Rust readers. Not optimized for performance. Functions encode and decode vu128.
  • Appendix A: Original vu128 format: Before revisions based on online discussions. If integer is less than 0xF0, stored in a single octet. Otherwise, first octet is 0xF0 | ((number of non-zero octets) - 1), and non-zero octets are appended in little-endian order.
  • Appendix B: Original Rust implementation: Used for initial benchmarking before revisions. Encoder and decoder functions with macros for different integer sizes. Signed integers use "zigzag" encoding. Floating-point values are byte-swapped.
  • Earliest reference: Concept first mentioned in MIDI 1.0 in 1983 (big-endian variant). Little-endian variant named LEB128 in DWARF v2 in 1993.
阅读 10
0 条评论