Zero version

Lucene-Core version 8.8.2

an introduction

Lucene's Index design basically relies on disk storage, and inverted index is a technology that relies on a large amount of redundant data to complete word segmentation search, so Lucene used a lot of time-for-space data compression technology in the design to ensure that the least amount of time. Disk resources to store the most data. VInt is one of the interesting structural designs.

Two technical principles

1 Overview

An ordinary int in Java occupies 4 bytes.
But when the value of int is -128 ~ 127, in fact, only one byte can be put down, and the other three bytes are meaningless redundancy (the range that can be represented by other bytes is analogous) , there are not many cases where these four bytes can be used up. VInt means variant int, which is a variable int. Its essence is to allocate on demand and reduce this redundancy.

2 byte indicator bit

A normal byte has eight data valid bits, while there are only seven in VInt, and the highest bit becomes the indicator bit of the next byte.

  • The highest bit is 1, which means that the next byte is still the current data
  • The highest bit is 0, which means there is no data behind

    3 Side effects of VInt

  • For positive numbers, since there are only seven data bits, when the value of int is relatively large, it may take 5 bytes to represent the current data (this problem cannot be solved, and VInt also feels that there is no need to solve it, because the situation is in Not much in real production)
  • For negative numbers, the highest bit is 1 and cannot be compressed (introducing zigzag encoding)

    4 zigzag encoding

    Use shift and XOR operations to move the first sign bit to the last bit of the data.

    Three Demo

    If you need to serialize the three int numbers 1 / 200 / -1 respectively, the specific steps of the VInt algorithm are (valid data marked in yellow):

    1 binarization

  • The binary number of 1 is 00000000 00000000 00000000 00000001
  • 200 in binary is 00000000 00000000 00000000 11001000
  • -1 in binary is 11111111 11111111 11111111 11111110

    2 Shift one bit forward, then add 0

  • 1 The processed binary number is 00000000 00000000 00000000 00000010
  • 200 The processed binary number is 00000000 00000000 00000001 10010000
  • -1 The processed binary number is 11111111 11111111 11111111 11111100

    3 XOR operation

    The essence of the XOR operation is that the difference is 0 and the same is 1.

  • For positive numbers, XOR a 11111111 11111111 11111111 11111111

    • The processing expression for 1 is 00000000 00000000 00000000 00000010 ^ 11111111 11111111 11111111 11111111 = 00000000 00000000 00000000 00000010;
    • The processing expression for 200 is 00000000 00000000 00000001 10010000 ^ 11111111 11111111 11111111 11111111 = 00000000 00000000 00000001 10010000
  • For negative numbers, XOR a 00000000 00000000 00000000 00000000

    • The processing expression for -1 is 11111111 11111111 11111111 11111100 ^ 00000000 00000000 00000000 00000000 = 00000000 00000000 00000000 00000011

      4 eight bits handle numbers as a unit

      The data is read in units of eight bits. When eight bits are read, the first bit is regarded as a mark bit, and if there are other data, another eight bits are read.

  • for number 1

    • Serialization process:

      • First read the seven bits 0000010, all of which are 0 before, if there is no data, then fill in 0 in front, it is 00000010
    • Reading process:

      • Read the serialized data 00000010, the first bit is 0, which means there is only one byte, and there is no other data behind, so the data is 00000010
      • The last digit is 0, which means it is a positive number. XOR operation with 11111111 to get 00000010
      • Move the data back one place, fill the front end with 0, and finally it is 00000001
  • For the number 200

    • Serialization process:

      • First read the seven digits of 0010000, if they are not all 0 before, then add 1 to the front, which is 10010000
      • Then read seven digits 0000011, all of which are 0 before, then fill in 0 in front, it is 00000011
      • Combined data is 10010000 00000011
    • Reading process:

      • Read serialized data 10010000, the first bit is 1, which means more than one byte, and there are other data behind, so the data is 0010000
      • Then read 00000011, the first bit is 0, which means there is no other data, the data is 0000011
      • Combined data is 00000001 10010000
      • The last bit is 0, which means it is a positive number, and XOR operation with 11111111 11111111 to get 00000001 10010000
      • Move the data one bit backwards, add 0 to the front end, and the final value is 00000000 11001000
  • for the number -1

    • Serialization process:

      • First read the seven bits 0000011, all of which are 0 before, if there is no data, then fill in 0 in front, it is 00000011
    • Reading process:

      • Read the serialized data 00000011, the first bit is 0, which means there is only one byte, and there is no other data behind, so the data is 00000011
      • The last digit is 1, which means it is a negative number. XOR with 00000000 to get 11111100
      • Move the data one bit back, and add 1 to the front end, the final value is 11111110

        Four source code

        0 process

        The calling process of the following source code:

  • lucene confirms an int value
  • Call zigZagEncode(...) to encode int to zint
  • Call the writeVInt(...) method to encode zint into vint and write it to disk or other memory container
  • Call the readVInt() method to read a vint value from disk or other memory container and reverse it to zint
  • Call the zigZagDecode(...) method to decode zint into int

    1 writeZInt

    The writeZInt(...) method is in org.apache.lucene.store.DataOutput:

     // 这个方法用于写入一个 zigzag 编码之后的 int 值
    public final void writeZInt(int i) throws IOException {
    // BitUtil.zigZagEncode(i) 用于 zigzag 编码
    writeVInt(BitUtil.zigZagEncode(i));
    }
    
    // 用于写入一个 VInt
    public final void writeVInt(int i) throws IOException {
    while ((i & ~0x7F) != 0) {
      // writeByte(...) 方法用于将 byte 持久化到文件中,暂时无需关注
      writeByte((byte)((i & 0x7F) | 0x80));
      i >>>= 7;
    }
    writeByte((byte)i);
    }

    2 zigZagEncode

    The zigZagEncode(...) method is in org.apache.lucene.util.BitUtil:

     // i >> 31 对于正数或者 0 来说,会返回全 0 的屏障
    // i >> 31 对于负数来说,会返回全 1 的屏障
    public static int zigZagEncode(int i) {
    return (i >> 31) ^ (i << 1);
    }

    3 readZInt

    The readZInt(...) method is in org.apache.lucene.store.DataOutput:

     public int readZInt() throws IOException {
    return BitUtil.zigZagDecode(readVInt());
    }
    
    public int readVInt() throws IOException {
    // 此处从磁盘读取一个 byte
    byte b = readByte();
    // b >= 0,代表最高位是 0,后续没有值了,以下雷同
    if (b >= 0) return b;
    int i = b & 0x7F;
    // 继续读取一个 byte
    b = readByte();
    i |= (b & 0x7F) << 7;
    if (b >= 0) return i;
    // 继续读取一个 byte
    b = readByte();
    i |= (b & 0x7F) << 14;
    if (b >= 0) return i;
    // 继续读取一个 byte
    b = readByte();
    i |= (b & 0x7F) << 21;
    if (b >= 0) return i;
    // 继续读取一个 byte,在 VInt 的编码下,最高五个 byte
    b = readByte();
    i |= (b & 0x0F) << 28;
    if ((b & 0xF0) == 0) return i;
    throw new IOException("Invalid vInt detected (too many bits)");
    }

    4 zigZagDecode

    The zigZagDecode(...) method is in org.apache.lucene.util.BitUtil:

     // decode 的操作和 zigZagEncode(...) 是完全相反的
    public static int zigZagDecode(int i) {
    return ((i >>> 1) ^ -(i & 1));
    }

三流
57 声望16 粉丝

三流程序员一枚,立志做保姆级教程。