1

Introduction

Collections are a fundamental abstraction in software. There are many ways to implement sets, such as hash set, tree, etc. To implement a set of integers, a bitmap (also known as bitset bit set, bitvector bit vector) is a good way. Using n bits, we can represent the integer range [0, n) . If integer i is in the set, the i-th bit is set to 1. The intersection, union, and difference of such sets can be implemented using bitwise AND, bitwise OR, and bitwise NOT of integers. And computers perform bit operations very quickly.

In the last article I introduced the bitset library.

bitset can consume a lot of memory in some scenarios. For example, setting the 1,000,000th bit requires more than 100kb of memory. To this end, the author of the bitset library has developed a compressed bit library: roaring .

This article first introduces the use of roaring. Finally, analyze the file storage format of roaring.

Install

The code in this article uses Go Modules.

Create a directory and initialize:

 $ mkdir -p roaring && cd roaring
$ go mod init github.com/darjun/go-daily-lib/roaring

Install roaring library:

 $ go get -u github.com/RoaringBitmap/roaring

use

Basic operation

 func main() {
  bm1 := roaring.BitmapOf(1, 2, 3, 4, 5, 100, 1000)
  fmt.Println(bm1.String())         // {1,2,3,4,5,100,1000}
  fmt.Println(bm1.GetCardinality()) // 7
  fmt.Println(bm1.Contains(3))      // true

  bm2 := roaring.BitmapOf(1, 100, 500)
  fmt.Println(bm2.String())         // {1,100,500}
  fmt.Println(bm2.GetCardinality()) // 3
  fmt.Println(bm2.Contains(300))    // false

  bm3 := roaring.New()
  bm3.Add(1)
  bm3.Add(11)
  bm3.Add(111)
  fmt.Println(bm3.String())         // {1,11,111}
  fmt.Println(bm3.GetCardinality()) // 3
  fmt.Println(bm3.Contains(11))     // true

  bm1.Or(bm2)                       // 执行并集
  fmt.Println(bm1.String())         // {1,2,3,4,5,100,500,1000}
  fmt.Println(bm1.GetCardinality()) // 8
  fmt.Println(bm1.Contains(500))    // true

  bm2.And(bm3)                      // 执行交集
  fmt.Println(bm2.String())         // {1}
  fmt.Println(bm2.GetCardinality()) // 1
  fmt.Println(bm2.Contains(1))      // true
}

The above demonstrates two ways to create a roaring bitmap:

  • roaring.BitmapOf() : Pass in collection elements, create a bitmap and add those elements
  • roaring.New() : Create an empty bitmap

First, we create a bitmap bm1: {1,2,3,4,5,100,1000}. Print its string representation, the set size, and check if 3 is in the set.

Then create a bitmap bm2: {1,100,500}. The output is checked three times.

Then create an empty bitmap bm3, add elements 1, 11, 111 in turn. The output is checked three times.

Then we perform a union on bm1 and bm2, and the result is stored directly in bm1. Since the elements in the set are different, the elements in bm1 at this time are {1,2,3,4,5,100,500,1000} and the size is 8.

Then we perform the intersection of bm2 and bm3, and the result is stored directly in bm2. At this time, the element in bm2 is {1} and the size is 1.

It can be seen that the basic operation provided by roaring is roughly the same as that of bitset. It's just that the naming is completely different, and special attention needs to be paid when using it.

  • bm.String() : returns the string representation of the bitmap
  • bm.Add(n) : add element n
  • bm.GetCardinality() : Returns the Cardinality of the set, that is, the number of elements
  • bm1.And(bm2) : Execute set intersection, which will modify bm1
  • bm1.Or(bm2) : Execute set union, which will modify bm1

iterate

Roaring bitmaps support iteration.

 func main() {
  bm := roaring.BitmapOf(1, 2, 3, 4, 5, 100, 1000)

  i := bm.Iterator()
  for i.HasNext() {
    fmt.Println(i.Next())
  }
}

Like iterators supported by many programming languages, first call the object's Iterator() to return an iterator, and then call HasNext() to check if there is a next element, call i.Next() Returns the next element.

The above code outputs 1, 2, 3, 4, 5, 100, 1000 in sequence.

Parallel operation

Roaring supports parallel execution of bitmap collection operations. You can specify how many goroutines to use to perform intersection, union, etc. on collections. At the same time, a variable number of bitmap collections can be passed in:

 func main() {
  bm1 := roaring.BitmapOf(1, 2, 3, 4, 5, 100, 1000)
  bm2 := roaring.BitmapOf(1, 100, 500)
  bm3 := roaring.BitmapOf(1, 10, 1000)

  bmAnd := roaring.ParAnd(4, bm1, bm2, bm3)
  fmt.Println(bmAnd.String())         // {1}
  fmt.Println(bmAnd.GetCardinality()) // 1
  fmt.Println(bmAnd.Contains(1))      // true
  fmt.Println(bmAnd.Contains(100))    // false

  bmOr := roaring.ParOr(4, bm1, bm2, bm3)
  fmt.Println(bmOr.String())         // {1,2,3,4,5,10,100,500,1000}
  fmt.Println(bmOr.GetCardinality()) // 9
  fmt.Println(bmOr.Contains(10))     // true
}

Parallel operation uses the Par* version of the corresponding interface. The first parameter specifies the number of workers, and then any number of bitmaps are passed in.

write and read

Roaring can write compressed bitmaps to files in a format compatible with implementations in other languages. That is, we can use Go to write a roaring bitmap to a file, then send it over the network to another machine, where it can be read using a C++ or Java implementation.

 func main() {
  bm := roaring.BitmapOf(1, 3, 5, 7, 100, 300, 500, 700)

  buf := &bytes.Buffer{}
  bm.WriteTo(buf)

  newBm := roaring.New()
  newBm.ReadFrom(buf)
  if bm.Equals(newBm) {
    fmt.Println("write and read back ok.")
  }
}
  • WriteTo(w io.Writer) : Write to an io.Writer, which can be memory (byte.Buffer), file (os.File), or even network (net.Conn)
  • ReadFrom(r io.Reader) : read from an io.Reader, the source can also be memory, file or network, etc.

Note that the return values of ---56e1afbefded52feafc5440fd8df073e WriteTo are size and err , and errors need to be handled when using them. ReadFrom also returns size and err , which also need to be processed.

64-bit version

By default, roaring bitmaps can only be used to store 32-bit integers. So the roaring bitmap can contain at most 4294967296 ( 2^32 ) integers.

Roaring also provides an extension to store 64-bit integers, namely github.com/RoaringBitmap/roaring/roaring64 . The provided interface is basically the same. However, the 64-bit version is not guaranteed to be compatible with formats such as Java/C++.

storage format

Roaring can write to and read from files. And provide multiple language compatible formats. Let's take a look at the storage format.

Roaring bitmaps can only store 32-bit integers by default. When serializing, store these integers in containers. Each container has a 16-bit representation of the cardinality (Cardinality, the number of elements, range [1,2^16] ) and a key (key). The key takes the most significant 16 bits of the element, so the range for the key is [0, 65536) . Thus if two integers have the same most significant 16 bits, they will be stored in the same container. There's another benefit to doing this: it takes up less space.

All integers are stored in little endian.

Overview

The storage format used by roaring is laid out as follows:

Introduced from top to bottom.

The beginning part is a Cookie Header. It is used to identify whether a binary stream is a roaring bitmap, and store some small amount of information.

The word cookie has a bit of meaning, the original meaning is cookie. My understanding refers to small objects, so cookies in http are only used to store small amounts of information. The same goes for the Cookie Header here.

Next is the Descriptive Header. See the name, it is used to describe the information of the container. Containers are described in detail later.

Next there is an optional Offset Header. It records the offset of each container relative to the first position, which gives us random access to any container.

The last part is the container that stores the actual data. There are 3 types of containers in roaring:

  • array (array type): 16bit integer array
  • bitset (bitset type): use the bitset introduced in the previous article to store data
  • run: This is a bit difficult to translate. Some of you may have heard of run-length encoding, which is translated into run-length encoding. That is to use length + data to encode, for example, "0000000000" can be encoded as "10,0", indicating that there are 10 0s. The run container is also similar, which will be described in detail later.

This layout is designed so that its data can be read randomly without loading the entire bitmap into memory. And the scope of each container is independent of each other, which makes parallel computing easy.

Cookie Header

There are two types of Cookier Header, occupying 32bit and 64bit space respectively.

For the first type, the value of the first 32 bits is 12346, and the next 32 bits represent the number of containers (denoted as n). At the same time this means that there is no run-type container behind it. The magic number 12346 is defined as a constant SERIAL_COOKIE_NO_RUNCONTAINER and the meaning is self-explanatory.

In the second type, the value of the least significant 16 bits of the first 32 bits is 12347. At this point, the value stored in the most significant 16 bits is equal to the number of containers -1. Shift the cookie right by 16 bits and add 1 to get the number of containers. Since the number of containers of this type will not be 0, we can have 1 more container with this encoding. This approach is used in many places, such as redis. The (n+7)/8 byte (as a bitset) will be used next to indicate whether the following container is a run container. Each bit corresponds to a container, 1 means that the corresponding container is a run container, and 0 means that it is not a run container.

Because it is little-endian storage, the first 16 bits of the stream must be 12346 or 12347. If other values are read, it means that the file is damaged, and the program can be exited directly.

Descriptive Header

After the Cookie Header is the Descriptive Header. It uses a pair of 16bit data to describe each container. One 16bit storage key (that is, the most significant 16bit of the integer), and the other 16bit storage corresponding to the container's cardinality (Cardinality) -1 (see again), that is, the number of integers stored in the container). If there are n containers, the Descriptive Header requires 32n bits or 4n bytes.

After scanning the Descriptive Header, we can know the type of each container. If the cookie value is 12347, there is a bitset after the cookie indicating whether each container is of the run type. For non-run containers, if the cardinality of the container is less than or equal to 4096, it is an array container. Instead, this is a bitset container

Offset Header

An Offset Header exists when any of the following conditions are met:

  • The value of the cookie is SERIAL_COOKIE_NO_RUNCONTAINER (ie 12346)
  • The value of the cookie is SERIAL_COOKIE (ie 12347) and there are at least 4 containers. There is also a constant NO_OFFSET_THRESHOLD = 4

The Offset Header uses a 32-bit value for each container to store the offset of the corresponding container from the beginning of the stream, in bytes.

Container

Next comes the container that actually stores the data. As briefly mentioned earlier, there are three types of containers.

array

Stores an ordered 16-bit unsigned integer value, which is convenient for using binary search to improve efficiency. The 16bit value is only the least significant 16bit of the data. Remember that each container in the Descriptive Header has a 16bit key. Stitching them together is the actual data.

If the container has x values, it takes 2x bytes of space.

bitmap/bitset

The bitset container uses a fixed 8KB space and is serialized in 64bit units (called words, word). Therefore, if the value j exists, the j%64 bits of the j/64th word (0-based) are set to 1 (0-based).

run

Starts with a 16bit integer representing the number of runs. Each subsequent run is represented by a pair of 16bit integers, the first 16bit represents the starting value, and the last 16bit represents the length -1 (see it again). For example, 11,4 means the data 11,12,13,14,15.

Parse the code by hand

The most effective way to verify that we really understand the roaring layout is to do a parsing by hand. Using the standard library encoding/binary can easily handle big and small endianness.

Define constants:

 const (
  SERIAL_COOKIE_NO_RUNCONTAINER = 12346
  SERIAL_COOKIE                 = 12347
  NO_OFFSET_THRESHOLD           = 4
)

Read Cookie Header:

 func readCookieHeader(r io.Reader) (cookie uint16, containerNum uint32, runFlagBitset []byte) {
  binary.Read(r, binary.LittleEndian, &cookie)
  switch cookie {
  case SERIAL_COOKIE_NO_RUNCONTAINER:
    var dummy uint16
    binary.Read(r, binary.LittleEndian, &dummy)
    binary.Read(r, binary.LittleEndian, &containerNum)

  case SERIAL_COOKIE:
    var u16 uint16
    binary.Read(r, binary.LittleEndian, &u16)
    containerNum = uint32(u16)
    buf := make([]uint8, (containerNum+7)/8)
    r.Read(buf)
    runFlagBitset = buf[:]

  default:
    log.Fatal("unknown cookie")
  }

  fmt.Println(cookie, containerNum, runFlagBitset)
  return
}

Read Descriptive Header:

 func readDescriptiveHeader(r io.Reader, containerNum uint32) []KeyCard {
  var keycards []KeyCard
  var key uint16
  var card uint16
  for i := 0; i < int(containerNum); i++ {
    binary.Read(r, binary.LittleEndian, &key)
    binary.Read(r, binary.LittleEndian, &card)
    card += 1
    fmt.Println("container", i, "key", key, "card", card)

    keycards = append(keycards, KeyCard{key, card})
  }

  return keycards
}

Read Offset Header:

 func readOffsetHeader(r io.Reader, cookie uint16, containerNum uint32) {
  if cookie == SERIAL_COOKIE_NO_RUNCONTAINER ||
    (cookie == SERIAL_COOKIE && containerNum >= NO_OFFSET_THRESHOLD) {
    // have offset header
    var offset uint32
    for i := 0; i < int(containerNum); i++ {
      binary.Read(r, binary.LittleEndian, &offset)
      fmt.Println("offset", i, offset)
    }
  }
}

Read the container and call different functions depending on the type:

 // array
func readArrayContainer(r io.Reader, key, card uint16, bm *roaring.Bitmap) {
  var value uint16
  for i := 0; i < int(card); i++ {
    binary.Read(r, binary.LittleEndian, &value)
    bm.Add(uint32(key)<<16 | uint32(value))
  }
}

// bitmap
func readBitmapContainer(r io.Reader, key, card uint16, bm *roaring.Bitmap) {
  var u64s [1024]uint64
  for i := 0; i < 1024; i++ {
    binary.Read(r, binary.LittleEndian, &u64s[i])
  }

  bs := bitset.From(u64s[:])
  for i := uint32(0); i < 8192; i++ {
    if bs.Test(uint(i)) {
      bm.Add(uint32(key)<<16 | i)
    }
  }
}

// run
func readRunContainer(r io.Reader, key uint16, bm *roaring.Bitmap) {
  var runNum uint16
  binary.Read(r, binary.LittleEndian, &runNum)

  var startNum uint16
  var length uint16
  for i := 0; i < int(runNum); i++ {
    binary.Read(r, binary.LittleEndian, &startNum)
    binary.Read(r, binary.LittleEndian, &length)
    length += 1
    for j := uint16(0); j < length; j++ {
      bm.Add(uint32(key)<<16 | uint32(startNum+j))
    }
  }
}

Integration:

 func main() {
  data, err := ioutil.ReadFile("../roaring.bin")
  if err != nil {
    log.Fatal(err)
  }

  r := bytes.NewReader(data)
  cookie, containerNum, runFlagBitset := readCookieHeader(r)

  keycards := readDescriptiveHeader(r, containerNum)
  readOffsetHeader(r, cookie, containerNum)

  bm := roaring.New()
  for i := uint32(0); i < uint32(containerNum); i++ {
    if runFlagBitset != nil && runFlagBitset[i/8]&(1<<(i%8)) != 0 {
      // run
      readRunContainer(r, keycards[i].key, bm)
    } else if keycards[i].card <= 4096 {
      // array
      readArrayContainer(r, keycards[i].key, keycards[i].card, bm)
    } else {
      // bitmap
      readBitmapContainer(r, keycards[i].key, keycards[i].card, bm)
    }
  }

  fmt.Println(bm.String())
}

I save the byte.Buffer in the write-read example to a file roaring.bin . The above program can parse this file:

 12346 1 []
container 0 key 0 card 8
offset 0 16
{1,3,5,7,100,300,500,700}

Bitmap restored successfully 😀

Summarize

In this paper we first introduce the use of roaring compressed bitmaps. If the internal implementation is not considered, compressed bitmaps are not much different from ordinary bitmaps in use.

I then analyzed the stored format in detail through 8 schematics.

Finally, deepen the understanding of the principle through an analysis by hand.

If you find a fun and easy-to-use Go language library, you are welcome to submit an issue on the Go daily library GitHub😄

refer to

  1. Roaring GitHub: github.com/RoaringBitmap/roaring
  2. Roaring file format: https://github.com/RoaringBitmap/RoaringFormatSpec
  3. Go daily library GitHub: https://github.com/darjun/go-daily-lib

I

My blog: https://darjun.github.io

Welcome to pay attention to my WeChat public account [GoUpUp], learn together and make progress together~


darjun
2.9k 声望359 粉丝