Application of Efficiently Compressed Bitmap in Recommender System

Author: vivo Internet Technology - Ke Jiachen

1. Background

When a user browses the content of some modules in the game center/app store, he will perform a series of sliding screen operations and request the game recommendation service for many times to display the game recommendation. This period of time is called a user session.

In a session, users generally perform a dozen or so swipe operations, and each swipe operation will request a recommendation service. Therefore, in this session, game recommendation needs to deduplicate the recommended games to avoid repeated recommendation of the same game that affects users. experience.

The simplified business process is as follows: a recommendation request will be triggered when the user performs a swipe operation. At this time, the client will transparently transmit the blacklisted games on the previous page to the recommendation system through the game center server, and the recommendation system will send a session to the recommendation system. The blacklist information of each request is accumulated and stored in Redis as a total filtering set, and these blacklist games will be filtered out during recall scoring.

Taking the actual business scenario as an example, the session duration of a user browsing a certain module is generally not more than 10 minutes, and the number of games displayed on each page of the module is about 20. Assuming that each user session will perform 15 screen sliding operations, then a session It is necessary to store the appIds (integer Ids) of 300 blacklisted games. Therefore, this business scenario is not suitable for persistent storage, and the business problem can be attributed to how to use a suitable data structure to cache a series of integer sets to save memory overhead.

2. Technical selection analysis

Next, we randomly select the appIds of 300 games ([2945340, ...... , 2793501,3056389]) as a set to analyze the effects of intset, bloom filter, and RoaringBitMap on the storage effect.

2.1 intset

The experimental results show that using intset to save 300 game sets, the space occupied is 1.23KB. This is because for 300 integer appId games, each appId can be represented by 4Byte int32. According to the data structure of intset, its overhead is only encoding + length + space occupied by 400 ints.

 typedef struct intset{
    unit32_t encoding;          // 编码类型
    unit32_t length;            // 元素个数
    int8_t contents[];          // 元素存储
} intset;

2.2 bloom filter

The bottom layer of the Bloom filter uses a bitmap. The bitmap itself is an array that can store integer numbers. arr[N] = 1 means that the number N is stored in the array.

Bloom filter will first use the hash function to calculate the data and map it to the corresponding position of the bitmap. In order to reduce collisions (different data may have the same hash value), multiple hash operators will be used to map the same data multiple times. . In the business, we assume that there are 10,000 games online, and the business scenario does not allow misjudgment, so the error must be controlled at 10^-5, through the calculation tool of bloom filter https://hur.st/bloomfilter/? n=10000&p=1.0E-5&m=&k= , 17 hash operators are required, and the bitmap space must reach 29KB to meet the business needs. Obviously, such a huge overhead is not the result we want.

2.3 RoaringBitMap

RoaringBitMap and bloom filter essentially use bitmap for storage. However, the bloom filter uses multiple hash functions to map and store the stored data. If the data obtained by the two game appIds after hash mapping are consistent, it is determined that the two are duplicated, and there is a certain false positive rate. Therefore, in order to satisfy the In this business scenario, the space overhead will be very large. RoaringBitMap directly compresses the metadata, and its accuracy is 100%.

The experimental results show that the overhead of RoaringBitMap for 300 game sets is only 0.5KB, which is smaller than the direct use of intset (1.23KB). It is the first choice in this business scenario. So let's focus on analyzing why RoaringBitMap is so efficient.

2.3.1 Data Structure

Each RoaringBitMap contains a RoaringArray, which stores all the data, and its structure is as follows:

 short[] keys;
Container[] values;
int sizer;

Its idea is to divide 32-bit unsigned integers into buckets (containers) according to the upper 16 bits, and store them as keys in short[] keys, which can store up to 2^16 = 65536 buckets (containers). When storing data, find the container according to the high 16 bits of the data, and then put the low 16 bits into the container, that is, Container[] values. size indicates the number of valid data in the current keys and values.

For ease of understanding, the following figure shows three containers:

Image quoted from: https://arxiv.org

The high 16 bits are the container of 0000H, which stores the first 1000 multiples of 62.
The container whose upper 16 bits are 0001H stores 100 numbers in the interval [2^16, 2^16+100).
The high-order 16-bit container with 0002H stores all even numbers in the interval [2×2^16, 3×2^16), a total of 215.

Of course, the details of the data structure are not the focus of our attention. Interested students can go to the relevant materials to learn. Now let's analyze how RoaringBitMap helps us save overhead in the recommendation business. The containers in RoaringBitMap are divided into ArrayContainer, BitmapContainer and RunContainer, but their compression methods are mainly divided into two types, which are called variable-length compression and fixed-length compression. These two methods have different applications in different scenarios.

2.3.2 Compression method and thinking

Suppose two strings of numbers are set [112, 113, 114, 115, 116, 117, 118 ], [112, 115, 116, 117, 120, 121]

Using variable length compression can be recorded as:

112,1,1,1,1,1,1 The byte size used is 7bit + 6bit = 13bit, the compression rate is (7 * 32 bit) / 13 bit = 17.23
112,3,1,1,3,1 The byte size used is 7bit + 2bit + 1bit + 1bit + 2bit + 1bit = 14bit, the compression rate is (6 * 32bit) / 14 = 13.7

Using fixed-length compression can be recorded as:

112, 6, the byte size used is 7bit + 3bit = 10bit, the compression rate is (7 * 32bit) / 10bit = 22.4
112, 115, 3, 120,2 The byte size used is 7bit + 7bit + 2bit + 7bit + 2bit = 25bit, the compression ratio is (6 * 32bit) / 25 = 7.68

Obviously, the sparse arrangement has an impact on both compression methods. Variable-length compression is suitable for sparsely distributed number sets, and fixed-length compression is suitable for continuous distribution of number sets. But in the case of too sparse, even the variable-length compression method is not good.

Assuming that the collection storage range is Interger.MaxValue, the collection of numbers to store now is [2^3 - 1, 2^9 - 1, 2^15 -1, 2^25 - 1, 2^25 , 2^30 -1] these 6 numbers. Using variable length compression, it is expressed as: 2^3 -1, 2^9 - 2^3, 2^15 - 2^9, 2^25 - 2^15, 1, 2^30 - 2^15 using words Section size 3bit + 9bit +15bit + 25bit + 1bit + 30bit = 83bit, the compression ratio is 6 * 32 bit / 83 = 2.3.

This compression rate is no different from the fixed-length compression method, in that it compresses low-bit integers in extreme cases, and cannot use offset compression to improve compression efficiency.

2.3.3 Business Analysis

In a more extreme case, for the data set [2^25 - 1, 2^26 - 1, 2^27 - 1, 2^28 - 1, 2^29 - 1, 2^30 - 1], after RoaringBitMap compression Only 25 + 26 + 27 + 28 + 29 + 30 = 165bit, the compression rate is 6 * 32 / 165 = 1.16, not to mention the component data structure, bit alignment, structure consumption, pointer size and other overheads . In particularly sparse cases, using RoaringBitMap may be even worse.

However, for the game business, the total number of games is more than 10,000. The identification appId is generally an auto-incrementing primary key. The randomness is small, and the distribution is not particularly sparse. In theory, it can compress the data well. At the same time, using RoaringBitMap to store the bits used by Redis itself is not affected by the data structure of Redis itself, which can save a lot of extra space.

3. Summary

In the article, we discussed the overhead of using the three data structures of intset, bloom filter and RoaringBitMap to save integer collections in the case of using Redis storage in the filtering and deduplication business.

Among them, the traditional bloom filter method increases the storage overhead in the game recommendation scenario due to the requirement of accuracy and the limited space saving of short id mapping, which is not suitable for storing data in this business scenario. Although the intset structure can meet the business needs, the space complexity of its use is not optimal, and there is still room for optimization.

In the end, we chose the structure of RoaringBitMap for storage. This is because in the filtering set saved by the game recommendation business, the game id is an auto-incrementing integer in the general trend, and the arrangement is not very sparse. The compression feature of RoaringBitMap can be used very well. Save space overhead. We randomly selected 300 game id sets for testing. Combining with the table, we can see that compared with the 1.23KB space used by the intset structure, RoaringBitMap only uses 0.5KB in size, and the compression rate is 2.4.

For a memory-based database like Redis, using appropriate data structures to improve storage efficiency has huge benefits: it not only greatly saves hardware costs, but also reduces fork blocking threads and the delay of a single call, which greatly affects system performance. The improvement is very significant, so it is very appropriate to use RoaringBitMap in this scenario.

Application of Efficiently Compressed Bitmap in Recommender System

1. Background

2. Technical selection analysis

2.1 intset

2.2 bloom filter

2.3 RoaringBitMap

2.3.1 Data Structure

2.3.2 Compression method and thinking

2.3.3 Business Analysis

3. Summary

vivo互联网技术

引用和评论

vivo 官网 APP 首页端智能业务实践

嘎嘎好用！推荐三款开源的 Redis 桌面客户端！

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

如何实现页面广告随时上下线、过期自动下线及到时自动上线

Y 分钟速成 zfs

基于 pyflink 的算法工作流设计和改造