Explain the principle and realization of Bloom filter

Why do you need a Bloom filter

Imagine how you would deal with the following scenario:

Whether the phone number is registered repeatedly
Whether the user has participated in a spike activity
Forged requests for a large number of IDs to query non-existent records, at this time the cache is missed, how to avoid cache penetration

The conventional approach to the above problems is: query the database and load the database hard. If the pressure is not great, you can use this method, just keep it simple.

Improved approach: Use list/set/tree to maintain a collection of elements and determine whether the elements are in the collection. The time complexity or space complexity will be relatively high. If it is a microservice, you can use the list/set data structure in redis. The data size is very large. The memory capacity requirements of this solution may be very high.

These scenarios have one thing in common. The problem can be abstracted as: How to efficiently judge that an element is not in the set?
So is there a better solution that can achieve both time complexity and space complexity?

have! Bloom filter .

What is bloom filter

Bloom Filter (English: Bloom Filter) was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. Bloom filter can be used to retrieve whether an element is in a set. Its advantage is that the space efficiency and query time far exceed the general algorithm.

working principle

The principle of the Bloom filter is that when an element is added to the set, the element is mapped to K points (offsets) in a bit array through K hash functions, and they are set to 1. When searching, we only need to see if these points are all 1 to (approximately) to know if there is any of them in the set: if these points have any 0, then the inspected element must not be there; if they are all 1, then the inspected element It is likely to be. This is the basic idea of Bloom filter.

Simply put, it is to prepare a bit array of length m and initialize all elements to 0, use k hash functions to perform k hash operations on the elements and take the remainder of len(m) to obtain k positions and the corresponding positions in m Set to 1.

Advantages and disadvantages of bloom filters

advantages:

The space occupancy is very small, because it does not store data but uses bits to indicate whether the data exists, which has a certain degree of confidentiality.
The time complexity of insertion and query are both O(k), constant level, and k represents the number of executions of the hash function.
The hash functions can be independent of each other, and the calculation can be accelerated at the hardware instruction layer.

Disadvantages:

Error (false positive rate).
Cannot be deleted.

Error (false positive rate)

Bloom filter can 100% judge that the element is not in the set, but there may be misjudgment when the element is in the set, because when there are too many elements, the k-site generated by the hash function may be repeated.
Wikipedia has a mathematical derivation of the false positive rate (see the link at the end of the article). Here we directly give the conclusion (actually I did not understand...), assuming:

Bit array length m
Number of hash functions k
Expected number of elements n
Expected error_ε_

In order to find the appropriate m and k when creating the Bloom filter, we can derive the most appropriate m and k ε

This algorithm is used by Guava and Redisson in java to implement Bloom filter to estimate the optimal m and k:

// 计算哈希次数
@VisibleForTesting
static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}

// 计算位数组长度
@VisibleForTesting
static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
        p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}

cannot delete

Some k points in the bit array are reused by multiple elements. If we set all k points of one element to 0, it will directly affect other elements.
This makes it impossible for us to deal with scenes where elements are deleted when using Bloom filters.

By the timing reconstruction clear dirty data mode. If it is achieved through redis, do not directly delete the original key when rebuilding, but first generate a new one and then use the rename command, and then delete the old data.

Bloom filter source code analysis in go-zero

core/bloom/bloom.go

A Bloom filter has two core attributes:

Bit array:
Hash function

The bit array bloom filter implemented by go-zero Redis.bitmap . Since redis is used, distributed scenarios are naturally supported, and the hash function is MurmurHash3

Redis.bitmap be used as a bit array?

There is no separate bitmap data structure in Redis. The bottom layer is implemented using dynamic string (SDS), while the strings in Redis are actually stored in binary.
a code of ASCII is 97, converted to binary is: 01100001, if we want to convert it to b , we only need to enter one bit: 01100010. The following is achieved through Redis.setbit :

set foo a \
OK \
get foo \
"a" \
setbit foo 6 1 \
0 \
setbit foo 7 0 \
1 \
get foo \
"b"

The dynamic character string used at the bottom of bitmap can achieve dynamic expansion. When the offset reaches the high position, the bitmap in other positions will automatically fill in 0. The maximum support bit array of 2^32-1 length (occupying 512M of memory), you need to pay attention to the allocation of large memory Will block the Redis process.
According to the above algorithm principle, we can know that there are three main things to realize the Bloom filter:

The hash function k times calculates k locations.
When inserting, set the value of k positions in the bit array to 1.
When querying, judge whether the k positions are all 1 according to the calculation result of 1, otherwise it means that the element must not exist.

Let's take a look at how go-zero is implemented:

Object definition

// 表示经过多少散列函数计算
// 固定14次
maps = 14

type (
    // 定义布隆过滤器结构体
    Filter struct {
        bits   uint
        bitSet bitSetProvider
    }
    // 位数组操作接口定义
    bitSetProvider interface {
        check([]uint) (bool, error)
        set([]uint) error
    }
)

bit array operation interface implementation

First, you need to understand two lua scripts:

// ARGV:偏移量offset数组
// KYES[1]: setbit操作的key
// 全部设置为1
setScript = `
    for _, offset in ipairs(ARGV) do
        redis.call("setbit", KEYS[1], offset, 1)
    end
    `
// ARGV:偏移量offset数组
// KYES[1]: setbit操作的key
// 检查是否全部为1
testScript = `
    for _, offset in ipairs(ARGV) do
        if tonumber(redis.call("getbit", KEYS[1], offset)) == 0 then
            return false
        end
    end
    return true
    `

Why do we have to use lua scripts?
Because it is necessary to ensure that the entire operation is executed atomically.

// redis位数组
type redisBitSet struct {
    store *redis.Client
    key   string
    bits  uint
}
// 检查偏移量offset数组是否全部为1
// 是:元素可能存在
// 否:元素一定不存在
func (r *redisBitSet) check(offsets []uint) (bool, error) {
    args, err := r.buildOffsetArgs(offsets)
    if err != nil {
        return false, err
    }
    // 执行脚本
    resp, err := r.store.Eval(testScript, []string{r.key}, args)
    // 这里需要注意一下,底层使用的go-redis
    // redis.Nil表示key不存在的情况需特殊判断
    if err == redis.Nil {
        return false, nil
    } else if err != nil {
        return false, err
    }

    exists, ok := resp.(int64)
    if !ok {
        return false, nil
    }

    return exists == 1, nil
}

// 将k位点全部设置为1
func (r *redisBitSet) set(offsets []uint) error {
    args, err := r.buildOffsetArgs(offsets)
    if err != nil {
        return err
    }
    _, err = r.store.Eval(setScript, []string{r.key}, args)
    // 底层使用的是go-redis,redis.Nil表示操作的key不存在
    // 需要针对key不存在的情况特殊判断
    if err == redis.Nil {
        return nil
    } else if err != nil {
        return err
    }
    return nil
}

// 构建偏移量offset字符串数组,因为go-redis执行lua脚本时参数定义为[]stringy
// 因此需要转换一下
func (r *redisBitSet) buildOffsetArgs(offsets []uint) ([]string, error) {
    var args []string
    for _, offset := range offsets {
        if offset >= r.bits {
            return nil, ErrTooLargeOffset
        }
        args = append(args, strconv.FormatUint(uint64(offset), 10))
    }
    return args, nil
}

// 删除
func (r *redisBitSet) del() error {
    _, err := r.store.Del(r.key)
    return err
}

// 自动过期
func (r *redisBitSet) expire(seconds int) error {
    return r.store.Expire(r.key, seconds)
}

func newRedisBitSet(store *redis.Client, key string, bits uint) *redisBitSet {
    return &redisBitSet{
        store: store,
        key:   key,
        bits:  bits,
    }
}

At this point, the bit array operations are all realized. Next, let’s see how to calculate k bit points through k hash functions.

k hashes calculated k locations

// k次散列计算出k个offset
func (f *Filter) getLocations(data []byte) []uint {
    // 创建指定容量的切片
    locations := make([]uint, maps)
    // maps表示k值,作者定义为了常量:14
    for i := uint(0); i < maps; i++ {
        // 哈希计算,使用的是"MurmurHash3"算法,并每次追加一个固定的i字节进行计算
        hashValue := hash.Hash(append(data, byte(i)))
        // 取下标offset
        locations[i] = uint(hashValue % uint64(f.bits))
    }
  
    return locations
}

Insert and query

The implementation of adding and querying is very simple, just combine the above functions.

// 添加元素
func (f *Filter) Add(data []byte) error {
    locations := f.getLocations(data)
    return f.bitSet.set(locations)
}

// 检查是否存在
func (f *Filter) Exists(data []byte) (bool, error) {
    locations := f.getLocations(data)
    isSet, err := f.bitSet.check(locations)
    if err != nil {
        return false, err
    }
    if !isSet {
        return false, nil
    }

    return true, nil
}

Suggestions for Improvement

The overall implementation is very simple and efficient, so is there room for improvement?

Personally, I think there are still some. As mentioned above, the mathematical formula for automatically calculating the optimal m and k, if the creation parameter is changed to:

Expected Total ExpectedInsertions

Expected error falseProbability

Even better, although the error description is specifically mentioned in the author's comment, in fact, many developers are not sensitive to the length of the bit array, and it is impossible to intuitively know how many bits are transmitted and what the expected error will be.

// New create a Filter, store is the backed redis, key is the key for the bloom filter,
// bits is how many bits will be used, maps is how many hashes for each addition.
// best practices:
// elements - means how many actual elements
// when maps = 14, formula: 0.7*(bits/maps), bits = 20*elements, the error rate is 0.000067 < 1e-4
// for detailed error rate table, see http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
func New(store *redis.Redis, key string, bits uint) *Filter {
    return &Filter{
        bits:   bits,
        bitSet: newRedisBitSet(store, key, bits),
    }
}

// expectedInsertions - 预期总数量
// falseProbability - 预期误差
// 这里也可以改为option模式不会破坏原有的兼容性
func NewFilter(store *redis.Redis, key string, expectedInsertions uint, falseProbability float64) *Filter {
    bits := optimalNumOfBits(expectedInsertions, falseProbability)
    k := optimalNumOfHashFunctions(bits, expectedInsertions)
    return &Filter{
        bits:   bits,
        bitSet: newRedisBitSet(store, key, bits),
        k:      k,
    }
}

// 计算最优哈希次数
func optimalNumOfHashFunctions(m, n uint) uint {
    return uint(math.Round(float64(m) / float64(n) * math.Log(2)))
}

// 计算最优数组长度
func optimalNumOfBits(n uint, p float64) uint {
    return uint(float64(-n) * math.Log(p) / (math.Log(2) * math.Log(2)))
}

`Back to the question`

How to prevent illegal IDs from causing cache penetration?

Because the id does not exist, the request cannot hit the cache and the traffic is directly hit to the database. At the same time, the record does not exist in the database, which makes it impossible to write to the cache. This will undoubtedly greatly increase the pressure on the database in high-concurrency scenarios. There are two solutions:

Use bloom filter

When data is written to the database, it needs to be written to the Bloom filter synchronously. At the same time, if there are dirty data scenarios (such as deletion), the Bloom filter needs to be rebuilt regularly. When using redis as storage, bloom.key cannot be directly deleted, and rename can be used. Update bloom by key

When the cache and the database cannot be hit at the same time, a null value with a short expiration time is written to the cache.

`material`

Bloom Filter (Bloom Filter) principle and specific implementation in Guava

Bloom filter-Wikipedia

Redis.setbit

`project address`

https://github.com/zeromicro/go-zero

Welcome to use go-zero and star support us!

`WeChat Exchange Group`

Follow the " Practice " public exchange group get the QR code of the community group.

Explain the principle and realization of Bloom filter

Why do you need a Bloom filter

What is bloom filter

Advantages and disadvantages of bloom filters

Bloom filter source code analysis in go-zero

Suggestions for Improvement

`Back to the question`

`material`

`project address`

`WeChat Exchange Group`

kevinwan

`引用和评论`

熔断原理分析与源码解读

注册中心

Cloudflare 从 PHP 到 Go：迁移与经验分享

万字长文2024最全Go面经汇总

Go 并发控制：singleflight 详解

windows下golang 使用go-oci8连接orcale配置 goframe框架配置后可直接使用

Go-Zero实战：抽奖算法的设计与实现