[Redis advanced (3)] large data volume precision deduplication

Brief description

The bloom filter is used to achieve precise deduplication of large amounts of data.

Bloom filter can be understood as an inaccurate set structure. When you use its contains method to determine whether an object exists, it may misjudge. But the Bloom filter is not particularly imprecise. As long as the parameters are set reasonably, its accuracy can be controlled relatively accurately enough, and there will only be a small probability of misjudgment. Generally used for deduplication of large amounts of data.

When the Bloom filter says that a certain value exists, the value may not exist; when it says it does not exist, then there must be no . For example, when it says that it does not know you, it must not know it; when it says that it has seen you, it may not have met at all, but because your face is similar to a certain face of the people it knows (some Coefficient combination of some familiar faces), so the misjudgment has seen you before.

The Bloom filter officially provided by Redis did not officially debut until Redis 4.0 provided the plug-in function. Bloom filter is loaded into Redis Server as a plug-in, providing Redis with powerful Bloom deduplication function.

Application scenario

In the crawler system, we need to de-duplicate the URL so that the web pages that have been crawled do not need to be crawled. But there are too many URLs, tens of millions to hundreds of millions. It would be a waste of space to install these URL addresses in a collection. At this time, you can consider using a Bloom filter. It can greatly reduce the deduplication storage consumption, but it will also make the crawler system miss a small number of pages.

Bloom filters are widely used in the field of NoSQL databases. The HBase, Cassandra, LevelDB, and RocksDB we usually use have Bloom filter structures inside. Bloom filters can significantly reduce the number of database IO requests. When a user queries a row, he can filter out a large number of non-existent row requests through the Bloom filter in the memory, and then go to the disk to query.

The spam filtering function of the mailbox system also generally uses the Bloom filter. Because of this filter, we usually encounter some normal emails being placed in the spam directory. This is caused by a misjudgment. The probability is very low.

Install

Use docker directly

> docker pull redislabs/rebloom  # 拉取镜像
> docker run -p6379:6379 redislabs/rebloom  # 运行容器
> redis-cli  # 连接容器中的 redis 服务

Plug-in installation

# 下载编译安装Rebloom插件
wget https://github.com/RedisLabsModules/rebloom/archive/v1.1.1.tar.gz

# 解压 tar zxvf v1.1.1.tar.gz
cd rebloom-1.1.1
make

# redis服启动添加对应参数
rebloom_module="/usr/local/rebloom/rebloom.so"
daemon --user ${REDIS_USER-redis} "$exec $REDIS_CONFIG --loadmodule $rebloom_module --daemonize yes --pidfile $pidfile"

# 重启redis服务

测试命令
bf.add test testValue
命令成功说明开启成功

use

Bloom filter has two basic instructions, bf.add add element, bf.exists query whether the element exists, its usage is similar to the sad and sismember of the set collection. Note that bf.add can only add one element at a time. If you want to add more than one element at a time, you need to use the bf.madd instruction. Similarly, if you need to query whether multiple elements exist at once, you need to use the bf.mexists instruction.

127.0.0.1:6379> bf.add codehole user1
(integer) 1

127.0.0.1:6379> bf.exists codehole user1
(integer) 1

127.0.0.1:6379> bf.madd codehole user4 user5 user6

127.0.0.1:6379> bf.mexists codehole user4 user5 user6 user7

Parameter tuning

The bloom filter we used above is just the bloom filter with default parameters, which is automatically created when we add it for the first time. In fact, Redis also provides Bloom filters with custom parameters, which need to be explicitly created bf.reserve If the corresponding key already exists, bf.reserve will report an error. bf.reserve has three parameters, namely key, error_rate and initial_size . The lower the error rate, the more space required. initial_size parameter represents the expected number of elements. When the actual number exceeds this value, the misjudgment rate will increase.

Therefore, it is necessary to set a larger value in advance to avoid exceeding it and lead to an increase in the false positive rate. If bf.reserve is not used, the default error_rate is 0.01, and the default initial_size is 100 .

import redis
import random

client = redis.StrictRedis()

CHARS = ''.join([chr(ord('a') + i) for i in range(26)])

def random_string(n):
    chars = []
    for i in range(n):
        idx = random.randint(0, len(CHARS) - 1)
        chars.append(CHARS[idx])
    return ''.join(chars)


users = list(set([random_string(64) for i in range(100000)]))
print 'total users', len(users)
users_train = users[:len(users)/2]
users_test = users[len(users)/2:]


falses = 0
client.delete("codehole")
# 增加了下面这一句
client.execute_command("bf.reserve", "codehole", 0.001, 50000)
for user in users_train:
    client.execute_command("bf.add", "codehole", user)
print 'all trained'
for user in users_test:
    ret = client.execute_command("bf.exists", "codehole", user)
    if ret == 1:
        falses += 1

print falses, len(users_test)

The output is as follows:

total users 100000
all trained
6 50000

We see that the misjudgment rate is about 0.012%, which is much lower than the estimated 0.1%. However, Bloom's probability is in error. As long as it is not much higher than the estimated misjudgment rate, it is a normal phenomenon.

initial_size estimate of 06138273c4ebc7 of Bloom filter is too large, it will waste storage space. If the estimate is too small, it will affect the accuracy. The user must estimate the number of elements as accurately as possible before using it, and also need to add a certain amount of redundancy. Space to avoid the actual elements may unexpectedly be much higher than the estimated value.

error_rate Bloom filter is, the more storage space is needed. For occasions that do not need to be too precise, a error_rate setting of 06138273c4ebdd will not hurt. For example, in terms of news de-duplication, a higher rate of misjudgment will only prevent a small part of the article from being seen by the right people, and the overall reading volume of the article will not bring about a huge change because of this misjudgment rate.

[Redis advanced (3)] large data volume precision deduplication

Brief description

Application scenario

Install

Use docker directly

Plug-in installation

use

Parameter tuning

菜问

引用和评论

【redis进阶5】缓存雪崩+击穿+穿透

如何实现页面广告随时上下线、过期自动下线及到时自动上线

Linux Redis 安装、配置、教程（一）

10k star！一款轻量级的Redis客户端工具，贼好用！

基于k3s部署Nginx、MySQL、PHP和Redis的详细教程

Ubuntu telnet 正常无法连接Redis服务器

如何基于Redis实现延时任务

[Redis advanced (3)] large data volume precision deduplication

Brief description

Application scenario

Install

Use docker directly

Plug-in installation

use

Parameter tuning

菜问

引用和评论

【redis进阶5】缓存雪崩+击穿+穿透

如何实现页面广告随时上下线、过期自动下线及到时自动上线

Linux Redis 安装、配置、教程（一）

10k star！一款轻量级的Redis客户端工具，贼好用！

基于k3s部署Nginx、MySQL、PHP和Redis的详细教程

Ubuntu telnet 正常 无法连接Redis服务器

如何基于Redis实现延时任务

Ubuntu telnet 正常无法连接Redis服务器