Brief description
The bloom filter is used to achieve precise deduplication of large amounts of data.
Bloom filter can be understood as an inaccurate set structure. When you use its contains method to determine whether an object exists, it may misjudge. But the Bloom filter is not particularly imprecise. As long as the parameters are set reasonably, its accuracy can be controlled relatively accurately enough, and there will only be a small probability of misjudgment. Generally used for deduplication of large amounts of data.
When the Bloom filter says that a certain value exists, the value may not exist; when it says it does not exist, then there must be no . For example, when it says that it does not know you, it must not know it; when it says that it has seen you, it may not have met at all, but because your face is similar to a certain face of the people it knows (some Coefficient combination of some familiar faces), so the misjudgment has seen you before.
The Bloom filter officially provided by Redis did not officially debut until Redis 4.0 provided the plug-in function. Bloom filter is loaded into Redis Server as a plug-in, providing Redis with powerful Bloom deduplication function.
Application scenario
In the crawler system, we need to de-duplicate the URL so that the web pages that have been crawled do not need to be crawled. But there are too many URLs, tens of millions to hundreds of millions. It would be a waste of space to install these URL addresses in a collection. At this time, you can consider using a Bloom filter. It can greatly reduce the deduplication storage consumption, but it will also make the crawler system miss a small number of pages.
Bloom filters are widely used in the field of NoSQL databases. The HBase, Cassandra, LevelDB, and RocksDB we usually use have Bloom filter structures inside. Bloom filters can significantly reduce the number of database IO requests. When a user queries a row, he can filter out a large number of non-existent row requests through the Bloom filter in the memory, and then go to the disk to query.
The spam filtering function of the mailbox system also generally uses the Bloom filter. Because of this filter, we usually encounter some normal emails being placed in the spam directory. This is caused by a misjudgment. The probability is very low.
Install
Use docker directly
> docker pull redislabs/rebloom # 拉取镜像
> docker run -p6379:6379 redislabs/rebloom # 运行容器
> redis-cli # 连接容器中的 redis 服务
Plug-in installation
# 下载编译安装Rebloom插件
wget https://github.com/RedisLabsModules/rebloom/archive/v1.1.1.tar.gz
# 解压 tar zxvf v1.1.1.tar.gz
cd rebloom-1.1.1
make
# redis服启动添加对应参数
rebloom_module="/usr/local/rebloom/rebloom.so"
daemon --user ${REDIS_USER-redis} "$exec $REDIS_CONFIG --loadmodule $rebloom_module --daemonize yes --pidfile $pidfile"
# 重启redis服务
测试命令
bf.add test testValue
命令成功说明开启成功
use
Bloom filter has two basic instructions, bf.add
add element, bf.exists
query whether the element exists, its usage is similar to the sad and sismember of the set collection. Note that bf.add
can only add one element at a time. If you want to add more than one element at a time, you need to use the bf.madd
instruction. Similarly, if you need to query whether multiple elements exist at once, you need to use the bf.mexists
instruction.
127.0.0.1:6379> bf.add codehole user1
(integer) 1
127.0.0.1:6379> bf.exists codehole user1
(integer) 1
127.0.0.1:6379> bf.madd codehole user4 user5 user6
127.0.0.1:6379> bf.mexists codehole user4 user5 user6 user7
Parameter tuning
The bloom filter we used above is just the bloom filter with default parameters, which is automatically created when we add it for the first time. In fact, Redis also provides Bloom filters with custom parameters, which need to be explicitly created bf.reserve
If the corresponding key already exists, bf.reserve
will report an error. bf.reserve
has three parameters, namely key, error_rate
and initial_size
. The lower the error rate, the more space required. initial_size
parameter represents the expected number of elements. When the actual number exceeds this value, the misjudgment rate will increase.
Therefore, it is necessary to set a larger value in advance to avoid exceeding it and lead to an increase in the false positive rate. If bf.reserve is not used, the default error_rate
is 0.01, and the default initial_size
is 100 .
import redis
import random
client = redis.StrictRedis()
CHARS = ''.join([chr(ord('a') + i) for i in range(26)])
def random_string(n):
chars = []
for i in range(n):
idx = random.randint(0, len(CHARS) - 1)
chars.append(CHARS[idx])
return ''.join(chars)
users = list(set([random_string(64) for i in range(100000)]))
print 'total users', len(users)
users_train = users[:len(users)/2]
users_test = users[len(users)/2:]
falses = 0
client.delete("codehole")
# 增加了下面这一句
client.execute_command("bf.reserve", "codehole", 0.001, 50000)
for user in users_train:
client.execute_command("bf.add", "codehole", user)
print 'all trained'
for user in users_test:
ret = client.execute_command("bf.exists", "codehole", user)
if ret == 1:
falses += 1
print falses, len(users_test)
The output is as follows:
total users 100000
all trained
6 50000
We see that the misjudgment rate is about 0.012%, which is much lower than the estimated 0.1%. However, Bloom's probability is in error. As long as it is not much higher than the estimated misjudgment rate, it is a normal phenomenon.
initial_size
estimate of 06138273c4ebc7 of Bloom filter is too large, it will waste storage space. If the estimate is too small, it will affect the accuracy. The user must estimate the number of elements as accurately as possible before using it, and also need to add a certain amount of redundancy. Space to avoid the actual elements may unexpectedly be much higher than the estimated value.
error_rate
Bloom filter is, the more storage space is needed. For occasions that do not need to be too precise, a error_rate
setting of 06138273c4ebdd will not hurt. For example, in terms of news de-duplication, a higher rate of misjudgment will only prevent a small part of the article from being seen by the right people, and the overall reading volume of the article will not bring about a huge change because of this misjudgment rate.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。