1

Brief introduction

HyperLogLog provides an inaccurate de-duplication counting solution for recording large data volumes (such as website uv). Although it is not accurate, it is not very inaccurate. The standard error is 0.81%.

Instructions

HyperLogLog provides two instructions, pfadd and pfcount, which are well understood from the literal meaning, one is to increase the count and the other is to get the count. The usage of pfadd is the same as the sadd of the set collection. When a user ID comes, just insert the user ID. The usage of pfcount and scar is the same, you can get the count value directly.

127.0.0.1:6379> pfadd codehole user1 user2 user3
(integer) 3
127.0.0.1:6379> pfcount codehole
(integer) 3

Next, we use the script to pour more data into it to see if it can continue to be accurate, and if it can't be accurate, how big the gap is.

import redis

client = redis.StrictRedis()
for i in range(100000):
    client.pfadd("codehole", "user%d" % i)
print 100000, client.pfcount("codehole")
> python pftest.py 100000 99723

pfmerge

In addition to the above pfadd and pfcount, HyperLogLog also provides a third instruction pfmerge, which is used to accumulate multiple pf count values to form a new pf value.

For example, in the website we have two pages with similar content, and the operation said that the data of these two pages need to be merged. The UV visits of the page also need to be merged, then pfmerge can come in handy at this time.

# 合并三个key到第一个key:test_uv
127.0.0.1:6379> PFMERGE test_uv test_uv2 test_uv3
OK
127.0.0.1:6379> pfcount test_uv
5

菜问
625 声望132 粉丝

10年后端开发,常用编程语言PHP,java,golang,python。