7
头图

In the business scenario of the mobile Internet, the amount of data is very large , and we need to save such information: a key is associated with a data set, and at the same time make statistics on this data set.

for example:

  • Count the daily activity and monthly activity of a APP ;
  • Count how many different accounts visit a page every day (Unique Visitor, UV));
  • Count the number of different entries that users search for every day;
  • Count the number of registered IPs.

Usually, the number of users and visits we face are huge, such as the number of users at the level of millions or tens of millions, or the access information at the level of tens of millions or even billions .

Today, "Code Brother" uses different data types to implement: the function of counting how many different accounts visit a page every day, and gradually leads out HyperLogLog The principle is integrated with Java Redission Actual combat.

Tell everyone a trick, the official Redis website can now run Redis commands online: https://redis.io/ . As shown in the figure:

Redis 在线运行

Implemented using Set

A user visiting a website multiple times in a day can only be counted as one time , so it is easy to think of implementing it through the Redis Set collection .

For example, when the WeChat ID is "Xiao Caiji" to access the article " Why is Redis so fast ", we store this information in the Set.

 SADD Redis为什么这么快:uv 肖菜鸡 谢霸哥 肖菜鸡
(integer) 1

"Xiao Caiji" visited the " Why is Redis so fast " page many times, and the deduplication function of Set ensures that the same "WeChat ID" will not be recorded repeatedly.

Use the SCARD command to count the page UV of "Why is Redis so fast". The command returns the number of elements in a collection (that is, the user ID).

 SCARD Redis为什么这么快:uv
(integer) 2

Implemented using Hash

If the code is old and wet, it can also be implemented by using the Hash type, using the user ID as the key of the Hash set, and executing the HSET command to set the value to 1 when accessing the page.

Even if "Xiao Caiji" repeatedly visits the page and executes the command repeatedly, it will only set the key equal to "Xiao Caiji" value to 1.

Finally, use the HLEN command to count the number of elements in the Hash collection is UV.

as follows:

 HSET Redis为什么这么快 肖菜鸡 1
// 统计 UV
HLEN Redis为什么这么快

Implemented using Bitmap

The underlying data structure of Bitmap uses the SDS data structure of String type to store the bit array. Redis uses the 8 bits of each byte array, and each bit represents the binary state of an element (either 0 or 1 ).

Bitmap provides the GETBIT、SETBIT operation to read and write the bit at the offset position of the bit array through an offset value offset. It should be noted that the offset starts from 0.

Bitmap can be regarded as a bit-unit array, each unit of the array can only store 0 or 1, the subscript of the array is called offset offset in Bitmap.

For visual display, we can understand that each byte of the buf array is represented by a row, each row has 8 bits, and the 8 grids respectively represent the 8 bits in the byte, as shown in the following figure:

Bitmap

8 bits form a Byte, so Bitmap will greatly save storage space. This is the advantage of Bitmap.

How to use Bitmap to count the number of unique user visits to a page?

Bitmap provides the SETBIT 和 BITCOUNT operation. The former writes the bit at the offset position of the bit array through an offset value offset. It should be noted that the offset starts from 0.

The latter counts the number of bits with value = 1 in the given specified bit array.

It should be noted that we need to convert the "WeChat ID" into a number, because offset is a subscript.

Suppose we convert "Xiao Caiji" to the code 6 .

The first step is to execute the following command to indicate that the code of "Xiao Caiji" is 6 and visit the article " Using Redis data types to achieve billion-level data statistics ".

 SETBIT 巧用Redis数据类型实现亿级数据统计 6 1

The second step is to count the number of page visits and use the BITCOUNT command. This instruction is used to count the number of bits with value = 1 in a given bit array.

 BITCOUNT 巧用Redis数据类型实现亿级数据统计

HyperLogLog King Solution

Although Set is good, if the article is very popular and reaches the level of tens of millions, a Set will save the IDs of tens of millions of users, and the memory consumption will be too much if there are more pages.

The same is true for the Hash data type.

As for Bitmap, it is more suitable for the usage scenario of " binary state statistics ", and the statistical accuracy is high. Although the memory usage is less than HashMap , it will still occupy a large amount of memory for a large amount of data.

What to do?

These are typical "cardinality statistics" application scenarios, cardinality statistics: count the number of unique elements in a set.

The advantage of HyperLogLog is that the memory it requires does not change due to the size of the collection. No matter how many elements the collection contains, the memory required for HyperLogLog to perform calculations is always fixed and very small. of .

Each HyperLogLog only needs to spend up to 12KB of memory, and under the premise of standard error 0.81% , the cardinality of 2 to the 64th element can be calculated.

Redis in action

HyperLogLog is too simple to use. PFADD、PFCOUNT、PFMERGE Three commands to conquer the world.

PFADD

Add each user ID that visits the page to HyperLogLog .

 PFADD Redis主从同步原理:uv userID1 userID 2 useID3

PFCOUNT

Use PFCOUNT to obtain the UV value of the article " Redis Master-Slave Synchronization Principle ".

 PFCOUNT Redis主从同步原理:uv

PFMERGE usage scenarios

 HyperLogLog` 除了上面的 `PFADD` 和 `PFCOIUNT` 外,还提供了 `PFMERGE
grammar
 PFMERGE destkey sourcekey [sourcekey ...]

For example, in the website we have two pages with similar content, and the operation says that the data of these two pages needs to be merged.

Among them, the UV traffic of the page also needs to be combined, then this time PFMERGE can come in handy, that is, the same user visits these two pages only once .

As shown below: Redis, MySQL two HyperLogLog collections respectively save two pages of user access data.

 PFADD Redis数据 user1 user2 user3
PFADD MySQL数据 user1 user2 user4
PFMERGE 数据库 Redis数据 MySQL数据
PFCOUNT 数据库 // 返回值 = 4

Merge multiple HyperLogLogs into one HyperLogLog. The cardinality of the merged HyperLogLog is close to the union of the observed set of all input HyperLogLogs.

Both user1 and user2 have accessed Redis and MySQL, only one visit.

Redission in action

The detailed source code "Code Brother" has been uploaded to GitHub: https://github.com/MageByte-Zero/springboot-parent-pom.git

pom dependencies

 <dependency>
  <groupId>org.redisson</groupId>
  <artifactId>redisson-spring-boot-starter</artifactId>
  <version>3.16.7</version>
</dependency>

Add data to Log

 // 添加单个元素
public <T> void add(String logName, T item) {
  RHyperLogLog<T> hyperLogLog = redissonClient.getHyperLogLog(logName);
  hyperLogLog.add(item);
}

// 将集合数据添加到 HyperLogLog
public <T> void addAll(String logName, List<T> items) {
  RHyperLogLog<T> hyperLogLog = redissonClient.getHyperLogLog(logName);
  hyperLogLog.addAll(items);
}

merge

 /**
 * 将 otherLogNames 的 log 合并到 logName
 *
 * @param logName       当前 log
 * @param otherLogNames 需要合并到当前 log 的其他 logs
 * @param <T>
 */
public <T> void merge(String logName, String... otherLogNames) {
  RHyperLogLog<T> hyperLogLog = redissonClient.getHyperLogLog(logName);
  hyperLogLog.mergeWith(otherLogNames);
}

Statistical base

 public <T> long count(String logName) {
  RHyperLogLog<T> hyperLogLog = redissonClient.getHyperLogLog(logName);
  return hyperLogLog.count();
}

unit test

 @Slf4j
@RunWith(SpringRunner.class)
@SpringBootTest(classes = RedissionApplication.class)
public class HyperLogLogTest {

    @Autowired
    private HyperLogLogService hyperLogLogService;

    @Test
    public void testAdd() {
        String logName = "码哥字节:Redis为什么这么快:uv";
        String item = "肖菜鸡";
        hyperLogLogService.add(logName, item);
        log.info("添加元素[{}]到 log [{}] 中。", item, logName);
    }

    @Test
    public void testCount() {
        String logName = "码哥字节:Redis为什么这么快:uv";
        long count = hyperLogLogService.count(logName);
        log.info("logName = {} count = {}.", logName, count);
    }

    @Test
    public void testMerge() {
        ArrayList<String> items = new ArrayList<>();
        items.add("肖菜鸡");
        items.add("谢霸哥");
        items.add("陈小白");

        String otherLogName = "码哥字节:Redis多线程模型原理与实战:uv";
        hyperLogLogService.addAll(otherLogName, items);
        log.info("添加 {} 个元素到 log [{}] 中。", items.size(), otherLogName);

        String logName = "码哥字节:Redis为什么这么快:uv";
        hyperLogLogService.merge(logName, otherLogName);
        log.info("将 {} 合并到 {}.", otherLogName, logName);

        long count = hyperLogLogService.count(logName);
        log.info("合并后的 count = {}.", count);
    }
}

Fundamental

HyperLogLog is a probabilistic data structure that uses a probabilistic algorithm to count the approximate cardinality of a collection. The origin of its algorithm is the Bernoulli process.

A Bernoulli process is a coin tossing experiment. Toss a normal coin, the landing may be heads or tails, and the probability of both is 1/2 .

The Bernoulli process is to keep tossing a coin until it lands heads, and record the number of k .

For example, if a coin is tossed once, it will appear heads, at this time k is 1 ; the first coin toss is tails, then continue tossing, until the third time it appears heads, this When k is 3.

For n Bernoulli process, we will get n number of throws with k1, k2 ... kn , where the maximum value here is k_max .

According to a mathematical derivation, we can draw a conclusion: 2^{k_ max} as the estimated value of n.

That is to say, you can approximate the number of Bernoulli processes based on the maximum number of throws.

Therefore, the basic idea of HyperLogLog is to estimate the overall cardinality by using the maximum value of the first 1 in the bit string of the numbers in the set to estimate the overall cardinality. However, this estimation method has a large error. In order to improve the error situation, the bucket average is introduced into HyperLogLog. The concept of , computes the harmonic mean of m buckets.

HyperLogLog in Redis has a total of 2^14 buckets, which is 16384 buckets. Each bucket is a 6-bit array, as shown in the figure below.

图片来源:程序员历小冰

The principle of HyperLogLog is too complicated. If you want to know more, please move to:

Redis optimizes the storage of HyperLogLog . When the count is relatively small, the storage space uses a coefficient matrix, which occupies a small space.

Only when the count is large and the space occupied by the sparse matrix exceeds the threshold will it be converted into a dense matrix, occupying 12KB of space.

Why only need 12 KB?

HyperLogLog in the implementation is 16384 buckets, that is, 2^14 , each bucket needs bits maxbits bits to store, the maximum can represent maxbits=63 , so the total memory occupied is 2^14 * 6 / 8 = 12k byte.

Summarize

Using Hash , Bitmap , HyperLogLog to achieve:

  • Hash : The algorithm is simple, the statistical precision is high, and it is used with a small amount of data, which will occupy a lot of memory for massive data;
  • Bitmap : Bitmap algorithm, suitable for "binary statistics scenario", please refer to my article for details. For a large number of different page data statistics, it will still occupy a large amount of memory.
  • Set : Implemented using the deduplication feature, one Set saves the IDs of tens of millions of users, and the memory consumed by more pages is too large. In Redis, each HyperLogLog key needs only 12 KB of memory to calculate the cardinality of nearly 2^64 different elements. Because HyperLogLog only calculates the cardinality from the input elements, and does not store the input elements themselves, so HyperLogLog cannot return the individual elements of the input like a set.
  • HyperLogLog is an algorithm, not unique to Redis
  • The purpose is to do cardinality statistics, so it is not a collection, no metadata is saved, and only the number is recorded instead of the value.
  • The space consumption is very small, and it supports the input of a very large amount of data
  • The core is the cardinality estimation algorithm, which is mainly manifested in the use of memory during calculation and the processing of data merging. There is a certain error in the final value
  • Redis each Hyperloglog key occupies 12K of memory for marking cardinality (official document)
  • pfadd command does not allocate 12k memory at one time, but gradually increases the memory allocation with the increase of the cardinality; while the pfmerge operation will merge the sourcekey and store it in a 12k key, which is set by hyperloglog The principle of the merge operation (two Hyperloglog the value of each bucket needs to be compared separately when merging) can be easily understood.
  • Error description: The result of the cardinality estimate is an approximation with 0.81% standard error. is an acceptable range
  • Redis Optimize the storage of HyperLogLog . When the count is relatively small, the storage space is stored in a sparse matrix, and the space occupation is very small. Only when the count gradually becomes larger, the space occupied by the sparse matrix gradually increases. When the threshold is exceeded, it will be converted into a dense matrix at one time, and it will take up 12k of space

Good article recommended

References


码哥字节
2.2k 声望14.1k 粉丝