I. Overview
1.1 Business Background
When recommending videos, vivo short videos need to filter and deduplicate videos that users have already watched, so as to avoid recommending the same video to users repeatedly and affecting the experience. In a recommendation request processing process, video recall will be performed based on user interests, about 2,000 to 10,000 videos are recalled, and then the videos will be de-duplicated, the videos that the user has watched, and only the videos that the user has not watched will be retained. Sort, select videos with high scores and send them to users.
1.2 Current status
The current recommendation deduplication is implemented based on Redis Zset. The server will write the video reported by the buried point and the video sent to the client into Redis ZSet with different keys. The recommendation algorithm directly reads the playback of the corresponding user in Redis after the video is recalled. And the delivery record (the entire ZSet), based on the Set structure in the memory to achieve deduplication, that is, to determine whether the current recall video already exists in the delivery or playback video Set, the general process is shown in Figure 1.
(Figure 1: Current status of short video deduplication)
Video deduplication itself is filtered based on the videos that the user has actually watched, but considering that the videos actually watched are reported through the client's buried point, there is a certain delay, so the server will save the user's last 100 issued records for use. Removing duplicates ensures that even if the client's buried points have not been reported, the videos that have already been watched will not be recommended to the user (that is, repeated recommendations). The videos delivered to users may not be exposed, so only 100 videos are saved, so that videos that have not been watched by users can still be recommended after 100 recordings are delivered.
The main problem of the current solution is that it occupies a large amount of Redis memory, because the video ID is stored in the Redis Zset in the form of a raw string. In order to control the memory occupation and ensure the read and write performance, we limit the maximum length of each user's playback record. Currently The maximum storage length for a single user is limited to 10,000, but this will affect the product experience of heavy users.
2. Program research
2.1 Mainstream solutions
First, the storage form . The video deduplication scene is a typical one that only needs to determine whether it exists, so it is not necessary to store the original video ID. Currently, the more commonly used solution is to use a Bloom filter to store multiple hash values of the video, which can reduce the storage space. Several times or even ten times.
Second, the storage medium . If you want to support the storage of 90 days (three months) of playback records, instead of the current rude limit of 10,000 records, the required Redis storage capacity is very large. For example, according to 50 million users, an average single user plays 10,000 videos in 90 days, each video ID occupies 25B of memory, and a total of 12.5TB is required. Video deduplication will eventually be read into memory, and you can consider sacrificing some read performance in exchange for larger storage space. Moreover, the currently used Redis is not persistent. If there is a Redis failure, data will be lost and it will be difficult to recover (due to the large amount of data, the recovery time will be very long).
At present, the most commonly used solution in the industry is to use disk KV (generally, the bottom layer is based on RocksDB for persistent storage, and the hard disk uses SSD).
2.2 Technical selection
First, play the record . Because it needs to support at least three months of playback history, the Bloom filter is used to store the video records that the user has watched, which will greatly reduce the space occupation compared to storing the original video ID. We design for 50 million users. If Redis is used to store playback records in the form of bloom filters, it will also be more than terabytes of data. Considering that we eventually perform filtering operations in the host's local memory, it is acceptable to be slightly lower. To improve the read performance, the disk KV persistent storage is used to store the playback records in the form of Bloom filters.
Second, issue records . Because it only needs to store 100 video records for delivery, the overall data volume is not large, and considering that the data before 100 records needs to be eliminated, Redis is still used to store the latest 100 delivery records.
3. Scheme design
Based on the above technical selection, we plan to add a unified deduplication service to support functions such as writing, delivery and playback records, and video deduplication based on delivery and playback records. Among them, the key point to consider is to store the playback buried point in the Bloom filter after receiving it. After receiving the playback point, it takes three steps to write the KV to the disk in the form of bloom filter, as shown in Figure 2: First, read and deserialize the bloom filter, if the bloom filter does not exist Then you need to create a bloom filter; second, update the playing video ID to the bloom filter; third, serialize the updated bloom filter and write it back to the disk KV.
(Figure 2: Main steps of unified deduplication service)
The whole process is very clear, but considering the need to support tens of millions of users, assuming that the design is based on the target of 50 million users, we need to consider four issues:
- First , the video is sent by swiping (5~10 videos per swipe), and the playback points are reported according to the video granularity, so in terms of video recommendation and deduplication, the data writing QPS is higher than the reading, however, Compared with the performance of Redis disk KV, the performance of disk KV itself is lower than that of read performance. To support 50 million users, how to implement Bloom filter to write disk KV is an important issue to be considered.
- Second , since the Bloom filter does not support deletion, the data that exceeds a certain period of time needs to be expired and eliminated, otherwise the data that is no longer used will always occupy storage resources, so how to realize the expiration and elimination of the Bloom filter is also an important issue to be considered. .
- Third , the server and the algorithm currently interact directly through Redis. We hope to build a unified deduplication service. The algorithm calls this service to filter the watched videos. The server is based on the Java technology stack, and the algorithm is based on the C++ technology stack. The technology stack provides services to call the C++ technology stack. We finally use gRPC to provide an interface for algorithm calls, and the registry uses Consul. This part is not important and will not be elaborated.
- Fourth , after switching to the new scheme, we hope to migrate the playback records previously stored in Redis ZSet to Bloom filters to achieve smooth upgrades to ensure user experience. Designing the migration scheme is also an important issue to consider.
3.1 Overall Process
The overall process of the unified deduplication service and its interaction with upstream and downstream are shown in Figure 3. When the server sends a video, it saves the current delivery record to the key corresponding to the Redis delivery record through the Dubbo interface of the unified deduplication service. Using the Dubbo interface can ensure that the delivery record is written immediately. At the same time, monitor the video playback buried point and store it in the disk KV in the form of a Bloom filter. Considering the performance, we adopt a batch write scheme, which will be described in detail below. The unified deduplication service provides an RPC interface for the recommendation algorithm to call, so as to filter out the videos that the user has watched for the recalled videos.
(Figure 3: The overall process of unified deduplication service)
The write performance of the disk KV is much worse than the read performance, especially when the value is relatively large, the write QPS will be even worse. Considering that the disk KV write performance cannot meet the direct write requirements in the case of tens of millions of daily activities, it is necessary to design the write flow The aggregation solution is to aggregate the playback records of the same user within a period of time and write them together, which greatly reduces the writing frequency and reduces the writing pressure on the disk KV.
3.2 Traffic Aggregation
In order to achieve write traffic aggregation, we need to temporarily aggregate the playing video in Redis, and then write the temporarily stored video generation Bloom filter to the disk KV to save it at intervals. Specifically, we considered N minutes to write only There are two ways to write in batches once and for scheduled tasks. Next, we elaborate on our design and considerations for traffic aggregation and bloom filter writing.
3.2.1 Near real-time writing
After listening to the playback point reported by the client, it should have been directly updated to the Bloom filter and saved to the disk KV, but considering the reduction of the write frequency, we can only save the played video ID to Redis first, N The disk KV is written uniformly only once within a minute. This scheme is called a near real-time write scheme.
The simplest idea is to save a Value in Redis every time you write, and it will expire after N minutes. After listening to the playback point, you can judge whether the Value exists. If it exists, it means that the disk KV has been written once within N minutes. Do not write this time, otherwise perform the write disk KV operation. This consideration is mainly that when data is generated, do not write immediately, and wait for N minutes to gather a small batch of traffic before writing. This Value is like a "lock", which protects the disk KV from being written only once every N minutes. As shown in Figure 4, if it is currently locked, it will fail to lock again, and it can be protected during the locking period. Disk KV is not written. From the point of view of buried data flow, the original continuous data flow, through this "lock", becomes a batch of micro-batch data every N minutes, thereby achieving traffic aggregation and reducing the write pressure of disk KV.
(Figure 4: Near real-time write scheme)
The starting point of near real-time writing is very simple, and the advantages are also obvious. The video ID in the playback buried point can be written into the Bloom filter in near real time, and the time is relatively short (N minutes), which can avoid the temporary delay in Redis Zset. The stored data is too long. However, careful analysis also needs to consider many special scenarios, mainly as follows:
First , saving a Value in Redis is actually equivalent to a distributed lock. In fact, it is difficult to guarantee that this "lock" is absolutely safe. Therefore, there may be two times when the playback point is received and the disk KV write operation is considered possible. However, the temporary data read in these two times may not be the same. Since the disk KV does not support the bloom filter structure, the write operation needs to read the current bloom filter from the disk KV first, and then write the data that needs to be written. The video ID is updated to the bloom filter, and finally written back to the disk KV. In this case, there may be data loss after writing the disk KV.
Second , the data of the last N minutes needs to wait until the user uses it next time before triggering the KV to be written to the disk by playing the buried point. If there are a large number of inactive users, there will be a large amount of temporary data left in Redis. space. At this time, if a scheduled task is used to write this part of the data to the disk KV, the problem of concurrent write data loss in the first scenario will easily occur.
From this point of view, although the starting point of the near real-time writing scheme is very straightforward, it becomes more and more complicated when you think about it carefully, and you can only find other schemes.
3.2.2 Batch write
Since the near-real-time writing scheme is complex, consider a simple scheme to write the temporarily stored data to the disk KV in batches through scheduled tasks. We mark the data to be written, assuming we write every hour, then we can mark the staging data with an hourly value. However, considering that scheduled tasks may inevitably fail to be executed, we need to have compensation measures. A common solution is to execute tasks on the data 1-2 hours earlier each time a task is executed to compensate. However, it is obvious that such a scheme is not elegant enough. We are inspired by the time wheel, and based on this, we design a scheme for batch writing of Bloom filters.
We connect the hour value end to end to get a ring, and store the corresponding data in the place marked by the hour value, then the data of the same hour value (such as 11 o'clock every day) exists together, if today's data is not executed due to the task Or if the execution fails and the KV is not synchronized to the disk, it will be compensated once the next day.
Following this idea, we can take the hour value modulo a certain value to further shorten the time interval between two compensations. For example, taking modulo 8 as shown in Figure 5, we can see that 1:00~2:00 and 9:00~10 The data of :00 will fall on the data to be written marked by point 1 on the time ring in the figure, and there will be a compensation opportunity after 8 hours, that is to say, the value of this modulo is the compensation time interval.
(Figure 5: Batch write scheme)
So, what should we set the compensation interval to? This is a question worth thinking about. The selection of this value will affect the distribution of the data to be written on the ring. Our business generally has busy hours and idle hours, and the amount of data during busy hours will be larger. According to the characteristics of short video busy and idle hours, we finally set the compensation interval to 6, so that the business busy hours fall evenly on the ring. various points.
After determining the compensation time interval, we feel that the compensation of 6 hours is still too long, because users may watch a lot of videos within 6 hours. If the data is not synchronized to the disk KV in time, it will take up a lot of Redis memory. Moreover, we use Redis ZSet to temporarily store user playback records, which will seriously affect performance if it is too long. Therefore, we design to add a timed task every hour, and the second task compensates for the first task. If the second task still fails to compensate successfully, after one lap, it can be compensated again (the bottom line).
If you are careful, you should find that the "data to be written" and the timing task in Figure 5 are not distributed at the same point on the ring. The reason for our design is to hope that the scheme is simpler, and the timing task will only operate. Change the data, so that you can avoid concurrent operation problems. Just like garbage collection in the Java virtual machine, we cannot collect garbage while still throwing garbage in the same room. Therefore, the node on the ring is designed to only process the data on the previous node corresponding to the timing task, so as to ensure that no concurrency conflicts will occur and keep the scheme simple.
The batch writing scheme is simple and there is no concurrency problem, but in Redis Zset, it needs to save an hour of data, which may exceed the maximum length, but considering that in reality, the average user will not play a very large number of videos within an hour, this is possible accepted. In the end, we chose the batch write scheme, which is simple, elegant, and efficient. On this basis, we need to continue to design a video ID scheme for temporarily storing a large number of users.
3.3 Data Fragmentation
In order to support the 50 million daily activity level, we need to design the corresponding data storage sharding method for the scheduled batch write scheme. First of all, we still need to store the play video list in Redis Zset, because before writing the Bloom filter, we need to use this data to filter the videos that the user has watched. As mentioned earlier, we will temporarily store data for an hour. Normally, a user will not play more than 10,000 pieces of data within an hour, so generally there is no problem. In addition to the video ID itself, we also need to save which users have generated playback data during this hour. Otherwise, the scheduled task does not know which users' playback records to write into the Bloom filter. If 50 million users are stored, data analysis is required. piece.
Combined with the time loop introduced in the batch synchronization section, we designed a data sharding scheme as shown in Figure 6, hashing 50 million users into 5,000 Sets, so that each Set can store up to 10,000 user IDs without affecting Set performance. At the same time, each node on the time ring saves data according to this sharding method, and expands it as shown in the lower part of Figure 6, with played:user:${time node number}:${user hash value} Saves the IDs of all users who have generated playback data under a certain shard at a certain time node for the key.
(Figure 6: Data Sharding Scheme)
Correspondingly, our scheduled tasks are also sharded, and each task shard is responsible for processing a certain number of data shards. Otherwise, if there is a one-to-one correspondence between the two, divide the distributed timing task into 5,000 shards. Although it is better to retry on failure, there will be pressure on task scheduling. In fact, the company's timing tasks do not support it. 5000 shards. We divide the timing task into 50 shards, task shard 0 is responsible for processing data shards 0~100, task shard 1 is responsible for processing data shards 100~199, and so on.
3.4 Data Obsolescence
For the short video recommendation and deduplication business scenario, we generally guarantee that users will not recommend this video to the user within three months after watching the video, so it involves the elimination of expired data. The bloom filter does not support deletion, so we add the user's play history to the bloom filter, store it by month, and set the corresponding expiration time, as shown in Figure 7, the current expiration time is set to 6 months. When the data is read, the data of the last 4 months is selected to be read according to the current time for deduplication. The reason why you need to read 4 months of data is because the current month's data is less than one month old. In order to ensure that users will not be repeatedly recommended within three months, you need to read three full months and the current month's data.
(Figure 7: Data Retirement Scenario)
We have also carefully considered the setting of the data expiration time. The data is stored on a monthly basis, so the new data is generally generated at the beginning of the month. If the expiration time is only set to 6 months later, it will not only generate a large amount of new data at the beginning of the month, but also A large amount of old data needs to be eliminated, which puts pressure on the database system. Therefore, we break up the expiration time, first randomly to any day of the month after 6 months, and secondly, we set the expiration time when the business is idle, such as: 00:00~05:00, in order to reduce the database Pressure on the system while cleaning.
3.5 Program Summary
By synthesizing the above three design schemes of traffic aggregation, data fragmentation and data elimination, the overall design scheme is shown in Figure 8. The buried point data is played from left to right and flows from the data source Kafka to Redis for temporary storage, and finally flows to the disk for KV persistence change.
(Figure 8: Overall program flow)
First, after listening to the data from the Kafka playback buried point, we append the video to the user's corresponding playback history according to the user ID, and at the same time determine the corresponding time loop according to the current time and the hash value of the user ID, and assign the user ID Save it to the user list corresponding to the time ring. Then, each distributed timing task segment obtains the playback user data segment of the previous time ring, then obtains the user's playback record and updates it to the read Bloom filter, and finally writes the Bloom concern after serialization. Disk KV.
4. Data Migration
In order to achieve a smooth migration from the current Redis ZSet-based deduplication to the Bloom filter-based deduplication, we need to migrate the playback records generated by users before the unified deduplication service is launched to ensure that the user experience is not affected. We designed and tried The two schemes were compared and improved to form the final scheme.
We have already implemented batches of raw data of playback records to generate Bloom filters and store them in the disk KV. Therefore, the migration plan only needs to consider migrating the historical data stored in the original Redis (generated before the deduplication service goes online) to the new Redis It can be done in the middle, and then it can be completed by the scheduled task. The scheme is shown in Figure 9. The incremental data newly generated by the user after the unified deduplication service goes online is written through the monitoring and playback buried point, and the new and old data are double-written, so that they can be downgraded when needed.
(Figure 9: Migration plan 1)
However, we ignored two problems: first, the new Redis is only used for temporary storage, so the capacity is much smaller than the old Redis, and the data cannot be migrated at one time, and it needs to be migrated in multiple batches; second, the migration to The storage format of the new Redis is different from that of the old Redis. In addition to playing the video list, you also need to play the user list. Consult the DBA to find out that such a migration is difficult to achieve.
Since it is troublesome to migrate data, we consider whether we can not migrate the data. When deduplication, we judge whether the user has been migrated. If not, read a copy of the old data together for deduplication filtering, and trigger the deduplication. The user's old data is migrated to the new Redis (including writing to the playback user list). After three months, the old data can be expired and eliminated. At this time, the data migration is completed, as shown in Figure 10. This migration solution solves the problem of inconsistency in the migration of old and new Redis data formats, and triggers the migration when the user requests it. It also avoids the one-time migration data requirements for the new Redis capacity. At the same time, it can also achieve accurate migration, only three migrations Users who need to migrate data during the month.
(Figure 10: Migration plan 2)
Therefore, we carried out data migration according to the second plan. During the online test, we found that the deduplication interface was time-consuming and unstable due to the need to migrate old data when the user first requested it. Video deduplication is an important part of video recommendation. It is more sensitive to time-consuming, so we have to continue to think about new migration solutions. We noticed that when the Bloom filter is generated in batches at regular intervals, after reading the play user list corresponding to the time loop, the play video list is obtained according to the user ID, and then the Bloom filter is generated and saved to the disk KV. At this time, we It only needs to add a user's historical playback record read from the old Redis to migrate the historical data. In order to trigger the process of generating a Bloom filter from a user's play record, we need to save the user ID to the corresponding play user list on the time ring. The final solution is shown in Figure 11.
(Figure 11: Final Migration Scenario)
First, the DBA helps us scan the keys (including user IDs) of the playback records in the old Redis and export them through files; then, we import the exported files into Kafka through the big data platform, and enable consumers to monitor and consume the files in the files. After parsing the data, write it into the play user list corresponding to the current time ring. Next, after the distributed batch task reads a user in the playback user list, if the user has not migrated data, it reads the historical playback record from the old Redis and updates it to Bloom filter together with the new playback record and save to disk KV.
V. Summary
This article mainly introduces the design and thinking of constructing recommendation deduplication service for short videos based on Bloom filter. Starting from the problem, we design and optimize the solution step by step, striving for simplicity, perfection, and elegance. We hope that it can be of reference and reference value for readers. Due to the limited space of the article, some aspects are not covered, and many technical details are not elaborated in detail. If you have any questions, please continue to communicate.
Author: vivo internet server team - Zhang Wei
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。