Abstract: GaussDB (for Redis) easily handles the core storage of the recommended system, escorting enterprise-level applications.

This article is shared from the Huawei Cloud Community " GaussDB (for Redis) Demystification Issue 13: How to solve the recommended system storage problem? ", author: Gauss Redis official blog.

1. Thinking caused by recommendation bias

After the Qixi Festival, a friend of the author encountered an embarrassing thing: when his girlfriend clicked on his shopping app, a series of recommendations were automatically popped up: rose free shipping, moved and cried, romantic night light... Recalling the Qixi Festival, the gift did not appear. So the question arises: to recruit from the ground up, who did you give away?
image.png

In order to help friends rebuild trust, the author conducted a technical survey: this must be a deviation of the "recommendation system".
image.png

A recommendation system is an information filtering system that can quickly analyze massive user behavior data, predict user preferences, and make effective recommendations. In the business of product recommendation and advertising, the recommendation system has a heavy responsibility. According to Amazon's 2019 annual report, 40% of its revenue comes from its internal stable recommendation system.

In the example at the beginning of this article, it is precisely because of the recommendation system that the embarrassing scene is caused. I decided to support my friends and convince them with reliable knowledge!
image.png

2. What does the recommendation system look like?

Generally speaking, in a mature recommendation system, distributed computing, feature storage, and recommendation algorithm are the three key links, and none of them are indispensable.

The following introduces a complete recommendation system, in which GaussDB (for Redis) is responsible for the core feature data storage. This system is also one of the more mature best practices in many customer cases of Huawei Cloud.

Part 1: Obtain feature data

image.png

  • raw data collection

Likes, favorites, comments, purchases... These behaviors belong to the original data, and they happen at any time, so the amount of data is huge. It is passed downstream via Kafka, Redis Stream and other stream components, or stored in a data warehouse, waiting for later extraction and use.

  • distributed computing

The original data is discrete and ambiguous, and cannot be used directly by the algorithm. At this time, large-scale offline and online calculations are needed to process data. Spark and Flink are typical big data computing components, and their powerful distributed computing capabilities are indispensable for recommendation systems.

  • Characteristic data storage

The processed data, namely features and tags, are valuable data sources required by the recommendation algorithm. In a specific scenario, it can also be called a user portrait or an item portrait. This part of data has the value of repeated sharing and reuse. It can not only be used to train algorithm models, but also provide services for the production environment.

Ensuring the reliable storage of characteristic data is an extremely critical part of the recommendation system.

Part 2: Consumption characteristic data

image.png

  • offline model training

With the key feature data, the business can start training algorithm models. Only by making full use of the feature library and the latest behavioral data to continuously polish the recommendation algorithm can the overall level of the recommendation system be improved, and ultimately a better experience for users.

  • Online inference prediction

After the algorithm model is trained, it will be deployed to the online production environment. It will continue to use existing feature storage, make inferences based on the user's real-time behavior, quickly predict the best quality content that matches the user, form a recommended list, and push it to end users.

Three, the storage problem of the recommendation system

Obviously, "characteristic data" has played a key role in cohesion in the entire system. Since the KV form of data abstraction is very close to the characteristic data, Redis is often indispensable in the recommendation system.

In the scheme of the above system, the database type is GaussDB (for Redis) instead of open source Redis. reason for 16139cc1d6c5d7 is that open source Redis still has obvious pain points in the big data scenario:

1. Data cannot be stored reliably

The recommendation system actually hopes to not only use the KV database, but also to save the data for a long time with confidence. But the ability of open source Redis focuses more on data cache acceleration, rather than data storage. Moreover, open source Redis is a pure memory design after all. Even if there is AOF persistence, it can usually only be placed in seconds, and data storage is not reliable.

2. The amount of data can't go up, and the cost can't go down

Businesses involving recommendations often have a large user base. As the business develops, there will be more characteristic data that needs to be saved. In fact, it is normal for the same capacity of memory to be more than 10 times more expensive than a super-speed SSD. Therefore, when the amount of data reaches tens of GB or hundreds of GB, open source Redis will become more and more "burning", so it is generally only used as a "small" cache. In addition, the open-source Redis fork problem caused low capacity utilization and wasted hardware resources.

3. Poor performance of irrigation storage

Feature data needs to be updated regularly, and large-scale data infusion tasks are often carried out on an hourly or daily basis. If the storage component is not "sound enough", a large number of writes will cause a database failure, which will cause an abnormality in the entire recommendation system. This may cause the embarrassing user experience mentioned at the beginning.

Open source Redis is not very resistant to writes. This is because half of the nodes in the cluster are standby nodes and they can only handle read requests. When a large number of writes arrive, the master node is prone to problems, triggering a chain reaction.

Theoretically, the more complex the architecture design is not the better. If possible, who doesn't want to use a reliable data storage engine that can take into account the characteristic data KV type, cost-friendliness, performance and guarantee?

Fourth, meet late, meet GaussDB (for Redis)

Different from open source Redis, GaussDB (for Redis) is based on a storage-calculation separation architecture, which brings key technical values to big data scenarios such as recommendation systems:

1. Reliable storage

Data command level placement, three copies of redundant storage in the underlying storage pool, truly zero loss.

2. Cost reduction and efficiency enhancement

High-performance persistence technology + fine-grained storage pools help companies reduce database usage costs by more than 75%.

3. Strong writing resistance

Multi-threaded design + all nodes can be written, the anti-write ability is strong enough, and it can calmly deal with the pressure of Spark storage and real-time update. Huawei Cloud enterprise-level database GaussDB (for Redis) provides stable and reliable KV storage capabilities, which is an excellent choice for the core data of the recommended system.

5. Perfect connection, realizing the freedom to exist as long as you want

In fact, connecting Redis to the Spark backend has become a mainstream solution, and using Flink to extract dimension tables from Redis is also a very common usage. They also provide connectors for connecting to Redis. GaussDB (for Redis) is fully compatible with the Redis protocol, ready to use, and users can quickly create instances and access services at any time.

1. Spark-Redis-Connector

Spark-Redis-Connector perfectly realizes the mapping of Spark RDD, DataFrame to the String, Hash, List, Set and other structures in GaussDB (for Redis) instances. Users can use the familiar Spark SQL syntax to easily access GaussDB (for Redis) to complete key tasks such as filling, updating, and extracting characteristic data.

The method of use is very simple:

1) When you need to read the Hash, List, and Set structures to Spark RDD, you can do it with only one line:
image.png

2) When the recommended system is to update the database or feature data, it can be easily written as follows:
image.png

2. Flink-Redis-Connector

Flink's computing engine is as popular as Spark, and it also has a mature Redis connection scheme. Using the Connector provided by Flink or in combination with the Jedis client, you can easily complete the read and write operations from Flink to Redis.

Take the simple scenario of using Flink to count word frequency as an example. After the data source is processed by Flink, it can be easily stored in GaussDB (for Redis).
image.png

Six, conclusion

Big data applications have very high requirements for the storage of core data. The cloud database GaussDB (for Redis) has a cloud-native architecture that separates storage and calculation. It is fully compatible with the Redis protocol and achieves a comprehensive leadership in stability and reliability. . In the face of massive core data storage, it can also bring considerable cost savings to enterprises. Facing the future, GaussDB (for Redis) has the potential to become the new star of the next wave of big data.

Seven, appendix

Author of this article: Huawei Cloud Database GaussDB (for Redis) team

Hangzhou/Xi'an/Shenzhen resume delivery: yuwenlong4@huawei.com

GaussDB (for Redis) product homepage:
https://www.huaweicloud.com/product/gaussdbforredis.html

For more technical articles, follow the official blog of Gauss Redis:
https://bbs.huaweicloud.com/community/usersnew/id_1614151726110813

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量