Abstract: In the mobile Internet era, there are a huge number of users, a large number of tags, and a huge amount of user tag data. In the user portrait system, different companies have different implementation schemes for label storage and query. The current mainstream implementation scheme adopts the ElasticSearch scheme. However, building a user profile platform based on ElasticSearch often faces problems such as insufficient flexibility, high resource overhead, and inconvenience in development without SQL interface. To this end, this article provides a method for constructing a user portrait system based on Huawei MRS ClickHouse.

This article is shared from the HUAWEI cloud community " based on MRS-ClickHouse to build a user portrait system program introduction ", author: hourongqi.

1. Business scenario

User portrait is the labeling of user information. The user portrait system conducts in-depth analysis and mining of the collected data in various dimensions, and labels different users with different tags, thereby portraying the full picture of the customer. Through the user portrait system, each user can be accurately positioned, so that it can be applied to business scenarios such as personalized recommendation and precision marketing. The user profile system has been widely adopted by various companies and is one of the important ways for big data to be implemented.

In the era of mobile Internet, there are a huge number of users, a large number of tags, and a huge amount of data for user tags. In the user portrait system, different companies have different implementation schemes for label storage and query. The current mainstream implementation scheme adopts the ElasticSearch scheme. However, building a user profile platform based on ElasticSearch often faces problems such as insufficient flexibility, high resource overhead, and inconvenience in development without SQL interface. To this end, this article provides a method for constructing a user portrait system based on Huawei MRS ClickHouse.

2. Why build a label query system based on MRS-ClickHouse

2.1 Introduction to MRS-ClickHouse

MRS-ClickHouse is a columnar database for online analysis and processing. Its core features are extreme compression rate and extremely fast query performance. MRS-ClickHouse supports SQL query, and the query performance is good, especially the performance of aggregate analysis query based on large wide table is very good, which is an order of magnitude faster than other analytical databases.
image.png

ClickHouse has the following characteristics:

complete DBMS function
ClickHouse has a complete database management function, with the basic functions of a DBMS, including DDL, DML, authority control, data backup and recovery, and distributed management.
columnar storage and data compression
ClickHouse is a database that uses columnar storage. Data is organized in columns. Data belonging to the same column will be stored together, and columns will also be stored in different files. When performing data query, columnar storage can reduce the data scanning range and the size of data transmission, and improve the efficiency of data query.
vectorized execution engine
ClickHouse uses the SIMD instructions of the CPU to implement vectorized execution. The full name of SIMD is Single Instruction Multiple Data, that is, using a single instruction to manipulate multiple pieces of data, and an implementation method to improve performance through data parallelism. Its principle is to achieve parallel operation of data at the CPU register level.
relational model and SQL query
ClickHouse completely uses SQL as the query language, and provides a standard protocol SQL query interface, so that the existing third-party analysis visualization system can be easily integrated and docked with it.
At the same time, ClickHouse uses a relational model, so the cost of migrating systems built on traditional relational databases or data warehouses to ClickHouse will become lower.
data fragmentation and distributed query
The ClickHouse cluster consists of one or more shards, and each shard corresponds to 1 service node of ClickHouse. The upper limit of the number of shards depends on the number of nodes (1 shard can only correspond to 1 service node).
ClickHouse provides the concepts of Local Table and Distributed Table. A local table is equivalent to a piece of data. The distributed table itself does not store any data, it is the access agent of the local table, and its role is similar to the middleware of the sub-database. With the help of distributed tables, it is possible to proxy access to multiple data shards, thereby realizing distributed queries.

2.2 Introduction to Bitmap Index

Bitmap is a data structure that associates certain specific values with array subscripts. In the bitmap, each element occupies 1 bit. When the bit is 1, it means that the corresponding element has the specific value. Otherwise, it means no.

For example:

ID set: [0,1,4,5,6,7,9,10,13,14]
The bitmap can be expressed as: 11001111 01100110

As shown below:
image.png

A bitmap index is a special index that uses a bitmap, and is mainly created for a large number of columns with the same value. Each bit in the position code in the bitmap indicates the presence or absence of the corresponding data row. Bitmap indexes are suitable for columns with fixed values, such as gender, marital status, administrative district, and so on. It is not suitable for columns with discrete values such as ID number and consumption amount. In the user portrait scene, each label corresponds to a large number of people. The number of tags is a limited enumeration value, which is very suitable for bitmap indexing.

For example:

Suppose there are two labels, one is label 1-holding precious metals, and the other is label 2-holding insurance. The labels owned by each cardholder are shown in the table below.

From it, we can see that the set of IDs of cardholders with tag 1-holding precious metals is: [0, 1, 4, 5, 6, 7, 9, 10, 13, 14]. The ID cluster of cardholders with label 2-who holds insurance is: [2, 3, 5, 7, 8, 11, 12, 13, 15].
image.png

When we need to query users who have this tag at the same time, based on the bitmap index, we only need to perform bit operations on the bitmaps corresponding to the two tags to get the final result. In this way, the storage space of the label data is very small, and the label calculation speed is very fast.
image.png

2.3 MRS-ClickHouse natively supports bitmap index

Before the advent of ClickHouse, if you want to apply bitmap indexes to user portrait scenes, you need to build the bitmap data structure and manage the bitmap indexes yourself, and the threshold for use is high. The good news is that MRS-ClickHouse natively provides support for bitmap data structure and location indexing, and encapsulates the construction and maintenance of bitmaps inside ClickHouse. It becomes very simple for users to build a bitmap index based on ClickHouse.

ClickHouse bitmap constructor:
image.png

ClickHouse position operation function:
image.png

ClickHouse bitmap operation result acquisition function:
image.png

In summary, why did you choose to build a label query system based on ClickHouse?

  1. ClickHouse query speed is fast, the fastest can reach sub-second response;
  2. ClickHouse has a built-in bitmap data structure to facilitate the construction of bitmap indexes and improve label query performance;
  3. Based on the JDBC/SQL interface, the development is easier;
  4. Based on the MPP architecture, it can be scaled horizontally;

3. How to build a label query system based on MRS-ClickHouse

Create an original label table in ClickHouse, and import the original label data into it. Then build a label bit chart based on the original label table, and create a corresponding distributed table. The upper label query application is based on the label bitmap (distributed table) for label query.

The process is shown in the figure below:
image.png

The detailed process is described below.

Step 1: Create label original table, import label original data

First, create a label original table to save the original data of the label. The label result data calculated by the upstream system is written into this table. This table is a local table. The statement to build a table is as follows:

CREATE TABLE IF NOT EXISTS tbl_tag_src ON CLUSTER default_cluster(
    tagname String,   --标签名称
    tagvalue String,  --标签值
    userid UInt64 
)ENGINE = ReplicatedMergeTree('/clickhouse/default/tables/{shard}/tbl_tag_src ','{replica}')
PARTITION BY tagname
ORDER BY tagvalue;

Then create a distributed table:

CREATE TABLE IF NOT EXISTS default.tbl_tag_src_all ON CLUSTER default_cluster 
AS tbl_tag_src 
ENGINE = Distributed(default_cluster, default, tbl_tag_src, rand());

The data preview is as follows:
image.png

Step 2: Create a label bitmap and build a label bitmap

To create a label bit chart, first create a local table. The local table is used to save the label bitmap data. The creation statement is as follows:

-- 创建位图表,先创建本地表
CREATE TABLE IF NOT EXISTS tbl_tag_bitmap ON CLUSTER default_cluster
(
    tagname String,   --标签名称
    tagvalue String,  --标签值
    tagbitmap AggregateFunction(groupBitmap, UInt64 )  --userid集合
)
ENGINE = ReplicatedAggregatingMergeTree('/clickhouse/default/tables/{shard}/ tbl_tag_bitmap ','{replica}')
PARTITION BY tagname
ORDER BY (tagname, tagvalue)
SETTINGS index_granularity = 128;

Then create the corresponding distributed table. Distributed tables are used for upper-level applications to query tags. The statement to build a table is as follows:

CREATE TABLE IF NOT EXISTS default.tbl_tag_bitmap_all ON CLUSTER default_cluster
(
    tagname String,   --标签名称
    tagvalue String,  --标签值
    tagbitmap AggregateFunction(groupBitmap, UInt64 )  --userid集合
)
ENGINE = Distributed(default_cluster, default, tbl_tag_bitmap, rand());

Import the data of the original label table into the label bit chart. And during the import process, use the groupBitmapState() function to build a bitmap. The SQL statement is as follows:

-- 导入数据, 将同一个标签的所有userid使用groupBitmapState函数合并成一个bitmap
INSERT INTO tbl_tag_bitmap_all
SELECT tagname,tagvalue,groupBitmapState(userid)
FROM tbl_tag_src_all
GROUP BY tagname,tagvalue;

Step 3: Quickly retrieve tags based on distributed tables

Query a list of userids who hold precious metal products and whose gender is male:

WITH
    (
        SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = '持有产品' AND tagvalue = '贵金属' LIMIT 1
    ) AS bitmap1,
    (
        SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = '性别' AND tagvalue = '男' LIMIT 1
    ) AS bitmap2
SELECT bitmapToArray(bitmapAnd(bitmap1, bitmap2)) AS res

Separately count the total number of male and female customers with insurance:

---- 查询持有保险的客户中,男性人数:
WITH
    (
        SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = '持有产品' AND tagvalue = '保险' LIMIT 1
    ) AS bitmap1,
    (
        SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = '性别' AND tagvalue = '男' LIMIT 1
    ) AS bitmap2
SELECT bitmapCardinality(bitmapAnd(bitmap1, bitmap2)) AS res

---- 查询持有保险的客户中,女性人数:
WITH
    (
        SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = '持有产品' AND tagvalue = '保险' LIMIT 1
    ) AS bitmap1,
    (
        SELECT tagbitmap FROM tbl_tag_bitmap_all WHERE tagname = '性别' AND tagvalue = '女' LIMIT 1
    ) AS bitmap2
SELECT bitmapCardinality(bitmapAnd(bitmap1, bitmap2)) AS res

4. Summary

In view of the massive tag query in the user portrait scene, the traditional scheme has problems such as insufficient flexibility, high resource consumption, lack of SQL interface development, and so on. Based on Huawei MRS-ClickHouse, it is very convenient to construct a bitmap index and realize real-time retrieval of massive label data. MRS-ClickHouse greatly reduces development costs, responds to tag queries faster, and makes precision marketing more convenient.

Huawei Cloud FusionInsight MRS cloud-native data lake has been widely used in government, finance, operators, large enterprises, Internet and other industries, working with 800+ partners to serve 3000+ government and enterprise customers in 60+ countries and regions around the world.

Click to follow to learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量