Baidu Aifanfan real-time CDP construction practice

Author: Jimmy

With the advent of the era of Marketing 3.0, companies increasingly need to rely on powerful CDP capabilities to solve their serious data silos, helping companies warm up leads and activate customers. But what is a CDP and what are the key characteristics of a good CDP? While answering this question, this article describes in detail the construction practice of real-time CDP at the tenant level of Aifanfan, including the selection of components under the goal of advanced architecture, as well as the introduction of platform architecture and key implementation of core modules.

1. What is CDP

1.1 Origin of CDP

CDP (Customer Data Platform) is a popular concept in recent years. With the development of the times and changes in the general environment, while the number of self-owned media has increased, customer management and marketing have become more difficult, and the problem of data silos has become more and more serious. In order to better market customers, CDP was born. From a vertical perspective, CDP has mainly gone through two stages before its emergence:

In the CRM era, businesses interact with existing and potential customers via phone calls, text messages, emails, and perform data analytics to help drive retention and sales;
In the DMP stage, the company manages major Internet platforms to place advertisements and execute media promotion activities.

The three platforms of CRM, DMP, and CDP have different core functions, but it is easier to understand CDP by comparing vertically. There are big differences between the three in terms of data attributes, data storage, and data usage.

There are a few key differences as follows:

CRM vs CDP

Account Management: CRM focuses on sales documentation; CDP is more focused on marketing.
Contacts: CRM customers are mainly telephone, QQ, email, etc.; CDP also includes user accounts associated with tenants' own media (for example, the company's own website, app, official account, and applet).

DMP vs CDP

Data type: DMP is mainly anonymous data; CDP is mainly real-name data.
Data storage: DMP data is only short-term storage; CDP data is long-term storage.

1.2 Definition of CDP

In 2013, MarTech analyst David Raab first proposed the concept of CDP, and later the CDP Institute initiated by him gave an authoritative definition: packaged software that creates a persistent, unified customer database that is accessible to other systems.

There are mainly three levels:

Packaged software: Based on the enterprise's own resource deployment, the unified software package is used to deploy and upgrade the platform, and no custom development is required.
Persistent, unified customer database: extracts data from multiple types of business systems of the enterprise, forms a unified view of customers based on certain identifiers of the data, stores it for a long time, and can carry out personalized marketing based on customer behavior.
Accessible to other systems: Businesses can use CDP data to analyze, manage customers, and take away restructured, processed customer data in various forms.

1.3 Classification of CDPs

The C (Customer) of CDP itself refers to all customer-related functions, not just marketing. Different scenarios also correspond to different types of CDPs. Different types of CDPs mainly have different functional scopes, but there is a progressive relationship between the categories.

Mainly divided into four categories:

Data CDPs: Mainly customer data management, including multi-source data collection, identification, and unified customer storage, access control, etc.
Analytics CDPs: In addition to the capabilities of Data CDPs, it also includes customer segmentation, and sometimes extends to machine learning, predictive modeling, revenue attribution, and more.
Campaign CDPs: In addition to the related functions of Analytics CDPs, it also includes cross-channel customer strategies (Customer Treatments), such as personalized marketing, content recommendation and other real-time interactive actions.
Delivery CDPs: In addition to the related functions of Campaign CDPs, it also includes Message Delivery, such as emails, sites, APPs, advertisements, etc.

Compared with Analytics CDPs, Campaign CDPs and Delivery CDPs have more functions and are closer to MA (Marketing Automation) in China. The CDPs described in this article belong to the Analytics CDPs in terms of the scope of functions provided. There is also a special MA system in Aifanfan, and the CDP in this article provides data support for it.

2. Challenges and goals

2.1 Challenges

With the advent of the era of Marketing 3.0, in the case of Aifanfan private domain products, the powerful CDP is mainly used to provide enterprises with online and offline data management. At the same time, enterprises can use refined customer groups to conduct multiple Scenario-increasing activities (such as automated marketing methods, holiday promotion notifications, birthday messages, live events, etc.). More importantly, enterprises can conduct more personalized, accurate and timely secondary real-time marketing based on pure real-time user behavior, helping enterprises to warm up leads, activate customers, and improve the conversion effect of private domain marketing. How to do a good job of real-time CDP (Real-Time CDP, abbreviated as RT-CDP) to drive the upper-level marketing business faces many challenges.

2.1.1 Business level

Enterprises have many data channels and different data forms
In addition to the official website, files, apps, and its own systems, an enterprise also includes the data structure of various scenarios such as many current enterprise-owned media (such as WeChat public account, Douyin enterprise account, Baijia account, various small programs, etc.) Not unified, how to efficiently access enterprise data to RT-CDP? This is also a systematic problem that thousands of business owners urgently need to solve on the subject of customer data fusion.
Different ecosystems cannot be connected, and users cannot gain 360-degree insights
The scattered data makes it difficult to identify unique user identities, and it is impossible to establish a comprehensive and continuously updated user portrait, resulting in fragmented and one-sided cognition of users and insufficient insight. For example, in the actual marketing scenario, when an enterprise expects to issue coupons to the same user who accesses the official website and its applet at the same time, but because a person's behavior is scattered in the data of each channel with different identifiers, it is impossible to conduct cross-channel user behavior analysis , it will not be able to achieve corporate demands.
Crowd division rules are complicated
The businesses of our different companies are different, so we can label different customers with personalized labels according to business characteristics. For example, when companies conduct marketing activities, they want to label users who have gone through iterative journey nodes, participate in a live broadcast, etc. Labels for different scenarios, so that different groups of people can be subdivided and more refined marketing can be done.
How to use a platform to serve B2B2C and B2C enterprises well, the industry can learn from less experience
Aifanfan's customers involve various industries, some are B2C and some are B2B2C. Compared with B2C, the complexity of B2B2C business scenarios increases exponentially. While managing the portraits of B and C, we must also take into account the logic of upper-level services, such as identity fusion strategies, behavior-based circle selection, etc. In addition, there are many problems with unclear business boundaries in many business scenarios.

2.1.2 Technical level

High requirements for real-time and accurate identification of omni-channel
In today's era, a customer's behavior is cross-source, cross-device, and cross-media, and the behavior trajectory is severely fragmented. If a company wants to have a good marketing effect, it is an important prerequisite to accurately and real-time identify customers and connect customer behavior trajectories. How to achieve high-performance real-time recognition in multi-source and multi-identity is also a big challenge.
Requires the ability to process massive amounts of data in real time and with low latency
Now customers have more choices, and the degree of intention is not clear. Real-time marketing based on customer behavior and real-time secondary interaction based on customer feedback are the keys to improving marketing effects. For example, the marketing department of a company sends a group SMS message, and the customer does not click, but clicks What kind of further action is there, which represents the different degree of intention of customers. Enterprise marketing and sales personnel need to follow up in a timely manner according to customer actions. Only by grasping these changes in real time can the conversion of marketing activities be promoted more efficiently. How to handle massive data-driven business in real time?
Need a scalable architecture
In the context of multi-tenancy, Aifanfan manages massive data of thousands of small and medium-sized enterprises. With the increasing number of service companies, how to rapidly and continuously improve the service capabilities of the platform requires the design of an advanced technical architecture. In addition, how to achieve high performance, low latency, scalability, and high fault tolerance is also a big technical challenge.
How to balance multi-tenant features and performance
Aifanfan private domain products serve small and medium-sized enterprises in the form of Saas services, and a CDP with multi-tenant characteristics is a basic capability. Although SME customers generally range from 100,000 to one million, with the accumulation of marketing activities carried out by enterprises, the volume of data of enterprises will also grow linearly. For medium and large enterprises, the size of their customers determines that their data volume grows faster. In addition, it is difficult for different enterprises to preheat the model for different dimensions of data query. Under this premise, how to balance scalability and service performance is a difficult problem.
Diverse Deployment Scalability
CDP currently mainly serves small and medium-sized enterprises with Saas services, but does not rule out the need to support OP deployment (On-Premise, localized deployment) of major customers in the future. How to do a good job in component selection to support two types of services?

2.2 RT-CDP construction goals

2.2.1 Key business capabilities

After analysis and business abstraction, we feel that a really good RT-CDP needs to achieve the following key characteristics:

Flexible data docking capability: It can dock customer systems with various data structures and multiple types of data sources. In addition, the data can be accessed at any time.
At the same time, it supports B2C and B2B data models: for customers in different industries, it is supported by a set of services.
Unified user and enterprise portraits: including attributes, behaviors, tags (static, dynamic (rule) tags, predictive tags), intelligent scoring, preference models, and more.
Real-time omni-channel identity recognition and management: In order to break the data silos and open up multi-channel identities, it is the key to provide unified users and the premise for cross-channel user marketing.
Powerful user segmentation capabilities (user grouping): Enterprises can divide users into multi-dimensional and multi-window combinations based on user attributes, behaviors, identities, labels, etc., and carry out accurate user marketing.
Real-time user interaction and activation: In the face of rapid changes in user habits, real-time perception of user behavior for real-time automated marketing is particularly important.
Secure user data management: long-term and secure data storage is the basic requirement of a data management platform.

2.2.2 Advanced Technology Architecture

While clarifying the business goals of the platform, an advanced technical architecture is also the goal of platform construction. How to achieve the platform architecture, we have the following core goals:

1. Streaming data driven

In traditional databases and data processing, it is mainly "data passive, query active". Data is at rest in the database until a user makes a query request. Even if the data changes, the user must actively reissue the same query to get updated results. But now that the amount of data is getting bigger and bigger, and the requirements for timely perception of data changes are getting higher and higher, this method can no longer satisfy the entire paradigm of our interaction with data.

Now the system architecture is designed as shown in the figure below, and it is more inclined to actively drive the architecture of other systems, such as domain event-driven business. The same is true for data processing: "active data, passive query".

For example, when an enterprise wants to find users who have visited the enterprise applet to send text messages, how do they do the two?

Traditional method: first store user data in the storage engine, convert the query conditions into SQL before the enterprise sends text messages, and then filter the qualified users from the massive data.
Modern method: When user data flows into the data system, user portraits are enriched, and then based on this user portrait, it is judged whether it meets the query conditions of the enterprise. It is only a rule judgment on a single user data, rather than filtering from massive data.

2. Stream computing processing

Traditional data processing is more offline computing and batch computing. Offline computing is Data at rest, Query in motion; batch computing is to accumulate data to a certain extent, and then process it based on specific logic. Although the data processing methods of the two are also different, they are basically batch processing, and naturally there is a delay.

Streaming computing completely removes the concept of batches and processes streaming data in real time. That is, continuous calculation for unbounded and dynamic data can achieve millisecond-level delay. In today's fierce competition in the era of massive data, this is especially true for enterprise insights. The faster the data is mined, the higher the business value.

3. Integrated practice

Batch flow integration
In the field of big data processing, there are two typical architectures (Lamda, Kappa, Kappa+). The Lamda architecture uses two computing architectures for batch computing and real-time computing. As a result, sometimes two sets of codes are developed with the same logic, which is prone to inconsistent data indicators and brings maintenance difficulties. The Kappa and Kappa+ architectures are designed to simplify the distributed computing architecture, taking the real-time event processing architecture as the core and taking into account the two scenarios of batch streaming. In the actual production architecture of most enterprises, there are still more mixtures of the two, because there are many difficulties in a thorough real-time architecture, such as data storage and large-window aggregation calculations that are easier to process in some batch calculations.
Unified programming
The unified data processing programming paradigm is an important programming choice for two reasons:
- In actual business scenarios, batch and stream processing still exist at the same time. Considering that with the development of distributed data processing and computing, distributed processing frameworks will also be introduced;
- In various companies, the use of multiple computing frameworks is still common.
In this way, programming flexibility can be improved, data processing job development in batch and streaming scenarios can be supported, and a set of processing programs can be executed on any computing framework, which is also beneficial for subsequent platforms to switch to better computing engines.
Extensible as a prerequisite
This mainly refers to the scalability of the architecture. A scalable architecture can reasonably control resource costs while stabilizing the service business, so as to sustainably support the rapid development of the business.
separation of computing and storage
In today's era of big data with massive data, sometimes only high processing power is required in different scenarios, and sometimes only massive data storage is required. The traditional storage-computing integrated architecture requires high-configuration (multi-core, multi-memory, high-performance local disks, etc.) service nodes to meet two scenarios. Obviously, there are unreasonable resource utilization and cluster stability problems, such as node overloading. Most of the data will be scattered, resulting in the reduction of data consistency. The architecture of separation of computing and storage conforms to the idea of distributed architecture, and separately controls computing resources and storage resources for business scenarios to achieve reasonable resource allocation. It is also conducive to the ability to ensure cluster data consistency, reliability, scalability, and stability.
Dynamic scaling
The main purpose of dynamic scaling is to improve resource utilization and reduce enterprise costs. In actual business, sometimes the platform needs to cope with the peaks and valleys of traffic (real-time message volume) in a short period of time during the stable business period. For example, a large number of enterprises need to do a lot of marketing activities at the same time on various important festivals, resulting in a sharp increase in the volume of messages; Sometimes, with the continuous growth of the number of enterprises served by Aifanfan, it will also lead to a linear increase in the amount of messages, which in turn requires long-term expansion. For the former, on the one hand, it is not easy to predict, on the other hand, there are also high operation and maintenance costs. Therefore, a cluster resource management capability that can dynamically expand or shrink based on combination rules such as time and load is also an important consideration in architecture construction.

3. Technical selection

There is no one-size-fits-all framework, only suitable trade-offs. It is necessary to make a reasonable selection based on its own business characteristics and architectural goals. Combined with the construction goals of RT-CDP, we have conducted component research and determination of the following core scenarios.

3.1 A new attempt to store identity relationships

Cross-channel identity mapping (ID Mapping) in CDP is the core of the data flow channel business, which requires consistent, real-time, and high-performance data.

How does traditional idmapping work?

1. Using a relational database to store identity relationships generally stores identity relationships as multiple tables and multiple rows for management. There are two problems with this scheme:

High data concurrency and limited real-time write capability;
Generally, identity recognition requires multi-hop data relational queries. To find out the expected data in relational databases, multiple joins are required, and the query performance is very low.

2. Using Spark GraphX for timed calculation generally stores user behaviors in Graph or Hive, uses Spark to periodically load the identity information in user behaviors into memory at one time, and then uses GraphX to calculate user connectivity based on cross-relationships. There are also two problems with this scheme:

Not real time. In the past, more scenarios were offline aggregation and regular actions to users;
As the amount of data increases, the calculation time will become higher and higher, and the delay of the data result will also become higher and higher.

How do we do it?

With the development of graph technology in recent years, there are more and more cases of solving business problems based on graphs. The product capabilities and ecological integration of open source graph frameworks have become more and more perfect, and the community activity has become more and more active. Therefore, we try to model identity relationships based on graphs, and use the natural multi-degree query capability of graphs to perform real-time identity judgment and fusion.

Diagram frame comparison

You can also conduct key research based on the ranking trend of the latest graph database. In addition, there are more and more comparison cases about the mainstream gallery, you can refer to it yourself. Among the distributed, open source graph databases are HugeGraph, DGraph and Nebula. We mainly use DGraph and Nebula in production. Because Aifanfan services are based on cloud-native construction, DGraph was selected in the early stage of platform construction, but later found that horizontal expansion was limited, and had to migrate from DGraph to Nebula.

There is very little comparison between DGraph and Nebula on the Internet. Here is a brief description of the difference:

Cluster architecture: DGraph integrates computing and storage, and its storage is BadgerDB, which is transparent to the outside world; Nebula separates read and write, but the default is RocksDB storage (unless the storage engine is replaced based on source code, and some companies are doing this), there is read and write amplification question;
Data segmentation: DGraph is based on predicate segmentation (which can be understood as a point type), which is prone to data hotspots. To support multi-tenant scenarios, you need to dynamically create tenant granularity predicates to make data distribution as uniform as possible (DGraph Enterprise Edition also supports more Tenant features, but charging and still do not consider hot issues); Nebula is based on edge segmentation and partition based on vid, there is no hot issue, but when creating a graph space, you need to budget for the number of partitions, otherwise it is difficult to modify the number of partitions.
Full-text search: DGraph supports; Nebula provides a listener to connect to ES.
Query syntax: DGraph is its own query syntax; Nebula has its own query syntax and also supports Cypher syntax (Neo4j's graph query language), which is more in line with graph logic.
Transaction support: DGraph supports transactions based on MVCC; Nebula does not support it, and its side write transactions are only supported in the latest version (2.6.1).
Synchronous writing: DGraph and Nebula both support asynchronous and synchronous writing.
Cluster stability: The DGraph cluster is more stable; the stability of Nebula needs to be improved, and there are occasional crashes under certain operations.
Ecological cluster: DGraph is more mature in ecological integration, such as integration with cloud native; Nebula is more diverse in ecological integration, such as nebula-flink-connector, nebula-spark-connector, etc., but in the maturity of various integrations There is still room for improvement.

3.2 Stream Computing Engine Selection

For the comparison of mainstream computing frameworks, such as Apache Flink, Spark Streaming, Storm, there is a lot of information on the Internet, please do your own research.

Choose Apache Flink as the stream batch computing engine

Apache Flink is an open source platform for distributed stream and batch data processing that has developed rapidly in recent years. It is the distributed computing framework that best fits the DataFlow model implementation. High-performance computing based on stream computing has good fault tolerance, state management mechanism and high availability; the integration of other components in Flink is becoming more and more mature; so we choose Apache Flink as our stream batch computing engine .

Choose Apache Beam as the programming framework

With the continuous development of distributed data processing technology, excellent distributed data processing frameworks will emerge in an endless stream. Apache Beam is an incubation project contributed by Google to the Apache Foundation in 2016. Its goal is to unify the programming paradigm of batch processing and stream processing, so that data processing programs developed by enterprises can be executed on any distributed computing engine. Beam also provides powerful expansion capabilities while unifying the programming paradigm, and supports the new version of the computing framework in a timely manner. So we choose Apache Beam as our programming framework.

3.3 Mass Storage Engine Choices

Among the storage components of the Hadoop ecosystem, HDFS is generally used to support high-throughput batch processing scenarios, and HBase is used to support low-latency scenarios with random read and write requirements. However, it is difficult to use only one component to achieve both capabilities. In addition, how to achieve real-time data update under stream computing also affects the choice of storage components. Apache Kudu is Cloudera's open-source columnar storage engine, which is a typical HTAP (online transaction processing/online analytical processing hybrid mode). In the direction of exploring HTAP, TiDB and OceanBase are both in this ranks, but the scenarios that everyone focused on at first are different, and you can also compare them. ApacheKudu's vision is fast analytics on fast and changing data. From the positioning of Apache Kudu, it can be seen in the following figure:

Combined with our platform construction concept, real-time, high-throughput data storage and update is the core goal, and the QPS of complex data query and data application is not high (because the core business scenario is real-time customer processing based on real-time streams), plus Cloudera Impala seamlessly integrates Kudu, and we finally determined Impala + Kudu as the platform's data storage and query engine.

Analytics Enhancement: Doris

Based on the selection of Impala + Kudu, there is absolutely no problem in supporting OP deployment, because each enterprise has limited data volume and data query QPS. In this way, enterprises only need a very simple architecture to support their data management needs, improve platform stability and reliability, and reduce enterprise operation and maintenance and resource costs. However, due to the limited concurrency capability of Impala (of course, the introduction of multi-threading in Impala 4.0, the concurrent processing capability has improved a lot), Aifanfan's private domain services are currently focused on Saas services, and want to achieve high concurrency in Saas scenarios. Millisecond-level data analysis, the performance of this architecture is difficult to achieve, so we introduced the analysis engine Doris in the analysis scenario. The reason for choosing Doris, OLAP engine based on MPP architecture. Compared with open source analysis engines such as Druid and ClickHouse, Doris has the following characteristics:

Support a variety of data models, including aggregation model, Uniq model, Duplicate model;
Support Rollup, materialized view;
The query performance on single table and multi-table is very good;
Support MySQL protocol, low access and learning cost;
There is no need to integrate the Hadoop ecosystem, and the cost of cluster operation and maintenance is much lower.

3.4 Rule Engine Research

The real-time rules engine is mainly used for customer grouping. Combined with the rules comparison of Meituan, the characteristics of several engines (of course, some other URule, Easy Rules, etc.) are as follows:

In RT-CDP, there are many categories and combinations of customer grouping rules, complex rule calculation, many operators, large time window span, or even no window. There is no open source rule engine in the industry that can well meet business needs, so we chose self-developed.

4. Platform Architecture

4.1 Overall Architecture

In Aifanfan private domain products, it is mainly divided into two parts: RT-CDP and MA. The superposition of the two is approximately equivalent to the functional scope included in Deliver CDP. The functional scope of RT-CDP mentioned in this article is equivalent to Analytics CDPs. In short, it mainly includes customer data management and data analysis insights.

RT-CDP is also divided into two parts of functions, mainly including five parts: data source, data collection, real-time data warehouse, data application and public components. Except for the horizontal support of the public components, the other four parts are standard data docking To the four stages of data application:

Data source: The data source here not only includes customer private data, but also own media data in various ecosystems, such as WeChat official account, WeChat applet, Qiwei clue, Baidu applet, Douyin enterprise account, third-party ecology behavioral data, etc.
Data collection: Most small and medium-sized enterprises have no R&D capabilities or are very weak. How to help quickly connect their own systems to Aifan RT-CDP is a key consideration at this layer. For this reason, we have encapsulated a general collection SDK to simplify enterprise data. It collects costs and is compatible with excellent front-end development frameworks such as uni-app. In addition, due to the variety of data sources and different data structures, in order to simplify the continuous access of new data sources, we have built a unified collection service, which is responsible for managing the continuously added data channels, as well as data encryption and decryption, cleaning, and data conversion. Such as data processing, this service is mainly to provide flexible data access capabilities to reduce data docking costs.
Real-time calculation and storage: After the data is collected, cross-channel data identification is carried out, and then it is converted into a structured unified customer portrait. As far as data management is concerned, this layer also includes fragmented customer data that enterprises access to CDP for subsequent enterprise customer analysis. After this layer of processing, a cross-channel customer identity relationship diagram and unified portrait will be formed, and then a unified view will be used as the upper-level data interface. In addition, it is the routine data quality, resource management, job management, data security and other functions of the data warehouse.
Data application: This layer mainly provides product functions such as customer management and analysis insights for enterprises, such as rich potential customer portraits, customer groups with free combination of rules, and flexible customer analysis. A variety of data output methods are also provided to facilitate the use of various other systems.
Common components: RT-CDP service relies on AIfanfan's advanced infrastructure, manages services based on cloud native concepts, and also uses AIfanfan's powerful log platform and link tracking for service operation, maintenance, and monitoring. In addition, the rapid iteration of CDP capabilities is also carried out based on the complete CICD capabilities. From development to deployment, it is under the agile mechanism, continuous integration and continuous delivery.

4.2 Core modules

To put it simply, the function realized by RT-CDP is the real-time and regular collection of multi-channel data, and then through the identification service in the data, data processing, data mapping and processing (such as dimension Join, data aggregation, data Layering, etc.), then structured persistence, and finally real-time output to the outside world.

RT-CDP is mainly divided into six modules: collection service, connectors, identity service, real-time computing, unified portrait and real-time rule engine. The above figure depicts the interaction between RT-CDP core modules from the perspective of data interaction form and data flow. From left to right is the main direction of data, representing the data entering the platform to the data output to the external system that interacts with the platform; the upper middle side is the two-way data interaction between real-time computing and Identity Service, real-time rule engine and unified portrait.

The following describes the functions of each core module in combination with the data processing stage:

1. Data source & collection

In terms of data source and RT-CDP data interaction, it is mainly divided into real-time inflow and batch pull. For two scenarios, we abstract two modules: real-time collection service and Connectors.

Real-time collection service: This module is mainly used to connect with the company's existing self-owned media data sources, events in the field of Aifanfan's business system, and third-party platforms that Aifanfan cooperates with. This layer mainly has problems such as API protocols of different media platforms, business parameter filling when scene-based behaviors are connected in series, and increasing user events. We abstract data Processor & custom Processor Plugin in this module to reduce manual intervention in new scenarios.
Connectors: This module is mainly used to connect the data sources of the company's own business systems, such as MySQL, Oracle, PG and other business libraries. This part does not require real-time access, but only needs to be scheduled regularly in batches. What needs to be solved here is the support of multiple different data source types. For this reason, we also abstract Connector and extension capabilities, as well as general scheduling capabilities to support. For the two scenarios, there is the same problem: how to deal with the fast reading and fast access of data with various data structures? To this end, we abstract the data definition model (Schema), which will be described in detail later.

2. Data processing

Identity Service: This module provides cross-channel customer identification capabilities. It is a precise ID Mapping for real-time access to customer data entering RT-CDP. This service persists the relationship graph of customer identity and puts it in Nebula, and updates Nebula in real time and synchronously according to real-time data and identity fusion strategy, and then fills the identification results into real-time messages. The data entering the CDP can only continue after being identified by the Identity Service. It determines whether the customer interaction in the marketing journey meets the expectations, and also determines the throughput limit of RT-CDP.
Real-time computing: This module includes all batch stream operations such as data processing, processing, and distribution. At present, the job development framework based on Apache Beam is abstracted, and all attempts to batch flow are done on Flink, but some operation and maintenance jobs also use Spark, which will be gradually removed.
Unified portrait: This module mainly persists a large number of potential customer portraits. For hot data, it is stored in Kudu, and for warm and cold time series data, it is regularly transferred to Parquet. Potential customer profiles include customer attributes, behaviors, tags, customer groups, and aggregated customer extension data. Although tags and customer groups are separate aggregate roots, they are consistent storage mechanisms at the storage level. In addition, the standard RT-CDP should also manage customer fragmented data, so how to interact with the unified portrait and data lake data is the focus of subsequent construction.
Unified query service: In RT-CDP, customer data is scattered in graph database, Kudu, enhanced analysis engine and data lake, but for users there are only business objects such as attributes, behaviors, tags, customer groups, etc. How to support product transparency use? We built this unified query service through unified view and cross-source query, which supports query storage engines such as Impala, Doris, MySQL, Presto, ES, and cross-source access to APIs.
Real-time rule engine: This module mainly provides real-time rule judgment based on Flink to support business scenarios such as circle groups, rule-based static marking, and rule labels.

3. Data output

Data output has been supported in many ways, including OpenAPI, Webhook, message subscription, etc. On the one hand, it is also convenient for enterprises to obtain the real-time behavior of potential customers after CDP integration, and then conduct full-chain user management with their own downstream business systems. On the other hand, it provides real-time behavior flow-driven marketing loop for the upper-level MA. Here is a special note to explain that MA's journey nodes also require a lot of real-time rule judgments, with various judgment calibers, and some of them are difficult to implement in memory on nodes. Therefore, RT-CDP also realizes data output that can provide real-time judgment results for MA.

4.3 Key Implementations

4.3.1 Data Definition Model

Why do you need Schema?

As mentioned above, the data feature structures of multiple channels of an enterprise are different. In addition, the business characteristics of different tenants are different, and enterprises need the scalability of data customization. RT-CDP needs the ability to flexibly define data structures for two types of problems to connect with enterprise data.

In addition, RT-CDP itself manages two types of data: fragmented customer data and unified user portraits. For the former, data storage, query, and analysis capabilities can be provided for enterprises by using technologies such as data lakes without the need for relational data content itself, which is a Schemaless data management; for the latter, it is more necessary to combine different dimensions. Inquiry, grouping, and analysis itself require structured data management. Can the latter provide services in a Schemaless way? List the scenes of additions, deletions, revisions, and investigations, and disprove the limitations.

What is schema?

Schema is a description of a data structure. Schemas can refer to each other, constrain fields, field types, and values in data, and can also customize fields. Enterprises can use a unified specification to quickly access and flexibly manage their own data. For example, enterprises can abstract different business entities and attributes according to their own industry characteristics, and then define different schemas for different business entities. Enterprises can extract information that intersects business entities from the new schema, and then multiple schemas can reference the new schema; they can also customize their own business fields for each schema. Enterprises only need to access data according to the corresponding schema structure, and then they can use the data according to specific standards.

From these entities to illustrate the characteristics of Schema, as shown below:

Field: Field is the most basic data unit and the smallest granular element that composes Schema.
Schema: It is a collection of fields and schemas. It can contain multiple fields (Fields), and the fields can be customized, such as field names, types, value lists, etc.; it can also refer to one or more other schemas, and also refer to them. It can be carried in the form of an array. For example, a Schema can contain data of multiple Identity structures.
Behavior: It is the different behaviors of potential customers or enterprises, and it is also carried by Schema. Different Behaviors can also customize their unique Fields.

As shown in the figure above, after the industry abstraction, AIfanfan RT-CDP has built-in many industry-wide schemas, including common Identity, Profile, Behavior and other types of schemas. In the unified portrait of potential customers managed by Aifanfan RT-CDP, Identity, Profile, Tag, Segment, etc. are all business aggregation roots. In order to support the two data models of B and C, some B granularity aggregate roots exist.

How does Schema simplify data access?

Here we need to talk about the concept of a Dataset. Dataset is a data set whose structure is defined by Schema. Enterprises define different data sets for different data sources. In data source management, enterprises can structure the imported data according to different data sets. One data set can correspond to multiple data sources, or it can correspond to a type of data in one data source. Generally, the latter is used more. In addition, a dataset can also contain multiple batches of data, that is, an enterprise can periodically import the same dataset in batches. During data access, as shown in the figure below, for different Datasets, enterprises can bind different Schemas, each Schema can reference and reuse other sub-Schemas, and then automatically persist data to storage through RT-CDP Schema parsing The engine, depending on the definition of the data, will persist to different data tables. Corresponding to the real-time customer behavior, the data structure is also defined by defining different schemas, and then continuous data access is performed.

Extension 1: Solve the problem of multi-tenant unlimited expansion with the help of field mapping

What is the problem?

Aifanfan RT-CDP is a platform that supports multi-tenancy, but under multi-tenancy, each enterprise has its own business data. Generally, small and medium-sized enterprises may have data fields of hundreds or thousands of potential customers. For the amount of KA fields More. As a Saas service, how does CDP support so many fields storage and analysis in one model? Generally, engines that can be expanded indefinitely can be leveled directly by tenant + field. In order to carry out structured real-time storage, Aifanfan CDP chose Kudu. Kudu officially recommends that a single table should not exceed 300 columns, and it can support thousands of columns at most, so the method just now cannot be solved.

What is our solution?

On the premise of tenant isolation, we use field reuse to solve this problem. It is also reflected in the figure when introducing the Schema model. In the actual Profile and Event tables, there are attr fields. The key points are:

The fact table only does fields with no business meaning;
During data access and query, it interacts with front-end and tenants after data conversion through the mapping relationship between business fields (logical fields) and fact fields.

4.3.2 Identity Service

This service can also be called ID Mapping. However, compared with traditional ID Mapping, because of different business scenarios, the functional focus is also different. ID Mapping in the traditional sense is mostly anonymous data of advertising scenarios, based on offline and predictive identification of complex models; ID Mapping in CDP is based on more accurate data identification, which is more accurate and requires more access rate and real-time performance. .

To this end, we design an identity relationship model that supports B2B2C and B2C businesses. After standardized tenant data access, a continuous identity relationship graph fission is added based on the continuously accessed data. At the functional level, we support custom identity types and identity weights, as well as custom identity fusion actions for different identity tenants. In addition, according to our industry analysis, common identity and integration strategies are built-in, which is convenient for tenants to use directly.

From the architectural level, Identity Service (ID Mapping) is built based on cloud native + Nebula Graph, which achieves tenant data isolation, real-time reading and writing, high-performance reading and writing, and horizontal expansion and contraction.

1. Cloud native + Nebula Graph

Deploy Nebula Graph to K8s to reduce operation and maintenance costs. We are mainly:

Use the Nebula Operator to automate the operation and maintenance of our Nebula cluster under k8s;
Use Statefulset to manage Nebula-related stateful node Pods;
Each node uses local SSD disks to ensure the performance of graph storage services.

2. Optimize reading and writing

In general, Identity Service is a common feature of reading more and writing less, but in new tenants and new scenarios, it also requires high writing capabilities, and the read and write performance needs to be taken into account. It is necessary to optimize reading and writing under the premise of doing concurrent locks:

Design the data model to minimize the number of IOs in Nebula;
Reasonable use of Nebula syntax to avoid redundant memory operations of Graphd;
For queries, try to reduce in-depth queries; for updates, control the writing granularity and reduce the impact of no transaction on the business.

Extension 1: How to solve the problem of potential customers getting through when not logged in

For the scenario of one person and multiple devices, where a single device is used by multiple people, we use offline correction to get through.

4.3.3 Real-time storage and calculation

4.3.3.1 Stream Computing

The core capabilities of Aifanfan RT-CDP are realized by Apache Flink + Kafka. Stream computing on top of real-time streams achieves millisecond data latency.

The core data flow is shown in the figure above, which mainly includes the following parts after simplification:

The main collected and formatted data will be uniformly sent to the topic of cdp-ingest;
RT-CDP has a unified entry Job (Entrance Job) responsible for data cleaning, verification, schema parsing, and identification, etc., and then distributes data according to tenant attributes. Because this is an RT-CDP entry job and needs to support horizontal scaling, this job is a stateless job.
After data distribution, different job groups will carry out data processing logic such as data processing, persistence, and data aggregation. On the one hand, it enriches the portraits of potential customers, and on the other hand, it provides a data foundation for more dimensional potential customer circles.
Finally, the open data will be distributed to the downstream, which includes various business modules such as external systems, data analysis, real-time rule engines, and policy models for more real-time driving.

Extension 1: Data Routing

Why do routing?

As a basic data platform, Aifanfan RT-CDP not only serves tenants outside Baidu, but also serves Baidu's internal and even Aifanfan itself; not only for small and medium-sized enterprises, but also for medium and large enterprises. For the former, the level of service stability requirements is different. How to prevent the internal and external service capabilities from not affecting each other? For the latter, companies of different scales have different numbers of potential customers and different time-consuming resources such as using RT-CDP circles. How to avoid unfair distribution of resources?

How do we do it?

In response to the above problems, we solve them through the mechanism of data routing. We maintain a mapping relationship between tenants and data flow topics, which can be distributed according to tenant characteristics or dynamically adjusted according to tenant needs. Then, the data is distributed in the Entrance Job according to the tenant's mapping relationship, and distributed to Job groups with different resource ratios for separate data processing. The internal and external separation is achieved, and resource control can also be performed according to the individual needs of tenants.

Extension 2: Custom Trigger batch write

In terms of random read and write, Kudu's performance is relatively poor compared to HBase and so on. In order to achieve the writing capability of hundreds of thousands of TPS, we have also made some logical optimizations for Kudu writing. The main reason is to customize the Trigger (two triggers of quantity + time window), and change a single write to a batch of strategies on the premise of achieving millisecond-level delay.

Specific scheme: When the batch data meets > N pieces, or the time window > M milliseconds, trigger the write operation again.

A general tenant's marketing activity will generate a large number of potential customer behaviors, including system events, real-time user behavior, etc. This batch writing method can effectively improve throughput.

4.3.3.2 Real-time storage

The RT-CDP mainly includes three parts of data: fragmented tenant data, unified prospect profile and offline analysis data. We mainly classify two clusters for data storage. One cluster stores the unified portrait of potential customers and hot data with time series attributes, and the other cluster stores cold data and data for offline computing. Each cluster has integrated data lake capabilities. Then we developed a unified Query Engine, which supports cross-source and cross-cluster data query, and is transparent to the underlying storage engine.

Extension 1: Enhanced storage based on data tiering

Why do you need stratification?

If data is stored entirely based on Kudu, on the one hand, the cost is high (Kudu clusters must be built based on SSD disks to have better performance); (monthly, half-yearly, etc.) real-time behavior changes of customers, and the frequency of use of historical data for a long time is very low.

Layering mechanism

Comprehensive consideration, and also from the perspective of saving resource costs, we choose Parquet as the expansion storage, and do cold and hot tiered storage for storing massive data that conforms to time series.

According to the frequency of data usage, we divide the data into three tiers: hot, warm, and cold. Hot data, which means data frequently used by tenants, within three months; warm data, which means data that is used less frequently, generally only used for the selection of individual customer groups, with a time range of three months to one year ; Cold data, data that is basically not used by tenants, and the time range is beyond one year. To balance performance, we keep hot and warm data in the same cluster, and cold data in another cluster (the same cluster that feeds the policy model).

specific plan:

A unified view is established on top of hot, warm, and cold, and the upper layer performs data query according to the view.
Then, the sequential offline migration from hot to warm and warm to cold is carried out at regular intervals every day, and the views are updated in real time after the separate migration.

Extension 2: Mapping relationship management based on potential customer integration path to solve data migration problem

Why do you need to manage mappings?

There is a lot of potential customer portrait behavior data, and there may be frequent integration. If the data is migrated every time during the potential customer integration, on the one hand, the data migration cost is very high. On the other hand, when the potential customer behavior involves warm and cold data, Delete operation is not possible. For similar situations, the industry will make more choices, such as only migrating users' hot data for a period of time, and not processing the previous history. This solution is not ideal.

Mapping Management Mechanism

To this end, we changed our thinking and solved the problem by maintaining the integration path of potential customers.

specific plan:

Add a potential customer fusion relationship table (user_change_rela) to maintain the mapping relationship;
Create a view on top of the fusion relational table and time series table (such as event), so as to be transparent to the business layer.

For the fusion relationship table, we have done a certain strategy optimization: instead of maintaining the process relationship on the path, we only maintain the direct relationship between all process points of the path and the end point. This will not increase the performance of relational queries too much, even when the lead fusion path involves too many leads.

For example, the data change of user_change_rela when a potential customer is fused twice (affId=1001 is first fused to 1002, and then fused to 1003), as shown in the following figure:

4.3.3.3 Analysis Enhancements

We choose Baidu's open source Apache Doris as the data-enhanced analysis engine to provide customer insight capabilities for the Eppantoker version, such as journey analysis, crowd analysis, marketing effect analysis, fission analysis, live broadcast analysis, etc.

In order to facilitate the flexible removal of subsequent OP deployments, we use the data output by CDP as the data source for enhanced analysis, and then perform logical processing based on Flink Jobs, such as cleaning, dimension join, and data equality, and finally use the flink- doris-connector writes data to Doris.

Using the connector method to write Doris directly has two advantages:

Using flink-doris-connector to write data to Doris requires one less Kafka than using Routine Load.
Using flink-doris-connector is more flexible in data processing than Routine Load.

Flink-doris-connector is implemented based on Doris's Stream Load method, and data import processing is performed through FE redirect to BE. When we actually use flink-doris-connector, we perform a Flush every 10s, and a maximum of one million rows of data can be submitted for write operations in each batch. For Doris, it is more friendly to flush more frequently than a single batch of data.

Extension 1: Difference between RoutineLoad and Stream Load

Routine Load method

It submits an import task resident in Doris, and writes data to Doris by continuously subscribing and consuming JSON format messages in Kafka.

From an implementation point of view, FE is responsible for managing the import task, and Task imports data through Stream Load on BE.

Stream Load method

It uses the streaming data computing framework Flink to consume Kafka's business data, and uses the Stream Load method to write to Doris using the HTTP protocol.

From an implementation point of view, this method is that the framework directly writes data to Doris synchronously through BE, and the Coordinator BE directly returns the import status after the writing is successful. In addition, when importing, it is best to use the same label for the same batch of data, so that repeated requests for the same batch of data will only be accepted once, which can ensure At-Most-Once.

4.3.4 Real-time rules engine

In Aifanfan private domain products, flexible grouping capability is an important product capability. How to conduct real-time grouping with complex and flexible rules based on potential customer attributes, identity, customer behavior and other dimensions? The real-time rules engine here is built for this. This feature itself is not new, there is a similar capability in DMP. Many CDPs and customer management platforms have similar capabilities, but it is a challenge to achieve real-time, high-throughput rule judgment in the case of multi-tenancy and massive data.

In Aifanfan RT-CDP, on the one hand, the number of tenants is large, how does the Saas service premise support multi-tenant high-performance clustering? On the other hand, Aifanfan RT-CDP expects to achieve real-time judgment based on real-time streaming. Therefore, we have developed a real-time rule engine based on multi-layer data. Briefly talk about it here, and there will be a separate article to introduce it later.

What's the problem?

The traditional implementation scheme is mainly to translate the rules into a complex SQL when a tenant triggers a grouping request in real time or regularly, and temporarily perform SQL queries from the tenant's potential customer data pool. In addition, a layer of inverted index is generally made on potential customers. When there are few tenants or OP deployment, the data query speed is acceptable. However, implementing a rule engine based on real-time streams needs to solve the following problems:

Real-time judgment of massive data
Memory footprint of window granularity data aggregation
Window storm under sliding window
Data aggregation problem with no window rules
Window data update after prospect data change

Real-time rules engine implementation

Similar to many products, Aifanfan's rule circle group is mainly a combination of two-layer And/Or rules. Combining the characteristics of the rules, we mainly divide them into several types of rules as shown in the figure below: ordinary attribute operations (P1, P2), ordinary identity operations (I1), small window behavior judgment (E1), large window behavior judgment (E2) and windowless behavioral judgments (E3).

For the flexibility of rules and efficient data processing capability, we define a set of rule parsing algorithms. Then use Flink's powerful distributed computing capabilities and state management capabilities to drive real-time rule engine calculations. The concept of streaming data has been mentioned above. Here, we combine a potential customer behavior to enter the real-time rule judgment to more intuitively illustrate the real-time filling of data in the stream. The attribute information corresponding to the potential customer is supplemented, and finally the real-time rule judgment is made based on a complete potential customer data, and finally the potential customer responsible for the rule is put into the Segment table.

In addition, the rule engine is a service independent of business objects such as Segment, and can support business scenarios related to various rules such as circle group, tagging, and MA journey node.

4.3.5 Extension

4.3.5.1 Elastic Cluster

With the help of cloud capabilities, Aifanfan has realized the separation of storage and calculation of resources and dynamic scaling. You can customize flexible resource expansion and scaling strategies, increase or decrease resources according to the volume of messages, increase the size of the cluster in real time during peaks to provide computing power, and shrink clusters during troughs to reduce costs in a timely manner.

Our cluster is mainly divided into four types of nodes: Master, Core, Task, Client. Specifically as shown above.

Master node: cluster management node, deploy NameNode, ResourceManager and other processes, and achieve automatic migration when components fail;
Core nodes: computing and data storage nodes, deploying processes such as DataNode

Baidu Aifanfan real-time CDP construction practice

1. What is CDP

1.1 Origin of CDP

1.2 Definition of CDP

1.3 Classification of CDPs

2. Challenges and goals

2.1 Challenges

2.1.1 Business level

2.1.2 Technical level

2.2 RT-CDP construction goals

2.2.1 Key business capabilities

2.2.2 Advanced Technology Architecture

3. Technical selection

3.1 A new attempt to store identity relationships

3.2 Stream Computing Engine Selection

3.3 Mass Storage Engine Choices

3.4 Rule Engine Research

4. Platform Architecture

4.1 Overall Architecture

4.2 Core modules

4.3 Key Implementations

4.3.1 Data Definition Model

4.3.2 Identity Service

4.3.3 Real-time storage and calculation

4.3.3.1 Stream Computing

4.3.3.2 Real-time storage

4.3.3.3 Analysis Enhancements

4.3.4 Real-time rules engine

4.3.5 Extension

4.3.5.1 Elastic Cluster

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

Baidu Aifanfan real-time CDP construction practice

1. What is CDP

1.1 Origin of CDP

1.2 Definition of CDP

1.3 Classification of CDPs

2. Challenges and goals

2.1 Challenges

2.1.1 Business level

2.1.2 Technical level

2.2 RT-CDP construction goals

2.2.1 Key business capabilities

2.2.2 Advanced Technology Architecture

3. Technical selection

3.1 A new attempt to store identity relationships

3.2 Stream Computing Engine Selection

3.3 Mass Storage Engine Choices

3.4 Rule Engine Research

4. Platform Architecture

4.1 Overall Architecture

4.2 Core modules

4.3 Key Implementations

4.3.1 Data Definition Model

4.3.2 Identity Service

4.3.3 Real-time storage and calculation

4.3.3.1 Stream Computing

4.3.3.2 Real-time storage

4.3.3.3 Analysis Enhancements

4.3.4 Real-time rules engine

4.3.5 Extension

4.3.5.1 Elastic Cluster

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈