Big data insight portrait automation practice

Text｜Ding Long NetEase Cloud Merchant Senior System Test Engineer

1. What is Consumer Insights?

Consumer insight is based on big data, going up one level, and carrying out customer service with the idea of analytical insight. When the vertical industry goes deep into the business, the data can be further applied and analyzed, and output to the enterprise with more valuable insight reports, which directly reflect the current situation of consumers and guide the decision-making of the enterprise.

This is essentially subverting the service model of consulting companies. Their service model for enterprises is to output a report for a period of time, and there is a lot of time for data collection and report writing. But when we precipitate the analysis method and make the data kanban, it is the ability to deliver insight consulting in the SaaS way, which allows enterprises to see data changes at a faster frequency and cross data indicators in a more flexible way. analyze.

This product insight portrait: 20 brand-based values and big data portraits are presented.

Replacement insight: Based on a mobile phone, it shows the user's replacement status.

2. Business realization

(1) Overall structure

Data access and storage:

By writing a Spark application, deploying the application on the NetEase Mammoth data platform, and synchronizing the wide table and label table data from Hive to MySQL and ClickHouse, the specific synchronization method of the application is as follows:

Write label classification, label and label enumeration data to MySQL
Read wide table data from Hive and write to ClickHouse directly through insert statement

data service:

User portraits: mainly for insight algorithms in different industries, providing various label combinations, generating visual insight portraits and data, reflecting the current situation of different types of consumers
Tag management: It is mainly responsible for the hierarchical management of tags and the processing of special tags, and provides an interface for external query of tag lists. Tag classification, tags, tag enumeration all exist in MySQL

For the current structure of the business, testing whether the algorithms of big data insight portraits and synchronized big data labels are correct, and ensuring the rationality of user portraits, has become the top priority.

(2) Data flow

Through data cleaning, analysis, and precipitation, the consumer insight data obtains data tags and stores them in Hive, deploys spark applications to obtain wide table data from Hive and stores them in Clikchouse, and stores MySQL in the tag table. The final business presentation is calculated from the wide table in Clikchouse. .

Based on the above data flow and business use, the overall test idea is obtained:

Data integrity and accuracy verification based on the wide table of Clickhouse data source
Insight algorithm test for business portrait presentation

3. Current status

Large amount of data: After synchronizing the data, it takes a long time to manually verify various labeling rules, often one day or even longer.

Too many insight algorithms: Different industry insight algorithms will be generated under different combination conditions, which will greatly increase the workload of regression

Large amount of repetitive work: the data update frequency is fast, half a month/time, after the update synchronization needs to be checked again, there is a lot of repetitive work, and the same is true for the algorithm

4. Solutions

(1) Data verification automation

![]() title=

After the data synchronization is completed, the data synchronization module will trigger the automatic label verification mechanism of QA through the interface. The automated verification platform verifies the synchronized label data in an asynchronous manner, and an alarm notification will be triggered if there is a problem.

DataValidataController provides the API interface for development calls. ValidataImpl implements data integrity verification, including data magnitude, tag magnitude, data chain ratio, and tag search; data accuracy verification, including data data uniqueness and data correlation.

(2) Insight portrait algorithm service

The big data insight portrait algorithm service is mainly to abandon the cumbersome SQL input and manual calculation, automatically generate the corresponding algorithm SQL according to the project information, realize different calculations between SQL, and also provide insight algorithm summary, and provide corresponding algorithm API services.

Overall idea: Based on the current business architecture of consumer insight, an automated testing architecture based on Clickhouse module is realized.

Application layer: mainly based on different labels, price conditions and industries to achieve various algorithms summary, including this product insight, competitive product insight, replacement product insight, industry segmentation insight, etc.; this layer will also provide follow-up requirements and function iterations. algorithm.

Data service: Provide the corresponding algorithm API service to realize the interface call of various algorithms.

 @RequestMapping("/userProfile")public String userProfile(@RequestParam String projectId)

 @RequestMapping("/insight/summary")public String summaryInsight(@RequestParam String projectId,@RequestParam String key)

 @RequestMapping("/insight/avg")public String avgInsight(@RequestParam String projectId,@RequestParam String key)

Realize Clickhouse connection: The connection of Clickhouse is consistent with MySQL, which is convenient and fast.

Integrate goapi (interface management platform) and overmind (enterprise R&D efficiency platform) to realize CI:

goapi implements scenario use cases and result assertions, uses overmind's requirement CI and functional scenario modular calls, and automatically triggers scenario use case calls when developing insight requirements to achieve automatic verification and test/release card points.

( )

Although Clickhouse has abundant functions, the performance of some functions will be very poor. During the test, it is found that the hasAny function is very time-consuming in the calculation of big data and industry averages. Therefore, this function is avoided in both development and testing, and the use of Rows to columns are implemented.
Since big data portraits have relatively high requirements on accuracy, and Clickhouse will cause loss of accuracy when dealing with high accuracy, so it is best to deal with the loss of accuracy when dealing with high accuracy.
Clickhouse's performance will be poor when dealing with changing Where conditions. If the business's Where conditions are very rich and changeable, when there are multiple projects with simultaneous insights, the portrait return will be quite slow.

(4) Achievements

Improved test coverage

Not introduced: Sampling covers data label magnitude, label search.
After the introduction: covering data magnitude, tag magnitude, data chain ratio, tag online and offline, tag search, tag unique row, tag association.

Improved test efficiency

Not introduced: Data synchronization test requires 1 person/day, and insight algorithm test requires 2 persons/day.
Post-Introduction: No human effort required to test regressions.

V. Summary

The consumer insight business will also carry new industry user analysis in the future, but no matter what industry it is connected to, its essence is to analyze user tags and behaviors to form portraits and insight capabilities , and the pillars carried in it: data & algorithms, It can be well iterated and maintained in the big data portrait automated testing service.

about the author

Ding Long, senior system testing engineer of NetEase Cloud Merchant, is currently mainly responsible for the questionnaire survey and consumer insight testing of NetEase Cloud Merchant.

Big data insight portrait automation practice

1. What is Consumer Insights?

2. Business realization

(1) Overall structure

Data access and storage:

data service:

(2) Data flow

3. Current status

4. Solutions

(1) Data verification automation

(2) Insight portrait algorithm service

( )

(4) Achievements

Improved test coverage

Improved test efficiency

V. Summary

about the author

网易数智

引用和评论

InfoQ官媒报道|网易云信裴明明：云原生架构下中间件联邦高可用架构实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

Flink+Paimon+Hologres，面向未来的一体化实时湖仓平台架构设计

基于 pyflink 的算法工作流设计和改造

鹰角基于 Flink + Paimon + Trino 构建湖仓一体化平台实践项目

Flink CDC YAML：面向数据集成的 API 设计