Development and challenges of NetEase cloud music real-time computing platform in 2021

NetEase Cloud Music started to build a real-time computing platform in 2018. After several years of development, it has penetrated into various cloud music businesses. This article is a practical sharing by Teacher Dayu, starting from a daily operation and maintenance problem, leading everyone to understand some work progress and future plans of the cloud music real-time computing platform. The main content is:
Platform function
Batch flow integration
future plan

After the NetEase Cloud Music real-time data warehouse platform was launched, after a year and a half of development, the overall real-time data warehouse has begun to take shape. We have 300+ real-time data warehouse tables and 1200+ tasks in operation. Among them, about 1,000 tasks are SQL tasks, the total export traffic of Kafka has reached 18GB/S, and the total number of users has reached 200+.

The growth of data volume and users has also brought more and more challenges to the ease of use and stability of the data platform, including the stability of Kafka, the stability of the cluster, the challenges of operation and maintenance, and many early technical debts. ; The growth of business has exposed the weakness of infrastructure and has also accumulated a lot of experience in platform construction and operation and maintenance.

1. Platform function

For the overall functions of our platform, you can refer to "Technical Transformation of Cloud Music Real-time Data Warehouse and Some Future Plans", here will mainly introduce some of our latest work:

"My task is delayed, how can I expand the capacity, why is that?"

This is a problem we often encounter in daily operation and maintenance work, and it is often a time-consuming problem. There are many reasons for this kind of problem. In order to solve this problem, we have done some work to enhance our operation and maintenance capabilities.

1. Perfect IO indicators

IO problem is one of the reasons that cause the above problems frequently, including message reading efficiency, dimension table JOIN efficiency, SINK efficiency, etc. The performance and stability of third-party storage directly affect the stability of real-time tasks, in order to quickly locate the relevant Question, we have added a lot of IO-related Metric indicators.

1.1 Some performance indicators on the consumer side of Kafka

1.2 Reading deserialization indicators

Include:

Deserialized RT
The proportion of deserialization errors

On the Format side, we have developed a set of Format agents, which support functions such as reporting related metirc indicators and ignoring incorrect data without modifying the original format code. As long as you add the attribute format.proxy to specify the proxy class, you can support Format encapsulation in different ways.

For example, if we specify format.proxy=magina, we can support reporting the above-mentioned performance indicators; if we specify format.proxy=ds, we can support parsing the log format encapsulated by ds, and use the proxy format to parse the Body part in DS. There is no need to separately Format is developed for the log format encapsulated by DS, and performance-related indicators are also reported, and functions such as ignoring error messages are supported.

1.3 Dimension table JOIN related indicators

On the JOIN side of the dimension table, we have added:

Response time for data query
Hit rate of local cache
Proportion of query retries
Proportion of data on successful JOIN, etc.

1.4 Some performance indicators of data writing

RT for data serialization
Average response time of data written to external data sources, etc.

For the realization of the entire set of IO-related indicators, we have all done some public encapsulation on the top-level interface of Flink Connector, and refactored the related Connector code. As long as we implement the Connector according to our own interface, we don’t need to care about the reporting of detailed indicators. Report out automatically.

2. Kafka partition problem

The limitation of Kafka partitions is also the reason why our program performance cannot be expanded. For the realization of Exactly Once, read performance, and read stability considerations, Flink uses the active pull method to read Kafka messages. This method limits The number of tasks we read Kafka messages greatly limits our ability to expand the performance of our tasks. Take the following case as an example:

SET 'table.exec.state.ttl' = '1h';
SET 'table.exec.mini-batch.enabled' = 'true';
SET 'table.exec.mini-batch.allow-latency' = '10s';
SET 'table.exec.mini-batch.size' = '100000';
INSERT INTO music_kudu_online.music_kudu_internal.ads_ab_rtrs_user_metric_hour
SELECT 
from_unixtime(`timestamp`, 'yyyy-MM-dd') as dt,
from_unixtime(`timestamp`, 'HH')         as `hour`,
os, sceneid, parent_exp, `exp`, exp_type, userid,
count(1) pv
FROM iplay_ods.ods_rtrs_ab_log 
INNER JOIN abtest_online.abtest.abtest_sence_metric_relation
FOR SYSTEM_TIME AS OF user_metric.proctime
ON ods_rtrs_ab_log.sceneid = abtest_sence_metric_relation.sceneid 
GROUP BY from_unixtime(`timestamp`, 'yyyy-MM-dd'),  
         from_unixtime(`timestamp`, ‘HH’), 
         os, sceneid, parent_exp, `exp`, exp_type, userid

This is a real-time full aggregation task. The DAG executed by this piece of SQL in the original FLINK is roughly like this:

If the flow table ods_rtrs_ab_log we read has 5 partitions, and our SQL task has seven concurrency, because it is affected by the number of Kafka partitions, coupled with the optimization of the job chain of FLINK itself, our message reading, dimension table JOIN, The operations of MINI BATCH are all affected by the Kafka partition and cannot be extended. Especially for the IO operation of dimension table JOIN, the concurrency of the task seriously affects the performance of the overall program. At this time, I can only expand the partition of Kafka. Number to improve performance.

But this kind of operation is very heavy, and it is likely to affect other tasks that read this flow table; in order to solve this problem, we have made some modifications to the Kafka Connector, and support adding one more Shuffle operation through the configuration, such as the configuration above Among them we added the configuration:

'connector.rebalance.keys' = 'sceneid,parent_exp,userid'

The message will be hash fragmented according to the sceneid, parent_exp, userid and other fields after being read, which greatly improves the performance scalability of the overall program, and through the keyBy operation of the specified field, the hit rate of the dimension table JOIN cache can be greatly improved. The performance and efficiency of MINI BATCH.

In addition to the above configuration, we also support the addition of random Rebalance operations, Rescale operations, and disassembly of parsing behaviors to further improve the expansion of the overall program performance. It should be noted here that additional Shuffle operations will bring more threads and network overhead. When configuring these operations, you need to pay attention to the load of the machine at the same time. Although adding additional Shuffle operations can improve the scalability of the program, due to the additional network and thread overhead, if the performance of the machine itself is not good, it is likely to be counterproductive. The performance becomes worse under the same resource situation, which needs to be configured according to your own program and environmental conditions.

3. Kafka usage optimization

With the rapid growth of traffic, the stability of Kafka is also the main problem we face, including Kafka's cabinet bandwidth problem, cross-machine room bandwidth problem, Kafka expansion and shrinking jitter problem, and Kafka itself configuration problems, etc., basically everyone can We have all encountered the problems we encountered. In order to solve the above problems, we have done the following work:

3.1 Develop mirroring services to solve bandwidth problems and guarantee high-priority tasks

We developed a set of mirroring services through FLINK, deployed a set of Kafka clusters between different computer room modules, and synchronized the data of two sets of Kafak clusters through the mirroring service. The main Kafka provides real-time tasks at the more important P0 level. Others are not A particularly important task is to read the data of the mirrored cluster.

We use Yarn Label technology to control the computer room where tasks are located through the selection of different queues to reduce cross-computer room bandwidth consumption. In order to facilitate users to switch between different Kafka clusters, we have also made some transformations on the Flink flow table side to support a flow table. Multiple Kafka clusters are mounted at the same time, and Kafka clusters can be switched at will through simple configuration. After a round of task sorting and switching, Kafka bandwidth usage has been greatly improved:

3.2 Kafka monitoring is perfect

In our daily work, we found that many developers don't know much about Kafka itself. Due to lack of experience in operation and maintenance, the overall control of Kafka in the initial stage is not so strict, resulting in many problems in use. So we integrated the data of the Kafka monitoring service inside the music, combined with the task blood of our platform, and developed our own set of Kafka monitoring services.

At present, the system as a whole is still relatively rudimentary. In addition to the relationship between Kafka, flow tables, and tasks, we also actively monitor the following situations:

The rationality of the number of partitions of Kafka Topic mainly monitors the situation that the number of message queue partitions is too few or too many, mainly because of too few, to prevent the problem that the processing performance of downstream tasks cannot keep up because the number of partitions is too small;
Kafka partition data production balance problem: prevent the problem of poor downstream task processing performance due to the imbalance of Kafka's own partition data;
Kafka partition data consumption balance problem: prevent the problem of some data not being consumed due to changes in Kafka's own partition and downstream tasks because partition awareness is not turned on;
Traffic surge and drop alarm: Key queue traffic alarms to ensure the quality of real-time data.

Kafka version upgrade: In order to solve the stability problems of Kafka expansion and resource isolation problems, through our music public technical team, we have done some secondary development work on the basis of Kafka 2.X version, and made the whole service of Kafka platformized. Yes, it supports the smooth expansion of Topic and supports resource isolation.

Similar to YARN's LAEBL technology, it supports machines in different regions for different TOPICs, complete message mirroring services, and supports offset replication; unified Kafka operation and maintenance monitoring platform, this part of the content will be introduced in detail in subsequent articles.

3.3 Technical construction of partition flow table

After the real-time data warehouse was launched, we found that the following situations greatly affected the stability of the program and the ease of use of the flow table:

Many times we only need 1% of the data in a flow table, but because there is no way to read it on demand, we must consume a lot of resources to parse and read the other 99% of the data, resulting in a lot of resource bandwidth consumption and wasted A lot of resources, and the SQL development method itself has no way to parse the log on demand, which leads us to parse out every message completely, which leads to further consumption of computing resources.

When we split the big TOPIC into many small TOPICs according to experience and business, one table becomes many small tables, and the user must have a lot of experience and knowledge to understand the differences in these small tables with exactly the same schema. What messages are included, and the ease of use is very poor. This design does not conform to the overall design logic of the data warehouse. If you want to unify the metadata of the batch flow table in the future, it will become impossible as a whole.

In offline scenarios, we have many means to solve the above problems and reduce unnecessary IO, such as data bucketing, storing ordered data, using Parquet's push-down query capabilities, partitioning tables, etc., can all solve the above problems . However, there does not seem to be any good method under the real-time table case in the existing public solutions; therefore, in order to solve the above problems, we developed a partitioning scheme for the flow table, which is similar to the partitioning realization idea of the HIVE table as a whole:

We use the SupportsFilterPushDown interface provided by Flink Table Souce to implement a set of our own real-time streaming table partitioning scheme, one partition corresponds to a topic, and unnecessary partitions are filtered under the user's query conditions, thereby reducing unnecessary data reading Take; The first version is currently online, and the cloud music exposure log has been initially split. By the way, I also tried to use the AVRO data format instead of the previous JSON format. The optimization effect is obvious in practice:

The use of AVRO format can basically bring at least 30+% bandwidth optimization, and the message parsing performance is doubled compared to the parsing performance of the original log format of music.

Using the partition flow table, we initially migrated 4 consumption tasks for exposure logs, and have saved 7 physical machines, saving more than 75% of computing and bandwidth resources on average.

Although these are relatively extreme cases, from these examples we can expect that after the partition flow table technology is fully rolled out, if it is used, it is definitely an optimization that can bring qualitative changes.

Two, batch flow integration

Real-time data has always been a relatively large goal for our cloud music data platform team's data warehouse construction. Behind this goal, the integration of batch and streaming is also inevitable for us to avoid a "noun", "concept", "technology", or " product". Before officially starting to share our work, let’s first share the conversation I met with an algorithm classmate in the elevator and then with the algorithm classmate:

Algorithm: When will your batch-flow integration go online? Are we waiting to use it?
Me: What are your current demands?
Algorithm: Many of our real-time indicators are developed by ourselves, and it is impossible to directly use the ready-made data warehouse data after offline.

From this conversation, we can see that the algorithm students do not want any batch-streaming integration technology. What they want is real-time ready-made data warehouse data to improve their development efficiency. Behind the batch-stream integration, What are the demands of business parties in different roles?

For operations, products, bosses, and analysts:

What they want to see is accurate, real-time, analyzable report data, and the key point is that it can be analyzed. When the result data fluctuates abnormally, we have to have real-time detailed data to provide analysis and query to investigate the cause of the abnormal fluctuation. When the boss has some new ideas and wants to do a second analysis on the ready-made reports, we have to be able to provide detailed and analyzable data to analyze and give results.

In terms of real-time daily activity statistics, our common method is to de-duplicate or approximate the repetition in KV storage such as Redis stored by the user ID, and then calculate the real-time daily activity data, but when the daily activity fluctuates abnormally, because Reids' data is not analyzable. Therefore, it is difficult for us to quickly give the reasons, and we cannot do the analysis on the same day. This kind of plan and results are obviously unqualified.

For data warehouse development:

Unified real-time/offline data warehouse metadata management, unified model, unified storage, reduce data warehouse operation and maintenance construction costs, and improve the ease of use of the overall data warehouse;
Unified development code, a unified set of SQL to solve offline/real-time development problems, reduce development and operation costs, and completely solve the problem of large differences in real-time offline data results due to different business understandings and different logics.

For algorithm students:

There is a real-time/offline unified data warehouse table that can be used. The unified model reduces the threshold of business understanding, improves the ease of use of the overall data warehouse data, and facilitates the easy-to-use data warehouse metadata management service, which is convenient for algorithm students to do the second time. The feature development work of the company improves the development efficiency of the model. Provide accurate, real-time and analyzable algorithm model effect data to improve the efficiency of algorithm model iteration

summed up as a whole, the goal of batch flow integration mainly includes three aspects:

Unified code: A set of SQL completes real-time and offline related business development requirements;
Unified data warehouse metadata: a table can provide offline reading and real-time reading at the same time, a unified model of a data warehouse that integrates batch and flow;
Real-time report data: This is different from unified data warehouse metadata. Product report data needs to provide the ability to query real-time results in seconds, while unified data warehouse data often only needs real-time storage. The efficiency of OLAP query and No report data is not so sensitive.

1. Uniform Code

Since real-time SQL itself is not particularly mature, many logics that are easy to implement in offline scenarios cannot be implemented in real-time scenarios, or there are problems with stability.

At present, the industry is still exploring. Ali's current main method is to use a set of FLINK engines to solve the problem of real-time offline unified SQL, but it is also in practice at present. The realization of the upper ADS layer business logic is through the construction of the bottom data warehouse. Shield some real-time SQL capabilities and achieve a unified set of SQL in the development of product reports. This is also the direction we can try in the future. In addition to trying to unify SQL in the development of upper-level reports, we have also done some work and planning in the unified code:

Unified UDF, integrated and upgraded the platform framework to the new version of FLINK1.12, unified offline and real-time unified UDF;
Unified metadata management: On the FlinkSQL side, we inherit the metadata center service and provide data reading and writing methods such as catalog.db.table. In order to unify the metadata, we also encapsulate SparkSQL a second time. The data center is integrated to realize the read and write between heterogeneous data sources in the form of catalog.db.table.

The unified implementation of scenario-based configuration-style batch flow integration. For some simple business logic scenarios, we will develop a scenario-based batch flow integration implementation in the future. Such as batch-stream integrated indexing tasks, batch-stream integrated ETL cleaning platform, etc., due to resource issues, this is currently under planning.

Batch-stream integrated SQL is unified. Under the current technology, there is a larger premise that is the complexity of the log itself. This involves the standardization and completeness of the log burying point. Real-time calculation is not like offline, which can attribute a large number of attributions. Logic, associative logic is processed on the data side. Aside from rationality and cost issues, a lot of work can be done in offline scenarios.

However, in real-time scenarios, it is very sensitive to performance and stability. If a large amount of logic is placed on the data side for processing, it will bring many unrealizable problems, high cost of implementation, and many stability. , And the problem of data delay. If the management is not done well, the entire real-time data warehouse construction will be a problem. Therefore, Cloud Music also launched the Sugon Management project and cooperated with the Yushu team to completely reconstruct the realization of the management of various products of Cloud Music, and improve and improve the standardization and accuracy of management. It can reduce the development cost of real-time data warehouse.

2. Unified data warehouse metadata

There are currently two types of solutions in the industry:

The first is the solution of building a batch flow mapping layer. The solution currently disclosed by Alibaba is this kind of solution. It is more suitable for old products that already have real-time data warehouses and offline data warehouses. The unified mapping layer view provides an integrated user experience through the view. Refer to the following figure for the overall principle:

The second solution is to build a new metadata system. Multiple storages, such as HDFS, Kafka, etc., are simultaneously mounted under a set of schemas. When data is written, it is written at the same time. Depending on the method, choose the appropriate storage. At present, the Arctic developed by the NetEase Sofan product team adopts this scheme:

The overall idea is to encapsulate a variety of storage such as icberg, Kafka, and Hbase, and use different storage in different scenarios. In addition, Arctic has also done a lot of secondary development on the basis of iceberg to solve the problem of updating DWS data and provide similar Hudi Functions such as CopyOnWrite and MergeOnRead are used to solve the stability problem of Flink itself for full aggregation. At present, Cloud Music has been tested in some new business scenarios, and dozens of batch-flow integrated tables have been launched online. If you want to learn more about arctic, you can ask Netease Shufan's real-time computing team to understand it. I will not describe it here.

3. Real-time report data

Providing real-time report data mainly relies on OLAP engine and storage. The storage side needs to have the ability to provide real-time data update while also providing the ability to query data in seconds. In many cases, there is no way to write the results directly to In storage. Because the data report itself has a lot of flexible queries, if you write the results directly to the storage, you need real-time Cube capabilities like Kylin. This puts too much pressure on development and Flink's own calculations, and it will also bring a lot of resources. And storage waste, stability issues, and development workload issues will also be many, and the secondary data analysis capabilities will also be very limited; so at this layer we need the OLAP engine to provide at least tens of billions of data with a second-level delay For query capabilities, our main solutions currently use two types of storage: Kudu and Clickhouse. Taking our old version of ABTest as an example, the solutions we use are as follows:

For the real-time latest hourly and day-dimensional results, we use Impala to read the Kudu data in time to correlate the latest results; for the historical one-day-old-day-dimensional data or two-hour-old hour-dimensional data, we use Spark to precompute and store it. In the result table, the two data UNIONs are provided to users together to ensure the timeliness of the data results and the user experience of the overall data query.

3. Future planning

**Improvement of operation and maintenance tools

The development of real-time SQL reduces the difficulty of real-time data statistics development and greatly reduces the threshold of real-time data statistics. On the one hand, due to the immature and black box of real-time SQL itself, on the other hand, many students bring offline SQL development experience or MYSQL class The SQL experience of the database is used to develop real-time tasks, which brings a lot of operation and maintenance pressure to the platform. Therefore, the construction of operation and maintenance tools and the improvement of real-time tasks indicators are one of our main thinking directions in the future.

partition flow table technology is perfected

Partition streaming table technology is a technology that can bring qualitative changes to cloud music real-time platform resource usage, Kafka pressure, and data warehouse construction. At present, we have only completed a first version. In the future, we will be aware of partition dynamics, partition modification, and schema changes. Modification, operation and maintenance monitoring and promotion continue to be improved.

Scenario-based batch flow integrated construction

Such as batch-flow integrated index task construction, batch-flow integrated ETL tools, etc., unified log cleaning rules, and lay a solid foundation for batch-flow integrated data warehouse.

Batch streaming integrated storage exploration

Investigate the current solutions in the industry, combine the business scenarios of music, provide a complete set of solutions, reduce the development threshold of real-time reports, and improve the development efficiency of real-time reports;
Batch-flow integration logic layer construction, etc.

Finally, attach a real-time computing solution architecture diagram of the Netease Shufan Youshu team. The high-performance, one-stop real-time big data processing solution built on Apache Flink is widely applicable to streaming data processing scenarios.

For more Flink-related technical issues, you can scan the QR code to join the community DingTalk exchange group;

Get the latest technical articles and community dynamics in the first time, please follow the public account~

Development and challenges of NetEase cloud music real-time computing platform in 2021