1
头图

Abstract: This article is compiled from the speech delivered by Lin Jia, head of Billing Real-time Platform and SDK Technology at Netease Interactive Entertainment Technology Center, at the Flink Forward Asia 2021 industry practice session. This article is mainly divided into three parts:

  1. Start with an in-app purchase payment
  2. Two-line development of real-time SDK and platformization
  3. Towards real-time full correlation

Click to view live replay & speech PDF

When it comes to NetEase Interactive Entertainment, the first thing that comes to mind is definitely games. As one of NetEase's core business lines, it is naturally the top priority that the game business can run stably and reliably, and the most important thing in the game business is the reliability of the in-app purchase service. The sharing of this article starts with an in-app purchase.

1. Start with an in-app purchase payment

img

The player's operation of purchasing items in the game will first trigger the client behavior to communicate with the channel provider and the billing center to complete the order and payment. The billing center also interacts with channel providers to verify the legitimacy of client orders and payment status. The game service will only be notified of the shipment if the order is legitimate. After this whole process, the logs, data monitoring points, etc. generated by each participant may have different sources, data structures, and time steps. In addition, there are communication networks, databases, monitoring systems, etc. involved in this process, which makes the whole process very complicated.

img

The continuous and massive generation of data, the association of sessions between data, the heterogeneity between data sources, the heterogeneity of data structures, and the inconsistency of time pacing are all reasons why we choose to use real-time processing.

img

Before 2017, our processing method was relatively backward, and there were some old processing methods, such as network disk, rsync, T+1 processing offline tasks, etc.

img

Numerous components, fragmented technology stacks, low timeliness, and rough resource usage will prevent resources from being used evenly. Energy efficiency is relatively low.

img

The above picture is a schematic representation of the resource situation of our previous offline computing business operation, and the data report of the previous day is calculated in the early morning. Before the popularity of streaming computing, this was a very large-scale use model, using a large number of machines to perform Spark offline tasks in the early morning to calculate the results of the previous day. In order for reports to be delivered on time, the entire offline cluster requires large computing power, stacking a large number of machine resources, and these machine resources are idle for many periods of time, resulting in low resource energy efficiency.

img

If this kind of computing task can be real-time, then the computing power it needs can be allocated to each time slice, so as to avoid severe resource usage in the early morning. The computing power of these machines can be hosted on a resource management platform, so they can also be used by other businesses to improve energy efficiency.

So how to choose a real-time framework? After in-depth research and attempts, we finally chose Flink. The features it provides can be said to be fully adapted to our scenario. The following figure lists some of our considerations for technical architecture selection.

img

2. Dual-line development of real-time SDK and platformization

NetEase Interactive Entertainment has formulated a two-line development plan since 2018 to comprehensively promote the real-time process of data center JFlink.

img

After many iterations, we have now formed a one-stop operation and maintenance platform + an SDK that supports configuration development, and has completed the advancement from usable to practical. The next step is to let users love to use it.

How to improve the efficiency of manpower and code is what we paid great attention to when we designed JFlink from the very beginning. We hope to maximize energy efficiency with limited manpower, so the configuration and modularization of the SDK becomes particularly important. To achieve each real-time job, a set of configuration semantics can be used to describe it.

img

The connector processing functions and data flow objects commonly used in the JFlink SDK are encapsulated in the SDK, so that they can be assembled and used in a configurable form. The SDK also provides a unified configuration grammar, which can dynamically organize the Flink DAG after describing the job in the form of configuration, so as to realize the characteristics of an SDK package covering various data services, and improve code reuse and energy efficiency.

img

On the SDK, you can start a real-time business by writing or generating UDFs of Kafka source, TiDB sink and intermediate aggregation window without any additional development.

img

In order to cooperate with the unified grammar of SDK operations, we have also built a one-stop processing platform, so that data operation and maintenance personnel can construct their own data business in one-stop, convenient and visual way.

img

Even if DAGs are so intricate, they can still be generated using analytic grammars.

img

The SDK-based strategy realizes functional modularization, job configuration, data visualization, and stream-batch integration, making module reuse a routine, allowing everyone to understand each other's jobs, and enabling heterogeneous data to be used for each The written UDF modules are processed, and more importantly, historical jobs can be transitioned to Flink.

SDKization also provides us with the ability to quickly follow the community to upgrade Flink versions. The SDK isolates business logic and Stream API, and most of the extension functions are extensions of Flink native classes on the SDK side. When following Flink for major version upgrades, on the one hand, the business side can achieve almost zero-change upgrades, and on the other hand, it also solves the huge need to continuously merge from the internal branch of each version to the new version of the internal expansion function of Flink. cost.

The other side of the dual-line development plan is the one-stop platform of NetEase Interactive Entertainment, which is completely based on K8s to realize the independent cluster operation of jobs.

img

The above picture is the technical architecture diagram of the platform. It supports big data components such as Nexus and HDFS as the infrastructure, and maintains a versioned software warehouse, which hosts SDK and other business jar packages. At the operation level, Flink uses the concept of k8s independent cluster, that is, each job runs in its own independent k8s namespace and has its own resource support and dependency set, which realizes the complete isolation of business operations and the fine allocation of resources. .

In order to track the platform functions such as business iteration, job operation, and log set analysis, the JFlink platform also encapsulates various operation and maintenance interfaces, which are provided externally through stateless rest service nodes. The platform also provides the operation and maintenance personnel with the function of creating real-time jobs visually, which is the excellent result produced by the cooperation between the platform and the SDK.

On the one-stop platform, users can monitor the real-time status of their jobs, check the running logs, roll back historical versions, and even check historical exceptions, records and statistics, risk control, and detailed management of life cycles.

img

In addition to the capabilities mentioned above, there are quite a few other functions on our one-stop platform, all of which cooperate with the SDK to form our real-time computing system.

3. Towards real-time full correlation

Next, from the perspective of data business, analyze and explain NetEase Interactive Entertainment's experience and practice in developing real-time business in the key field of billing.

Our earliest practice is to perform statistical analysis on the data logs generated on the billing nodes. Logs from different sources are often in different forms and strange, especially the callbacks from external channel providers, it is even more difficult to standardize their log formats. How should we deal with these messy formats and turn them into data that can be processed uniformly? This is our first exploration goal.

img

To this end, the SDK encapsulates the UDF Message Unified Parse, which can process semi-structured data by defining an abstract syntax tree, and also formulates some UDFs that can process Group By and aggregation functions. For such a requirement, this statistical business is realized in the form of configuration grammar, and written into the self-developed TSDB through the encapsulated sink.

Log analysis and monitoring is to monitor the billing business interface, module access, regulations and delays from the point of view, so as to realize non-intrusive real-time monitoring of business and reduce the original time through micro-batch processing. It also improves the monitoring discovery rate and makes the business more reliable.

Next, we turned our attention to making a general ETL framework.

img

Heterogeneous data can be converted into a unified view and a unified flow object through Parser, and then can be processed and converted by the built-in UDF that conforms to the protocol. We also implemented a JavaScript UDF, which is easy and convenient through the flexible embedding of JS scripts. to handle the transformation of the data.

The data processed by Flink flows into our self-developed heterogeneous data warehouse, which can be easily used by business parties. You can also directly use SQL to query and even aggregate logs generated in real time. And these businesses use the data generated by the interface module in the payment environment in real time from the point of view. The amount of data processed every day is about 30 Billion level, which provides a powerful tool for further real-time data business development. guarantee.

img

Around 2019, we started thinking about how to connect these points into organic lines? The payment that is about to occur in the payment environment will undergo a full-link correlation analysis from the beginning to the end. Will there be any problems with the services?

img

The sources of logs for these services vary widely, and may be logs from clients, billing logs, or gateway logs. For these linked logs related to context analysis, in fact, Flink has provided us with a very convenient API, which is keyed stream + session window.

img

The above figure is the architecture diagram of full link monitoring. The knowledge of link analysis is encapsulated into a model and loaded into the Flink real-time analysis program. It will concatenate the data on a payment link in real time, and then write it into our self-developed graph database for continued downstream use. In addition, it also provides a Daily Spark Job to process abnormal links to complete some link completion requirements, which is also a practice for the Lambda architecture.

img

The above figure shows the effect of full-link concatenation. A payment order can be chained and displayed, which is very helpful for DBAs and products to locate payment problems.

Around 2020, NetEase Interactive Entertainment began to explore real-time data warehouses, one of which is a very important application of the user portrait system.

img

Previously, the data report was displayed in the form of T+1, and the timeliness was relatively low. After upgrading and real-time reporting, it is now possible to query in real time through the interface. And this improvement in timeliness allows products to do refined operations, respond to marketing needs in a more timely manner, and increase revenue.

img

These various calculations are also realized in the form of configuration + SDK.

img

Especially for streaming data widening, using Async IO provided by Flink to perform Lookup Join on the outside is a powerful assistant for real-time data processing.

The real-time user data warehouse and real-time data warehouse indicators provide the product with player-level micro queries and report-level macro queries. These user data can be connected to visualization tools, and displayed intuitively through data visualization, so that product operations can discover laws that cannot be found in numbers, and further tap the value of the data.

With the above practice, we began to think about whether it is possible to correlate various data in the entire payment environment at the level of a link and a user to achieve macro monitoring of the payment environment.

img

We will use the Interval Join feature of Flink to perform correlation analysis on various heterogeneous terms in the payment environment session, such as the payment database TiDB and various log data generated by the payment middleware.

img

For example, in TiDB, there are 40 rows in the database for order placement and payment, and the log contains records of payment processes such as the user's order placement from the client to the channel callback. By correlating them respectively, the situation of the corresponding service module can be analyzed.

img

Further, the links generated by each module can be associated and merged, and finally the association analysis results of the entire payment environment can be obtained.

For example, there is a possible abnormality. After the data log is shipped, there are many cases where the number of data logs plummets or there are many error codes, so the operation and maintenance personnel can quickly find out that there is an abnormality in the delivery service. The above figure shows the situation of this type of correlation analysis. In some complex scenarios of the generation environment, this full correlation analysis framework processes data from nearly ten heterogeneous sources, and correlates and analyzes dozens of business scenario sessions. Based on the ability of correlation analysis, real-time reports on many payment environments are made to assist operations to fix problems, guide products to formulate strategies, and ultimately improve revenue.

img

The improvement of resource energy efficiency and data energy efficiency brought about by the real-time data service is obvious to all, and the high timeliness brings a burst of new data usage inspiration, which is also the new big data future brought by Flink.


Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

image.png


ApacheFlink
949 声望1.1k 粉丝