NetEase game Flink SQL platformization practice

Abstract: This article is compiled from the speech of Lin Xiaobo, senior development engineer of NetEase Games, in the Flink Forward Asia 2021 platform construction special session. The main contents include:
NetEase Game Flink SQL Development History
StreamflySQL v1 based on template jar
StreamflySQL v2 based on SQL Gateway
future work

Click to view live replay & speech PDF

1. Development History of NetEase Game Flink SQL

NetEase's game real-time computing platform is called Streamfly, which is named after Stormfly in the movie "How to Train Your Dragon". Since we are already migrating from Storm to Flink, we replaced Storm in Stormfly with the more general Stream.

The predecessor of Streamfly is a subsystem named Lambda under the offline job platform Omega. It is responsible for the scheduling of all real-time jobs. At first, it supported Storm and Spark Streaming, and later it was changed to only support Flink. In 2019, we spun out Lambda and built the Streamfly computing platform on this basis. Subsequently, we developed and launched the first version of the Flink SQL platform StreamflySQL at the end of 2019. This version provides basic Flink SQL functions based on the template jar, but the user experience needs to be improved, so we will rebuild the second version of StreamflySQL from scratch in early 2021, and the second version is based on SQL Gateway.

To understand the difference between the two versions, we need to review the basic workflow of Flink SQL first.

The SQL submitted by the user will first be parsed into a logical execution plan by the Parser; the logical execution plan will be optimized by the Planner Optimizer, and a physical execution plan will be generated; the physical execution plan will be generated by the Planner CodeGen code and translated into the common Transformation of the DataStream API; finally, the StreamGraphGenerator will These Transformations are transformed into the final representation of the Flink job, the JobGraph, which is submitted to the Flink cluster.

The above series of processes all take place in TableEnvironment. Depending on the deployment mode, TableEnvironment may run in Flink Client or JobManager. Flink now supports 3 cluster deployment modes, including Application, Per-Job and Session modes. In Application mode, TableEnvironment will run on the JobManager side, while in the other two modes, TableEnvironment will run on the client side. However, these three modes have a common feature. TableEnvironment is one-time and will automatically exit after submitting the JobGraph.

In order to better reuse the TableEnvironment to improve efficiency and provide stateful operations, some projects will put the TableEnvironment into a new independent server-side process to run, resulting in a new architecture, which we call Server side SQL compilation. Conversely, there is also client-side SQL compilation.

Some students may ask why there is no SQL compilation on the JobManager side. This is because the JobManager is a relatively closed component and is not suitable for expansion, and even if it is done, the effect achieved is basically the same as that of the client side compilation. So overall, there are generally two common Flink SQL platform architectures, Client and Server.

Client-side SQL compilation, as the name implies, is that SQL parsing, translation and optimization are performed in the client-side (the client here is a generalized client, not necessarily a Flink client). Typical cases are the generic template jar and Flink's SQL Client. The advantage of this architecture is that it can be used out of the box, the development cost is low, and it uses the Flink public API, which makes version upgrades easier; the disadvantage is that it is difficult to support advanced functions, and a relatively heavy TableEnvironment must be started every time. So The performance is relatively poor.

Then there's server-side SQL editing. This architecture puts the SQL parsing and translation optimization logic into a separate server process, making the Client very light and closer to the traditional database architecture. A typical case is Ververica's SQL Gateway. The advantage of this architecture is that it has good scalability, can support many customized functions, and has good performance; the disadvantage is that there is no mature solution in the open source world. As mentioned above, SQL Gateway is only a relatively early prototype system, lacking many Enterprise-level features, if you use the production environment, you need to go through a certain transformation, and these transformations involve more Flink internal APIs, and more Flink background knowledge is required. In general, the development cost is relatively high, and the subsequent version upgrade workload is relatively large. .

Editor's note: The Apache Flink community is currently developing the SQL Gateway component, which will natively provide Flink SQL service capabilities and be compatible with the HiveServer2 protocol. It is planned to be released in version 1.16, so stay tuned. Interested students can follow FLIP-91 ^[1] and FLIP-223 ^[2] to learn more, and everyone is very welcome to contribute.

Back to our Flink SQL platform, our StreamflySQL v1 is based on client-side SQL compilation, while v2 is based on server-side SQL compilation. Let me introduce them one by one.

2. StreamflySQL v1 based on template jar

There are three main reasons why StreamflySQL v1 chooses client-side SQL compilation:

The first is platform integration. Unlike many companies whose job schedulers are written in Java, which is more mainstream in big data, our Lambda scheduler is developed in Go. This is because Lambda supports a variety of real-time computing frameworks at the beginning of its design. In consideration of loose coupling and the company's technology stack, Lambda uses Go as the development language and uses a dynamic shell script similar to YARN to invoke different frameworks command line interface. Such a loosely coupled interface brings us a lot of flexibility. For example, we can easily support multiple versions of Flink without forcing users to upgrade with the system version, but it also makes it impossible to directly call Flink's native Java. API.

The second reason is loose coupling. At the time of development, the Flink version was 1.9. At that time, the Client API was relatively complex and not suitable for platform integration. At that time, the community was also promoting the refactoring of the Client, so we tried our best to avoid relying on the Client API to develop the Flink SQL platform.

The third reason is practical experience. Because the template jar + configuration center mode has been widely used in NetEase games, we have accumulated a lot of practical experience in this area. In summary, we naturally adopted the structure of template jar + configuration center to realize the v1 version.

The above picture is the overall architecture diagram of the v1 version. Based on the Lambda job platform, we added the StreamflySQL backend as a configuration center, which is responsible for generating a Lambda job based on the SQL submitted by the user and the job running configuration plus a common template jar.

The overall job submission process is as follows:

The user submits SQL and runs the configuration in the front-end SQL editor.
The StreamflySQL backend receives the request and generates a Lambda job and passes the configuration ID.
Then Lambda starts the job, which is behind the execution of the Flink CLI run command to submit the job.
The Flink CLI run command will start the Flink Client to load and execute the main function of the template jar, then read the SQL and configuration, and initialize the TableEnvironment.
TableEnvironment will read the necessary metadata such as Database/Table from the Catalog. By the way, in Netease games, we do not use a unified catalog to maintain the meta information of different components, but different components have their own metadata centers, corresponding to different catalogs.
Finally, TableEnvironment compiles the JobGraph and deploys the job as a Per-Job Cluster.

StreamflySQL v1 realizes the construction of the Flink SQL platform from zero to one, and meets some business needs, but there are still many pain points.

The first pain point is slow response.

For a typical SQL, starting a job as a template jar requires preparing the TableEnviroment, which may take 5 seconds, and then executing SQL compilation optimizations including interacting with the Catalog to get metadata, which may also take 5 seconds ;After compiling the jobgraph, you need to prepare the per-job cluster, which generally takes more than 20 seconds; finally, you need to wait for the scheduling of the Flink job, that is, the job changes from scheduled to running, which may also take 10 seconds .

In general, the v1 version takes at least 40 seconds to start a Flink SQL job, which is relatively long. However, after careful analysis of these steps, only SQL compilation optimization and job scheduling are inevitable. Others, such as TableEnvironment and Flink cluster, can actually be prepared in advance. The slowness here is that the resources are lazy initialized, and there is almost no reuse.

The second pain point is the difficulty of debugging.

Our needs for SQL debugging are as follows:

The first point is that the debugged SQL should be basically the same as the online SQL.
The second point is that debugging SQL cannot affect online data. It can read online data, but cannot write.
Third, because debugging SQL usually only needs to extract a small number of data samples to verify the correctness of SQL, we hope to limit the resources for debugging SQL, on the one hand, for cost reasons, on the other hand, it is also to prevent debugging SQL competes with online jobs for resources.
Fourth, because the amount of data processed by debugging SQL is relatively small, we want to get the results in a faster and more convenient way.

In the v1 version, we designed the following solutions for the above requirements:

First of all, for the debugged SQL, the system will replace the original Sink with a dedicated PrintSink during SQL translation, which solves the first two points in the requirement.
Then limit the current of PrintSink, and achieve the overall current limit through Flink's back pressure mechanism, and will limit the maximum execution time of the job. After the timeout, the system will automatically end the job, which solves the resource limitation in demand.
Finally, in order to respond faster, the debugged job will not be submitted to the YARN cluster to run, but will start a MiniCluster locally on the Lamdba server to execute, and it is also convenient for us to extract the PrintSink result from the standard output. the last point in the requirement.

The architecture of the debug mode is shown in the figure above. Compared with the general SQL submission process, the main difference is that the job will not be submitted to YARN, but will be executed locally on the Lambda server, which saves the cost of preparing the Flink cluster and is easier to manage and control. Resources and Get Results.

The above debugging solutions are basically available, but there are still many problems in the actual use process.

First, if the SQL submitted by the user is complex, the compilation and optimization of the SQL may take a long time, which will cause the job to easily time out and may be terminated by the system before the result is output. At the same time, such SQL will also Put a lot of pressure on the server.
Second, the architecture cannot debug jobs with long time windows or jobs that require Bootstrap State.
Third, because the execution results are returned in batches after the job ends, not streamed during the job execution process, so users need to wait until the job ends—usually more than 10 minutes before they can see the results.
Fourth, in the SQL translation stage, the Sink for debugging SQL is replaced. This function is realized by transforming Flink's Planner, which is equivalent to invading the business logic into the Planner, which is not elegant.

The third pain point is that the v1 version only allows a single DML.

Compared with traditional databases, the SQL statements we support are very limited. For example, MySQL's SQL can be divided into DML, DQL, DDL and DCL.

DML is used to manipulate data , common statements are INSERT / UPDATE / DELETE. StreamflySQL v1 only supports INSERT, which is consistent with Flink SQL. Flink SQL uses Retract mode—that is, similar to Changelog—to represent UPDATE/DELETE, so it only supports INSERT, which is not a problem.
DQL is used to query data , and a common statement is SELECT. This is supported in Flink SQL, but StreamflySQL v1 does not support DQL because the lack of a sink cannot generate a meaningful Flink job.
DDL is used to define metadata , and common statements are CREATE / ALTER / DROP, etc. This is not supported in StreamflySQL v1, because the template jar calls SQL through sqlUpdate, which does not support pure metadata operations, and it is completely uneconomical to start a TableEnvironment for pure metadata operations.
Finally, there is DCL, which is used to manage data permissions , such as GRANT and REVOKE statements. This Flink SQL is not supported because Flink is currently only a user of data rather than a manager, and DCL is meaningless.

Taken together, the v1 version only supports a single DML, which makes our beautiful SQL editor empty. Based on the above pain points, we investigated and developed StreamflySQL v2 this year. v2 uses a server-side SQL compiled architecture.

3. StreamflySQL v2 based on SQL Gateway

Our core requirement was to address several pain points of the v1 release, including improving the user experience and providing more complete SQL support. The general idea is to use the server-side SQL-compiled architecture to improve scalability and performance. In addition, our cluster deployment mode has also been changed to Session Cluster, which prepares cluster resources in advance and saves the time to start the YARN application.

There are two key issues here.

First of all, are we going to be completely self-developed or based on open source projects? During the research, we found that Ververica's SQL Gateway project fits our needs, is easy to expand, and is a basic implementation of the FLIP-91 SQL Gateway in the Flink community. It is also easy to integrate with the development direction of the community in the future.
The second problem is that SQL Gateway itself has the ability to submit jobs, which coincides with our existing Lambda platform, which will cause problems of repeated construction and difficult unified management, such as authentication and authorization, resource management, monitoring alarms, etc. There are two entrances. So how should the two be divided? Our final solution is to use the two-phase scheduling of Session Cluster, that is, resource initialization and job execution are separated, so we can let Lambda be responsible for the management of Session Cluster, and StreamflySQL is responsible for the management of SQL jobs, which can reuse Lambda greatly part of the basic abilities.

This is the architecture diagram of StreamflySQL v2. We embedded SQL Gateway into the SpringBoot application and developed a new backend. Overall, it looks more complicated than the v1 version, because the original one-level scheduling has become two-level scheduling of sessions and jobs.

First, the user needs to create a SQL session, and the StreamflySQL backend will generate a session job. From Lambda's point of view, a session job is a special job that uses the yarn-session script to start a Flink Session Cluster at startup. After the Session Cluster is initialized, users can submit SQL within the session. The StreamflySQL backend will open a TableEnvironment for each session, which is responsible for executing SQL statements. If it is SQL that only involves metadata, it will directly call the Catalog interface to complete it. If it is SQL of job type, it will be compiled into JobGraph and submitted to Session Cluster for execution.

The v2 version largely solves several pain points of the v1 version:

In terms of response time, v1 often takes around 1 minute, while v2 version usually completes within 10 seconds.
In terms of debug preview, v2 does not need to wait for the job to finish, but streams the results back over the socket while the job is running. This is based on the clever design of SQL gateway. For the select statement, SQL Gateway will automatically register a socket-based temporary table, and write the select result to this table.
In terms of SQL support, v1 only supports DML, while v2 can support DML/DQL/DDL with the help of SQL Gateway.

However, although SQL Gateway has good core functions, it is not easy for us to use it, and we also encounter some challenges.

First and foremost is the persistence of metadata.

The metadata of SQL Gateway itself is only stored in the memory. If the process restarts or encounters an abnormal crash, the metadata will be lost, which is unacceptable in the production environment of the enterprise. Therefore, after we integrate SQL Gateway into the SpringBoot program, it is natural to save the metadata to the database.

Metadata is mainly session metadata, including Catalog, Function, Table, and job of the session. These metadata can be divided into 4 layers according to their scope. The bottom two layers are global configurations, which exist in the form of configuration files; the top two layers are metadata dynamically generated at runtime and stored in the database. The configuration items of the upper layer have higher priority and can be used to override the configuration of the lower layer.

We look at these metadata from the bottom up:

The bottom layer is the global default Flink Configuration, which is our flink-conf yaml configuration under Flink Home.
The upper layer is the configuration of Gateway itself, such as deployment mode (such as YARN or K8S), such as Catalog and Function to be published by default, and so on.
The third layer is the Session Configuraion at the session level, such as the cluster ID of the Session Cluster corresponding to the session or the resource configuration of the TaskManager, and so on.
The top layer is the job-level configuration, including metadata dynamically generated by the job, such as job ID, user-set checkpoint cycle, and so on.

This more flexible design not only solves the problem of metadata persistence, but also lays the foundation for our multi-tenancy feature.

The second challenge is multi-tenancy.

Multi-tenancy is divided into two aspects: resources and authentication:

In terms of resources, StreamflySQL uses the Lambda job platform to start Session Clusters in different queues. Their Master nodes and resources are naturally isolated, so there is no problem of different users sharing a Master node and mixing resources like Spark Thrift Server.
In terms of authentication, because the Session Cluster belongs to different users, the StreamflySQL backend needs to implement multi-tenancy camouflage. In NetEase games, components generally use Kerberos authentication. The way we implement multi-tenancy is to use Hadoop's Proxy User, first log in as a super user, and then pretend to be a project user to obtain delegation tokens from different components. The components here are mainly Hive MetaStore and HDFS, and finally these tokens are stored in UGI Inside and use doAS to submit jobs.

The third challenge is horizontal expansion.

In order to be highly available and expand service capabilities, StreamflySQL naturally needs to be deployed in a multi-instance architecture. Because we have stored the main state metadata in the database, we can build a new TableEnvironment from the database at any time, so the StreamflySQL instance is very light like a normal web service and can be easily scaled up and down.

But not all states can be persisted, and some states we deliberately do not persist. For example, the user uses the SET command to change the properties of the TableEnvironment, such as enabling Table Hints. These are temporary properties and will be reset after the TableEnvironment is rebuilt. This is as expected. For another example, when a user submits a select query for debugging preview, the TaskManager will establish a socket link with the StreamflySQL backend, and the socket link is obviously not persistent. Therefore, we add affinity load balancing in front of multiple instances of StreamflySQL, and schedule traffic according to Session ID, so that under normal circumstances, requests from the same user all fall on the same instance to ensure the continuity of user experience.

The fourth challenge is job status management.

In fact, the word status here is a pun and has two meanings:

The first meaning is the running status of the job. SQL gateway currently only submits SQL and does not monitor the subsequent running status. Therefore, StreamflySQL sets up a monitoring thread pool to periodically poll and update the job status. Because StreamflySQL has multiple instances, if their monitoring threads operate the same job at the same time, there may be a problem of lost updates, so we use CAS optimistic locking here to ensure that outdated updates will not take effect. Then we will give an alarm when the job exits abnormally or the status cannot be obtained. For example, when the JobManager performs failover, we cannot know the status of the Flink job. At this time, the system will issue a disconnected abnormal status alarm.
The second meaning is the persistent state of Flink, that is, Flink State. The native SQL gateway does not manage Flink's Savepoint and Checkpoint, so we added the functions of stop and stop-with-savepoint, and forced the retained checkpoint to be turned on. This enables the system to automatically find the latest checkpoint when the job is restarted after the job encounters an abnormal termination or a simple stop.

Here I can share our algorithm. In fact, Lambda, which can automatically find the latest checkpoint, is also provided, but Lambda assumes that all jobs are Per-Job Cluster, so just search for the latest checkpoint in the checkpoint directory of the cluster. But such an algorithm is not applicable to StreamflySQL, because Session Cluster has multiple jobs, and the latest checkpoint is not necessarily our target job. Therefore, we have changed to a lookup method similar to JobManager HA, first read the metadata of the job archive directory, and extract the latest checkpoint from it.

4. Future work

One of the first problems we will solve in the future is the problem of state migration, that is, how to restore from the original Savepoint after the user makes changes to the SQL. At present, users can only be informed of risks by changing the type. For example, adding or subtracting fields will not cause incompatibility between Savepoints. However, if a join table is added, the impact is difficult to say. Therefore, in the future, we plan to inform users about the state compatibility before and after the change by analyzing the execution plan before and after the SQL change.
The second problem is fine-grained resource management. At present, we cannot specify SQL resources when the job is compiled. For example, the CPU and memory of the TaskManager are determined after the Session Cluster is started, and they are at the session level. Currently, resource adjustment can only be done through job parallelism, which is inflexible and prone to waste. Now Flink 1.14 has supported the fine-grained resource management of the DataStream API, and resources can be set at the operator level, but there is no plan for the SQL API yet, and we may participate in and promote the progress of related proposals in the future.
Finally, there are community contributions. We have some experience in using SQL Gateway, and we have made a lot of improvements to it. We hope that these improvements can be given back to the Flink community to promote the progress of FLIP-91 SQL Gateway.

Click to view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group to get the latest technical articles and community dynamics as soon as possible. Please pay attention to the public number~

Recommended activities

Alibaba Cloud's enterprise-level product based on Apache Flink - real-time computing Flink version is now open:
99 yuan to try out the Flink version of real-time computing (yearly and monthly, 10CU), and you will have the opportunity to get Flink's exclusive custom sweater; another package of 3 months and above will have a 15% discount!
Learn more about the event: https://www.aliyun.com/product/bigdata/en

NetEase game Flink SQL platformization practice

1. Development History of NetEase Game Flink SQL

2. StreamflySQL v1 based on template jar

3. StreamflySQL v2 based on SQL Gateway

4. Future work

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

NetEase game Flink SQL platformization practice

1. Development History of NetEase Game Flink SQL

2. StreamflySQL v1 based on template jar

3. StreamflySQL v2 based on SQL Gateway

4. Future work

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景

基于 pyflink 的算法工作流设计和改造

MCP+Hologres+LLM 搭建数据分析 Agent

某全球领先网络解决方案提供商 基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈

SelectDB 实时分析性能突出，宝舵成本锐减与性能显著提升的双赢之旅

某全球领先网络解决方案提供商基于 Apache Doris 统一 Trino、Pinot、Iceberg、Kyuubi技术栈