Practice of Homework Help based on Flink&#39;s real-time computing platform

Abstract: This article organizes the sharing of Zhang Ying, the head of real-time computing at Homework Group, at Flink Forward Asia 2021. Flink has played an important role in the evolution of real-time computing in Homework. In particular, with the help of FlinkSQL, the development efficiency of real-time tasks has been greatly improved. This article mainly shares the usage and practical experience of FlinkSQL in Homework, as well as the problems and solutions encountered in the process of building a real-time computing platform from 0 to 1 as the scale of tasks increases. content include:
development path
Flink SQL application practice
Platform construction
Summary Outlook

FFA 2021 Live Replay & Presentation PDF Download

1. Development history

Homework Help mainly uses artificial intelligence, big data and other technologies to provide students with more efficient learning solutions. Therefore, the business data mainly includes the attendance of students and the mastery of knowledge points. In terms of overall architecture, whether it is binlog or common log, it is collected and written to Kafka, then written to the storage layer by real-time and offline computing respectively, and then based on OLAP to provide corresponding productized services, such as workbench and BI analysis tools.

At present, the real-time computing of Homework Help is mainly based on Flink, and the development process has three stages:

In 2019, real-time computing included a small amount of SparkStreaming homework, which was provided to tutors and lecturers. In the process of solving real-time requirements, it will be found that the development efficiency is very low, and the data can hardly be reused;
After that, the normal practice is to gradually apply Flink JAR in production practice, and start building the platform and applying Flink SQL after accumulating experience. However, in 20 years, the business has put forward a lot of real-time computing needs, and we have insufficient development manpower reserves. At that time, shortly after Flink SQL 1.9 was released, the SQL function changed a lot, so our approach was to directly apply Flink SQL in the direction of real-time data warehouse. At present, more than 90% of the tasks of the entire real-time data warehouse are implemented using Flink SQL;
In November 2020, the number of Flink jobs quickly increased to several hundred. We started to build a real-time computing platform from 0 to 1, which has supported all the important business lines of the company, and the computing is deployed on multiple clusters in multiple clouds.

The following two aspects are introduced:

Typical problems and solutions encountered in FlinkSQL practice;
Some thoughts on the construction of real-time computing platform.

2. Flink SQL application practice

Here is the complete dataflow architecture based on Flink SQL:

After binlog/log is collected and written to Kafka, the topic will be automatically registered as a table of metadata, which is the starting point of all subsequent real-time SQL jobs. Users can use this table in SQL jobs without defining complex DDL.

At the same time, when considering practical applications, it is also necessary to add or replace table attributes based on the metadata table:

Added: Metadata records table-level attributes, but SQL jobs may need to add task-level attributes. For example, for the Kafka source table, add the group.id of the job to record the offset;
Replacement: When testing offline, on the basis of referencing the metadata table, you only need to define properties such as broker topic to cover the source table, so that you can quickly build an offline test table.

The framework also needs to support the user's SQL jobs to easily output metrics and logs, so as to achieve full-link monitoring and trace.

Here we mainly introduce the DAG optimization practice when adding the Trace function to SQL, as well as our selection and encapsulation of the underlying physical storage of Table.

2.1 SQL adds Trace function

SQL can improve development efficiency, but the complexity of business logic is still there, and the DML written by complex business logic will be very long. In this case, views are recommended to improve readability. Because the SQL of the view is shorter, it should not be too long like a single function in the code specification.

The left side of the figure below is a partial DAG of an example task, and you can see that there are many SQL nodes. In this case, it is difficult to locate the case, because if it is the code implemented by the DataStream API, you can also add logs. But SQL can't do it. There are few entrances that users can intervene, and they can only see the input and output of the entire job.

Similar to printing logs in functions, we hope to support adding Trace to views to facilitate case tracing.

But I ran into some problems trying to add Trace to SQL, here's a simplified example:

The SQL in the upper right corner creates source_table as the source table, the prepare_data view reads the table, calls foo udf in the sql, and then uses StatementSet to insert into two downstreams respectively, and at the same time, convert the view to DataStream to call TraceSDK to write to the trace system.

Note: We developed it based on 1.9 at the time. For the sake of clarity, we also used some features that were added later.
https://issues.apache.org/jira/browse/FLINK-16361 https://issues.apache.org/jira/browse/FLINK-18840

The actual DAG below the image above doesn't quite look as expected:

The DAG is divided into two parts that are not related up and down. The Kafka source table is the DataSource part, which is read twice;
The foo method is called three times.

Data source pressure and computing performance need to be optimized.

To solve this problem, we need to optimize from several angles. Here we mainly introduce the idea of DAG merging. Whether it is the env of table or stream, the corresponding transformation will be generated. Our approach is to merge into stream env uniformly, so that we can get a complete transformation list in stream env, and then generate StreamGraph submission.

The bottom left is our optimized DAG, reading the source table and calling the foo method only once:

The optimized DAG effect is very similar to the logic diagram when we write SQL, and the performance is naturally in line with expectations.

Back to the problem itself, in business, you can simply add trace to some fields of the view with a statement, for example: prepare_data.trace.fields=f0,f1. Since SQL naturally contains field names, the data readability of trace is better than Ordinary logs are even higher.

2.2 Selection and design of Table

As mentioned earlier, our primary requirement is to improve human efficiency. Therefore, Table needs to have better layering and reuse capabilities, and support templated development, so that N end-to-end Flink jobs can be quickly connected in series.

Our solution is based on Redis implementation, which has several advantages first:

High qps, low latency: this should be the concern of all real-time computing;
TTL: Users don't need to care about how the data exits, just give a reasonable TTL;
By using high-performance and compact serialization methods such as protobuf, and using TTL, the overall storage is less than 200G, and the memory pressure of redis is acceptable;
Fit the calculation model: In order to ensure the timing of the calculation itself, the keyBy operation will be performed to shuffle the data that needs to be processed at the same time to the same concurrency, so it does not depend on storage too much to consider the optimization of locks.

Next, our scenario is mainly to solve the problem of multiple indexes and trigger messages.

The figure above shows an example of a table showing whether students are present in a certain chapter:

Multi-index: data is first stored in string format, such as key=(uid, lesson_id), value=serialize(is_attend, ...), so that we can JOIN ON uid AND lesson_id in SQL. What if JOIN ON other fields, like lesson_id? Our approach is to write a set with lesson_id as key at the same time, and the elements in the set are corresponding (uid, lesson_id). Next, when looking for lesson_id = 123, first take out all elements under the set, and then find all VALUEs through the pipeline and return;
Triggered message: writing to redis, an update message will be written to Kafka at the same time. The consistency, sequence, and no data loss between the two storages are guaranteed in the implementation of Redis Connector.

These functions are encapsulated in the Redis Connector, and the business can simply define such a Table through DDL.

Several important properties in DDL:

primary defines the primary key, corresponding to the data structure of string, such as uid + lesson_id in the example;
index.fields defines index fields for auxiliary search, such as lesson_id in the example; multiple indexes can also be defined;
poster.kafka defines the kafka table that receives the trigger message. This table is also defined in the metadata, and users can directly read the table in subsequent SQL jobs without defining it.

Therefore, the entire development mode is highly reusable, and users can easily develop end-to-end N SQL jobs without worrying about how to trace the case.

3. Platform Construction

After the above data flow architecture was built, the number of real-time jobs quickly increased to several hundred in 2020.11, much faster than in 2019. At this time, we started to build a real-time computing platform from 0 to 1, and then shared some thoughts during the construction process.

There are three main starting points for the functions supported by the platform:

Unification: Unify the different cluster environments, Flink versions, submission methods, etc. of different cloud vendors; previously, hadoop clients were scattered on the user's submission machine, which had hidden dangers to cluster data and task security, and increased the subsequent upgrade and migration of the cluster. cost. We hope to unify the submission entry and submission method of tasks through the platform;
easy to use: can provide more easy-to-use functions through platform interaction, such as debugging and semantic detection, which can improve the human efficiency of task testing, and record the version history of tasks to support convenient online and rollback operations;
specification: permission control, process approval, etc., similar to the online process of online services, through the platform, the research and development process of real-time tasks can be standardized.

3.1 Specification - Real-time task process management

FlinkSQL makes development very simple and efficient, but the simpler it is, the more difficult it is to standardize, because it may take only two hours to write a piece of SQL, but it takes half a day to go through the specification.

However, the specification still needs to be implemented. Some problems are similar to online services, and also encountered in real-time computing:

n't remember task has been running online for a year, the initial demand may be word of mouth, it is better to remember wiki or email, but it is easy to not remember clearly during task handover;
not standardized: UDF or DataStream code does not abide by the specification, and the readability is poor, which causes the students who take over later to upgrade and change, or dare not change, and cannot be maintained for a long time. There should also be a specification for how to write SQL including real-time tasks;
can't be found: The task running online depends on a jar, which corresponds to which commitId of which git module, and how can I find the corresponding code implementation in the first time if there is a problem;
modified blindly: has been a normal task, but suddenly called the police on the weekend, because the SQL of the online task was modified privately.

The specification is mainly divided into three parts:

development: RD can quickly create a UDF module from the UDF archetype project, this is a reference to the flink quickstart. The created UDF module can be compiled normally, including udf examples like WordCount, as well as default helper methods such as ReadMe and VersionHelper. After modification according to business requirements, upload to Git through CR;
Requirements management and compilation: will be associated with the requirements card, and after cluster compilation and QA testing, the order can be issued online;
is online: Select which jobs to update/create according to the module and compilation output, and redeploy after approval by the job owner or leader.

The entire R&D process cannot be modified offline, such as changing the jar package or taking effect on which task. A real-time task, even if it runs for a few years, can find out who is online, who approved the current task, the test records at that time, the corresponding Git code, and the real-time indicator requirements put forward by whoever started it. The task is maintained for a long time.

3.2 Ease of use - monitoring

Our current Flink jobs run on Yarn. After the job is started, it is expected that Prometheus will grab the Container allocated by Yarn, and then connect to the alarm system. Users can configure Kafka delay and Checkpoint failure alarms based on the alarm system. There are two main problems encountered when building this pathway:

After PrometheusReporter starts HTTPServer, how can Prometheus dynamically perceive it; it also needs to be able to control the size of the metric to avoid collecting a lot of useless data;
The source table of our SQL is basically based on Kafka. Compared with third-party tools, it is more convenient to configure Kafka delay alarms on the computing platform. Because you can naturally get the topic and group.id read by the task, you can also use the same alarm group as the task failure. Combined with the alarm template, configuring the alarm is very simple.

On the solution:

The discovery function is added on the basis of the official PrometheusReporter. After the HTTPServer of the Container is started, the corresponding ip:port is registered on the zk in the form of a temporary node, and then the discover targets of Prometheus are used to monitor the changes of the zk node. Since it is a temporary node, the node disappears when the Container is destroyed, and Prometheus can also sense that it is no longer grabbed. In this way, it is very easy to build a path for Prometheus to grab.
KafkaConsumer.records-lag is a more practical and important latency indicator, and it mainly does two tasks. Modify the KafkaConnector and expose it after KafkaConsumer.poll to ensure that the records-lag indicator is visible. In addition, in the process of doing this, we found that different Kafka versions have different formats of this indicator ( https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=74686649 ), our approach is to level them all It is a format that is registered in flink's metrics. In this way, the indicators exposed by different versions are consistent.

4. Summary and Outlook

In the previous stage, Flink SQL was used to support rapid development of real-time jobs, and a real-time computing platform was built to support thousands of Flink jobs.

One of the bigger insights is that SQL does simplify development, but it also blocks more technical details. The requirements of real-time job operation and maintenance tools, such as Trace, or the specification of tasks have not changed, and even more stringent requirements for these. Because while the details are blocked, once a problem occurs, the user does not know how to deal with it. It's like the tip of the iceberg, the less leaks and the more sinks, the more you need to do a good job in the construction of the surrounding system.

The other is to adapt to the status quo. First, we can meet the current needs as soon as possible. For example, we are improving human efficiency and lowering the development threshold. At the same time, we must continue to explore more business scenarios, such as replacing Redis Connector with HBase and RPC services. The advantage now is to modify the underlying storage, and the user's SQL job perception is very small, because SQL jobs are basically business logic, and DDL defines meta data.

The next plan is mainly divided into three parts:

Support elastic scaling of resources, balance the cost and timeliness of real-time jobs;
We have been applying Flink SQL on a large scale since 1.9. Now the version upgrade has changed a lot, and we need to consider how to enable the business to upgrade and use the features in the new version at a low cost;
Explore the implementation of flow-batch integration in actual business scenarios.

FFA 2021 Live Replay & Presentation PDF Download

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

Practice of Homework Help based on Flink's real-time computing platform

1. Development history