The practice and application of Flink stream processing in China Securities Securities Co., Ltd.

Abstract: The content of this article is compiled from the speeches of Liu Chenglong, the head of the financial real-time data warehouse project of CITIC Securities, and Cai Yue, the financial information data R&D engineer, at the Flink Forward Asia 2021 industry practice session. The main contents include:
CSC Securities Flink Framework
Flink Stream Processing Scenario
Real-time transformation of financial information
future outlook

view live replay & speech PDF

China Securities Co., Ltd. was established in 2005, listed on the Hong Kong Stock Exchange in 2016, and listed on the main board of the Shanghai Stock Exchange in 2018. The investment banking business has remained in the top 3 in the industry for 8 consecutive years, and the scale of custody securities is the second in the industry. The main operating indicators are currently listed in the top 10 in the industry. As the company's business has been advancing rapidly, and technology cannot be left behind, digital transformation is becoming the focus of our development in recent years.

Due to the large number of business fields involved in the financial industry, the company has accumulated a large amount of complex and highly business-related basic data over the years. In the process of discovering, analyzing and solving problems, how to coordinate the front, middle, back and technology departments of the business, etc. Combining various aspects to sort out business calibers and develop processing logic has become a key issue that needs to be solved urgently.

1. CSC Securities Flink Framework

The data center architecture is shown in the figure. It is mainly divided into the following major sections: data center section composed of Greenplum data warehouse and Hadoop big data platform; data development section mainly based on offline development, real-time development, and data exchange; and data portal, data gateway, data governance, operation Management and other sectors.

Among them, the current tasks of the data development section mainly focus on offline data processing of offline development and data exchange. However, with the improvement of the timeliness of data for business, the t+1 business model based on offline batch processing can no longer fully meet the demand for timeliness of information in the current market environment. This is also to vigorously develop real-time development and strive to provide customers with higher timeliness Purpose of Sexual Data Services.

We take the real-time development of the entire link as an example to illustrate the interaction between the various sections of the data center.

Enter the real-time development module from the unified entrance of the data portal. First, the real-time incremental data of the centralized transaction, margin financing and securities lending and other business information is pulled into the Kafka message queue. Flink consumes the Kafka real-time stream data and performs data processing with the dimension table data. When the amount of dimension table data involved in the processing logic is relatively large, offline development and data exchange are required, and data preparation for the dimension table is completed by offline batch running. Finally write the resulting data to a relational or NoSQL database. The data gateway then generates an API interface by reading the result data, and provides data services to the downstream system.

The data management and control module in the data governance section mainly manages the database tables in the data center and the metadata of business-related database tables. Users can subscribe to the change information of the database tables they care about in the data portal. When the subscribed data table changes, the operation center can notify subscribers of changes in database tables through a unified alarm module through multiple channels, so that developers can adjust data processing tasks in a timely manner.

The Flink real-time stream processing architecture first collects the CDC logs of the business database through the Attunity tool, and writes the database table changes under the same system to a topic queue in Kafka, which means that each topic in Kafka will have multiple tables. Therefore, in Flink's Kafka source, the two fields, schema and tablename, must be filtered once to obtain the CDC data stream of the data table that you want to get, and then the subsequent processing logic with the dimension table will be performed. Write the processed data to the result table, and write it to different databases for storage according to different needs.

Database selection generally follows the following principles:

When the amount of data is relatively small and high concurrency is not required, a relational database is usually selected for storage;
When the amount of data is large and there is a demand for high concurrency, HBase is usually selected as the storage medium;
When a small amount of data is required but high concurrency is required, choose Redis for caching;
In cases involving a large amount of data retrieval, ES is generally selected as the storage component.

There are two distinct features of securities industry data:

One of them is that the opening time is fixed. For a large number of businesses, the amount of data will be greatly reduced after the closing, and even some businesses will not generate new data after the closing. Therefore, in order to save resources, we will based on the actual situation. The task set start and stop time;
The second feature is the importance of financial data. Data deviation is not allowed in a large number of scenarios. For the characteristics of extremely high data reliability, we have set up an offline task of nighttime data correction for a large number of real-time tasks to ensure the correctness of the data.

2. Flink stream processing scenarios

The following describes the application of Flink stream processing in several practical scenarios. It is mainly divided into three scenarios, real-time indicator statistics of retail business, real-time indicator statistics of fund investment advisors, and detailed query of capital flow.

2.1 Retail Business Scenario

The real-time indicators of the retail business line are an important part of the management cockpit, and decision makers can make reasonable decisions on the company's operation and development by analyzing the company's operational indicators.

To design a real-time data warehouse for retail business, it is necessary to obtain statistical indicators of account opening statistics, customer service, and APP operation. According to the real-time data processing architecture and the hierarchical design of data warehouse, the real-time data warehouse for retail business can be divided into the following categories: processes:

The first is to build ODS layer data, and collect CDC logs of related basic tables such as customer information table, business flow meter, and channel table in real time. The data table of each business library corresponds to the ODS layer that is connected to a Kafka topic to establish a real-time data warehouse;
The second is the data modeling of the DWD layer, creating a Flink task to consume the Kafka messages of the ODS layer, and performing data cleaning, filtering, desensitization, association conversion and other processing. At the same time, the data is merged at the customer account granularity, and the expansion operation is carried out with the help of the offline dimension table, so as to obtain the detailed list of the account granularity, and realize the establishment of the DWD layer;
After that, the data modeling of the DWS layer is based on the data of the DWD layer. By analyzing the business requirements, the data of the DWD layer is divided according to the theme, and the broad table of channel service topics, the business department operation topic wide table, and the transaction product topic are summarized. Wide table of public indicators such as wide table, establish DWS layer;
Finally, according to the actual business needs, calculate the business indicators to establish the ADS layer. For some business indicators of user account granularity, they can be directly calculated through the details of the DWD layer, and some coarse-grained business indicators, such as the number of customers served by APP channels and the number of readers of investment products, can be calculated through the DWS layer. The final calculation result is connected to the data gateway to uniformly provide the data to the downstream system or display it through the BI system.

Hierarchical management of real-time data warehouses can bring two benefits:

The first is to avoid the chimney-style data development model, and it is not necessary to start all tasks from consuming Kafka's ODS layer data, which reduces the time overhead, is more conducive to data recovery, and can support flexible analysis of different business topics;
Secondly, in the case of data processing errors, it is easier to determine which layer of data processing logic is faulty, and shorten the troubleshooting time.

2.2 Real-time indicator statistics of fund investment advisors

The importance of fund business in the securities industry has become increasingly prominent. It can provide real-time sales information of fund investment advisory products and provide data support for fund investment advisers to adjust their strategies in a timely manner. The data of the fund investment advisory scene has three characteristics:

First, the scale of data involved is relatively small;
Second, the data is provided to the insiders of the company for viewing at the opening time;
Third, the requirements for accuracy of data are particularly high.

In view of the small amount of data, we output the data index results to the Oracle relational database; for the characteristics of providing data for internal personnel to view at the opening time, we enable the start and stop strategy of real-time tasks, leaving more resources for batch running at night. Tasks are used; for the characteristics of high data accuracy, we correct the data by running batches offline at night to ensure the accuracy of the data.

The original solution is to trigger the stored procedure to read the data through the page, and the read data is not the source system data, and there is a minute-level delay. The real-time data processing solution enables business departments to more efficiently grasp core data by pushing indicators in real time, such as customer additions, additions, contracts, retention, contract rate, and scale.

2.3 Real-time ETL-Fund Flow Scenario

This scenario is mainly for business personnel to quickly query the detailed data of the customer's transaction flow within a certain period of time during the opening period. It needs to solve three problems:

First, the details of the capital flow, a total of billions of data, how to do a quick query when the amount of data is large?
Second, to meet business personnel's queries during the opening hours, and the amount of data during non-opening hours is small, should timing scheduling be used?
Third, the flow of funds must not be wrong. How to ensure the accuracy of the data?

In view of the large amount of data, we finally choose to store the data through the Hbase component. Through the rational design of rowkey and the establishment of region partitions, we can quickly query the details of the flow of funds within a specified time period; Features, open the scheduled start and stop strategy of tasks, and leave more resources for running batch tasks at night; for the characteristics of high data accuracy, the offline data correction method is used to meet the accuracy requirements.

3. Real-time transformation of financial information

In the financial field, there are various news announcements and other information that each market participant reads and pays attention to the most. Our company's definition of information not only includes the above information in the traditional sense, but also takes into account the complexity of the data itself and the actual flow process of collection, management, application, etc., we redefine the information, that is, all non-user non-transaction-related information The data are in the category of financial information.

Our center gathers the following 4 categories of financial information data, the most common ones are news, announcements, research reports, in addition to securities market data related to the trading market, such as currencies, stocks, bonds, funds, derivatives, and various dimensions of the macro industry Data, the last category is other and derivatives that are all-inclusive as the bottom line, covering various data analyzed by other third-party institutions based on original market data, such as company public opinion, fundamental analysis and forecasting, etc.

If transactions and users are compared to the bones and meridians of the financial market, information and data are the blood of our financial market, which is produced by the former throughout the whole body and continuously.

So how do the various types of information and data flow? It is very simple, as shown in the three-layer structure: the bottom layer is the data source we introduced. At present, most of the information data has been collected and sorted by information data providers such as Wind and Flush, and we can obtain it without spending too much time and cost. various basic data.

But with the increase in the introduction of data commerce, problems also follow. If a data provider has a problem and the cooperation cannot be continued, the data service will also be affected. In order to solve this problem, we introduced the concept of a central repository, built a set of financial data models, and connected the downstream systems with the data structure of the central repository. We are responsible for shielding the information data providers, which is the second layer in the figure. .

There is also a small module on the far right side of the above figure called data direct sending, because not all downstream systems are suitable for docking in practical applications, and some still rely on the original data quotient structure, so this small interface still exists, in parallel with the central library Output data service.

The top layer is the service object, covering all business lines within the company, and continuing to transfuse blood for each business system.

Under the three-tier overall structure, the increasing number of data sources and types of data improves the overall quality of our data services and enables us to serve more customers. At the same time, the central repository is the core structure, which improves the anti-risk capability of the overall service, and preventing risks is the top priority of financial companies.

Our previous work mainly focused on these two points. When these two functions were gradually improved and stabilized, the focus gradually shifted to information data transmission and information content optimization. The market is changing rapidly. The shorter the time required for data to propagate on the link, the higher the value of information in terms of time. There is no upper limit on the transmission speed, and the faster the better, this is the efficiency of data transmission. However, the data is fast and the quality of upstream data providers is uneven, the service is only fast and inaccurate, and there are problems with the data provided to users, so how to control the quality of data content without losing 1, 2, and 3 points? became a thorny problem.

In order to optimize the points 3 and 4, we carried out architectural transformation with the Flink engine as the core, and selected two scenarios to share.

3.1 Dragonfly Dianjin APP F10 news scene

The Dragonfly Dianjin APP mainly provides financial information and data services for investors to browse. The above picture is the first version of the plan. The main process is to mark news from the upstream labeling system, flow into Kafka, and then enter the central library just designed. When downstream use, the data is extracted and transformed, and transmitted to the interface library, and finally Provide external services through API.

In order to obtain the changes of the database in time, we selected Canal, which is lightweight and easy to integrate, to implement among many CDC tools. By capturing database changes, developers write programs to read and subscribe Canal data in real time, parse and combine the data into the data format required by the business, and then actively update and write to the top Redis. When downstream users use the relevant interface, they can obtain the latest information data, and there is no need to wait for the data to expire passively.

After running the solution for a period of time, we found two problems. First, the link is too long, which will lose timeliness. Second, the active write cache process has gradually become an important part of the entire information service. However, as an open source tool, Canal’s functions are still being improved. For example, program monitoring and alarming need to be developed and implemented separately. In addition, stability and high availability are also slightly insufficient.

To this end, we introduced Flink to adjust the data link and data processing links. In terms of data processing, Flink's efficient ETL capabilities are used to introduce information data processing scenarios with high timeliness requirements. At the same time, Flink, as a stream computing engine, is naturally integrated with Kafka, can be seamlessly connected, and has the ability to directly output messages to Kafka systems. , such as the news tag system. The community has been improving various connectors, and the CDC method provides more space for Flink ETL capabilities. At the same time, the support of Redis sink also enables the original cache program and business logic to be integrated into Flink for unified implementation. Ultimately, the entire information data processing process is managed centrally, shortening the link and saving the transmission time.

The powerful ETL capability reduces the complexity of the architecture and saves a series of original components. On the whole, the distributed high-availability architecture runs through the upstream, middle and downstream, so that the information service capability can be output stably and efficiently. In the long run, information data is widely used, with various sources and outputs. Flink's constantly enriched connectors can also support the further expansion of data sources and destinations, making it possible for information data to cope with more scenarios.

3.2 Multi-source data cross-check scenario

It would be great if one set of architecture could solve all the problems. For this reason, we tried the multi-source data cross-checking scenario. This scenario is mainly to solve the problem of controlling the quality of data content. Faster data can be solved through technical means, but more accurate data is not something we can control in the middle.

The upstream relies on many data providers. Data providers may obtain data through crawler collection, manual input, data reception, etc. The various data types and various links lead to the uneven quality of the data we receive, and the problem will be directly transmitted. to the downstream and zoom in step by step. Since we are far from the source, it is impossible for us to provide accurate data services across data providers, and we can only settle for timely error correction services.

The competition among data providers in the market is fierce. You and I have the same data, so we are very lucky. We can get multiple copies of most basic data, which makes it possible to discover data differences. Cross-check the data, obtain the difference data, remind and correct the errors in time.

Throughout the service chain, the earlier a problem is identified, the lower its impact. So, how to catch problems earlier? Divided into the following three steps:

The first step is ID Lacy. Everyone knows about stocks in the financial market. Transaction standards, coding standards, and one code run through all data. However, more debts and bases are not as good as stock standards. Data providers often design internal codes for financial entities. , which is locally unique within the data quotient, so if you want to do cross-checking, you must first solve the problem of ID Lacy, which is manual labor.
The second step is indexing. The requirements for data verification are often specific, such as verifying the daily closing price of stocks. However, the data structures of data providers for checkpoints are often quite different. Therefore, through indexing, Flink SQL is used to write the index generation logic for multi-source libraries, and the heterogeneous data structures are aligned.
The third step is to check the window in real time. The starting idea is relatively simple, just run the script and then periodically take the number and compare it. However, with the increase in the demand for index verification and the increase in the amount of data, the script processing capability of running batches is slightly weak. Therefore, using the window feature of Flink, we developed a real-time verification window for verification, aggregated the indicators to be verified, triggered the verification window calculation in the dimension of time and quantity, and output the result to Kafka, which can support real-time message push.

Two types of windows are supported in Flink, one is time-based and the other is quantity-based. If it is necessary to control both the time and quantity dimensions, a variety of user-defined window assignments can be achieved by using global windows and triggers. A few lines of pseudocode are placed in the figure. On the global window, the trigger is judged when the element comes and when the timer arrives.

In the verification window, use maxcount to judge whether the multi-index data has arrived, and then trigger the window function and compare the index values; Define the maximum duration of the window, then no longer wait, directly trigger the window function, and define the corresponding data source indicator as delayed arrival, and the final result output is as shown in the upper right table. Both technical and business personnel can respond in time based on the verification results, and ultimately achieve detours to improve the accuracy of data services. The timely detection and processing of differences minimizes the impact on downstream.

The application of Flink in the financial industry, I believe there are more scenarios worth exploring. Catch the express train of the open source community, so that the financial information services of our securities companies can be improved qualitatively.

4. Future Outlook

Finally, share some future prospects for real-time stream processing, including some scenarios that are being communicated and exploration of the direction of stream-batch integration.

The scenarios in demand communication are divided into the following aspects:

Account assets, including real-time asset position indicator statistics, customer transaction profit and loss, and analysis of transaction records;
Marketing knowledge, including mot lost customer reminder and recall, account opening unsuccessful customer reminder and tracking, mining of potential new customers in the two financing business, content and content operation of e-commerce APP activities;
Risk control includes the analysis and statistics of the position concentration index in the customer dimension, and the company's financing amount accounted for the company's net capital and other indicators.

On the other hand, our project team is investigating the OLAP multi-dimensional analysis component. Since the current real-time development still uses the Lambda architecture, the result table storage component involves relational databases such as MySQL, SQL server, oracle, and NoSQL databases such as HBase, ES, and Redis. Data silos are a serious problem at present. We hope to realize the unified writing of real-time data and offline data through OLAP components, realize stream-batch integration, break the current situation of data silos, and hope to achieve unified storage and unified external in the stream-batch integrated storage layer. Purpose of service, unified analytical processing.

view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

The practice and application of Flink stream processing in China Securities Securities Co., Ltd.

1. CSC Securities Flink Framework

2. Flink stream processing scenarios

2.1 Retail Business Scenario

2.2 Real-time indicator statistics of fund investment advisors

2.3 Real-time ETL-Fund Flow Scenario

3. Real-time transformation of financial information

3.1 Dragonfly Dianjin APP F10 news scene

3.2 Multi-source data cross-check scenario

4. Future Outlook

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

Dolphinscheduler IDEA本地调试

蚂蚁技术研究院发布推理大模型强化学习框架，邀请开发者共同助力 AGI 生态

【Hadoop】HDFS架构解析

基于 MCP 的 AI Agent 应用开发实践

【Hadoop】HBase系统解析及适用场景

Koupleless 助力「人力家」实现分布式研发集中式部署，又快又省！