2
头图

Abstract: Most modern applications are developed in the mode of separating the front and back ends. In the new generation system of China Construction Bank, each transaction corresponds to three message packets: the front-end buried point information, the sent HTTP request, and the returned HTTP response. CCB has a large number of branches and salespersons around the world, and generates a large number of financial business transactions every day, including a large number of messages, including hundreds of application scenarios such as operational distribution, cash distribution, and credit card approval. The financial business is characterized by complexity, stability and high requirements, especially in the banking industry.

The content of this article is compiled from the speech delivered by Zhou Yao, a financial technology development engineer from CCB, at the Flink Forward Asia 2021 industry practice session. Based on the experience of CCB Jinke's intensive operation service team, the flow computing framework Flink will be used. Focusing on how to introduce stateful flow computing, the three messages of burying point, request and response are processed, merged, and passed back to the application for consumption in a stable, timely and efficient manner, and finally generate business value. Show you the evolution of the stream computing architecture in the big data of bank operations, the solutions and the detours it has taken. It is hoped that it can provide reference and reference for financial enterprises to use stream computing. At the same time, it is hoped that the products and scenarios developed by CCB Jinke in CCB will be commercialized and promoted to more peers and industries.

This article will focus on four aspects: company introduction, business background and challenges, solution evolution and business effects, and future prospects.

Click to view live replay & speech PDF

1. Company introduction

img

CCB Jinke is a financial technology subsidiary of China Construction Bank. It was transformed from the former CCB Software Development Center. The company continues to be committed to becoming a technology promoter and ecological connector of the "new financial" system, helping China Construction Bank The Group's digital transformation empowers the construction of "Digital China", and enables FinTech to do its best to make society a better place. At the same time, he is also doing To B's digital transformation consulting business and digital construction business.

img

CCB Jinke's intensive operation service team mainly produces four major intelligent operation products, namely process operation, delivery management, operational risk control and channel management. This article mainly the practical application of Flink real-time computing in this part of the process operation.

2. Business Background and Challenges

Process operation is customer-centric, driven by processes, data, and technology, to realize digital management and control of operational journeys and resources, and to build a group-wide intelligent operation system of "vertical and horizontal integration".

img

2.1 Introduction to Process Operation

Taking the credit card process as an example, each user can apply for a CCB credit card through the bank's mobile APP, WeChat or other channels. After the user submits the application, the application will be transferred to the credit card approval department. The approval department will determine the amount based on some comprehensive conditions of the user, and then transfer the information flow to the card issuing and card making department, and finally to the distribution department, and send it to the user by mail. Users can start using it after activation. Each of the key business nodes, such as: application, review, card issuance, card creation, and activation can be called a business action. Multiple business actions are connected in series to form a business process from a business perspective.

For such a credit card application process, different roles hope to obtain different information from this process from different perspectives.

  • As an applying customer, I want to know whether the applied card has been approved and when will it be sent out?
  • As the staff who recommends the card application offline, what I want to know is the customer information collected today. Is there any incomplete materials that cause the process to be returned?
  • As a bank leader, you may want to get real-time information on how many credit card applications have been processed by your branch on this day. Is the average review time slower than before?

For critical and high-frequency applications such as credit cards, there is a special process system in the industry to meet user needs, but some are relatively low-frequency and , and the data between each system is from each other. For example: door-to-door collection, ATM cash addition, cash supply chain process, open account process, etc. There are hundreds of similar process applications. If each such process application were to be developed and built separately, it would not only be costly, but would also require invasive modifications to each component, which would introduce more uncertainty.

Therefore, we hope to build a system that can access all log information, connect the data of each system, and the business process from the perspective of business, so that business can stand in the business perspective. Looking at these data from a global perspective can make the data generate greater value, which is also in line with the trend of digital transformation of banks in recent years. Our intelligent operation can well meet this demand.

2.2 Process operation goals

In order to meet the needs of various business users and roles for process analysis, process operations are mainly responsible for four things:

  • The first is to present the status quo of the business process completely;
  • The second is to diagnose the problem of the process;
  • The third is to monitor and analyze the process;
  • The fourth is to optimize the business process.

img

So, the first thing we need to do is to restore the process. So who defines the process? After thinking about it, we came to the conclusion that it is decentralized to business users. In the past, the business process was written in the code by the developer. For the business personnel, the credit card approval only knew whether the result data passed or not, but in fact, the approval steps may have some dependencies, such as step A, step B, step The approval process in the true sense is passed when all C are passed.

If the definition process is delegated to business users, since business users do not know the internal process of the system, they may not know at first that the approval of the three steps of A, B, and C is completed. Therefore, we need to provide a tool for business users to to try. The business personnel first set a process parameter according to intuition, and the process operation system runs according to the process set by the business personnel according to the real data, and restores it to a real process instance to verify whether the process conforms to the business scenario. If it matches, use the parameters of the business process to go online. If it does not match, business users can modify and iterate according to this scenario in time, improve it little by little, and continuously improve the business process in the iteration.

Then, with these running processes, you can build some process applications on this process, such as indicators, monitoring and early warning. Taking the credit card application business as an example, what is the pass rate of the credit card application, what is the activation rate after issuance, etc.

After that, we will do a series of process monitoring and operation and maintenance.

Finally, with these indicator data, you can use these data to guide and empower business, improve operation efficiency, and improve service satisfaction.

It is hoped that the products and scenarios developed by CCB Jinke in China Construction Bank can be commercialized and promoted to more peers and industries.

2.3 Technical challenges

img

In order to achieve this goal, we will also face some business challenges:

  • Business data originates from multiple systems. Our CCB is also working on a data lake, but the data in the data lake is just a simple accumulation of some data. If a global view cannot be formed, it will easily lead to data islands.
  • High business flexibility. It is hoped that a mechanism can be provided to allow businesses to configure the process independently and continuously improve the process in iterations.
  • The real-time data requirements are high. The requirement is to be able to process the process in real time after the business has occurred.
  • Data comes from multiple streams. Looking horizontally, data comes from multiple systems, such as auditing, card issuance, and activation. Most of these systems are deployed separately from the front and back ends. There are both front-end buried points and back-end request responses. We need to collect data from each system and map the corresponding The buried point and the request response are linked together in real time to get a business operation.
  • Business data is huge. The daily average of tens of billions of data arrives 24/7.

img

In order to solve the above pain points, we have taken a series of measures:

  • Using message queue, the production business log and data processing system are isolated through Kafka, so as to reduce the intrusive transformation of the application as much as possible.
  • We define parameters, configure and manage through sites and processes.
  • We use Flink, which can do real-time processing and scale horizontally.
  • We use the flow-batch integration to save resources.

Because all our stream computing applications run on CCB's big data cloud platform, let's first introduce the big data cloud platform.

img

The above picture is a big data cloud platform. The data processing structure is as follows: data is received from network card, buried point, log, database CDC, etc., then Flink does real-time data processing, and finally the processed results are processed and stored in the storage system. , such as HBase, GP and Oracle, and finally the upper-layer application visualizes the results.

3. Solution evolution and business effects

In CCB, data usually comes from three channels, namely customer channel, employee channel and outreach channel.

img

Transactions in the customer channel are mainly initiated on the mobile banking APP, and transactions in the employee channel are mainly initiated by the bank tellers distributed in various branches of CCB. For example, when you go to the bank for a deposit business, the teller will initiate a transaction in the employee channel Deposit transaction, external channel refers to the transaction formed by external system calling CCB's interface.

3.1 Process Analysis Scenario

Each business action initiated by the channel corresponds to three log messages of the data processing system, namely request, response, and buried point, and they all have globally unique tracking numbers. The difficulty here is to extract the unique identifier of the three continuous data streams, and join the three data streams according to the unique identifier to form a complete business action. In addition, there are issues of first-come and later-messages, intermediate state storage, and delayed arrival of messages.

img

In order to solve these problems, the scheme has also undergone three evolutions.

  1. The first solution uses a sliding window, but efficiency problems soon arise;
  2. So the second solution is to use the interval join that comes with Flink, but it encounters the problem of unstable operation of the program oom.
  3. Then we used the third solution and implemented a keyedProcessFunction ourselves to manually manage the state of the center to solve the problems of efficiency and stability.

Before officially sharing the details, a little background. In 80% of cases, the three pieces of data corresponding to each business action will arrive within 5 seconds, but due to network jitter or acquisition delay, we need to tolerate a delay of about an hour, and a global tracking number will only correspond to one A request, a response and a buried point, that is, the three messages corresponding to a business action will only be successfully joined last time.

img

3.1.1 Sliding window (version 1.0)

In order to meet the above requirements, we quickly launched version 1.0, which uses a sliding window. When the request response arrives, we first separate it and then extract the unique business identifier, and then do keyBy again. Because there is a problem of arriving before and after, it may be that the request comes first, or the response may come first. Therefore, we use a 10-second sliding window, which slides every 5 seconds. If the request comes, the response can arrive within 5 seconds and can be connected within the window, and the business operation output can be directly performed; if 5 seconds If it does not arrive, the state must be extracted and stored in Redis for waiting. When the next response comes, it will first go to Redis to check whether there is a request based on the business ID. If there is, it will take it out and then carry out business operations and business processing.

That is to make a connection between the request and the response first, and then make a connection between the request response and the buried point on the connection, which is equivalent to making two real-time joins, and uses Redis as a state store, and stores the messages that are not connected in the in.

img

But this leads to some disadvantages:

  • The first is low throughput. With more and more data access messages, the parallelism set by Flink will become larger and larger, and the number of Redis connection requests used will also increase. Limited by the throughput and connection limit of Redis, reaching After a threshold, the overall throughput will be limited;
  • Second, the operation and maintenance pressure of Redis is high. After the amount of data is large, there will be more and more data without connections, and Redis will soon be full. In order to ensure stability, some manual clearing is required;
  • Third, you need to manually write some additional code in Flink to interact with Redis;
  • Fourth, the state backlog of Redis becomes larger, which will cause the parameters or data in it to expire or be squeezed out.

So we evolved a second version, the interval join version.

3.1.2 Inerval join version (version 2.0)

Interval join is a feature of the Flink framework. Using RocksDB for state storage is equivalent to replacing Redis with it.

img

On the one hand, the original intention of using this solution is to reduce the pressure on operation and maintenance. On the other hand, it can be easily scaled horizontally as the amount of data increases.

The first optimization is that after the data arrives, some filtering is done according to the configuration, and the unnecessary data is filtered out in advance, so that the amount of data to be processed is greatly reduced. The second is to use interval join to join the request response once, and then join the data on the join with the buried point again. The logic here is consistent with the previous 1.0 scheme. At the same time, in order to meet the requirement of tolerant data delay of about one hour, we set an upper and lower limit interval of 30 minutes.

But after running it for a few days, we found that it often appeared OOM. Since Flink on K8s is used, it is more complicated to analyze. Later, by reading the source code, we found that Flink's interval join will keep all the data in its upper and lower time intervals in the state, and will not delete it until the data expires, which also leads to some new problems.

img

First of all, the checkpoint volume will be very large. By reading the source code of Flink's Interval Join implementation, we found that Interval Join will save all the status of the online and offline within 30 minutes in the Rocks DB status backend because it will save the data within 30 minutes. All are reserved for handling one-to-many, and many-to-many join situations.

Second, the operation is unstable. It uses RocksDB as state storage. RocksDB itself is written in C++, and Flink uses java to call it, which can easily cause oom. And due to some constraints, RocksDB can only be configured with enough space by configuring parameters to prevent it from OOM. For applications in our industry, once OOM occurs, it will cause real-time business interruption, which is absolutely not allowed.

img

Secondly, we analyzed the scene of in-line Join and found that: in the case of in-line requests, responses, and buried points, there must be a one-to-one relationship, and there will not be a one-to-many relationship like a database. In view of this background, we consider that in this one-hour interval, in fact, a lot of data is unnecessary to be stored in the state backend, so we want to do state management by ourselves and delete unnecessary data from the state backend. So evolved a third version.

3.1.3 Manual state management (version 3.0)

img

Because it will only join the last time, in 3.0, I implemented a keyedProcessFunction code to manage this state.

After the data arrives, do a filtering first, whether it is a request/response/buried point, unify it, and then perform key by grouping according to the extracted unique identifier. After grouping, messages with the same unique identifier will fall into the same in a slot. After each message comes, it will check whether the response and the buried point have arrived, and whether it meets the conditions of join. If the conditions are met, it will be output and the data in the state backend will be clear. If the output conditions are not met, the Messages are kept waiting in RocksDB's state backend. This makes it possible to manage state manually, reducing state storage by 90%, bringing great benefits.

img

First, RocksDB is used as the state backend. Compared with version 1.0, the throughput has been improved a lot.

Second, it reduces the operational and maintenance difficulty of development. Third, the real-time processing capability is improved, and when the amount of data increases later, it can be expanded horizontally by adding more nodes. In addition, Flink comes with a lot of join solutions, which provide a good interface to make it easier for us to implement the logic inside.

img

3.2 Process Indicator Scenario

With the basic data of the process, we did some index calculations on the basic data for this process, and the real-time process index calculation also carried out 2 iterations of the plan.

3.2.1 Real-time indicators version 1.0

img

The real-time indicators of version 1.0 use stream computing and offline computing to process at the same time. Limited by the proficiency of the technology stack and tools, what we do is quasi-real-time indicators at the minute level. First, the data source is Kafka. After pre-aggregation processing by Flink, it is sink into Kafka, and then periodically transferred by Spark to write the data from Kafka to the GP library. We take the GP library as the core, use SQL to calculate the index, write the index calculation result back to Oracle, and finally consume it by the application. This is the 1.0 version we use.

3.2.2 Real-time indicators version 2.0

img

As we become more familiar with Flink and tools, the team is also thinking about how to achieve second-level reality. Version 2.0 can directly receive data from Kafka, then use Flink to calculate real-time indicators, and directly write the data to Oracle, achieving end-to-end second-level delay, which is a real-time indicator.

img

In a bank, channels, products and institutions are three very important dimensions. We have made statistics on the distribution of channels, products and institutions for the business processes of online production. For example, we need to collect statistics on online and offline employee channels for each outlet. What is the proportion of business processes? For example, from the beginning to the end of the process, will it be rolled back because some materials are not ready in the middle, and the average processing time of each link, etc.

img

So in general, Flink has indeed brought relatively large benefits to the project.

  • First of all, Flink SQL makes the processing process easier. After Flink 1.11, the functions of SQL have been gradually improved, making it more and more convenient for developers to use;
  • Second, it provides end-to-end, true second-level real-time;
  • Third, using Flink can reduce the interaction of data components, shorten the entire data link, and improve data stability.

3.3 Business Results

img

In the middle of the above picture is a business process of cash reservation and door-to-door collection on the mobile APP of China Construction Bank.

First, the cash is entered into the account, then the cash is counted, then the cash is handed over, and finally the business acceptance is completed. The processes and indicators we analyzed can not only be viewed on the APP on the mobile phone, but also in the employee channel. For example, each green dot on the left represents a site. After each site is completed, it can be linked together to form a complete process.

img

So for business people, the first value gained is process remodeling. From the access of indicators to the visualization of indicators to data mining, the indicators are finally obtained according to the process to optimize the process, forming a complete business closed loop.

img

With these basic data, we can carry out risk interventions for business processes. For example, if a customer wants to handle a large amount of cash withdrawal business, the system will notify the branch manager in real time to retain and intervene in the customer.

Second, process analysis brings about optimal allocation of resources. Process-based applications can monitor the use of business resources. For example, if a process suddenly increases in number of applications, we will consider whether the processing time is too long due to insufficient manpower. The system will conduct early warning monitoring and allocate more staff. In this way, resource utilization can be optimized and service satisfaction can be improved through resource allocation. Similar process scenarios have also been promoted to many line-of-business applications in the industry and have been widely praised.

The processing of real-time data by Flink provides strong data support for CCB's digital transformation.

4. Future Outlook

img

At present, the operation of this set of processes is only carried out within our CCB. In the future, we hope to productize and platformize the methodology of this intelligent operation process, and promote it to more industries, so that more industries can obtain financial-level services. The practice of process operations.


view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

image.png


ApacheFlink
946 声望1.1k 粉丝