User case｜Farewell to traditional financial messaging architecture: Apache Pulsar's practice in Ping An Securities

This article was first published from InfoQ "Farewell to Traditional Financial Message Architecture: The Practice of Apache Pulsar in Ping An Securities".
Author: Wang Dongsong, Chen Xiang

In the financial scenario, with the expansion of business, the application system also adds more scenarios. These new scenarios put more demands on the messaging system, causing the original architecture to face a series of challenges. After trying to use Apache Pulsar, Ping An Securities decided to implement it in a production environment. This article introduces the reasons why Ping An Securities chose Apache Pulsar, the scenarios of using Apache Pulsar, the problems encountered in the practical application of Apache Pulsar, and the future plans for using Apache Pulsar.

Background introduction

Traditional financial companies or brokerage firms generally use unified access services or components to handle external business. After receiving the user request, the request is forwarded to the corresponding business system/module according to the corresponding business rules. Some requests will be forwarded to the message queue. After the request is written, the downstream business system obtains the request from the message queue and returns it to the customer through the message queue after processing. The entire request process is closed and has limited functions.

Challenges brought by traditional architecture under message queues

Ping An Securities adopts the above-mentioned traditional architecture, and currently only supports message queues. Although we have certain development capabilities, it is difficult to obtain the detailed information of the message queue. At the same time, because it is a custom-developed system, the supported languages are relatively limited. The existing message queue has the following shortcomings for business development and business innovation:

black box system, difficult to observe: message queue is a black box system, it is difficult for us to observe the details of the architecture;
Direct Exchange, cannot be routed: currently only supports message queues and cannot support scenarios that require routing;
weak check access, high security risk: existing system password authentication, verification and other tests are weak, and the security risk is high;
customized system, limited language support: customized system access language support is limited, which leads to a small range of choices and it is difficult to reform on the basis of the original system.

With business expansion and architecture improvements, the company's existing message queue system/components are facing a series of challenges, and many problems in the system, such as security, are urgent in financial scenarios.

Business needs of financial scenarios

Our business needs are mainly divided into three categories: identification & security control, routing distribution, and auditing.

Identity & Security Control

Identity recognition is mainly used to determine the identity information of the client and the accessor who access the message queue, specify corresponding security rules, and reject illegal access, thereby achieving the expected security requirements. From the most basic level, it is necessary to identify the system and IP that control access, and restrict permissions according to business scenarios and specific needs.

Route distribution

Routing distribution means that messages are routed from the write queue to the corresponding queue according to the corresponding rules. The existing message queue supports limited scenarios. If you want to support more scenarios, you need to invest a lot of time and effort in development (involving the transformation of upstream and downstream systems), and other problems will be introduced. A better solution is that the message queue system natively supports more modes and features, such as TOPIC mode and streaming message processing. If the message queuing system can support routing, the access complexity of the system will be greatly reduced, and the access layer can be operated in a more optimal way. Each system only needs to connect to a set of topics, and routing is responsible for distribution; Targeted optimization of performance (routing, forwarding, and protocol conversion are all performance-consuming operations).

The communication mechanism of the original system architecture is point-to-point, closed operation, request messages cannot be shared, and can only be distributed indirectly using adapters or log collection methods. Such methods are difficult to effectively meet the real-time requirements.

audit

The publisher/receiver of the message belongs to the participants of the entire system and is the top priority. The main influencing factor of system security is all participants in the system; therefore, from a security perspective, the audit requirements for messages are relatively high. Another more urgent need is to control the flow of messages. If identity recognition and security control can be performed, security information can be perfected and optimized during audits, so as to ensure that invalid and illegal requests are rejected at the business entrance, and the internal system is robust. In addition, the information of the publisher/recipient of the accessed message can also be used for abnormal situation monitoring, auditing and auditing.

System requirements for new business

The new services put forward higher requirements on the message system, including availability, message sending delay, capacity expansion and contraction, and message backtracking.

Requirement 1: High availability, low latency

For the Internet industry, high availability and low latency are the basic requirements of the system. From single point to disaster recovery, to cross-machine room in the same city, then to cross-multi-center in different cities, or first cross-city, disaster recovery, and then cross-city multi-center (two places and three centers) model has become more and more normalized. Many companies' business systems are or will be developing in this direction. Such systems have relatively high requirements for high availability and low latency. Therefore, it is necessary to consider how to minimize the delay when the system complexity increases (such as disaster recovery, cross-city scenarios, etc.).

Requirement 2: Rapid expansion and recovery

For the financial industry, one of the main characteristics of the business is that requests may surge in a certain period of time or in a certain period. After this time window, traffic gradually returns to normal. This feature requires that the system can be rapidly expanded and contracted horizontally. For cost considerations, it is obviously unreasonable to deploy the entire system architecture according to the highest traffic. The best solution is that the system can reasonably arrange the system architecture or system deployment method according to the single-layer traffic. When the traffic suddenly increases, the system can quickly expand to support the business. The most ideal situation is that all components of the system have the ability to quickly expand and shrink, and recover.

Requirement 3: Messages are orderly, and messages are anti-repeated

In some special business scenarios, it is necessary to ensure that messages are orderly or anti-repetitive. We often perform idempotent operations on some interfaces. If we can ensure that upstream messages are not repeated, we can reduce downstream pressure.

Requirement 4: Can be traced back and serialized

If there is a problem in the business system, but it is difficult to reproduce the problem in the test environment, it is necessary to introduce message backtracking. Message backtracking refers to replaying all requests in the time window when the problem occurred, verifying whether the problem can be reproduced, and troubleshooting, which can greatly reduce the workload of troubleshooting. In addition, we can also use this function to perform grayscale verification and parallel verification.

Choose Apache Pulsar

Based on the above business requirements and system requirements, we found that many features of Apache Pulsar perfectly fit our needs.

Cluster mode supports cross-cluster synchronization. The construction of system active-active, cross-cluster geographical replication realizes message synchronization without any sense of the client.
Separation of computing and storage. The storage/computing is scaled horizontally according to the usage, and the client is not aware of this operation. Based on the function of secondary storage, the usage scenarios of messages are expanded, providing possibilities for data analysis and message auditing.
The client access authentication module is plug-in and supports custom development. Due to business requirements, authentication and authentication are required when the client accesses to effectively ensure that the source of the message is reliable and controllable.
Complete Rest API, you can view the queue status. The previously used message system has good performance, but lacks in observability, which causes difficulties in troubleshooting the system. At the same time, the management method of the message system is relatively primitive, which is difficult to adapt to the requirements of large-scale system management. And Apache Pulsar's complete Rest API can not only obtain system operation indicators, but also contribute to the efficient management of the cluster.
Based on Functions, it can realize message routing development, filtering and statistics.
The persistence mode and expiration time of the message can be set to allow the message to be replayed.
Multi-language support, fast and convenient access.

Apache Pulsar's business scenario in Ping An Securities

Ping An Securities uses Apache Pulsar to build a unified messaging platform, hoping to integrate the four major data streams of customers, transactions, market quotations, and funds, and apply them to market distribution and real-time risk control. This article mainly introduces how to apply Apache Pulsar to three business scenarios: request routing, data broadcasting and message notification, the advantages and disadvantages of the new architecture, and its impact on development and operation and maintenance teams.

Scenario 1: Request routing-simplifying the system

Our message routing process is as shown in the figure below. The request sent from component A is written to Topic A, and then the information in the topic is routed by the routing module and distributed to multiple corresponding topics. Downstream components that subscribe to these topics can process related messages. Component A only needs to write messages to a fixed queue, and does not need to pay attention to the information of Topic B, C, and D. The downstream system only needs to know the queue that receives the message, and does not need to pay attention to Topic A, thus simplifying the structure of the entire network.

This message routing mode simplifies the overall architecture of the system. Currently, our routing system still needs to be optimized:

Although the workload of routing distribution has been reduced, the steps to troubleshoot problems have increased. For example, after component A sends a message, when component B does not receive the message, first check whether component A writes the message to Topic A, whether the routing module successfully routes the message, and then see whether component B subscribes to the message correctly.
Judging from the current test results, as the message link becomes longer, the time delay increases.
Since the messages of each queue are persisted, data redundancy occurs in both storage and queues.
The routing module is a newly added module, and the learning cost of operation and maintenance is relatively high.

Scenario 2: Data Broadcasting-Reduce Time Delay

Data broadcasting is another business scenario where we use Apache Pulsar. Data broadcasting adopts the sending/subscription mode, which is mainly used for synchronizing messages. A long time ago, we did not need to synchronize the market quotations to the business system, or through other means (such as synchronizing the database). However, with the growth of business, the competition between synchronization timeliness and user experience has become increasingly fierce. How can users see information faster? Taking the scenario of synchronizing market conditions as an example, the time delay is relatively long by synchronizing the database first and then consulting; and in the broadcast mode, the business system only needs to subscribe to all the topics needed, and the data can be directly read when consulting, effectively reducing the time Extension.

Scenario 3: Message notification-security control

The third scenario where we use Apache Pulsar is message notification. Although the business involved in message notification is relatively small, this business scenario is very important. The overall business flow chart is as follows. Since the signal source is not unique, after the message is published to the calculation engine, the calculation engine needs to perform logic and security calculations based on the information of the signal source. After the calculation is completed, the Task is called up, and the activated Task sends a service request to the relevant business system. After execution, the result is returned to the service that initiated the signal source, and the service triggers the next signal source based on the returned result.

The business involved in this scenario has very strict requirements for security and control. It not only needs to restrict the messages or signals sent by the signal source, cut off/filter some signals, but also need to process the returned results: which can be returned and which need to be filtered Drop or convert to other content. If the message queue method is not used, the message source will directly send the message to the computing engine. After the computing engine executes the security or control strategy, the message is sent to the Task; after the task is executed, the result requires another round of security control processing. The repetitive operations in this part have a greater impact on performance, and the timeliness of policy updates and signal status viewing is not so real-time.

After the introduction of Apache Pulsar, we stripped out the control and audit module, specifically for filtering, auditing, and statistics on signal queues and result queues, and outputting the results to the management side in real time. After seeing the information, the operation and maintenance or audit staff can control and update the corresponding strategy. This model can not only streamline the data flow, but also increase data supplement channels, and it also more clearly defines the boundaries of each service module.

Problem discovery and solution

At present, we have mainly explored the use of Apache Pulsar in the above three scenarios, and gradually put it into production. In the course of use, we found several problems, and share our solutions here for reference.

1. Implement REQ-REP mode

The first problem we encountered was how to implement the request-response (REQ-REP) mode. Our solution was to be compatible through the bus mode.

At present, the common calling method is that the client initiates a call request, and the server returns a response after processing. But after the introduction of the bus (synchronous to asynchronous), in a multi-node deployment scenario, node 1 sends a request, and the server returns the processing result after receiving the request. All nodes need to monitor the processing result, and node 2 receives the home node 1. How to deal with the response message? Node 2 needs to first subscribe and get the message returned to the package, determine whether it is a response to the request initiated by its own node, and discard the message if it is not. If implemented according to this mode, when sending a message, each node needs to cache the message ID sent by itself; after the server has processed it, it needs to bring the requested message ID with the packet data according to the protocol, and each node subscribes Get all the returned packets, and check whether the message ID exists in the buffer, and if it does not exist, discard the message.

There is a very serious problem that needs to be solved urgently under this implementation: when a node initiates a request to query a large amount of data, assuming that Apache Pulsar sets a message size of 8M and a TPS of 1000, does it mean that every node has to receive so much? What about the requested return traffic? If there are 5 nodes, each node should only receive 200 request packet return traffic, but the current model requires each node to bear 1000 request packet return traffic, and its purpose is only for filtering operations. If the node load performance reaches the upper limit, the node needs to be expanded, which will cause the network bandwidth to increase exponentially. Since Apache Pulsar can support a large number of topics, although this problem can be solved by configuring a packet return queue for each node, we want to try to solve this problem through the FILTER function of the broker.

2. Realize read-write separation

The message broadcast scenario will involve read-write separation. If you increase a large number of subscribed nodes, it is best to avoid concentrating the links of all nodes on the owner broker of the topic. To solve this problem, the feasible solution is to allocate the Topic and Partition reasonably. The Apache Pulsar 2.7.2 we are currently using does not support read-write separation. We plan to upgrade Apache Pulsar to 2.8 to easily achieve read-write separation and meet the needs of message broadcast scenarios.

3. Solve the problem of multiple network cards

Based on the company's network security considerations, there are multiple network partitions and network segments inside. Different network partitions/network segments use different IPs. The server has multiple network cards for communication between systems across partitions. At present, if you use IP to register a broker, you can only register the IP of a certain network segment; if you use a domain name to register a broker, the DNS resolution of different network areas needs to be configured differently. If the broker can support multi-NIC communication, these problems will not exist. At present, our solution is to use proxy to proxy client requests, and external systems only connect to the proxy. We will also add some high-availability configurations to the proxy.

future plan

At present, we are running Apache Pulsar on a small scale in a single-machine room and a single-cluster line, and did not consider the construction of active-active in the initial stage of the launch. As the infrastructure of business systems, the availability of Apache Pulsar itself is extremely important. Therefore, we plan to carry out active-active planning based on the construction of dual centers and single clusters in the same city, as shown in the figure:

In the process of testing and using Apache Pulsar, we encountered some problems. Thanks to the Apache Pulsar community for the positive response. We look forward to participating more in the development of Apache Pulsar, and we look forward to contributing to the Apache Pulsar and Apache Pulsar communities.

About the Author:
Wang Dongsong, R&D engineer of Ping An Securities Brokerage Business Division.
Chen Xiang, Architect of Ping An Securities Brokerage Division.

join the Apache Pulsar Chinese exchange group 👇🏻

Background introduction

Challenges brought by traditional architecture under message queues