About Apache Pulsar
Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate computing and storage architecture design to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.
GitHub address: http://github.com/apache/pulsar/
Introduction to Chuanzhi Education
Chuanzhi Education (former Chuanzhi Podcast) is an IT training company dedicated to the training of high-quality software development talents. Its subsidiaries cover Dark Horse Programmers, Bo Xue Valley, Chuan Zhi Hui, Kudingyu Children's Programming, Chuan Zhi Special Training College, and academies. And other sub-brands.
Chuanzhi Education is the first education company to achieve an A-share IPO. The company is committed to cultivating high-precision digital talents, mainly cultivating digital professionals and data analysis such as artificial intelligence, big data, intelligent manufacturing, software, Internet, and blockchain. , Network marketing, new media and other digital application talents.
In order to benefit more students with better educational resources, Chuanzhi Education has opened 19 branch schools across the country and trained 300,000 IT practitioners; published 111 books, covering 2 million college students nationwide; released 120,000 video tutorials During the festival, the average annual downloads and broadcasts are 40+ million times; 1500+ free live open classes are carried out, and the average annual number of lectures is nearly one million.
Boxue Valley was formally established in July 2016. Relying on 15 years of IT education precipitation of Chuanzhi Education, with employment courses as the core, adopting a personalized, on-the-go self-adaptive learning model, providing students with IT online learning services that integrate zero-based introductory, skill improvement and career planning . Focus on integrating superior IT teaching resources to create high-quality teaching products and services that are more suitable for online learning.
the problem we are facing
In 2020, the epidemic has brought tremendous changes to our lives and work. Due to the needs of epidemic prevention and control, many offline courses cannot be carried out normally, and more users choose to use online learning to enhance their knowledge reserves and expand professional capabilities. Boxuegu provides online teaching services and has become the best choice for more users. With the sharp rise in user behavior counseling and learning, erudition Valley online system pressure has increased, the original system proposed new challenges :
- The original system only supports offline synchronization, and the response is slow.
- It is necessary to synchronize the old data collected by the original system, to perform offline and real-time collection of new data, and to perform link-type data cleaning and aggregation analysis based on all data.
- Currently, the Alibaba Cloud DTS (Data Transmission Service) synchronization method is used to synchronize business tables, which is expensive and cannot perform operations such as data cleaning and conversion during the synchronization process.
Faced with scale growth and model adjustments, Bo Xuegu needs a more flexible and efficient system to process the business data of the scale growth, ensure the normal operation of the business system, support the adjustment of the business model, and at the same time make more data available. Used for decision analysis.
Why choose Pulsar?
We hope to solve these challenges with the help of message middleware. Our team members have experience in using RabbitMQ and Kafka: RabbitMQ is more suitable for lightweight scenarios, and Apache Kafka is suitable for scenarios with a large amount of logs. We need a more comprehensive solution for application scenarios and source code reading. In the process of research, we know that there is another popular messaging system Apache Pulsar on the market. For the operation and maintenance team, learning these three message middlewares has a certain learning cost problem, and it is not easy to change the infrastructure once it is implemented. Therefore, we have conducted a thorough investigation on the middleware selection of Chuanzhi Education . The survey angles mainly include:
- Support message streaming to ensure the order of message processing
- Support "only once" semantic message processing
- Support the permanent persistence of messages, and the storage scale is easy to expand
- Cloud-native deployment friendly, low operation and maintenance costs
- The source code is of good quality and the community is highly active
We found that Pulsar is a cloud-native messaging and event streaming platform, with many built-in features that just meet our needs. For example: Pulsar adopts the architecture design of separation of computing and storage, storing data on Apache BookKeeper, and performing Pub/Sub-related calculations on the broker, which has the characteristics of IO isolation. Compared with traditional messaging platforms (such as Kafka), Pulsar has obvious advantages :
- Broker and bookie are independent of each other, can be independently expanded and fault-tolerant, and improve the availability of the system.
- Partitioned storage is not limited by the storage capacity of a single node, and data distribution is more even.
- BookKeeper storage is safe and reliable to ensure that messages are not lost, and it also supports batch flashing to obtain higher throughput.
- The peak value of read does not affect write performance. Different physical storage is used for read and write, and the persistence of data becomes more convenient and cheaper.
From April to September 2020, we conducted functional tests on Pulsar, including sequential consumption of messages, data consistency, and loss rate. The test results prove that Pulsar can consume messages in an orderly manner, keep the data consistent, and not lose. In application scenarios that do not consider orderliness, Pulsar can be used directly as a message queue. Multiple subscription modes and subscription levels do not affect Topic, so that multiple Consumers can consume the Topic in an orderly or disorderly manner at the same time.
In terms of operation and maintenance, we can use K8S (Helm) to deploy Pulsar, Pulsar IO, and Pulsar Functions; use pulsar-admin to simplify the complexity of deployment and management of the operation and maintenance team.
In a commercial company, adopting any new technology (including open source technology) has certain risks, even if the technology has significant advantages. After careful consideration and thorough investigation, we finally decided to introduce Apache Pulsar.
Pulsar's practical application in Chuanzhi Education
As an online education platform, we need to exchange large amounts of data with external parties. We use the third-party messaging system Ronglian Qimo to collect online customer service data, and the Zhuge IO system to collect user behavior data for analysis. Therefore, we need a system to aggregate external data, and after secondary processing, persist it in the data warehouse, and finally obtain a set of data that meets business analysis.
Based on Apache Pulsar, we have built a data processing system for Pulsar Valley, isolate the data and configuration of each application through multiple namespaces, and realize data collection and processing through Pulsar IO and Pulsar Functions. According to business needs, some namespaces are configured so that messages never expire and remain permanently. Thanks to the design of separation of computing and storage in the Pulsar messaging system, the system can flexibly expand its storage capacity. The Pulsar currently deployed in the production environment is a modified version based on the official v2.6.1. All the code for fixing problems has been shared with the community through GitHub and will be fixed in a future release.
Multi-dimensional data collection is performed by building a Source cluster, and Pulsar Functions are used to perform operations such as real-time cleaning of the collected data. Pulsar Topic uses persistent storage during the entire link process, using Pulsar SQL [1] Convenient for each stage Data backtracking. The Sink cluster performs persistent operations on the cleaned data.
In the above link, we used Pulsar's Delay Topic to identify the completion status of the session, and Dead Letter Topic records the message that the sink side consumption failed.
During the development process, we found that in the real-time streaming (ordered) scenario of Pulsar Functions, the process will not be interrupted after the Receive Fail response. Then we contacted the Pulsar community, submitted an issue and PR, and got quick response and support from the StreamNative team. This problem is currently marked in Pulsar 2.8.0 to fix, and we internally patched it based on Pulsar 2.6.1.
Online consultation clue analysis
The Boxuegu system uses a third-party online customer service system to implement online consultation functions on the web and mobile terminals. Previously, the use of online consultation session data was restricted due to the restrictions of third-party service interfaces. With the growth of the business and the adjustment of the model, the team hopes to combine this part of the data with the customer management system (CMS) to better mine customer needs and improve the efficiency of consultation and feedback.
The third-party system uses HTTP API to provide a data query interface to the access party, and restricts the access to the interface, which affects the use of session data by the CMS system.
After analysis and discussion, we designed and developed the HTTP Polling Source component and Common JDBC Sink component based on Pulsar IO to efficiently capture the session data into the internal MySQL database for persistent storage, and at the same time support data cleaning during the data collection process And conversion, greatly improving the efficiency of session data utilization and usage scenarios.
HTTP Polling Source is a data collection message source based on the HTTP polling mechanism. It cyclically executes HTTP requests based on the configuration template, updates the synchronization state (Offset State) to State Storage after each request, and writes the request result to the downstream Pulsar Topic.
Common JDBC Sink uses the JDBC interface to persist structured object data, supports a variety of JDBC-driven general structured document storage and processing, not only covers all data types of H2, MySQL, MariaDB, and PostgreSQL databases, but also supports INSERT, UPDATE, UPSERT, DELETE And Schema Migration operation.
User interaction behavior collection
The Boxuegu system uses a third-party system to realize the client-side user behavior analysis function. The business system has limited user behavior analysis functions and it is not convenient to combine the analysis dimensions with the concepts in the business system. The Boxuegu system needs to make user behavior data more generated. Great value in order to provide customers with better services.
The commercial system provides a data subscription service based on the early version of Apache Kafka (v0.8). The Kafka Source built in Pulsar does not support this Kafka version. Through program evaluation, we packaged the existing subscription program that supports Kafka v0.8 version into the Pulsar IO Source interface, namely Legacy Kafka Source. This interface supports the log message source of Kafka v0.8, which is used to efficiently save data subscribed from Kafka to Pulsar Topic to support downstream flexible data processing, and support functions such as abnormal behavior research and judgment, learning effect evaluation, etc.
Data change log collection
With the evolution of business systems, collecting business change logs has gradually become a burden on the R&D team. At present, the R&D team records the change history of business data through additional database tables, such as order change records, process flow records, etc. Developers need to be familiar with the design of database tables, and carefully adjust the logging function when the table structure changes; in order to ensure the integrity of key data, data changes and logs need to be written in the same transaction, which affects system performance Certain influence.
Through the MySQL Binlog Connector based on the MySQL Replication protocol, data change events in the business system database can be synchronized to Pulsar Topic in real time, and Pulsar's streaming message processing mechanism is used to ensure that messages are processed once downstream in order. In this way, automatically generate data change logs, support automatic migration of DDL changes, and support downstream use of multiple log storage mechanisms (MySQL, ElasticSearch, etc.) to persist business logs, reduce the intrusion of business system code, and reduce the impact on business system performance .
MySQL Binlog Connector has two components: MySQL Binlog Source and MySQL Binlog Sink. MySQL Binlog Source is used to collect raw Binlog Event data, send messages downstream in transaction units, and save it to State Storage using Binlog Filename/Position or GTID Set as the offset of the synchronized data. MySQL Binlog Sink processes these data by replaying (in transaction units) Binlog Event messages in downstream databases, and synchronizes DML or DDL changes to downstream database instances.
Data real-time desensitization synchronization
When developing data processing systems, data security has always been the focus of the R&D team. How to better mine the value of data while ensuring that sensitive information is not accessed illegally has become an urgent issue to be addressed. At present, our team uses Alibaba Cloud DTS or internal ETL tools to synchronize business data to an analytical database (OLAP) to achieve data analysis requirements, but this type of solution cannot well desensitize sensitive information during the synchronization process .
Based on the work accumulation of the data change log collection module, a real-time data desensitization synchronization solution based on MySQL Binlog Source was designed and implemented. The solution uses the Binlog Event information saved in Pulsar Topic, develops desensitization processing function based on Pulsar Functions, matches the desensitization processing method according to the rule engine, and then persists the desensitized data to the analytical database through Common JDBC Sink , Improve the scalability and flexibility of the data synchronization program.
We use Pulsar to solve the problems of low collection efficiency and high delay rate of the original collection system, and compatible with different collection methods for multiple data sources; at the same time, in terms of synchronizing the production business database, we use Pulsar to replace the original costly The DTS solution performs data desensitization in a chained manner, ensuring data security and also facilitating the data analysis team to use data better and more efficiently.
future plan
Based on the overall plan of Chuanzhi Educational Information Construction, and integrating the actual needs of Erudition Valley, we will continue to explore the value of the data processing system in the future, and make better use of Apache Pulsar, an excellent messaging system, to support system operation and business development.
- Simplify the development of business log functions through data change log collection solutions
- Replace Alibaba Cloud DTS with a real-time data desensitization synchronization solution
- Realize user's abnormal behavior research and judgment, learning effect evaluation, and operation history playback
- Build a cross-departmental data exchange system
Thanks
Thanks to the support of the Apache Pulsar community and the StreamNative team, the construction and future development of the Boxue Valley data processing system are inseparable from the outstanding contributions of the open source community. The Boxuegu R&D team will continue to promote the application of the Apache Pulsar system in the company's business system construction, and encourage team members to participate more in open source community activities and grow together with everyone.
Summarize
In the process of investigating and using Pulsar, we have made full use of Pulsar Functions, Pulsar IO and many other native features of Pulsar, and have also been partially optimized according to the needs. As the next generation cloud native distributed message flow platform, Pulsar's community is very active and growing. In the future, we plan to build a multi-dimensional data flow rule engine based on Pulsar, use Pulsar to build basic middleware services for the group's e-commerce platform, and increase the application scenarios of Pulsar in Chuanzhi Education.
Author's profile and photos
Sun Changyu, Head of R&D, Chuanzhi Education Erudition Valley
Liu Zilin Erudite Valley Infrastructure R&D Engineer
Related Reading
- Apache Pulsar's landing practice in the field of energy Internet
- Apache Pulsar on Tencent Angel PowerFL Federal Learning Platform
- The performance tuning of Apache Pulsar in BIGO (Part 2)
Reference link
[1] Pulsar SQL : https://pulsar.apache.org/docs/en/sql-overview/
[2] The official website of Chuanzhi Education: ">https://pulsar.apache.org/docs/en/standalone/
[4] Debezium official website: ">https://trino.io/
[6] Binlog Connector : https://github.com/shyiko/mysql-binlog-connector-java
[9] DTS : https://help.aliyun.com/product/26590.html
Click on the link to get the Apache Pulsar hard core dry goods information!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。