I. Overview
Since its development, Apache Kafka has become a very mature message queue component and an indispensable member of the big data ecosystem. The Apache Kafka community is very active. Through the continuous contribution of code and iterative projects by community members, Apache Kafka has more functions and more stable performance, and has become an important part of enterprise big data technology architecture solutions.
As a popular message queue middleware, Apache Kafka has efficient and reliable message processing capabilities and has a very wide range of applications. So, today, let's talk about the practical application of Kafka-based real-time data warehouse in search.
2. Why do you need Kafka
Before designing a big data technical architecture, some technical research is usually done. We will think about why we need Kafka? How to judge whether the selected Kafka technology can meet the current technical requirements?
2.1 Early Data Architecture
The early data types were relatively simple, and the business architecture was relatively simple, that is, to store the required data. For example, game data is stored in a database (MySQL, Oracle). However, with the increase of business, the types of stored data also increase. Then we need to use big data clusters and use data warehouses to classify and store these data, as shown in the following figure:
However, there is a delay in storing data in a data warehouse, and the delay is usually T+1. However, the current data service objects have high requirements on latency, such as the Internet of Things, microservices, mobile APPs, etc., all of which need to process these data in real time.
2.2 The emergence of Kafka
The emergence of Kafka provides a new storage solution for increasingly complex businesses. Store all kinds of complex business data in Kafka uniformly, and then distribute the data through Kafka. As shown below:
Here, different types of data such as videos, games, and music can be stored in Kafka uniformly, and then the data in Kafka can be streamed through stream processing. For example, store data in a data warehouse, store calculated results in KV for real-time analysis, etc.
Usually there are two common message systems, they are:
- message queue : The queue consumer acts as a work group, and each message record can only be passed to one worker process, thus effectively dividing the work process;
- Produce & Consume : Consumers are usually independent of each other, each consumer can get a copy of each message.
Both of these methods are effective and practical. The work content is separated by message queues for fault tolerance and expansion; production and consumption can allow multi-tenancy to decouple the system. One of the advantages of Apache Kafka is that it combines message queuing, production and consumption into one powerful messaging system.
At the same time, Kafka has the correct message processing features, which are mainly reflected in the following aspects:
- Scalability : When Kafka's performance (such as storage, throughput, etc.) reaches a bottleneck, horizontal expansion can be used to improve performance;
- real storage : Kafka data is landed on disk in real time, and data will not be lost due to cluster restart or failure;
- real-time processing : It can integrate mainstream computing engines (such as Flink, Spark, etc.) to process data in real time;
- sequentially written to : disk sequential I/O read and write, skip the head "addressing" time, and improve the read and write speed;
- memory mapping : Operating system paging storage uses memory to improve I/O performance, realizes file-to-memory mapping, and controls Flush through synchronization or asynchronous;
- zero-copy : Copy the data of the disk file to the "page cache" once, and then send the data from the "page cache" directly to the network;
- efficiently stores : Topic and Partition are split into multiple file segments, and invalid files are regularly cleaned up. Sparse storage is used, and an index is established at intervals of several bytes to prevent the index file from being too large.
2.3 Simple application scenarios
Here, we can understand the purpose of Kafka through a simple and intuitive application scenario.
Scenario: If user A is playing a game, one day user A likes a prop in the game and intends to buy it, so he recharges 10 yuan at 14:00 that day, and likes another item when visiting the game store. A prop, so I recharged 30 yuan at 14:30, and then started to place an order at 15:00, which cost 20 yuan, and the remaining amount was 20 yuan. Then, the entire event flow, corresponding to the data details in the library table should be as shown in the following figure:
3. What problem does Kafka solve
In the early days, in response to the project's rapid launch, a WebServer was deployed on a server or cloud server to provide access experience for personal computers or mobile users, and then a database was connected in the background to provide data persistence and data query for Web applications. The process is shown in the figure below. :
However, with the rapid growth of users, all the user's access directly through the SQL database makes it overwhelmed, and the pressure on the database is also increasing, and the cache service has to be added to reduce the load of the SQL database.
At the same time, in order to understand user behavior, logs are collected and saved to a big data cluster such as Hadoop for offline processing, and the logs are placed in a full-text retrieval system (such as ElasticSearch) to quickly locate problems. Due to the need to show investors the business status, the data also needs to be aggregated into a data warehouse (such as Hive) to provide interactive reports. The system architecture at this time already has a certain complexity, and real-time modules and external data interaction may be added in the future.
Essentially, this is a data integration problem. No one system can solve everything, so business data is stored in different systems according to different purposes, such as archiving, analysis, search, caching, etc. There is nothing inherently wrong with data redundancy, but overly complex data synchronization between different systems is a challenge. As shown below:
And Kafka allows the right data to appear in the right place in the right form. Kafka's approach is to provide a message queue, let producers add data to the end of the queue, and let multiple consumers read data from the queue in turn and then process it themselves. If the complexity of the previous connection is O(N^2), then the complexity is now reduced to O(N), and it is much more convenient to expand. The process is shown in the following figure:
Fourth, the practical application of Kafka
4.1 Why do we need to build a real-time data warehouse?
4.1.1 Purpose
Usually, in big data scenarios, the construction of data warehouses for storing massive data is generally offline data warehouses (with a delay of T+1). Incremental data is pulled every day through scheduled tasks, and then data of different dimensions of each business is created to externally. Provide T+1 data services. The real-time performance of calculation and data is relatively poor, and business personnel cannot obtain real-time data a few minutes ago according to their real-time requirements. The value of the data itself will gradually weaken with the passage of time, so the data must reach the hands of users as soon as possible after it is generated, and the demand for the construction of real-time data warehouses comes from this.
4.1.2 Objectives
In order to adapt to the characteristics of high-speed iteration of business, analyze user behavior, mine user value, improve user retention, and provide better support in real-time data availability, scalability, ease of use, and accuracy, it is necessary to build real-time data warehouses . The main goals include the following:
- Unified converged data export: unified data caliber to reduce data duplication construction;
- Reduce data maintenance costs: improve data accuracy and timeliness, and optimize data usage experience and costs;
- Reduce data usage costs: Improve data reuse rate and avoid repeated consumption of real-time data.
4.2 How to build a real-time data warehouse to provide data for search
The current mainstream architecture of real-time data warehouse generally includes three large modules, which are message queue, computing engine, and storage. Combined with the above comprehensive analysis of Kafka, combined with the business scenario of search, Kafka is introduced as a message queue, and the ability to reuse the big data platform (BDSP) is used as a computing engine and storage. The specific architecture is shown in the following figure:
4.3 Stream Processing Engine Selection
At present, there are two common stream processing engines in the industry, they are Flink and Spark, so how to choose a stream processing engine? We can compare the following characteristics to decide which stream processing engine to choose?
As an open source big data stream computing engine, Flink also supports stream-batch integration. The main reasons for introducing Flink as a stream engine for real-time data warehouse construction are as follows:
- High throughput, low latency;
- Flexible streaming window;
- Lightweight fault tolerance mechanism;
- flow batch integration
4.4 Problems encountered in building real-time data warehouses
In the early stage of construction, the Kafka cluster used for real-time processing is small in scale, and the data capacity of a single topic is very large. Different real-time tasks will consume the same topic with a large amount of data, which will lead to a very large I/O pressure on the Kafka cluster. .
Therefore, in the process of use, it will be found that the pressure of Kafka is very high, and there are often delays and I/O performance alarms. Therefore, we adopted real-time distribution of a single topic with a large amount of data to solve this problem. Based on Flink, we designed the data distribution process as shown in the following figure.
The above process, with the increase of business types and data volume, will face new problems:
- The amount of data increases. With the increase of consumption tasks, consumption will be affected when the I/O load of the Kafka cluster is large;
- If the consumption of topics between businesses is not stored on the ground (such as HDFS, HBase storage, etc.), repeated consumption will occur;
- Data coupling is too high, and it is difficult to migrate data and tasks.
4.5 Advanced real-time data warehouse solution
At present, there are usually two mainstream real-time data warehouse architectures, namely Lambda and Kappa.
4.5.1 Lambda
With the real-time requirement, in order to quickly calculate some real-time indicators (such as real-time clicks, exposures, etc.), a real-time calculation link will be added on the basis of the offline data warehouse big data architecture, and the data source will be realized for the message queue. For the loss processing, the data in the message queue is consumed, and the stream computing engine is used to realize the incremental calculation of the indicators, and push them to the downstream data service, and the downstream data service layer completes the summary of offline and real-time results. The specific process is as follows:
4.5.2 Kappa
The Kappa architecture only cares about stream computing, data is written to Kafka in a stream, and then the calculation results are stored in the data service layer for query through real-time computing engines such as Flink. It can be seen as simplifying the part of offline data warehouse based on the Lambda architecture. The specific process is as follows:
In the process of actually building a real-time data warehouse, we combine the ideas of these two architectures. The real-time data warehouse introduces a layered concept similar to the offline data warehouse, mainly to provide the reuse rate of the model, but also to consider the ease of use, consistency, and calculation cost.
4.5.3 Real-time data warehouse layering
In the advanced construction of real-time data warehouses, the design of the layered architecture is not as complicated as the offline data warehouses. This is to avoid unnecessary delays caused by too long data computing links. The specific flow chart is as follows:
- ODS layer : Using Kafka as the message queue, put all the data that needs to be calculated and processed in real time to the corresponding topic for processing;
- DW layer : Consume data in topics in real time through Flink, and then associate some business systems with the same dimension and feature attributes in dimension tables through data cleaning, multi-dimensional association (JOIN), etc., providing data ease of use and Reusability, and finally get real-time detailed data;
- DIM layer : used to store the dimension information of the associated query, the storage medium can be selected as needed, such as HBase, Redis, MySQL, etc.;
- DA layer : According to the needs of real-time data scenarios, highly aggregated and summarized, serving KV, BI and other scenarios. OLAP analysis can use ClickHouse, KV can choose HBase (if the amount of data is small, Redis can be used).
Through the above process, when building the real-time data warehouse layering, it is ensured that tasks with high real-time computing requirements will not affect BI reports or KV queries. However, there will be new problems to solve:
How to check real-time data in Kafka?
How to analyze when the consumption task is abnormal?
4.5.4 Kafka Monitoring
In response to these problems, we investigated and introduced the Kafka monitoring system - Kafka Eagle (currently renamed EFAK). The more important dimension monitoring functions in the monitoring system are reused.
In addition to meeting the monitoring needs of the two dimensions of appeal, Kafka Eagle processing also provides some daily more practical functions, such as topic record viewing, topic capacity viewing, consumption and production task rates, consumption backlog, etc. We use Kafka-Eagle as task monitoring for real-time data warehouses. The Kafka-Eagle system design architecture is shown in the following figure:
Kafka-Eagle is a completely open source system for comprehensive monitoring of Kafka clusters and applications. Its core consists of the following parts:
- data acquisition : core data source JMX and API acquisition;
- data storage : support MySQL and Sqlite storage;
- data display : consumer applications, chart trend monitoring (including cluster status, consumption production rate, consumption backlog, etc.), developed distributed KSQL query engine, query through KSQL messages;
- Data Alarm : Supports commonly used IM alarms (WeChat, DingTalk, WebHook, etc.), as well as email, SMS, and phone alarms.
Some preview screenshots are as follows:
1) Topic write volume distribution in the last 7 days
By default, the daily write volume distribution of all topics is displayed. You can select the time dimension and topic aggregation dimension to view the write volume distribution. The preview screenshot is as follows:
2) KSQL query topic message record
You can query (support filter conditions) the message records in the topic by writing SQL statements. The preview screenshot is as follows:
3) Consumption Topic Backlog Details
You can monitor the consumption rate, consumption backlog and other details of all consumed topics. The preview screenshot is as follows:
5. References
1.https://kafka.apache.org/documentation/
3.https://github.com/smartloli/kafka-eagle
Author: vivo internet server team - Deng jie
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。