About Apache Pulsar
Apache Pulsar is a top-level project of the Apache Software Foundation. It is a next-generation cloud-native distributed message flow platform that integrates messaging, storage, and lightweight functional computing. Multi-machine room and cross-region data replication, with streaming data storage features such as strong consistency, high throughput, low latency, and high scalability.
GitHub address: http://github.com/apache/pulsar/
The article is transferred from the public account: Apache InLong, the original address: https://mp.weixin.qq.com/s/WgVJzu77Hncu-okce8_qaQ
Apache InLong increases the ability to access data through Apache Pulsar, makes full use of the technical advantages of Pulsar different from other MQs, and provides a complete solution for data access scenarios with higher data quality requirements such as finance and billing. In the following, we'll walk through a complete example of how to use Apache Pulsar to ingest data via Apache InLong.
Introduction to Apache InLong (incubating)
Apache InLong ( https://inlong.apache.org ) is a one-stop data stream access service platform donated by Tencent to the Apache community, providing automatic, secure, reliable and high-performance data transmission capabilities for business convenience Build streaming-based data analysis, modeling, and applications. The InLong project, formerly known as TubeMQ, focuses on high-performance, low-cost message queuing services. In order to further release the ecological capabilities around TubeMQ, we upgraded the project to InLong, focusing on building a one-stop data stream access service platform. Apache InLong is based on the TDBank used internally by Tencent. Relying on trillion-level data access and processing capabilities, Apache InLong integrates the entire process of data collection, aggregation, storage, sorting and data processing. It is easy to use, flexible expansion, stable and reliable, etc. characteristic.
Apache InLong serves the entire life cycle from data collection to landing, and provides different processing modules according to different stages of data, including:
- inlong-agent, data collection agent, supports reading regular logs from specified directories or files and reporting them one by one. In the future, the capabilities of DB collection and HTTP reporting will also be expanded;
- inlong-dataproxy, a Flume-ng-based Proxy component, supports data transmission blocking and disk retransmission, and has the ability to forward received data to different MQs (message queues);
- inlong-tubemq, Tencent's self-developed message queue service, focuses on high-performance storage and transmission of massive data in big data scenarios, and has good core advantages in terms of mass practice and low cost;
- inlong-sort, performs ETL processing on the data consumed from different MQs, and then aggregates and writes it to storage systems such as Hive, ClickHouse, Hbase, Iceberg, etc.;
- inlong-manager, provides complete data service management and control capabilities, including metadata, task flow, permissions, OpenAPI, etc.;
- inlong-website, the front-end page for managing data access, simplifies the use of the entire InLong management and control platform.
About Apache Pulsar
Apache Pulsar is a messaging system of the Pub/Sub model, and is designed to separate storage and computation. The architecture of Apache Pulsar's separation of computing and storage, as well as the design of sharded storage, brings some advantages to Apache Pulsar compared to traditional partition-based storage MQ:
- Broker and Bookie are independent of each other, which facilitates independent expansion and independent fault tolerance;
- Broker is stateless, easy to go online and offline quickly, and is more suitable for cloud-native scenarios;
- Partitioned storage is not limited to the storage capacity of a single node;
- Partitioned data is evenly distributed.
Preparation conditions
- Install Apache Pulsar, version 2.6+
- Install Apache Hive, version 2.3+
Install InLong
To deploy InLong, you can use Docker Compose to achieve one-click deployment, or you can deploy on ordinary machines through binary files.
- Docker Compose deployment: https://inlong.apache.org/en-US/docs/next/deployment/docker
- Deploy using the installation package: https://inlong.apache.org/zh-CN/docs/next/deployment/bare_metal
Different from InLong TubeMQ, if you use Apache Pulsar, you need to configure the Pulsar cluster information in the Manager component installation, the format is as follows:
# Pulsar admin URL
pulsar.adminUrl=http://127.0.0.1:8080,127.0.0.2:8080,127.0.0.3:8080
# Pulsar broker address
pulsar.serviceUrl=pulsar://127.0.0.1:6650,127.0.0.1:6650,127.0.0.1:6650
# Default tenant of Pulsar
pulsar.defaultTenant=public
Create data access
Configure data flow group information
When creating data access, the message middleware that can be used by the data flow group selects Pulsar, and other configuration items related to Pulsar include:
- Queue module: Queue model, parallel or sequential, the number of Topic partitions can be set when parallel is selected, and the order is one partition;
- Write quorum: the number of copies of the message written;
- Ack quorum: confirm the number of Bookies written;
- retention time: the time the message that has been confirmed by the consumer is kept;
- ttl: expiration time of unacknowledged messages;
- retention size: The size of the message that has been acknowledged by the consumer to be kept.
Configure data flow
When configuring the message source, the file path in the file data source can refer to the detailed guide of File Agent in inlong-agent.
configuration data format
Configure Hive Cluster
Save the Hive flow direction and click Submit for Approval.
Data access approval
Go to the approval management page, click My Approval, and approve the access application submitted above. After the approval is completed, the topics and subscriptions required for the data flow will be created in the Pulsar cluster synchronously.
We can use the command line tool in the Pulsar cluster to check whether the topic was created successfully:
Profile Agent
When configuring the file Agent, you need to create a file according to the directory specified when creating the data access:
touch /data/test_file.txt;
Write data to the file according to the data source format when the data stream was created (more data can be written in the format):
echo -e "1|test\n2|test\n" >> /data/test_file.txt
Data landing check
Finally, we log in to the Hive cluster and check whether the data has been successfully inserted into the test_stream
table through the Hive SQL command.
Troubleshoot
If the data is not correctly written to the Hive cluster, you can check whether the information about Dataproxy and Sort are synchronized:
- Check whether the Topic information corresponding to the data stream is correctly written in the conf/topics.properties folder of Inlong-Dataproxy
b_test_group/test_stream=persistent://public/b_test_group/test_stream
- Check whether the configuration information of the data stream is successfully pushed in the ZooKeeper monitored by InLong Sort:
get /inlong_hive/dataflows/{{sink_id}}
Follow the public account "Apache Pulsar" to get more technical dry goods
Join the Apache Pulsar Chinese exchange group 👇🏻
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。