About Apache Pulsar

Apache Pulsar is a top-level project of the Apache Software Foundation. It is a next-generation cloud-native distributed message flow platform that integrates messaging, storage, and lightweight functional computing. Multi-machine room and cross-region data replication, with streaming data storage features such as strong consistency, high throughput, low latency, and high scalability.
GitHub address: http://github.com/apache/pulsar/

The article is transferred from the public account: Apache InLong, the original address: https://mp.weixin.qq.com/s/WgVJzu77Hncu-okce8_qaQ

Apache InLong increases the ability to access data through Apache Pulsar, makes full use of the technical advantages of Pulsar different from other MQs, and provides a complete solution for data access scenarios with higher data quality requirements such as finance and billing. In the following, we'll walk through a complete example of how to use Apache Pulsar to ingest data via Apache InLong.

图片

Introduction to Apache InLong (incubating)

Apache InLong ( https://inlong.apache.org ) is a one-stop data stream access service platform donated by Tencent to the Apache community, providing automatic, secure, reliable and high-performance data transmission capabilities for business convenience Build streaming-based data analysis, modeling, and applications. The InLong project, formerly known as TubeMQ, focuses on high-performance, low-cost message queuing services. In order to further release the ecological capabilities around TubeMQ, we upgraded the project to InLong, focusing on building a one-stop data stream access service platform. Apache InLong is based on the TDBank used internally by Tencent. Relying on trillion-level data access and processing capabilities, Apache InLong integrates the entire process of data collection, aggregation, storage, sorting and data processing. It is easy to use, flexible expansion, stable and reliable, etc. characteristic.

图片

Apache InLong serves the entire life cycle from data collection to landing, and provides different processing modules according to different stages of data, including:

  • inlong-agent, data collection agent, supports reading regular logs from specified directories or files and reporting them one by one. In the future, the capabilities of DB collection and HTTP reporting will also be expanded;
  • inlong-dataproxy, a Flume-ng-based Proxy component, supports data transmission blocking and disk retransmission, and has the ability to forward received data to different MQs (message queues);
  • inlong-tubemq, Tencent's self-developed message queue service, focuses on high-performance storage and transmission of massive data in big data scenarios, and has good core advantages in terms of mass practice and low cost;
  • inlong-sort, performs ETL processing on the data consumed from different MQs, and then aggregates and writes it to storage systems such as Hive, ClickHouse, Hbase, Iceberg, etc.;
  • inlong-manager, provides complete data service management and control capabilities, including metadata, task flow, permissions, OpenAPI, etc.;
  • inlong-website, the front-end page for managing data access, simplifies the use of the entire InLong management and control platform.

About Apache Pulsar

图片

Apache Pulsar is a messaging system of the Pub/Sub model, and is designed to separate storage and computation. The architecture of Apache Pulsar's separation of computing and storage, as well as the design of sharded storage, brings some advantages to Apache Pulsar compared to traditional partition-based storage MQ:

  • Broker and Bookie are independent of each other, which facilitates independent expansion and independent fault tolerance;
  • Broker is stateless, easy to go online and offline quickly, and is more suitable for cloud-native scenarios;
  • Partitioned storage is not limited to the storage capacity of a single node;
  • Partitioned data is evenly distributed.

Preparation conditions

  • Install Apache Pulsar, version 2.6+
  • Install Apache Hive, version 2.3+

Install InLong

To deploy InLong, you can use Docker Compose to achieve one-click deployment, or you can deploy on ordinary machines through binary files.

Different from InLong TubeMQ, if you use Apache Pulsar, you need to configure the Pulsar cluster information in the Manager component installation, the format is as follows:

# Pulsar admin URL
pulsar.adminUrl=http://127.0.0.1:8080,127.0.0.2:8080,127.0.0.3:8080
# Pulsar broker address
pulsar.serviceUrl=pulsar://127.0.0.1:6650,127.0.0.1:6650,127.0.0.1:6650
# Default tenant of Pulsar
pulsar.defaultTenant=public

Create data access

Configure data flow group information

图片

When creating data access, the message middleware that can be used by the data flow group selects Pulsar, and other configuration items related to Pulsar include:

  • Queue module: Queue model, parallel or sequential, the number of Topic partitions can be set when parallel is selected, and the order is one partition;
  • Write quorum: the number of copies of the message written;
  • Ack quorum: confirm the number of Bookies written;
  • retention time: the time the message that has been confirmed by the consumer is kept;
  • ttl: expiration time of unacknowledged messages;
  • retention size: The size of the message that has been acknowledged by the consumer to be kept.

Configure data flow

图片

When configuring the message source, the file path in the file data source can refer to the detailed guide of File Agent in inlong-agent.

configuration data format

图片

Configure Hive Cluster

Save the Hive flow direction and click Submit for Approval.

图片

Data access approval

Go to the approval management page, click My Approval, and approve the access application submitted above. After the approval is completed, the topics and subscriptions required for the data flow will be created in the Pulsar cluster synchronously.

We can use the command line tool in the Pulsar cluster to check whether the topic was created successfully:

图片

Profile Agent

When configuring the file Agent, you need to create a file according to the directory specified when creating the data access:

touch /data/test_file.txt;

Write data to the file according to the data source format when the data stream was created (more data can be written in the format):

echo -e "1|test\n2|test\n" >> /data/test_file.txt

Data landing check

Finally, we log in to the Hive cluster and check whether the data has been successfully inserted into the test_stream table through the Hive SQL command.

Troubleshoot

If the data is not correctly written to the Hive cluster, you can check whether the information about Dataproxy and Sort are synchronized:


Follow the public account "Apache Pulsar" to get more technical dry goods

Join the Apache Pulsar Chinese exchange group 👇🏻


ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统