云原生 - Blog post dry goods | Apache InLong uses Apache Pulsar to create data storage - ApachePulsar

About Apache Pulsar
Apache Pulsar is a top-level project of the Apache Software Foundation. It is a next-generation cloud-native distributed message flow platform that integrates messaging, storage, and lightweight functional computing. Multi-machine room and cross-region data replication, with streaming data storage features such as strong consistency, high throughput, low latency, and high scalability.
GitHub address: http://github.com/apache/pulsar/

The article is transferred from the public account: Apache InLong, the original address: https://mp.weixin.qq.com/s/WgVJzu77Hncu-okce8_qaQ

Apache InLong increases the ability to access data through Apache Pulsar, makes full use of the technical advantages of Pulsar different from other MQs, and provides a complete solution for data access scenarios with higher data quality requirements such as finance and billing. In the following, we'll walk through a complete example of how to use Apache Pulsar to ingest data via Apache InLong.

Introduction to Apache InLong (incubating)

Apache InLong ( https://inlong.apache.org ) is a one-stop data stream access service platform donated by Tencent to the Apache community, providing automatic, secure, reliable and high-performance data transmission capabilities for business convenience Build streaming-based data analysis, modeling, and applications. The InLong project, formerly known as TubeMQ, focuses on high-performance, low-cost message queuing services. In order to further release the ecological capabilities around TubeMQ, we upgraded the project to InLong, focusing on building a one-stop data stream access service platform. Apache InLong is based on the TDBank used internally by Tencent. Relying on trillion-level data access and processing capabilities, Apache InLong integrates the entire process of data collection, aggregation, storage, sorting and data processing. It is easy to use, flexible expansion, stable and reliable, etc. characteristic.

Apache InLong serves the entire life cycle from data collection to landing, and provides different processing modules according to different stages of data, including:

inlong-agent, data collection agent, supports reading regular logs from specified directories or files and reporting them one by one. In the future, the capabilities of DB collection and HTTP reporting will also be expanded;
inlong-dataproxy, a Flume-ng-based Proxy component, supports data transmission blocking and disk retransmission, and has the ability to forward received data to different MQs (message queues);
inlong-tubemq, Tencent's self-developed message queue service, focuses on high-performance storage and transmission of massive data in big data scenarios, and has good core advantages in terms of mass practice and low cost;
inlong-sort, performs ETL processing on the data consumed from different MQs, and then aggregates and writes it to storage systems such as Hive, ClickHouse, Hbase, Iceberg, etc.;
inlong-manager, provides complete data service management and control capabilities, including metadata, task flow, permissions, OpenAPI, etc.;
inlong-website, the front-end page for managing data access, simplifies the use of the entire InLong management and control platform.

About Apache Pulsar

Apache Pulsar is a messaging system of the Pub/Sub model, and is designed to separate storage and computation. The architecture of Apache Pulsar's separation of computing and storage, as well as the design of sharded storage, brings some advantages to Apache Pulsar compared to traditional partition-based storage MQ:

Broker and Bookie are independent of each other, which facilitates independent expansion and independent fault tolerance;
Broker is stateless, easy to go online and offline quickly, and is more suitable for cloud-native scenarios;
Partitioned storage is not limited to the storage capacity of a single node;
Partitioned data is evenly distributed.

Preparation conditions

Install Apache Pulsar, version 2.6+
Install Apache Hive, version 2.3+

Install InLong

To deploy InLong, you can use Docker Compose to achieve one-click deployment, or you can deploy on ordinary machines through binary files.

Docker Compose deployment: https://inlong.apache.org/en-US/docs/next/deployment/docker
Deploy using the installation package: https://inlong.apache.org/zh-CN/docs/next/deployment/bare_metal

Different from InLong TubeMQ, if you use Apache Pulsar, you need to configure the Pulsar cluster information in the Manager component installation, the format is as follows:

# Pulsar admin URL
pulsar.adminUrl=http://127.0.0.1:8080,127.0.0.2:8080,127.0.0.3:8080
# Pulsar broker address
pulsar.serviceUrl=pulsar://127.0.0.1:6650,127.0.0.1:6650,127.0.0.1:6650
# Default tenant of Pulsar
pulsar.defaultTenant=public

Create data access

Configure data flow group information

When creating data access, the message middleware that can be used by the data flow group selects Pulsar, and other configuration items related to Pulsar include:

Queue module: Queue model, parallel or sequential, the number of Topic partitions can be set when parallel is selected, and the order is one partition;
Write quorum: the number of copies of the message written;
Ack quorum: confirm the number of Bookies written;
retention time: the time the message that has been confirmed by the consumer is kept;
ttl: expiration time of unacknowledged messages;
retention size: The size of the message that has been acknowledged by the consumer to be kept.