云原生 - Blog recommendation｜Use Pulsar IO to create a streaming data pipeline - ApachePulsar

This article is translated from the StreamNative blog. The original author of the blog: Ioannis Polyzos, StreamNative solution engineer. Original link: https://streamnative.io/blog/engineering/2021-11-10-streaming-data-pipelines-with-pulsar-io/

background

Building a modern data infrastructure has always been a problem for today's enterprises. Today's enterprises need to manage large amounts of heterogeneous data that are generated and delivered around the clock. However, because enterprises have multiple requirements for the amount and speed of data, etc., there is no “one size fits all” solution. Instead, companies need to move data between different systems in order to store, process, and provide data.

Looking at the history of building infrastructure, companies have used many different tools to try to move data, such as Apache Kafka for streaming workloads and RabbitMQ for messaging workloads. Now, the birth of Apache Pulsar simplifies this process for enterprises.

Apache Pulsar is a cloud-native distributed message flow platform. Pulsar is designed to meet modern data needs, supporting flexible messaging semantics, tiered storage, multi-tenancy, and offsite replication (cross-regional data replication). Since graduating to become the top project of the Apache Software Foundation in 2018, the Pulsar project has experienced rapid community growth , the development of the surrounding ecology and the growth of global users. Using Pulsar as the backbone of the data infrastructure, companies can move data in a fast and scalable way. In this blog post, we will introduce how to use Pulsar IO to easily import and export data between Pulsar and external systems.

1. Introduction to Pulsar IO

Pulsar IO is a complete toolkit for creating, deploying and managing Pulsar connectors integrated with external systems (such as key/value stores, distributed file systems, search indexes, databases, data warehouses, other messaging systems, etc.) . Since Pulsar IO is built on Pulsar's computing layer (called 161add11c82ab4 Pulsar Function ), writing a Pulsar IO connector is as simple as writing a Pulsar Function.

With Pulsar IO, users can use existing Pulsar connectors or write their own custom connectors to easily move data in and out of Pulsar. Pulsar IO has the following advantages:

Diverse connectors: There are many existing Pulsar IO connectors in the current Pulsar ecosystem for external systems, such as Apache Kafka, Cassandra and Aerospike. Using these connectors helps to reduce production time because all the components needed to create the integration are already in place. Developers only need to provide configuration (such as connection url and credentials) to run the connector.
Managed runtime: Pulsar IO comes with a managed runtime, which is responsible for execution, scheduling, expansion and fault tolerance. Developers can focus on configuration and business logic.
Multi-interface: Through the interface provided by Pulsar IO, users can reduce the boilerplate code used to generate and use applications.
High scalability: In scenarios where more instances are needed to handle incoming traffic, users can easily scale horizontally by changing a simple configuration value; if users use Kubernetes to run, they can scale flexibly according to traffic requirements.
Make full use of the schema: Pulsar IO helps users make full use of the schema by specifying the schema type on the data model. Pulsar IO supports schema types such as JSON, Avro, and Protobufs.

2. Pulsar IO runtime

Since Pulsar IO is built on Pulsar Function, Pulsar IO and Pulsar Function have the same runtime options. When deploying the Pulsar IO connector, users have the following options:

thread : run in the same JVM as the worker thread. (Usually used for testing and local operation, not recommended for production deployment.)
process : run in different processes, users can use multiple worker threads to scale horizontally across multiple nodes.
Kubernetes : It runs as a Pod in a Kubernetes cluster, and the worker coordinates with Kubernetes. This runtime approach ensures that users can take full advantage of the advantages provided by a cloud-native environment such as Kubernetes, such as easy horizontal expansion. The advantages provided by the cloud native environment, such as easy horizontal expansion.

3. Pulsar IO interface

As mentioned earlier, Pulsar IO reduces the boilerplate code required to generate and consume applications. It does this by providing different basic interfaces that abstract out boilerplate code and allow us to focus on business logic.

Pulsar IO supports the basic interfaces of Source and Sink. The Source connector allows users to bring data from external systems into Pulsar, while the Sink Connector can be used to move data out of Pulsar and into external systems, such as databases.

There is also a special type of Source connector called Push Source. The Push Source connector can easily implement certain integrations that need to push data. For example, Push Source can be a change data capture source system. After it receives a new change, it will automatically push the change to Pulsar.

Source interface

public interface Source<T> extends AutoCloseable {
 
    /**
     * Open connector with configuration.
     *
     * @param config initialization config
     * @param sourceContext environment where the source connector is running
     * @throws Exception IO type exceptions when opening a connector
     */
    void open(final Map<String, Object> config, SourceContext sourceContext) throws Exception;
 
    /**
     * Reads the next message from source.
     * If source does not have any new messages, this call should block.
     * @return next message from source.  The return result should never be null
     * @throws Exception
     */
    Record<T> read() throws Exception;
}

Push Source interface

public interface BatchSource<T> extends AutoCloseable {
 
    /**
     * Open connector with configuration.
     *
     * @param config config that's supplied for source
     * @param context environment where the source connector is running
     * @throws Exception IO type exceptions when opening a connector
     */
    void open(final Map<String, Object> config, SourceContext context) throws Exception;
 
    /**
     * Discovery phase of a connector.  This phase will only be run on one instance, i.e. instance 0, of the connector.
     * Implementations use the taskEater consumer to output serialized representation of tasks as they are discovered.
     *
     * @param taskEater function to notify the framework about the new task received.
     * @throws Exception during discover
     */
    void discover(Consumer<byte[]> taskEater) throws Exception;
 
    /**
     * Called when a new task appears for this connector instance.
     *
     * @param task the serialized representation of the task
     */
    void prepare(byte[] task) throws Exception;
 
    /**
     * Read data and return a record
     * Return null if no more records are present for this task
     * @return a record
     */
    Record<T> readNext() throws Exception;
}

Sink interface

public interface Sink<T> extends AutoCloseable {
    /**
     * Open connector with configuration.
     *
     * @param config initialization config
     * @param sinkContext environment where the sink connector is running
     * @throws Exception IO type exceptions when opening a connector
     */
    void open(final Map<String, Object> config, SinkContext sinkContext) throws Exception;
 
    /**
     * Write a message to Sink.
     *
     * @param record record to write to sink
     * @throws Exception
     */
    void write(Record<T> record) throws Exception;
}

4. Summary

Apache Pulsar can serve as the backbone of modern data infrastructure, which enables companies to move data in a fast and scalable way. Pulsar IO is a connector framework that provides developers with all the necessary tools to create, deploy and manage Pulsar connectors integrated with different systems. Pulsar IO abstracts away all boilerplate code, allowing developers to focus on application logic.

5. Further reading

If you are interested in learning more and building your own connector, please check out the following resources:

Translator Profile

Song Bo, a senior development engineer at Beijing Baiguan Technology Co., Ltd., focuses on the fields of microservices, cloud computing, and big data.

Join the Apache Pulsar Chinese exchange group👇🏻

Click the link to view the Apache Pulsar dry goods collection

Blog recommendation｜Use Pulsar IO to create a streaming data pipeline

background

1. Introduction to Pulsar IO

2. Pulsar IO runtime

3. Pulsar IO interface

Source interface

Push Source interface

Sink interface

4. Summary

5. Further reading

Translator Profile

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

定档 7 月！Community Over Code Asia 2025 议题征集全面启动！

强烈推荐|新手从搭建到二开TinyEngine低代码引擎

面对开源大模型浪潮，基础模型公司如何持续盈利？