Recently, StreamNative and Cloudera officially announced to jointly open source the joint solution of Apache NiFi and Apache Pulsar, integrating the two into a complete edge-to-cloud data streaming platform.
StreamNative was established by the founding team of Apache Pulsar, focusing on the Apache Pulsar community and ecological construction, and building a cloud-native solution around Apache Pulsar that integrates batch and streaming; the Cloudera team includes some original developers of Apache NiFi, and builds data streams through Apache NiFi. By integrating NiFi with Pulsar, enterprises can create a cloud-native, scalable real-time streaming data platform to ingest, transform, and analyze massive amounts of data.
This article will introduce the open source background of the processor and how to set up Apache NiFi with simple configuration to produce and consume messages from Pulsar topics at scale. Cloudera provides out-of-the-box processors for Data Hub 7.2.14 and later CDF .
About Apache NiFi
Apache NiFi The original project name is Niagara Files, which is an open source project contributed by the National Security Agency (NSA) to the Apache Software Foundation. Its original intention is to automate the data flow between systems. In July 2015, NiFi graduated from the Apache Software Foundation and became a top-level project of the Apache Software Foundation.
NiFi implements a visualization tool based on stream programming, through which users can build data streams that move data from one platform (such as databases, cloud storage, and messaging systems) to another platform.
NiFi helps users automatically move data between different data sources and systems, ensuring data ingestion is fast, easy and secure; NiFi provides real-time controls to easily manage data movement between any source and any destination; it also provides event-level Data traceability and traceability, users can trace each piece of data back to its source.
The NiFi platform includes a collection of over 100 pre-built processors that can be used to enrich, route, and more from data sources to data destinations.
About Apache Pulsar
Apache Pulsar is a message queue and stream fusion system in the cloud-native era. It provides a unified consumption model and supports both message queue and stream scenarios. It can not only provide enterprise-level read and write service quality and strong consistency guarantee for queue scenarios, but also provide stream Scenarios provide high throughput and low latency; adopt a storage-computing separation architecture to support enterprise-level and financial-level functions such as large clusters, multi-tenancy, millions of topics, cross-regional data replication, persistent storage, tiered storage, and high scalability .
At its core, Pulsar uses replicated distributed ledgers to provide persistent streaming storage, guaranteed to easily scale to retain petabytes of data. Pulsar's scalable streaming storage makes it the perfect long-term repository for event data. Through its message retention policy, users can retain historical event data indefinitely, facilitating streaming analysis of event data at any time in the future.
Processor: Complementing Apache Pulsar with Apache NiFi
The capabilities of Apache NiFi and Apache Pulsar complement each other in modern streaming data architectures. NiFi provides a data flow solution that automates the flow of data between software systems. Therefore, it can act as a short-term buffer between different data sources, rather than a long-term data repository.
Instead, Pulsar is designed to act as a long-term repository for event data and provides powerful integrations with common stream processing frameworks such as Flink and Spark. By combining these two technologies, users can create a powerful real-time data processing and analysis platform.
The synergies achieved by the combination of these technologies will be evident in the data platform. NiFi provides all the data flow management needs of users including prioritization, back pressure and edge intelligence.
Users can use NiFi's extensive suite of connectors to automatically stream data to the message flow platform while performing ETL processing. Once the data is transformed, it can be routed directly to Pulsar's persistent streaming storage for long-term retention through these NiFi processors designed for Apache Pulsar.
Once the data is stored in Pulsar, it is ready to be used by various common stream processing engines such as Flink or Spark to use the data for more complex stream processing and analytics scenarios.
In short, NiFi's rich set of connectors allows users to easily "input" data into the message flow platform, while ensuring that Pulsar integration with Flink or Spark provides easy access to real-time insights.
The combination of Apache Pulsar and Apache NiFi creates a complete edge-to-cloud data streaming platform that provides real-time insights across multiple applications. The integration is suitable for multiple industries and scenarios. For example, in the cybersecurity industry, users need to identify and detect threats as quickly as possible, requiring the system to have the ability to ingest and parse log data; manufacturing, mining, and oil and gas and many other industries Both need to be able to ingest massive amounts of IoT sensor data from different locations, and businesses need to analyze this massive amount of data in near real-time to prevent catastrophic equipment failures and/or prevent sudden disruptions of operations that could result; in the financial services industry, algorithmic trading or cryptocurrencies Time-sensitive applications such as arbitrage require systems with the ability to ingest and process data in near real-time.
Video demo
Let's take a look at these processors in action next. This video demonstrates configuring and using these processors to send and receive data to and from an Apache Pulsar cluster.
Scan the code to watch the video demo:
As you can see from the video demo, there are four processors: PublishPulsar and PublishPulsarRecord are used to publish data to Pulsar; ConsumePulsar and ConsumePulsarRecord are used to consume data from Pulsar. The bundle also contains two controller services: one for creating Pulsar clients and another for authentication to secure the Pulsar cluster.
use processor
These processors are available in CDF version 7.2.14 and above on the public cloud, refer to document . If you wish to use these processors in other Apache NiFi clusters, you can download the artifacts directly from the Maven Central Code Repository , or build
Related Reading
- Pulsar Summit speech video: of technology stack (Flink, NiFi, Pulsar) in edge AI scenarios1623d3e4888765
Download demo code to start running the processor.
About StreamNative
StreamNative is an open source basic software company formed by the founding team of Apache Pulsar, a top-level project of the Apache Software Foundation, to build the next-generation cloud-native batch-stream fusion data platform around Pulsar. As an Apache Pulsar commercial company, StreamNative focuses on open source ecology and community building, and is committed to innovation in the field of cutting-edge technology. The founding team members have worked in well-known large companies such as Yahoo, Twitter, Splunk, and EMC.
Follow public account "Apache Pulsar" to get more technical dry goods
Join the Apache Pulsar Chinese exchange group👇🏻
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。