云原生 - Blog Recommended | Building Edge Applications with Apache Pulsar - ApachePulsar

This article was translated by the Apache Pulsar Chinese community volunteers organized by StreamNative. The original text comes from the StreamNative English blog "Building Edge Applications With Apache Pulsar", author Tim Spann, StreamNative preacher. Translator: YOLO, working in the bomc team of BSC BOMC ORP. Original link: https://streamnative.io/blog/engineering/2021-11-17-building-edge-applications-with-apache-pulsar/

The explosive growth of remotely connected devices in recent years has created challenges for the centralized computing paradigm. Constrained by network and infrastructure, it is increasingly difficult for businesses to move and process data generated by all devices in the data center or cloud without latency or performance issues. As a result, edge applications are on the rise. Gartner that by 2025, enterprises will create and process 75% of their data outside the data center or cloud.

So what are edge applications? Edge applications run on or near data sources, such as IoT devices, local edge servers, edge execution. Edge computing enables computing, storage, caching, management, alerting, machine learning and routing to take place outside the data center and cloud. Industries such as retail, agriculture, manufacturing, transportation, healthcare, and telecommunications are adopting edge applications that enable lower latency, better bandwidth, lower infrastructure costs, and more efficient decision-making.

This article will introduce you to some of the challenges faced in developing edge applications, and Apache Pulsar's solutions for edge applications. This article will also share an example that shows step-by-step how to build an edge application with Pulsar.

key challenges

While the decentralized nature of edge computing brings many benefits, it also brings challenges, including:

Solution

An open source solution that is adaptable, hybrid, geo-replicated, and scalable is needed to solve the problems of building edge applications. Open source projects with many users can provide broad community support and a rich ecosystem of adapters, connectors, and extensions needed for edge applications. Based on my experience working with different technologies and open source projects over the past two decades, I believe that Apache Pulsar meets the needs of edge applications.

Apache Pulsar is an open-source, cloud-native, distributed message flow platform. Since Pulsar became a top-level project of the Apache Software Foundation in 2018, its community participation, surrounding ecological growth, and global usage have all grown rapidly. Pulsar is able to solve many of the challenges in edge computing thanks to the following:

In this example, we are building an edge application on NVIDIA Jetson Xavier NX, which gives us enough power to run an edge Apache Pulsar stand-alone broker, multiple web cameras, and a deep learning edge application. My edge device contains 384 NVIDIA CUDA® cores and 48 Tensor cores, 6 64-bit ARM cores, and 8 GB of 128-bit LPDDR4x RAM. In a follow-up blog, I'll show you that even running Pulsar on simpler devices like Raspberry PI 4s and NVIDIA Jetson Nano can still meet the need for fast edge event streaming.

Architecture

The physical structure of the solution has been introduced above, so the question now is how to logically build the application architecture for incoming data. For those unfamiliar with Pulsar, the first thing to understand is that each topic belongs to a tenant and a namespace, as shown in the image below.

These logical structures allow us to group data according to various criteria, such as the original source of the data and different businesses. Once we have decided on the tenant, namespace, and topic, we need to determine the fields needed to collect the additional data needed for the analysis.

Next, we need to determine the format of the data. Depending on the architecture, it can be the same as the original format, or it can be converted according to specific requirements for transmission, processing or storage. Additionally, in many cases, our devices, facilities, sensors, operating systems, or transmission modes require us to select specific data formats.

In this article, we'll use the JSON data format, which satisfies the readability needs of almost any language and most people. Also, Apache Avro is a good choice as a binary format, but my blog series will choose the simplest format.

Once the data format has been chosen, we may need to enrich the raw data with additional fields beyond sensors, machine learning taxonomies, logs, or other sources. I like to add IP address, mac address, hostname, creation timestamp, execution time, and some fields about device health like disk space, memory, and CPU. You can add or subtract if you don't think it's necessary, or if your device already broadcasts device health. Especially when we have thousands of devices, these fields can help us debug our program. So I'm used to adding this data as bandwidth allows.

We need to find the primary key or unique identifier for the event record, IoT data usually doesn't come with it. We can synthesize one with the UUID generator when creating the record.

After we have a list of fields, we need to set a schema for the data and determine the field name, type, default value and whether it is empty. Once a schema is defined, we can use the JSON schema or build a class with fields and then query the data in the subject using Pulsar SQL. For IoT applications, time series master data storage of such events is often used. I recommend Aerospike, InfluxDB or ScyllaDB. We can use Pulsar IO Sink connector or other mechanism depending on the scenario and requirement. If necessary, we can also use Spark connectors, Flink connectors or NiFi connectors.

Our final event will look like the JSON example shown below.

{"uuid": "xav_uuid_video0_lmj_20211027011044", "camera": "/dev/video0", "ipaddress": "192.168.1.70", "networktime": 4.284832000732422, "top1pct": 47.265625, "top1": "spotlight, spot", "cputemp": "29.0", "gputemp": "28.5", "gputempf": "83", "cputempf": "84", "runtime": "4", "host": "nvidia-desktop", "filename": "/home/nvidia/nvme/images/out_video0_tje_20211027011044.jpg", "imageinput": "/home/nvidia/nvme/images/img_video0_eqi_20211027011044.jpg", "host_name": "nvidia-desktop", "macaddress": "70:66:55:15:b4:a5", "te": "4.1648781299591064", "systemtime": "10/26/2021 21:10:48", "cpu": 11.7, "diskusage": "32367.5 MB", "memory": 82.1}

fringe producer

Next we tested some libraries, languages and clients on NVIDIA Jetson Xavier NX to see which worked best for our scenario. After prototyping with an NVIDIA Jetson Xavier NX version ARM installed Ubuntu system running several libraries, I found the following technical options that generate the messages my application needs. Although these are not the only paths for this edge platform, they are still very good choices.

Go language Pulsar producer

package main

import (
        "context"
        "fmt"
        "log"
        "github.com/apache/pulsar-client-go/pulsar"
        "github.com/streamnative/pulsar-examples/cloud/go/ccloud"
       "github.com/hpcloud/tail"
)

func main() {
    client := ccloud.CreateClient()

    producer, err := client.CreateProducer(pulsar.ProducerOptions{
        Topic: "jetson-iot",
    })
    if err != nil {
        log.Fatal(err)
    }
    defer producer.Close()

    t, err := tail.TailFile("demo1.log", tail.Config{Follow:true})
        for line := range t.Lines {
        if msgId, err := producer.Send(context.Background(),
&pulsar.ProducerMessage{
            Payload: []byte(line.Text),
        }); err != nil {
            log.Fatal(err)
        } else {
            fmt.Printf("jetson:Published message: %v-%s \n",
msgId,line.Text)
        }
    }
}

Python3 Websocket producer

import requests, uuid, websocket, base64, json

uuid2 = uuid.uuid4()
row = {}
row['host'] = 'nvidia-desktop'
ws = websocket.create_connection( 'ws://server:8080/ws/v2/producer/persistent/public/default/energy')
message = str(json.dumps(row) )
message_bytes = message.encode('ascii')
base64_bytes = base64.b64encode(message_bytes)
base64_message = base64_bytes.decode('ascii')
ws.send(json.dumps({ 'payload' : base64_message, 'properties': { 'device' : 'jetson2gb', 'protocol' : 'websockets' },'key': str(uuid2), 'context' : 5 }))
response =  json.loads(ws.recv())
if response['result'] == 'ok':
            print ('Message published successfully')
else:
            print ('Failed to publish message:', response)
ws.close()

Java Pulsar Producer with Schema

public static void main(String[] args) throws Exception {
        JCommanderPulsar jct = new JCommanderPulsar();
        JCommander jCommander = new JCommander(jct, args);
        if (jct.help) {
            jCommander.usage();
            return;
        }
        PulsarClient client = null;

        if ( jct.issuerUrl != null && jct.issuerUrl.trim().length() >
0 ) {
            try {
                client = PulsarClient.builder()
                        .serviceUrl(jct.serviceUrl.toString())
                        .authentication(
AuthenticationFactoryOAuth2.clientCredentials(new URL(jct.issuerUrl.toString()),new URL(jct.credentialsUrl.toString()), jct.audience.toString())).build();
            } catch (PulsarClientException e) {
                e.printStackTrace();
            } catch (MalformedURLException e) {
                e.printStackTrace();
            }
        }
        else {
            try {
                client = PulsarClient.builder().serviceUrl(jct.serviceUrl.toString()).build();
            } catch (PulsarClientException e) {
                e.printStackTrace();
            }
        }

        UUID uuidKey = UUID.randomUUID();
        String pulsarKey = uuidKey.toString();
        String OS = System.getProperty("os.name").toLowerCase();
        String message = "" + jct.message;
        IoTMessage iotMessage = parseMessage("" + jct.message);
        String topic = DEFAULT_TOPIC;
        if ( jct.topic != null && jct.topic.trim().length()>0) {
            topic = jct.topic.trim();
        }
        ProducerBuilder<IoTMessage> producerBuilder = client.newProducer(JSONSchema.of(IoTMessage.class))
                .topic(topic)
                .producerName("jetson").
                sendTimeout(5, TimeUnit.SECONDS);

        Producer<IoTMessage> producer = producerBuilder.create();

        MessageId msgID = producer.newMessage()
                .key(iotMessage.getUuid())
                .value(iotMessage)
                .property("device", OS)
                .property("uuid2", pulsarKey)
                .send();
        producer.close();
        client.close();
        producer = null;
        client = null;
    }

   private static IoTMessage parseMessage(String message) {

        IoTMessage iotMessage = null;

        try {
            if ( message != null && message.trim().length() > 0) {
                ObjectMapper mapper = new ObjectMapper();
                iotMessage = mapper.readValue(message, IoTMessage.class);
                mapper = null;
            }
        }
        catch(Throwable t) {
            t.printStackTrace();
        }

        if (iotMessage == null) {
            iotMessage = new IoTMessage();
        }
        return iotMessage;
    }

java -jar target/IoTProducer-1.0-jar-with-dependencies.jar --serviceUrl pulsar://nvidia-desktop:6650 --topic 'iotjetsonjson' --message "...JSON…"

You can find all the source code at here .

Now, we decide how to execute the application on the device. You can use the scheduler that comes with your system, like cron or some add-ons. For reference, I often use cron, MiNiFi proxy, shell script, or run the application continuously as a service. You need to configure your devices and sensors yourself for optimal scheduling.

Validate data and monitor

Now that we have a steady stream of events coming into our Pulsar cluster, we can validate the data and monitor progress. Take the StreamNative Cloud Manager interface as an example, as shown in the following figure. We can also choose to look at the Pulsar metrics endpoint documented at here .

Check Statistics via REST

Check Statistics via Admin CLI

bin/pulsar-admin topics stats-internal persistent://public/default/mqtt-2

Find the subscription the topic is in

http://nvidia-desktop:8080/admin/v2/persistent/public/default/mqtt-2/subscriptions

Consume from subscription via REST

http://nvidia-desktop:8080/admin/v2/persistent/public/default/mqtt-2/subscription/mqtt2/position/10

Consume messages via the CLI

bin/pulsar-client consume "persistent://public/default/mqtt-2" -s "mqtt2" -n 5

Query topics via Pulsar SQL

select * from pulsar."public/default".iotjetsonjson;

Next steps

We have now built an edge application that can stream data at event speed and connect streaming data from thousands of other applications into an Apache Pulsar cluster. Next, we can add rich real-time analytics using Flink SQL. This allows us to do advanced stream processing, integrate event streams, and do large-scale data processing.

Blog Recommended | Building Edge Applications with Apache Pulsar

key challenges

Solution

Architecture

fringe producer

Go language Pulsar producer

Python3 Websocket producer

Java Pulsar Producer with Schema

Validate data and monitor

Check Statistics via REST

Check Statistics via Admin CLI

Find the subscription the topic is in

Consume from subscription via REST

Consume messages via the CLI

Query topics via Pulsar SQL

Next steps

Further reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

定档 7 月！Community Over Code Asia 2025 议题征集全面启动！

强烈推荐|新手从搭建到二开TinyEngine低代码引擎

面对开源大模型浪潮，基础模型公司如何持续盈利？