This article was translated by the Apache Pulsar Chinese community volunteers organized by StreamNative. The original text comes from the StreamNative English blog "Building Edge Applications With Apache Pulsar", author Tim Spann, StreamNative preacher. Translator: YOLO, working in the bomc team of BSC BOMC ORP. Original link: https://streamnative.io/blog/engineering/2021-11-17-building-edge-applications-with-apache-pulsar/
The explosive growth of remotely connected devices in recent years has created challenges for the centralized computing paradigm. Constrained by network and infrastructure, it is increasingly difficult for businesses to move and process data generated by all devices in the data center or cloud without latency or performance issues. As a result, edge applications are on the rise. Gartner that by 2025, enterprises will create and process 75% of their data outside the data center or cloud.
So what are edge applications? Edge applications run on or near data sources, such as IoT devices, local edge servers, edge execution. Edge computing enables computing, storage, caching, management, alerting, machine learning and routing to take place outside the data center and cloud. Industries such as retail, agriculture, manufacturing, transportation, healthcare, and telecommunications are adopting edge applications that enable lower latency, better bandwidth, lower infrastructure costs, and more efficient decision-making.
This article will introduce you to some of the challenges faced in developing edge applications, and Apache Pulsar's solutions for edge applications. This article will also share an example that shows step-by-step how to build an edge application with Pulsar.
key challenges
While the decentralized nature of edge computing brings many benefits, it also brings challenges, including:
- Edge applications often need to support a variety of devices, protocols, languages, and data formats.
- Communication from edge applications needs to be asynchronous to the flow of events from sensors, logs, and applications at a fast but uneven rate.
- Edge producers of data need to deploy different messaging clusters according to design requirements.
- By design, edge applications are geographically dispersed and diverse.
Solution
An open source solution that is adaptable, hybrid, geo-replicated, and scalable is needed to solve the problems of building edge applications. Open source projects with many users can provide broad community support and a rich ecosystem of adapters, connectors, and extensions needed for edge applications. Based on my experience working with different technologies and open source projects over the past two decades, I believe that Apache Pulsar meets the needs of edge applications.
Apache Pulsar is an open-source, cloud-native, distributed message flow platform. Since Pulsar became a top-level project of the Apache Software Foundation in 2018, its community participation, surrounding ecological growth, and global usage have all grown rapidly. Pulsar is able to solve many of the challenges in edge computing thanks to the following:
- Apache Pulsar supports fast messaging, metadata, and multiple data formats under multiple schemas.
- Pulsar supports multilingual clients such as Go, C++, Java, Node.js, Websockets and Python. In addition, there are open source clients for Haskell, Scala, Rust, and .Net from community developers, as well as stream processing libraries for Apache Flink and Apache Spark.
- Pulsar supports multiple messaging protocols, including MQTT, Kafka, AMQP, and JMS.
- Pulsar's cross-geo replication capability solves the problem of the location of distributed devices.
- Pulsar's cloud-native architecture allows it to run in multi-cloud, on-premises or Kubernetes environments. It can also accommodate small edge gateways, as well as powerful devices like NVIDIA Jetson Xavier NX.
In this example, we are building an edge application on NVIDIA Jetson Xavier NX, which gives us enough power to run an edge Apache Pulsar stand-alone broker, multiple web cameras, and a deep learning edge application. My edge device contains 384 NVIDIA CUDA® cores and 48 Tensor cores, 6 64-bit ARM cores, and 8 GB of 128-bit LPDDR4x RAM. In a follow-up blog, I'll show you that even running Pulsar on simpler devices like Raspberry PI 4s and NVIDIA Jetson Nano can still meet the need for fast edge event streaming.
Architecture
The physical structure of the solution has been introduced above, so the question now is how to logically build the application architecture for incoming data. For those unfamiliar with Pulsar, the first thing to understand is that each topic belongs to a tenant and a namespace, as shown in the image below.
These logical structures allow us to group data according to various criteria, such as the original source of the data and different businesses. Once we have decided on the tenant, namespace, and topic, we need to determine the fields needed to collect the additional data needed for the analysis.
Next, we need to determine the format of the data. Depending on the architecture, it can be the same as the original format, or it can be converted according to specific requirements for transmission, processing or storage. Additionally, in many cases, our devices, facilities, sensors, operating systems, or transmission modes require us to select specific data formats.
In this article, we'll use the JSON data format, which satisfies the readability needs of almost any language and most people. Also, Apache Avro is a good choice as a binary format, but my blog series will choose the simplest format.
Once the data format has been chosen, we may need to enrich the raw data with additional fields beyond sensors, machine learning taxonomies, logs, or other sources. I like to add IP address, mac address, hostname, creation timestamp, execution time, and some fields about device health like disk space, memory, and CPU. You can add or subtract if you don't think it's necessary, or if your device already broadcasts device health. Especially when we have thousands of devices, these fields can help us debug our program. So I'm used to adding this data as bandwidth allows.
We need to find the primary key or unique identifier for the event record, IoT data usually doesn't come with it. We can synthesize one with the UUID generator when creating the record.
After we have a list of fields, we need to set a schema for the data and determine the field name, type, default value and whether it is empty. Once a schema is defined, we can use the JSON schema or build a class with fields and then query the data in the subject using Pulsar SQL. For IoT applications, time series master data storage of such events is often used. I recommend Aerospike, InfluxDB or ScyllaDB. We can use Pulsar IO Sink connector or other mechanism depending on the scenario and requirement. If necessary, we can also use Spark connectors, Flink connectors or NiFi connectors.
Our final event will look like the JSON example shown below.
{"uuid": "xav_uuid_video0_lmj_20211027011044", "camera": "/dev/video0", "ipaddress": "192.168.1.70", "networktime": 4.284832000732422, "top1pct": 47.265625, "top1": "spotlight, spot", "cputemp": "29.0", "gputemp": "28.5", "gputempf": "83", "cputempf": "84", "runtime": "4", "host": "nvidia-desktop", "filename": "/home/nvidia/nvme/images/out_video0_tje_20211027011044.jpg", "imageinput": "/home/nvidia/nvme/images/img_video0_eqi_20211027011044.jpg", "host_name": "nvidia-desktop", "macaddress": "70:66:55:15:b4:a5", "te": "4.1648781299591064", "systemtime": "10/26/2021 21:10:48", "cpu": 11.7, "diskusage": "32367.5 MB", "memory": 82.1}
fringe producer
Next we tested some libraries, languages and clients on NVIDIA Jetson Xavier NX to see which worked best for our scenario. After prototyping with an NVIDIA Jetson Xavier NX version ARM installed Ubuntu system running several libraries, I found the following technical options that generate the messages my application needs. Although these are not the only paths for this edge platform, they are still very good choices.
- Go Lang Pulsar Producer
- Python 3.x Websocket Producer
- Python 3.x MQTT Producer
- Java 8 Pulsar Producer
- Go Lang Pulsar Producer
Go language Pulsar producer
package main
import (
"context"
"fmt"
"log"
"github.com/apache/pulsar-client-go/pulsar"
"github.com/streamnative/pulsar-examples/cloud/go/ccloud"
"github.com/hpcloud/tail"
)
func main() {
client := ccloud.CreateClient()
producer, err := client.CreateProducer(pulsar.ProducerOptions{
Topic: "jetson-iot",
})
if err != nil {
log.Fatal(err)
}
defer producer.Close()
t, err := tail.TailFile("demo1.log", tail.Config{Follow:true})
for line := range t.Lines {
if msgId, err := producer.Send(context.Background(),
&pulsar.ProducerMessage{
Payload: []byte(line.Text),
}); err != nil {
log.Fatal(err)
} else {
fmt.Printf("jetson:Published message: %v-%s \n",
msgId,line.Text)
}
}
}
Python3 Websocket producer
import requests, uuid, websocket, base64, json
uuid2 = uuid.uuid4()
row = {}
row['host'] = 'nvidia-desktop'
ws = websocket.create_connection( 'ws://server:8080/ws/v2/producer/persistent/public/default/energy')
message = str(json.dumps(row) )
message_bytes = message.encode('ascii')
base64_bytes = base64.b64encode(message_bytes)
base64_message = base64_bytes.decode('ascii')
ws.send(json.dumps({ 'payload' : base64_message, 'properties': { 'device' : 'jetson2gb', 'protocol' : 'websockets' },'key': str(uuid2), 'context' : 5 }))
response = json.loads(ws.recv())
if response['result'] == 'ok':
print ('Message published successfully')
else:
print ('Failed to publish message:', response)
ws.close()
Java Pulsar Producer with Schema
public static void main(String[] args) throws Exception {
JCommanderPulsar jct = new JCommanderPulsar();
JCommander jCommander = new JCommander(jct, args);
if (jct.help) {
jCommander.usage();
return;
}
PulsarClient client = null;
if ( jct.issuerUrl != null && jct.issuerUrl.trim().length() >
0 ) {
try {
client = PulsarClient.builder()
.serviceUrl(jct.serviceUrl.toString())
.authentication(
AuthenticationFactoryOAuth2.clientCredentials(new URL(jct.issuerUrl.toString()),new URL(jct.credentialsUrl.toString()), jct.audience.toString())).build();
} catch (PulsarClientException e) {
e.printStackTrace();
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
else {
try {
client = PulsarClient.builder().serviceUrl(jct.serviceUrl.toString()).build();
} catch (PulsarClientException e) {
e.printStackTrace();
}
}
UUID uuidKey = UUID.randomUUID();
String pulsarKey = uuidKey.toString();
String OS = System.getProperty("os.name").toLowerCase();
String message = "" + jct.message;
IoTMessage iotMessage = parseMessage("" + jct.message);
String topic = DEFAULT_TOPIC;
if ( jct.topic != null && jct.topic.trim().length()>0) {
topic = jct.topic.trim();
}
ProducerBuilder<IoTMessage> producerBuilder = client.newProducer(JSONSchema.of(IoTMessage.class))
.topic(topic)
.producerName("jetson").
sendTimeout(5, TimeUnit.SECONDS);
Producer<IoTMessage> producer = producerBuilder.create();
MessageId msgID = producer.newMessage()
.key(iotMessage.getUuid())
.value(iotMessage)
.property("device", OS)
.property("uuid2", pulsarKey)
.send();
producer.close();
client.close();
producer = null;
client = null;
}
private static IoTMessage parseMessage(String message) {
IoTMessage iotMessage = null;
try {
if ( message != null && message.trim().length() > 0) {
ObjectMapper mapper = new ObjectMapper();
iotMessage = mapper.readValue(message, IoTMessage.class);
mapper = null;
}
}
catch(Throwable t) {
t.printStackTrace();
}
if (iotMessage == null) {
iotMessage = new IoTMessage();
}
return iotMessage;
}
java -jar target/IoTProducer-1.0-jar-with-dependencies.jar --serviceUrl pulsar://nvidia-desktop:6650 --topic 'iotjetsonjson' --message "...JSON…"
You can find all the source code at here .
Now, we decide how to execute the application on the device. You can use the scheduler that comes with your system, like cron or some add-ons. For reference, I often use cron, MiNiFi proxy, shell script, or run the application continuously as a service. You need to configure your devices and sensors yourself for optimal scheduling.
Validate data and monitor
Now that we have a steady stream of events coming into our Pulsar cluster, we can validate the data and monitor progress. Take the StreamNative Cloud Manager interface as an example, as shown in the following figure. We can also choose to look at the Pulsar metrics endpoint documented at here .
Check Statistics via REST
- http://:8080/admin/v2/persistent/public/default/mqtt-2/stats
- http://:8080/admin/v2/persistent/public/default/mqtt-2/internalStats
Check Statistics via Admin CLI
bin/pulsar-admin topics stats-internal persistent://public/default/mqtt-2
Find the subscription the topic is in
http://nvidia-desktop:8080/admin/v2/persistent/public/default/mqtt-2/subscriptions
Consume from subscription via REST
http://nvidia-desktop:8080/admin/v2/persistent/public/default/mqtt-2/subscription/mqtt2/position/10
Consume messages via the CLI
bin/pulsar-client consume "persistent://public/default/mqtt-2" -s "mqtt2" -n 5
Query topics via Pulsar SQL
select * from pulsar."public/default".iotjetsonjson;
Next steps
We have now built an edge application that can stream data at event speed and connect streaming data from thousands of other applications into an Apache Pulsar cluster. Next, we can add rich real-time analytics using Flink SQL. This allows us to do advanced stream processing, integrate event streams, and do large-scale data processing.
Further reading
If you're interested in learning more about edge applications and building your own connectors, see the following resources:
- Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar)
- PPT download address
- Pulsar client library
- Example source data
- InfluxDB Pulsar IO sink connector
Follow the public account "Apache Pulsar" to get more technical dry goods
Join the Apache Pulsar Chinese exchange group 👇🏻
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。