This article is translated from "Event Streaming with Apache Pulsar and Scala" by Giannis.

Translator information: Yao Yuqian@Shenzhen Juexing Technology Co., Ltd., dedicated to the field of medical big data. Passionate about open source and active in the Apache Pulsar community.

The author of this article, Giannis Polyzos, is a senior engineer at StreamNative, focusing on Apache Pulsar. Apache Pulsar is a cloud-native message streaming platform with great promise. In this article, he'll introduce what Pulsar is and how good it is, and then walk through a quick tutorial to help readers get started with running Pulsar in the Scala language.

Article Summary

In the modern data age, there is an ever-increasing need to deliver data insights as fast as possible. What is happening "currently" can become irrelevant in minutes or even seconds, so there is a growing need to receive and process events as quickly as possible - whether to improve business in demanding markets To be more competitive is to enable a system to grow and adapt itself according to the environmental stimuli it is exposed to.

As containers and cloud infrastructure evolve, companies are looking to leverage and adopt cloud-native approaches. Moving to the cloud and adopting containers in the system means we will likely be leveraging technologies like Kubernetes for all its amazing capabilities. Placing infrastructure in the cloud and adopting cloud-native solutions means that many users also expect their messaging and streaming solutions to conform to these principles.

In this post, we'll cover how to implement cloud-native event stream processing using Apache Pulsar and Scala. We'll review what Apache Pulsar has to offer in this modern data age, what makes it stand out, and how to run it by creating some simple producers and consumers using Scala and the pulsar4s library.

1. What is Apache Pulsar

As stated in the documentation,

Apache Pulsar is a cloud-native, distributed messaging and streaming platform that manages hundreds of billions of events every day.

It was originally created by Yahoo in 2013 to meet its huge scaling needs - the engineering team also reviewed solutions like Apache Kafka at the time (although these systems have since grown considerably), but didn't quite meet them demand.

Other systems lacked features such as cross-region replication, multi-tenancy, offset management, and performance in dealing with message backlogs, so Apache Pulsar was born.

Let's take a closer look at what makes it stand out:

  1. Unified Messaging and Streaming Scenarios: The first thing you should notice about Apache Pulsar is that it is a unified platform for messaging and streaming. The terms message and stream are often confused, but there are fundamental differences. For example, in a messaging scenario, the user might want to consume a message as soon as it arrives, and then delete the message; however, in a streaming scenario, the user might want to keep the messages and be able to reproduce them.
  2. Multi-tenancy: Apache Pulsar was designed from the beginning to be a multi-tenant system. You can think of multi-tenancy as different groups of users, each running in their own isolated environment. Pulsar's logical architecture consists of tenants, namespaces, and topics. A namespace is a logical grouping of topics within a tenant. You can easily map your organization's needs using defined hierarchies and provide isolation, authentication, authorization, quotas, and applying different policies at the namespace and topic level. An example of multi-tenancy for an e-commerce business is as follows, with different departments such as WebBanking and Marketing as tenants, and members of these departments can then operate within the tenant.

  1. Cross-regional replication: Cross-regional replication ensures disaster tolerance by providing data copies in data centers of different clusters distributed across regions. Apache Pulsar provides out-of-the-box geo-replication without the need for external tools. Alternatives like Apache Kafka rely on a third-party solution - namely MirrorMaker - to address such known problematic scenarios. With Pulsar, you can overcome these issues with powerful built-in cross-geo replication and design a disaster recovery solution that meets your needs.
  2. Horizontal scaling: The architecture of Apache Pulsar consists of three components. Pulsar Broker is a stateless service layer, Apache BookKeeper (bookie server) is a storage layer, and finally Apache ZooKeeper is a metadata layer - although version 2.8.0 introduced a metadata layer as an alternative (ref PIP 45 ). All layers are separated from each other, which means you can scale each component independently as needed. Apache BookKeeper uses the concept of distributed ledger rather than a log-based abstraction, which makes it very easy to scale without rebalancing. This also makes Apache Pulsar ideal for cloud-native environments.
  3. Tiered storage: When you're dealing with a lot of data, topics can get infinitely large and over time (storage costs) can become very expensive. Apache Pulsar provides tiered storage, so as your topic grows, you can offload old data to some cheaper storage (e.g. Amazon S3) and your clients can still access the data and continue serving as if Nothing ever happened.
  4. Pulsar Functions: Pulsar Functions is a lightweight serverless computing framework that allows you to deploy your own stream processing logic in a very simple way. Lightweight also makes it an excellent choice for IoT edge analytics use cases.

Apache Pulsar has many more features like built-in Schema Registry, support for transactions and Pulsar SQL, but now let's see how to actually get Pulsar up and running and create our first producers and consumers in Scala.

2. Scenario and cluster setup example

Taking a simple scenario as an example, we create a producer that reads events from a simulated sensor, sends the events to a topic, and then creates a consumer on the other side that subscribes to the topic and only reads incoming events. We will use the pulsar4s client library for the implementation, while using docker to run the Pulsar cluster. In order to start the Pulsar cluster in stand-alone mode, run the following command in the terminal:

docker run -it \ 
    -p 6650:6650 \ 
    -p 8080:8080 \ 
    --name pulsar \ 
    apachepulsar/pulsar:2.8.0 \ 
    bin/pulsar standalone

This command will start Pulsar and bind the necessary ports to the local computer. With the cluster up and running, you can start creating producers and consumers.

3. Apache Pulsar Producer

First, the pulsar4s-core and pulsar4s-circe dependencies need to be present - so we need to add the following to the build.sbt file:

val pulsar4sVersion = "2.7.3"

lazy val pulsar4s       = "com.sksamuel.pulsar4s" %% "pulsar4s-core"  % pulsar4sVersion
lazy val pulsar4sCirce  = "com.sksamuel.pulsar4s" %% "pulsar4s-circe" % pulsar4sVersion

libraryDependencies ++= Seq(
  pulsar4s, pulsar4sCirce
)
``
Then we will define the message payload for a sensor event as follows:

Then we define the message payload for the sensor event as follows:


case class SensorEvent(sensorId: String,
                         status: String,
                         startupTime: Long,
                         eventTime: Long,
                         reading: Double)

We also need to introduce the following scope content:

import com.sksamuel.pulsar4s.{DefaultProducerMessage, EventTime, ProducerConfig, PulsarClient, Topic}
import io.ipolyzos.models.SensorDomain
import io.ipolyzos.models.SensorDomain.SensorEvent
import io.circe.generic.auto._
import com.sksamuel.pulsar4s.circe._
import scala.concurrent.ExecutionContext.Implicits.global

The main entry point for production and consumption of all applications is the Pulsar Client, which handles connections to the broker. In the Pulsar client, you can also set up authentication for the cluster or adjust other important configurations such as timeout settings and connection pooling. You can simply instantiate the client by providing the service url to connect to.

val pulsarClient = PulsarClient("pulsar://localhost:6650")

With the client in hand, let's look at the initialization and looping of the producer.

val topic = Topic("sensor-events")

// create the producer
val eventProducer = pulsarClient.producer[SensorEvent](ProducerConfig(
  topic, 
  producerName = Some("sensor-producer"), 
  enableBatching = Some(true),
  blockIfQueueFull = Some(true))
)

// sent 100 messages 
(0 until 100) foreach { _ =>
   val sensorEvent = SensorDomain.generate()
   val message = DefaultProducerMessage(
      Some(sensorEvent.sensorId), 
      sensorEvent, 
      eventTime = Some(EventTime(sensorEvent.eventTime)))
  
   eventProducer.sendAsync(message) // use the async method to sent the message
}

There are a few things to note here:

  • We create the producer by providing the necessary configuration - both the producer and the consumer are highly configurable and can be configured according to the desired scenario.
  • Here we provide the producer's topic name, enable batching and make the producer block operations when the queue is full.
  • By enabling batching, Pulsar will use an internal queue to hold messages (default is 1000) and send batches of messages to the broker when the queue is full.
  • As you can see in the sample code, we use the .sendAsync() method to send the message to Pulsar. This is sending a message without waiting for an acknowledgment, and since we're buffering the message into a queue, this can overwhelm the client.
  • Use the backpressure mechanism with option blockIfQueueFull and tell the producer to wait before sending more messages.
  • Finally, we create the message to send. Here we specify sensorId as the key of the message, sensorEvent as the value, and we also provide eventTime, which is the time the event was generated.

At this point, our producer is in place and starts sending messages to Pulsar. The full implementation can be found here .

4. Apache Pulsar Consumer

Now let's turn our attention to the consumption side. As we did with the production side, the consumer side needs to be configured with a Pulsar Client.

val consumerConfig = ConsumerConfig(
  Subscription("sensor-event-subscription"), 
  Seq(Topic("sensor-events")), 
  consumerName = Some("sensor-event-consumer"), 
  subscriptionInitialPosition = Some(SubscriptionInitialPosition.Earliest), 
  subscriptionType = Some(SubscriptionType.Exclusive)
)

val consumerFn = pulsarClient.consumer[SensorEvent](consumerConfig) 

var totalMessageCount = 0 
while (true) {   
  consumerFn.receive match {
     case Success(message) => 
         consumerFn.acknowledge(message.messageId)
         totalMessageCount += 1
         println(s"Total Messages '$totalMessageCount - Acked Message: ${message.messageId}")
      case Failure(exception) => 
         println(s"Failed to receive message: ${exception.getMessage}")
  }
}

The steps are as follows:

  • Again, first create the consumer configuration. Here we specify a subscription name, the topic to subscribe to, the name of the consumer, and where we want the consumer to start consuming messages - here we specify Earliest - which means the subscription will be after the last message it acknowledged Start reading.
  • Finally, we specify the SubscriptionType - in this case it's the Exclusive type, which is also the default subscription type (more on subscription types later).
  • Once the configuration is in place, set up a new consumer with the created configuration, and we have a simple consumption loop - all we have to do is read a new message using the receive method, which blocks until a message is available, Then we acknowledge the message and finally get the total number of messages received so far along with the acknowledged messageId.
  • Please note: When a new message is received, you need to acknowledge to the client if all goes well, otherwise you need to use the negativeAcknowledge() method to negatively acknowledge the message.
  • With the consumer implementation in place, we have a running publish-subscribe application that produces sensor events for a Pulsar topic, and a consumer that subscribes to and consumes those messages.
  • The full implementation of the consumer can be found here .

5. Apache Pulsar subscription types

As mentioned in the article, Apache Pulsar provides a unified message and streaming schema by providing different subscription types.

Pulsar has the following subscription types:

  • Exclusive subscription: only one consumer is allowed to read messages using subscription at any point in time
  • Disaster Recovery Subscription: Only one consumer is allowed to read messages using the subscription at any point in time, but you can have multiple standby consumers to take over in case the active consumer fails
  • Shared Subscriptions: Multiple consumers can be attached to a subscription and share work among them in a round-robin fashion.
  • Key-sharing subscriptions: Multiple consumers can be attached to a subscription, and each consumer is assigned a unique set of keys. This consumer is responsible for processing the set of keys assigned to it. If that fails, another consumer will be assigned the set of keys.

Different subscription types are used in different scenarios. For example, to implement a typical fan-out messaging pattern, you can use an exclusive or disaster recovery subscription type. For message queues and work queues, shared subscriptions can be selected to share work among multiple consumers; for streaming scenarios or key-based stream processing, disaster recovery and key-sharing subscriptions are good choices, they can allow order consumption or extend your processing based on certain keys.

6. Summary and reading extension

In this post, we briefly covered what Apache Pulsar is, its ability to stand out as a unified message streaming platform, how to create some simple production and consumption applications, and finally we focused on how Pulsar works with different subscriptions Schema Unified Messaging and Streaming.

Further reading:

  • Pulsar IO : Move data in and out of Pulsar with ease
  • Pulsar Functions (Pulsar's serverless and lightweight computing framework): Users can use them to handle logic on Pulsar topics, reducing all the boilerplate code required to produce and consume applications
  • Function Mesh : Make your event streams truly cloud-native by leveraging Kubernetes-native features like deployment and autoscaling.

Follow public account "Apache Pulsar" to get more technical dry goods

Join the Apache Pulsar Chinese exchange group👇🏻

Click link to read the original English text


ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统