About Apache Pulsar

Apache Pulsar is the top-level project of the Apache Software Foundation. It is the next-generation cloud-native distributed message flow platform. It integrates messaging, storage, and lightweight functional computing. It uses a separate architecture design for computing and storage to support multi-tenancy, persistent storage, Multi-computer room and cross-regional data replication, with strong consistency, high throughput, low latency and high scalability and other streaming data storage characteristics.

GitHub address: http://github.com/apache/pulsar/

Translator of this article: teng-da, author: PAVELS SISOJEVS, the article was first published on InfoQ, the original address: https://scala.monster/train-station/ .

It takes about 10 minutes to read this article.

This photo was taken at Landwasser viaduct in Switzerland. Switzerland is famous for its railway network . According to Wikipedia, Switzerland has the densest railway network in the world. This article takes you to simulate the Swiss railway network.

We will use Apache Pulsar and Neutron . Apache Pulsar is an open source distributed pub-sub messaging system, originally developed by Yahoo!, and currently belongs to the Apache Software Foundation. Data architects, data analysts, programmers, etc. often compare Apache Pulsar and Apache Kafka, and there have been many articles comparing the advantages and disadvantages of the two.

Neutron is FS2 concurrent stream processing library files. As a mature product, Neutron has been used Chatroulette , but the development of Neutron has not stopped.

Owning a toy railway network has always been my childhood dream. Now, I can build a virtual railway network by myself.

Next, we will develop an event-driven railway network simulator together.

Ideas

We are going to build a railway network with three stations: Geneva, Berne and Zurich. Among them, Geneva and Zurich are both connected to Berne, but Geneva and Zurich are not connected.

Each site is a node, and the connected nodes communicate through the message broker-Apache Pulsar. A node consumes events published by its connected nodes. Consumers filter incoming events and consume events related to a specific city.

There are two ways to control the behavior of the simulator. One is to add an HTTP endpoint that can be used for manual intervention. The user adds a new train to the system by sending an HTTP request.

We don't store any data persistently, we don't need to use a database or caching system, and we store all data in memory. So we can use advanced concurrency mechanisms Ref

Apache Pulsar is the core of the system and is responsible for communication between nodes. Once the state changes, the system should publish a new event describing this action. In other words, every event should have a timestamp. In addition, each event should have a train ID, which represents the identification number of a specific train. Initially, there are two events:

  • Departed event-Departed event is announced when the train departs.
  • Arrived (Arrived) event-the arrival event is announced when the train arrives.

These two events contain basic information about the train: train identification number, departure city, destination city, estimated time of arrival, and event timestamp.

Each city consumes events from connected cities. For example, Zurich consumes events from Bern, but does not pay attention to events from Geneva. The event consumer in Zurich should ensure that it can capture departing from Bern 1609b51bf3fbc9 and destined for Zurich. Each city corresponds to a topic, and 3 cities correspond to 3 topics. When optimization is needed, the general "urban topic" can be divided into several more specific topics.

The business logic is connected to Apache Pulsar Neutron

Each consumed topic will be converted to fs2 stream. If you don’t know how to deal with fs2 stream, you can refer to fs2 guide , this part of the code will not be involved in this part.

I wrote this application based on the Tagless Final technology of the cats ZIO as the runtime effect .

Introduction to Pulsar

Apache Pulsar is a distributed messaging and streaming platform that can be used to build highly scalable systems. The system communicates through messages within the system, and the number of topics can reach millions. From the developer's point of view, Apache Pulsar can be seen as a black box, but I suggest to learn more about its underlying working principle. In order to better understand the operations in this article, I first introduce a few concepts:

  • topic -the medium of information transmission. There are two types of topics:

    1. persistent topic-persistent storage of message data.
    2. non-persistent topic-does not store message data persistently, but saves the message in memory. If the Pulsar broker goes down, all messages in transit will be lost.
  • producer -connected to topic, used to publish messages.
  • consumer -Connected to topic through subscription for receiving messages.
  • Subscription-Formulate configuration rules for publishing messages to consumers. Pulsar supports four types of subscriptions:

    1. exclusive-for a single consumer, if multiple consumers subscribe at the same time, an error will be raised;
    2. failover-multiple consumers, but only one consumer can receive the message;
    3. sharing-multiple consumers receive messages in polling mode;
    4. Key_Shared consumers, distribute messages according to key (one consumer corresponds to one key).

Message system after the release event, by the producer handle these events and posted to topic on another system in the consumer by subscription connected to this topic.

Learn more about Apache Pulsar .

Business logic

The two events that will happen in the railway network mentioned above-the departure and arrival of the train. The code defining these two events is as follows:

case class Departed(id: EventId, trainId: TrainId, from: From, to: To, expected: Expected, created: Timestamp) extends Event
case class Arrived(id: EventId, trainId: TrainId, from: From, to: To, expected: Expected, created: Timestamp)  extends Event

The event needs to contain the basic information of the actions that have taken place in the system: unique event id, train id, departure city, destination city, estimated time of arrival, and actual event timestamp. We can also add information such as the station number in the future.

To ensure that this article is simple and easy to understand, we limit the amount of data required for this system to work. In order to distinguish the fields in the event (such as destination and departure city), all fields are strongly typed.

Since there is no system that can automatically detect the arrival or departure of trains, we need to manually control the railway network. Suppose a train dispatcher controls the railway network through buttons and dashboards. Although we don't have a cool UI, we can build an API. The core of the API is two simple commands to trigger the business logic of the station:

case class Arrival(trainId: TrainId, time: Actual)
case class Departure(id: TrainId, to: To, time: Expected, actual: Actual)

Train departure

Let's start by creating a train! This command is relatively simple and can be sent via cURL:

curl --request POST \
  --url http://localhost:8081/departure \
  --header 'content-type: application/json' \
  --data '{
    "id": "153",
    "to": "Zurich",
    "time": "2020-12-03T10:15:30.00Z",
    "actual": "2020-12-03T10:15:30.00Z"
}'

The above command assumes that the Berne service node is running on port 8081, and each node is running an HTTP server, which can also handle this request. We use the Http4s library as the HTTP server, and the first line is defined as follows:

case req @ POST -> Root / "departure" =>
  req
    .asJsonDecode[Departure]
    .flatMap(departures.register)
    .flatMap(_.fold(handleDepartureErrors, _ => Ok()))

To call the Departures service, you only need to register to list the trains sent:

trait Departures[F[_]] {
  def register(departure: Departure): F[Either[DepartureError, Departed]]
}

Scala supports multiple ways of validating data. I choose the most straightforward one-returning Either with a custom error. If the train registration is successful, the Departed event will be returned; otherwise, an error will be returned.

To ensure that this article is simple and easy to understand, we will call the message producer during the execution of the Departures Departures service needs to be executed make function is created in the Departures companion object:

object Departures {
  def make[F[_]: Monad: UUIDGen: Logger](
      city: City,
      connectedTo: List[City],
      producer: Producer[F, Event]
  ): Departures[F] = new Departures[F] {
    def register(departure: Departure): F[Either[DepartureError, Departed]] = ???
  }
}

In order to implement the Departures interface, we need to set the boundary for effect F: UUIDGen and Logger instances are required. I have created virtual UUIDGen and Logger interfaces in the program.

F should also have Monad instances, which are used to connect function calls.

First execute the verification logic to check start event is valid. We only need to check whether the destination city is in the list of connected cities:

def validated(departure: Departure)(f: F[Departed]): F[Either[DepartureError, Departed]] = {
  val destination = departure.to.city
  connectedTo.find(_ === destination) match {
    case None =>
      val e: DepartureError = DepartureError.UnexpectedDestination(destination)
      F.error(s"Tried to departure to an unexpected destination: $departure")
       .as(e.asLeft)
    case _ =>
      f.map(_.asRight)
  }
}

If the destination city is not in the list, an error message log is generated and an error is returned. Otherwise, create a Departed event and return it as a result.

Next, you need to implement the registration function, the sample code is as follows:

def register(departure: Departure): F[Either[DepartureError, Departed]] =
  validated(departure) {
    F.newEventId
      .map {
        Departed(
          _,
          departure.id,
          From(city),
          departure.to,
          departure.time,
          departure.actual.toTimestamp
        )
      }
      .flatTap(producer.send_)
  }

To verify the destination city, if valid, generates a newEventId , for creating new Departed event, which will be passed through to make function producer published to Pulsar's city topic. View the final version for the Departures event.

Estimated departure train

We have seen how to generate trains. If a train goes from Zurich to Bern, Bern will be notified accordingly.

Bern listens to events from Zurich, and once there is a Departed event with Bern as the destination, it will be added to the expected train list. Now we only focus on business logic, and we will discuss message consumption later. Define DepartureTracker for the expected departure event. The sample code is as follows:

trait DepartureTracker[F[_]] {
  def save(e: Departed): F[Unit]
}

This service will become Departed event stream, so we don’t care about the return type and don’t want any validation errors. As with the Departures service above, first create the companion object and define the make function:

def make[F[_]: Applicative: Logger](
    city: City,
    expectedTrains: ExpectedTrains[F]
  ): DepartureTracker[F] = new DepartureTracker[F] {
    def save(e: Departed): F[Unit] =
      val updateExpectedTrains =
        expectedTrains.update(e.trainId, ExpectedTrain(e.from, e.expected)) *>
          F.info(s"$city is expecting ${e.trainId} from ${e.from} at ${e.expected}")
      updateExpectedTrains.whenA(e.to.city === city)
  }

We rely on ExpectedTrains service. ExpectedTrain is a service for storing incoming trains, and we will be able to implement this service soon. We have implemented the save function, which will only be executed when the destination city of the incoming train matches the expected city. For example, Geneva and Zurich both consume events from Bern. When Berne issued the Departed event, one of the cities would ignore this message, and the other city, the destination city, would update the expected train table and create a log message.

It is expected that the train storage contains at least the following functions:

trait ExpectedTrains[F[_]] {
  def get(id: TrainId): F[Option[ExpectedTrain]]
  def remove(id: TrainId): F[Unit]
  def update(id: TrainId, expectedTrain: ExpectedTrain): F[Unit]
}

Even if we try to delete a train that does not exist in the system, the operation will not fail. In some business situations, system failure errors may occur, but in this special case, we will ignore this error. Throughout the test process, the data has been stored in the memory and is not persisted.

def make[F[_]: Functor](
    ref: Ref[F, Map[TrainId, ExpectedTrain]]
  ): ExpectedTrains[F] = new ExpectedTrains[F] {
    def get(id: TrainId): F[Option[ExpectedTrain]] = 
      ref.get.map(_.get(id))
    def remove(id: TrainId): F[Unit] = 
      ref.update(_.removed(id))
    def update(id: TrainId, train: ExpectedTrain): F[Unit] = 
      ref.update(_.updated(id, train))
  }

Ref as an advanced concurrency mechanism in this application.

Train arrives

The last part of the business logic trilogy is the arrival of the train. Similar to the train departure, first create an HTTP endpoint, which can be called with a simple cURL POST request:

curl --request POST \
  --url http://localhost:8081/arrival \
  --header 'Content-Type: application/json' \
  --data '{
    "trainId": "123",
    "time": "2020-12-03T10:15:30.00Z"
}'

Then the request is processed by the Http4s route:

case req @ POST -> Root / "arrival" =>
  req
    .asJsonDecode[Arrival]
    .flatMap(arrivals.register)
    .flatMap(_.fold(handleArrivalErrors, _ => Ok()))

Arrivals service is similar to the Departures service described above. Arrivals is only one method in the register method:

trait Arrivals[F[_]] {
  def register(arrival: Arrival): F[Either[ArrivalError, Arrived]]
}

Then you need to verify the request, the sample code is as follows:

def validated(arrival: Arrival)(f: ExpectedTrain => F[Arrived]): F[Either[ArrivalError, Arrived]] =
  expectedTrains
    .get(arrival.trainId)
    .flatMap {
      case None =>
        val e: ArrivalError = ArrivalError.UnexpectedTrain(arrival.trainId)
        F.error(s"Tried to create arrival of an unexpected train: $arrival")
         .as(e.asLeft)
      case Some(train) =>
        f(train).map(_.asRight)
    }

Check whether the arriving train is in line with expectations. If it does, create Arrived event 0609b51bf4064e; otherwise, generate an error log. register method in the train arrival event is similar to the implementation of the previous register method:

def register(arrival: Arrival): F[Either[ArrivalError, Arrived]] =
  validated(arrival) { train =>
    F.newEventId
      .map {
        Arrived(
          _,
          arrival.trainId,
          train.from,
          To(city),
          train.time,
          arrival.time.toTimestamp
        )
      }
      .flatTap(a => expectedTrains.remove(a.trainId))
      .flatTap(producer.send_)
  }

Compared with Departures , the arrival event not only releases a new event, but also deletes the arrival train from the list of expected departure trains.

The above is all the business logic, the code has passed the unit test (using ZIO Test ), please refer to the GitHub file .

Message consumption

This section mainly talks about message consumption, and will also connect all logical services together.

Create resources

First create the required resources. A city node contains four components: configuration, event producer, event consumer, and ExpectedTrains Ref . We can combine these four resources in a case class and create them outside the Main

final case class Resources[F[_], E](
  config: Config,
  producer: Producer[F, E],
  consumers: List[Consumer[F, E]],
  trainRef: Ref[F, Map[TrainId, ExpectedTrain]]
)

We use ciris library to read Config from environment variables. For configuration, please refer to GitHub file . We use Neutron library developed by Chatroulette to create producers and consumers.

First, create a Pulsar object instance to establish a connection with the Apache Pulsar cluster:

Pulsar.create[F](config.pulsar.serviceUrl)

The above operation only needs serviceUrl , we will get Resource[F, PulsarClient] , which can be used to create producers and consumers. Before creating a producer, you should create a topic object topic

def topic(config: PulsarConfig, city: City) =
  Topic(
    Topic.Name(city.value.toLowerCase),
    config
  ).withType(Topic.Type.Persistent)

The name of the topic is the name of the city, and it is a , so that any unconfirmed messages will not be lost. In addition, as part of the configuration, we passed the namespace and the tenant. For more information about namespaces and tenants, please Pulsar document .

The creation of a producer operation is just a simple line:

def producer(client: Pulsar.T, config: Config): Resource[F, Producer[F, E]] =
  Producer.create[F, E](client, topic(config.pulsar, config.city))

There are many ways to create a producer, we choose the simplest one, just use the Pulsar client created before and a topic.

It takes slightly more operations to create a consumer, because you also need to create a subscription:

def consumer(client: PulsarClient, config: Config, city: City): Resource[F, Consumer[F, E]] = {
  val name         = s"${city.value}-${config.city.value}"
  val subscription =
          Subscription
            .Builder
            .withName(Subscription.Name(name))
            .withType(Subscription.Type.Failover)
            .build
  Consumer.create[F, E](client, topic(config.pulsar, city), subscription)
}

Create a subscription and set the subscription name to a combination of the names of the connected cities and the names of the cities where the train stops. Failover subscription type is used by default, and 2 instances are run in parallel (in case one instance goes down).

Adding the required Ref , we can finally create Resources :

for {
  config    <- Resource.liftF(Config.load[F])
  client    <- Pulsar.create[F](config.pulsar.url)
  producer  <- producer(client, config)
  consumers <- config.connectedTo.traverse(consumer(client, config, _))
  trainRef  <- Resource.liftF(Ref.of[F, Map[TrainId, ExpectedTrain]](Map.empty))
} yield Resources(config, producer, consumers, trainRef)

Please note that we used the traverse method to create a consumer list in the connectedTo GitHub file view the final result.

Start engine

zio.Task as the effect type in the application. zio.Task contains the fewest type parameters. For those who are not familiar with ZIO, zio.Task is easier to understand. If you want to know more type parameters, you can refer to ZIO Introduction .

First, create the Resources class defined earlier:

Resources
  .make[Task, Event]
  .use {
    case Resources(config, producer, consumers, trainRef) => ???
  }

There are still 4 parameters. First initialize the service and create the route for the HTTP server:

val expectedTrains   = ExpectedTrains.make[Task](trainRef)
val arrivals         = Arrivals.make[Task](config.city, producer, expectedTrains)
val departures       = Departures.make[Task](config.city, config.connectedTo, producer)
val departureTracker = DepartureTracker.make[Task](config.city, expectedTrains)
val routes = new StationRoutes[F](arrivals, departures).routes.orNotFound

Create an HTTP server:

val httpServer = Task.concurrentEffectWith { implicit CE =>
  BlazeServerBuilder[Task](ec)
    .bindHttp(config.httpPort.value, "0.0.0.0")
    .withHttpApp(routes)
    .serve
    .compile
    .drain
}

If you know Http4s well, then the above operation should not be difficult to understand. If you don’t understand, check the related documents . Start consuming incoming messages and create a stream:

val departureListener =
  Stream
    .emits(consumers)
    .map(_.autoSubscribe)
    .parJoinUnbounded
    .collect { case e: Departed => e }
    .evalMap(departureTracker.save)
    .compile
    .drain

In short, we created an event stream using the FS2 library. First, create a consumer stream, call the autoSubscribe method for each consumer to subscribe to the topic, and then parJoinUnbounded all streams together through 0609b51bf40d82, and then delete all messages except Departed collect Finally, before implementation departureTracker call on save method, compile and exhaust flow.

There are now two final streams: HTTP server and incoming messages from Pulsar. At this point we have processed all the messages, just run the stream, that is, compress in parallel and discard the results:

departureListener
  .zipPar(httpServer)
  .unit

The code blocks that make up the Main class are relatively simple, and relatively easy to read and maintain.

Conclusion

This article gives an example of an event-driven system, sorts out the business logic step by step, and simulates the Swiss railway network. You can modify and expand on the basis of the sample code in this article.

Part of the features of Apache Pulsar are used in this article, but Pulsar is more than that, it is easy to operate and powerful. We built a simple distributed system consisting of several nodes, which communicate using message passing on Apache Pulsar. This application is cats library, of which ZIO Task is the main effect type.

In addition, we also tried Neutron . Although Neutron has been used in the Chatroulette , it is still under development.

Click to view , the final version of this program, , and the operation guide can be seen in the readme section.

Related Reading


ApachePulsar
192 声望939 粉丝

Apache软件基金会顶级项目,下一代云原生分布式消息系统