Introduction: At the TGIP-CN live event on March 6, we invited Lu Neng, a senior engineer at StreamNative, to share the functions and features of Pulsar Function Mesh. The following is a concise text-organized version of the video shared by Lu Neng for your reference.
I am very happy to share with you today the new work of StreamNative based on Pulsar Function: Function Mesh. Its core idea is to manage some complex, separate and separately managed functions in a unified manner, based on the native integration into Kubernetes, And can make full use of its various functions and scheduling algorithms.
Data Processing in Pulsar
First, let's take a look at the various data processing modules and methods supported in Pulsar, which are mainly divided into three aspects. The first is an interactive query based on Presto. Pulsar has its own Pulsar SQL, which queries the entire Pulsar cluster based on Presto; there are connectors related to Presto, which can query topics directly through the Presto cluster.
Second, as the core of message queues and message processing data, Pulsar can interface with various streaming data or batch data processing frameworks, such as Flink, Spark, and Hive. We will release a complete solution for the integration of Pulsar and Flink SQL in the future.
Finally, Pulsar has built-in Pulsar Function, the core idea is to provide a simplest API, allowing users to easily process the data flowing in Pulsar. To sum up, Pulsar Function is a lightweight data processing process, which mainly performs the following operations:
- Consume messages from one or more Pulsar topics;
- apply user-supplied processing logic to each message;
- Publish the results to a Pulsar topic.
Pulsar Function
What is Pulsar Function
The lightweight data processing process Pulsar Function mentioned above is shown in the figure. The user can enter multiple topics, and each input topic can send data to the user-defined Pulsar Function. After the processing unit of the Pulsar Function is completed, the result is sent to the only Output Pulsar topic, and some auxiliary topics can be logged. or the collection of messages.
Pulsar Function is not a complete stream processing framework. It does not provide as many guarantees as Flink, nor is it a computing abstraction layer. It is mainly closely integrated with Pulsar to process computing tasks. Its deployment is very simple, and there is no need to build and manage any additional clusters. You only need to open the support for Function in the Pulsar configuration file, and then you can submit Function to the existing Pulsar cluster. Users can directly process data in the cluster, and do not need to maintain another cluster for docking and processing.
Common application scenarios of Pulsar Function, such as tasks focused on ETL data cleaning, real-time data aggregation... Since Function itself is actually a very general abstraction, it is just an application function, so it can also be applied to microservices Scenes. In the functions of the Function application, any API can be called for operations, such as event routing, and users can use Pulsar Function to distribute data to different clusters.
How to implement Pulsar Function
The picture above shows the API of Pulsar Function. Pulsar Function supports three languages for data processing: Java, Python, and Golang.
There are three semantics supported by Pulsar Function:
- At-most once: Don't care whether the message is sent successfully, don't need the return value of the message sending;
- At least once (At-least once): The sent message will be resent if the return value is not received to ensure that the message is not lost, which may cause the repetition of the message. It is necessary to perform idempotent operations on the message when consuming;
- Exactly once: guarantees that messages are not lost or repeated.
Pulsar Function comes with simple internal state management divided into three types:
- Provide
Context
object to support user-accessible state; - Store the state in BookKeeper;
- Server-side operations (such as counters) are supported.
In addition to Input, the API introduced before will also have Context parameters, and many state management is in Context.
publicclassWordCountFunctionimplementsFunction<String, void> {
@Override
publicVoid process(String input, Context context) throwsException{
Arrays.aslist(input.split(“\.”)).forEach(word -> context.incrCounter(word, 1));
}
}
How to deploy Function
The CLI of Pulsar Function can perform a series of operations such as create, delete, update, get, restart, close, open and so on.
$ ./pulsar—admin functions
Usage: pulsar—admin functions [options] [command] [command options]
Commands:
localrun Run a PulsarFunction locally, rather than deploy to a Pulsar cluster)
create Create a PulsarFunctionin cluster mode (deploy it on a Pulsar cluster)
deleteDelete a PulsarFunction that is running on a Pulsar cluster
update Update a PulsarFunction that has been deployed to a Pulsar cluster
getFetch information about a PulsarFunction
restart Restartfunction instance
stop Stopsfunction instance
start Starts a stopped function instance
status Check the current status of a PulsarFunction
stats Get the current stats of a PulsarFunction
list List all PulsarFunctions running under a specific tenant andnamespace
querystate Fetch the current state associated with a PulsarFunction
putstate Put the state associated with a PulsarFunction
trigger Trigger the associated specified Pulsarwith a supplied value
Features of Pulsar Function
Pulsar Function has the following features:
- Efficient development: simple API, does not require a lot of effort to learn, and supports multiple languages;
- Convenient operation and maintenance: fully integrated with Pulsar, no additional system/service settings are required;
- Easy to troubleshoot: The local runtime is convenient, and log topics are easy to use.
Explain Function Mesh in detail
The Pulsar Function was introduced above. This part will introduce you to the Function Mesh in depth.
What is Function Mesh
Function Mesh is a collection of functions, which allows multiple functions to coordinate to complete data processing goals, and each function has its own specific tasks and defined stages. It is particularly emphasized that the original intention of Function Mesh is not to replace Flink or become a competitor of Flink, but to supplement and support the existing stream data processing engine.
The following figure is the classic view of Function Mesh:
As shown in the figure above, before Function Mesh, we used a single Pulsar Function. After the introduction of Function Mesh, multiple functions have associations and data connections, and finally produce the desired results, which can be mapped into different scenarios of microservices.
Function Mesh implementation scheme
Function Mesh Design Implementation Scheme 1: Based on Pulsar
At present, Pulsar provides command-line tools that can be used to manage a single function. As shown in the example above, you need to start function 1 to function 6 in the Pulsar command-line tool, which will bring duplication and complexity of management; at the same time, Pulsar will The above multiple functions are treated as a single function, making it difficult to keep track of the functions and treat them as a combination; and to have a good understanding of the upstream and downstream of each function and the processing order.
In the face of several problems mentioned above, we propose solutions in a targeted manner. For details, please refer to PIP-66[[1]](#). The main idea of PIP-66 is to provide native support for Function Mesh in Pulsar, that is, to submit Function Mesh through the Pulsar command line, and define each function parameter and organizational relationship, input and output contained in the Function Mesh YAML configuration file. source etc.
bin/pulsar-admin function-mesh create -f mesh.yaml //创建 Function Mesh 示例命令
// YAML 配置演示
# Metadata
name: PIP_Mesh
namespace: PIP_Namespace
tenant: PIP_Tenant
# Function Mesh configs
jarFile: /local/jar/files/example.jar
# Functions
functionInfos:
- name: Func1
classname: org.apache.pulsar.functions.api.examples.ExclamationFunction
replicas: 1
inputs:
- pulsar_topic_sourcce
output:
- pulsar_topic_1
- name: Func2
classname: org.apache.pulsar.functions.api.examples.ExclamationFunction
replicas: 1
inputs:
- pulsar_topic_1
output:
- pulsar_topic_result
The following figure is a Function Mesh scheduling scheme generated according to the above ideas. It maximizes the use of the existing Pulsar Function scheduling mechanism to achieve the design goals of Function Mesh. Of course, FunctionMeshManager
is also introduced to manage the metadata of Function Mesh.
Function Mesh Design and Implementation Plan 2: Based on Kubernetes
With the advancement of the overall project and related cloud projects, we found that it is very meaningful and valuable to implement Function Mesh based on Kubernetes. To create a Function Mesh, users can directly use the Kubernetes command-line tool to create it (the following demo commands), and what we need to do is to develop a CRD, the type is FunctionMesh, and the related relationship is basically the same as the setting in Scheme 1:
$ kubectl apply -f function-mesh.yaml
…
apiVersion: cloud.streamnative.io/v1alpha1
kind: FunctionMesh
metadata:
name: functionmesh-sample
spec:
functions:
- name: f1
…
- name: f2
…
- name: f3
…
- name: f4
…
- name: f5
…
- name: f6
…
In this mode, Function Mesh no longer runs on Pulsar, but on the Kubernetes cloud platform. In this mode, we can define a series of resources, such as a single Function, Mesh (a complete set of function combinations), Source and Sink, where Source and Sink are more concepts in Pulsar connector, and Source is to import third-party system data into Pulsar In topic, sink is the opposite action, which is also convenient for processing data lake and other scenarios.
The Function Mesh scheduling scheme based on Kubernetes, as shown in the following figure:
Scheme comparison: Pulsar vs Kubernetes
Comparing the Pulsar-based and Kubernetes-based Function Mesh implementations, we have a few thoughts:
- It is very beneficial to be able to use the scheduling capabilities of Kubernetes. The scheduling of different tasks is a special capability of Kubernetes, and it can also provide high availability and fault tolerance guarantees.
- In the cloud environment, Function becomes a first-class citizen, the same status as the service provided by Pulsar.
- If we can extract Function from Pulsar, then it has the potential to connect and process data with other message systems. Similar to the Lamba Function provided by AWS, it is also convenient for us to design event-driven patterns.
demo video
(click link view the original tweet video)
Follow-up planning
At present, we have done a lot of preliminary work on Function Mesh, and we have more plans for Function Mesh:
- Provide more cloud-native support;
- Customize Function Runtime according to different languages;
- Function registry: convenient for users to package and manage
- ...
We also plan to open source it in the near future. If you are interested, you can contact Pulsar Bot and reply "Mesh" at the bottom to try it out, and you are welcome to give us feedback.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。