Pulsar Functions is a lightweight, functional computing architecture launched by Apache Pulsar. With the help of Pulsar Functions, you can create complex processing logic based on a single message without deploying a separate system, simplify the event flow and introduce Serverless to reduce the burden of operation and maintenance. This article is organized by the co-founder of StreamNative and Tencent Cloud TVP Zhai Jia's speech "Function Mesh: Serverless Innovative Practice in Message and Streaming Data Scenarios" at the Techo TVP Developer Summit ServerlessDays China 2021, and will share with you.
Click on the link to watch the wonderful speech video: Techo TVP Developer Summit ServerlessDays China 2021 "Serverless, Great Future Serverless, Empower More" Zhai Jia
1. What is Apache Pulsar?
Function Mesh is the latest open source project of StreamNative. Serverless and K8s are very closely combined, and Function Mesh is the same original intention. The Pulsar Functions we did before are to better integrate Pulsar and Serverless. Function Mesh makes it easy for everyone to use the resources on the cloud to better manage Functions.
Today's sharing mainly starts from four directions. Introduction to Pulsar, Pulsar Functions and Function Mesh, Pulsar community.
Why was the Pulsar community born and what do you want to do? Pulsar was originally a messaging system, which was born inside Yahoo. What kind of problems was it to solve at the time? In the news scenario, the small partners who may be doing infrastructure will understand that due to the architectural technology, the requirements are naturally divided into two directions according to different scenarios. One is to cut peaks and fill valleys, internal MQ interacting with each other; in the other scenario, a big data data engine needs to be used as a data transmission pipeline. The usage scenarios, data consistency, and technical architecture of these two scenarios are completely different. In 2012, Yahoo’s main problem at that time was that various departments would maintain multiple systems. There were three or four systems internally, which is equivalent to the fact that the bottleneck of operation and maintenance has been found to be particularly severe within the entire department. There are islands of data between various departments. Become particularly powerful. Therefore, Yahoo’s main original intention for Pulsar at the time was: For users, they hope to build a unified data platform, provide unified operation and maintenance and management, reduce operation and maintenance pressure, and improve resource utilization; for business departments, 2012 was also when stream computing was just emerging. We hope that more data can be opened up, real-time computing can capture more data sources, get more accurate calculation results, and better play the value of data. From these two aspects, Pulsar mainly provides the unification of two scenarios, using the same platform to solve the previous two sets of applications in message-related scenarios.
With such a demand, see why Pulsar can do such a thing? Related to the following two aspects:
First, the cloud native architecture . There are several points behind, the first is the service layer-the computing layer and the storage layer are completely isolated. In the service layer, no data is stored, and all data is handed over to the underlying storage layer. At the same time, what is exposed to users is the concept of logical partitions. This partition is not directly bound to the file system like other systems-binding to single-node folders. Instead, the partition is divided into a series of fragments according to the size or time specified by the user, and the partition mode is guaranteed to be a partition. Data can be evenly placed on multiple storage nodes. Through this strategy, it achieves distributed storage and more distributed logic for each partition. It will not bring any data migration when expanding and shrinking, which is the advantage of cloud native architecture.
is the peer-to-peer architecture 161a740a17d58c. This is inseparable from Yahoo's needs for larger clusters and multi-tenancy. Only the state between nodes is simple enough and state maintenance is simple enough to maintain a relatively large cluster. Inside Twitter, the bottom storage layer has two computer rooms, each with 1500 nodes. For this kind of peer-to-peer node, the upper-level Broker understands it well and does not store any data, so it is the concept of leaderless, and there is no master-slave distinction. When multiple backups fall to the underlying storage node, each storage node is in a peer-to-peer state. To write one data, multiple nodes will be written concurrently. The internal consistency of a single node is maintained through CRC, but for multiple nodes Multiple copies of data are written concurrently, so multiple storage nodes also have a peer-to-peer architecture. Through such a mechanism, through its own cap design, to maintain consistency, there is the architecture of separation of storage and computing mentioned above, and the basis of node peering, so it will bring better expansion, operation and maintenance, and fault tolerance. Experience.
Another feature of Pulsar is that it has Apache BookKeeper, a storage engine dedicated to message flow. BookKeeper is an older system, a product born in 2008 and 2009, and an open source system by Yahoo. BookKeeper was born to solve the problem of HDFS. The HA of the layer was born to save every change of the namenode, and at the beginning is to save the metadata of the metadata. Therefore, there are particularly high requirements for consistency, low latency, throughput, and reliability. However, its model is very simple, it is an abstraction of write-ahead-log. This matches our message very well, because the main mode of the message is also append only to write at the end. As time goes by, the value of the old data may become lower and lower, and then be deleted as a whole.
With such a BookKeeper that can provide relatively stable service quality, consistent and particularly high support, Pulsar is able to support the MQ scenario just mentioned; at the same time, due to the simple abstraction of log, the data is improved based on the mode of additional data writing. The bandwidth supports the streaming mode. Both MQ and Kafka scenarios provide good support through the underlying storage layer, and are guaranteed by the underlying storage layer.
With the previous foundation, it will be especially easy to build the enterprise-level features at the bottom of Pulsar that are particularly useful for users. The reason for the birth of Pulsar is the need for a large cluster and multi-tenancy. For users at this level, each topic is no longer a single-level concept, but is similar to a folder in the file system, which is a first-level directory and a second-level directory management. The first-level directory is our tenant. It mainly provides users with more isolation strategies. Each tenant can be assigned different permissions. The administrator of each tenant manages the management of permissions between other tenants and internal users. For example, let the first tenant access the information of the second tenant? Information similar to this kind of authentication.
Further down, the namespace layer contains various strategies, which can facilitate many enterprise-level controls, such as flow control; the bottom layer is what we call topic. Through the concept of hierarchy and the support of large clusters, it is easier to get through the data between the various organizations and departments within the user.
In addition, due to its good data consistency, Pulsar is also used by many users in cross-cluster replication scenarios. Pulsar's cross-region replication is a built-in function of the broker. If the user needs this function, simply adjust the Pulsar command to complete the construction of the cross-level group. The built-in producer can directly synchronize the data that has just been placed locally to the remote computer room, which is extremely time-efficient. The user experience is particularly simple to configure, particularly efficient in use, and particularly low in latency, while at the same time providing a good guarantee of data consistency. Therefore, there are rich applications in many scenarios, including Tencent and Suning. Many users choose Pulsar because of a single scenario.
Because of these foundations, Pulsar's growth in the community is also particularly significant. There are now 403 contributors in the community, and the number of github stars is close to 9,000. Thank you very much for Tencent Cloud’s many small partners who have done a lot of useful to Pulsar, very rich scene inspection .
Two, Pulsar Functions
At the beginning of the birth of Pulsar, we started from the field of news, and we connected with the entire ecology through the cloud. The discussion with you today is mainly focused on Functions under the computing layer, and we will do a detailed expansion in the computing layer. In our commonly used big data calculations, there are roughly three types: interactive query, Presto is a more commonly used scenario; further down, such as batch processing, stream processing, the corresponding Spark, Flink, etc. are commonly used by users. For the above two types, what Pulsar does is to provide support for the corresponding connectors, so that these engines can understand Pulsar's schema, and directly read and use a Pulsar topic as a table. The concept of Functions is the key point I want to focus on with you today. It is a lightweight calculation, which is not the same concept as the complex calculation scenarios above. This diagram may be relatively intuitive, and the internal equivalent is to abstract the simple calculation scenarios that users often use in the message scenario, and provide the abstraction of Functions. The built-in consumer on the left of Function will subscribe to the generated messages, the function of the user’s physical examination in the middle will provide calculations, and the producer on the right will write the functions and calculation results passed in by the user back to the destination topic. Through such a mode, the user’s frequently used The information needed to create, manage, and schedule the number of copies in the middle is provided as a unified infrastructure.
Some students asked what is the concept of topic? Topic is more related to the message field. It is an abstraction of a pipeline and a carrier. All data is cached through the topic, and the producer will generate messages and deliver them to the topic. Consumers use topics for consumption in the order of production, which is a cache carrier and pipeline.
Pulsar Functions does not need to be a complex computing engine. The main idea is to better integrate the serverless concept with the message system, so that Pulsar itself can handle a lot of lightweight data on the message and data sides. Common scenarios, such as very simple scenarios such as ETL and aggregation, account for about 60%-70% of the entire data processing volume. Especially in the IoT scenario, it will account for eighty to ninety percent. For such simple scenarios, through simple Function Mesh processing, there is no need to construct complex clusters. Simple calculations can be processed in our message terminal, these resources can be saved, transmission resources, computing resources can be used very economically.
Let me give you a simple demonstration. What kind of experience is Functions for users? What this function needs to deal with is very simple. For a topic, I add an exclamation mark to the data "hello" thrown in by the user. For a function such as a user, no matter what language, there can be a corresponding runtime support in our Functions. In this process, users do not need to learn any new logic, do not need to learn new APIs, and can write in any language they are familiar with. After writing, hand it to Pulsar Functions. Functions will subscribe to all the incoming data, and do the corresponding processing in the function based on this data.
Functions is related to Serverless. Everyone has the same philosophy. It integrates well with messages and uses Serverless to process messages and process calculations. But the difference from Serverless is that it is more related to data processing and has various semantic support. Within Pulsar Function, three types of semantically flexible support are also provided.
At the same time, the state storage is embedded in the Function Mesh, and the result of the intermediate calculation is saved to the BookKeeper itself. For example, to make a statistics, the user will pass in a sentence, and the sentence will be divided into multiple word segments. Each word can count the number of times it appears, so that the information of a number of times can be recorded in Pulsar itself. In this way, it can be simple Function, complete the statistics of the number of occurrences in the topic, and update it in real time. At the same time, Pulsar has built a REST-based admin interface to make it easier for users to use, schedule, and manage Pulsar Function. Behind it is actually a REST API. You can directly call the interface through your own programming to better integrate with user applications.
To sum up briefly, Pulsar Functions simply means to provide a better experience for all your friends in your application and ecology. For example, for your developers, it can support a variety of languages . We are also doing web support recently. In addition, different models can be supported, and the simplest way is to throw it on the broker and run into the process mode at the same time. The deployment mode is also very flexible. If resources are limited, deploy on the broker and let it run with the broker. If you need better isolation, you can take it out and make a separate cluster, and run your Functions through this cluster. Before Function Mesh, we provided very simple Kubernetes support.
The benefits it brings to everyone will be easier for users, because users may be big data experts, familiar with various languages, and they can write this set of logic according to their familiar language. Its operation is also very simple, because the original big data processing, a lot of Pulsar is needed. Now that you are familiar with Pulsar, you can also integrate well with Pulsar. With Pulsar, it runs together with the broker and does not require another server. In our development and deployment, a local run mode is also provided. Users can easily debug Functions. For each user on the entire calculation path, Pulsar Functions provides a good experience and a lot of richness. Tool support.
Three, Function Mesh
However, although there is K8s support, it was not natively supported before. How did users call Functions before? Functions can be deployed with brokers. Now there is a Functions woker in each broker, which corresponds to all the management and operation and maintenance interfaces of Functions. Users submit Functions to Functions worker, and then save some source data information of Functions to the topic inside Pulsar . When scheduling, tell K8s to get the source data from the topic, have several copies, read from the source data, and start the instance of the corresponding data volume Functions.
There are some unfriendly aspects in this process. Its source data itself is stored in the topic of Pulsar, which will cause a problem. Many users mention Functions woker, read the data of the topic itself, and obtain information about the source data. If the broker of the topic service is not up, there will be one Crush the loop and wait until the broker that actually serves the source data of the Functions worker gets up. In addition, there are two parts of source data management in this process. The first part is submitted to the Functions worker and stored in Pulsar itself. At the same time, Kubernetes will be adjusted and a copy of the source data will be handed in. In this way, the source data management will be more troublesome. A very simple example, Functions has been handed over to Kubernetes, and there is no coordination mechanism between the two sides. Third, doing expansion, dynamic management, and elastic scaling is itself a great advantage of Kubernetes. If you do such a thing again, it may be a repeated process with Kubernetes.
The second question is also a question mentioned by many users of Pulsar Functions. Pulsar Functions runs in a cluster. Many scenarios are not limited to the inside of the cluster and need to span multiple clusters. At this time, the interaction will become more complicated. For example, in the scenario of federated learning, the user hopes that the data will be handed over to Functions for training, and then the model is written to the cluster, but there are many scenarios. The federated learning Functions trains the data of user A, and writes the training results to user B. At this time, cross-cluster operations are required. The previous operations are bound within a cluster, and it is difficult to share functions across the cluster level.
Another problem is that we were the most direct and main reason for making Pulsar Functions at the time. We found that users did not deal with a simple problem and only used one function. It might be necessary to connect multiple functions in series, hoping to connect multiple functions to multiple functions. As a whole, to do operation and control. With the previous model, many such commands will be written, and the management of each command will be particularly complicated, and the relationship between the topic of command subscription and output is difficult to control, and it is impossible to make an intuitive description, management and operation and maintenance meeting. Especially troublesome.
main purpose of 161a740a17d7ff Function Mesh is not to make a more complex, full-scale, universal framework for all calculations, but to provide better management and make it more convenient for users to use a tool . For example, the multiple Functions just mentioned need to be connected in series to provide services to users as a whole. Therefore, with such a simple requirement, we made a proposal in August and September 2020. The idea is very simple: I hope there is a unified place to describe the relationship between input and output, so that it can be seen at a glance. The output of the first Function is the input of the second Function, and the logic between them can be well described by the yaml file. Users can know the relationship between the two functions at a glance.
If the logic just mentioned is better integrated with K8s, it can be combined with the original scheduling and elastic strategy of Kubernetes to provide users with a better management and use experience. Pulsar Functions and Function Mesh mainly take the Kubernetes CRD as the core, and output the data generated by the subscribed topic to the specified place for each type of function, such as our common Function, as well as Source and Sink (equivalent to a special case of Function) , Or outputting data from a specified source (such as from a database) is a special case of Function.
CRD describes how the Function should be concurrent, how it should run, and the serial relationship between the front and back topics. In addition to CRD, there will be a function mesh controller responsible for specific scheduling and execution. In this way, for the user’s experience, first from the left, the user gives it to K8s to help you describe the relationship between the functions, and at the same time describe the maximum and minimum concurrency, and the information about some resources required. It can be described by yaml file. After the yaml file is handed over to K8s, the internal resources will be scheduled through the API server, and the changes will also be monitored. If there is a change in the CRD description, the pod information will be changed according to the change, which is to expand and shrink the pod, pod and Pulsar cluster information It is clearer and does not save any information. As the source of data, or the export of data, it is just a data pipeline and does not involve the management of all metadata. Its feature is that it wants to combine K8s to bring users a better experience. With the help of K8s, CPU-based elastic expansion and contraction can be well realized.
The flexible scheduling of K8s can bring a better experience to the operation and maintenance of Function. Once the CRD is changed, you can control the addition, deletion, and modification of the pod according to the description of the CRD. It is also through such a mode that it runs on K8s and is completely decoupled from a single pulsar cluster. Through this mode, functions between multiple clusters can be shared and opened up.
We are currently working on a job. If we want to use the Function Package Management tool to ease the user's operational difficulties, we should meet you in version 2.8. Our original intention of making Function Mesh is mainly to facilitate users to use Pulsar Function. On this basis, the previous interface is accessed through the rest interface, so we have also made forward compatibility. Based on the current implementation of K8s and API, Function Admin has been opened up, and users can control through the previous interface. Old users before, if they are not used to submitting CRDs and providing changes directly, they can also have the same experience as before through this mode.
Fourth, the Pulsar community
What follows is the situation in the community. Tencent is a very important contributor to the Pulsar community. In the first critical business scenario, Tencent’s billing platform was mentioned. All businesses go through Pulsar, including WeChat red envelopes and many billing for Tencent games. At that time, Tencent also investigated other systems internally, and finally made such a trade-off, because Pulsar has good consistency, good data accumulation and operation and maintenance capabilities, especially the cloud-native architecture, which can reduce large-scale Pain points of cluster operation and maintenance.
Another typical scenario is that in a big data scenario, Kafka is required. For Kafka, this is a very common problem for users of large clusters, storage calculations and binding. A previous article introduced some cases and made some conclusions. For example, storage computing binding brings inconvenience to operation and maintenance, and expansion and contraction will reduce cluster performance. One of the headaches in Kafka is reblance. Once it needs to expand or shrink, it will automatically trigger reblance to move the topic from one node to another to achieve data rebalancing. In this process, moving data may have a certain impact on online business, because the bandwidth between clusters or network bandwidth is occupied, and the response to external business may not be timely. Data loss has occurred. Mirror maker performance and stability issues, etc. In fact, the main problem is that we mentioned the problem of scaling. Bigo found that scaling was extremely labor-intensive. For these reasons, they switched from Kafka cluster to Pulsar cluster. The very important feature that contributed is called KoP, which analyzes the Kafka protocol on the server side. In this way, users can get zero-cost migration.
This picture is mainly to introduce some users who use Function. Many of its scenarios are lightweight, especially in IoT scenarios. For example, EMQ is a very early user of Pulsar Function. The previous Tuya Smart, Toyota Smart, etc. are all IoT scenarios, and many Functions are used in the application.
The growth of the community is worth paying attention to. Since 2019, the growth has become more rapid. This is a very common phenomenon in the open source community. Behind every open source community there will be a commercial company. Our company was established in 2019, and the purpose of the commercialization company is different from the previous open source Yahoo of Pulsar. Yahoo's purpose is to allow more users to use Pulsar to help polish it, but it does not have a strong motivation to maintain the community and spend more energy to develop more features to attract community users. But this is the purpose of our commercial company, relying on the community to do its own commercialization and do its own growth. Therefore, after the establishment of the company, we will do a lot of communication and coordination with developers to help developers use Pulsar more conveniently and provide more functions to meet the needs of users.
Finally, the relevant community information, welcomes those who want to know more information, through these channels, find more Pulsar resources . These resources include a lot of very rich video resources on site B, and other Apache mailing lists; there are more than 4,000 users in slack, and China and the United States have about half of them. On the right are the two WeChat official accounts maintained, the Pulsar community and our company. If you are interested in Pulsar or job opportunities in the community, you are welcome to scan the QR code for more information. This is the main content shared with you today. , Thank you all for your time.
Co-founder of StreamNative, Tencent Cloud TVP
Co-founder of StreamNative, Tencent Cloud TVP. Prior to this, he worked at EMC as the technical leader of Beijing EMC's real-time processing platform. Mainly engaged in the development of real-time computing and distributed storage systems, and continue to contribute code to the open source projects Apache BookKeeper, Apache Pulsar and other projects. He is a PMC member and Committer of the open source projects Apache Pulsar and Apache BookKeeper.
- start-ups, and such an intimate cloud-native database?
- used Serverless for so long, here is a little experience of its underlying technology
- Serverless + low code, so that technical novices can become full-stack engineers?
- Left Ear Mouse: What is Serverless?
Feeling regretful for missing the live broadcast? The video review of the speeches of all the guests of this summit is online, click link to watch~
Watching the video is not addictive, but still want to dry goods PPT? Scan the QR code↓, follow the official account, and reply to the keyword "serverless" in the background to get it!