大数据 - Blog recommendation｜Effective management of data security-Pulsar Schema management - ApachePulsar

About Apache Pulsar
Apache Pulsar is a top-level project of the Apache Software Foundation. It is a next-generation cloud-native distributed message flow platform that integrates messaging, storage, and lightweight functional computing. It adopts a separate computing and storage architecture design and supports multiple Tenants, persistent storage, and cross-regional data replication in multiple computer rooms have streaming data storage features such as strong consistency, high throughput, low latency, and high scalability.
GitHub address: http://github.com/apache/pulsar/

Before performing Schema management, it is necessary to ensure that there is no problem with the normal use of Pulsar for sending and receiving. First clarify what is Schema?

Schema is the organization and structure of data in the database. If Pulsar is compared to a relational database, then Topic stores the bytes in the disk file of the relational database, and Schema plays the role of converting the bytes in the disk file of the relational database. It has the same effect as a specific type of database table, which belongs to the meta information of the data table. Then why do we need Schema management in the message queue? Let's take a look at the use of Pulsar Schema with questions.

Problem background

The current message queue overall system availability tends to be stable, but in the process of use, the security of upstream and downstream data has not been effectively guaranteed. For example:

type TestCodeGenMsg struct {
-    Orderid     int64     `json:"orderid"`
+    Orderid     string    `json:"orderid"`
     Uid         int64     `json:"uid"`
     Flowid      string    `json:"flowid"`
}

This "incompatible" format will break most downstream services because they expect a numeric type but now get a string. It is impossible for us to know in advance how much damage will be caused. In the example, it is easy for people to blame "miscommunication" or "lack of proper processes."

First of all, in development, API is regarded as a first-class citizen in the microservice architecture, because API is a kind of contract, which is strongly binding, and any protocol changes can be quickly sensed in advance, but the event consumption of message queues is often not Respond quickly and test. When large-scale model repairs, especially when it comes to writing databases, it is likely to cause the same negative results as API failures. Here I recommend that Gwen Shapira wrote an article before, introduces data contract and schema management , we hope to manage schema changes based on simple compatibility strategies, let data evolve safely, decouple teams and allow them to develop independently and quickly. . This is why we need Schema management.

Desired goal

Based on the compatibility strategy, the schema is managed to allow the data to evolve safely, such as:

type TestCodeGenMsg struct {
    Orderid     int64     `json:"orderid"`
    Uid         int64     `json:"uid"`
    Flowid      string    `json:"flowid"`
+   Username    string    `json:"username"`
}

If the following is not passed:

//校验不通过
type TestCodeGenMsg struct {
-    Orderid     int64     `json:"orderid"`
+    Orderid     string    `json:"orderid"`
     Uid         int64     `json:"uid"`
     Flowid      string    `json:"flowid"`
}

How we use

The main difference between the message model and the API is that the storage time of events and their models is very long. Once you have upgraded all applications that call this API from v1 --> v2, you can safely assume that the services that use v1 have disappeared. This may take some time, but it is usually measured in weeks rather than years. But this is not the case for events that can store the old version of the message queue forever. The following questions need to be considered: Who do we upgrade first-consumers or producers? Can the new consumer handle the old events still stored in Pulsar? Do we need to wait before upgrading consumers? Can old consumers handle events written by new producers?

Pulsar Schema defines some compatibility rules, which relate to what changes we can make to the Schema without damaging consumers, and how to deal with the upgrade of different types of Schema changes. How to do it? We need to confirm on the broker whether we support automatic evolution and the schema compatibility strategy under the current namespace. The compatible strategies are: Click Details , or refer to the following table:

We operate via CLI

// 查询当前namespace是否支持schema自动演进
./pulsar-admin namespaces get-is-allow-auto-update-schema tenant/namespace
 
// 如果不支持则打开
./pulsar-admin namespaces set-is-allow-auto-update-schema --enable tenant/namespace
 
// 查询当前namespace的schema演进策略
./pulsar-admin namespaces get-schema-compatibility-strategy tenant/namespace
 
// 这么多策略，总有一款适合你
./pulsar-admin namespaces set-schema-compatibility-strategy -c FORWARD_TRANSITIVE tenant/namespace

Producer

Then access the producer, first look at the following example:

package main
import (
    "context"
    "fmt"
    "github.com/apache/pulsar-client-go/pulsar"
)
type TestSchema struct {
    Age   int    `json:"age"`
    Name  string `json:"name"`
    Addr  string `json:"addr"`
}
const AvroSchemaDef = "{"type":"record","name":"test","namespace":"CodeGenTest","fields":[{"name":"age","type":"int"},{"name":"name","type":"string"},{"name":"addr","type":"string"}]}"
var client *pulsar.Client
func main() {
     // 创建client
    cp := pulsar.ClientOptions{
        URL:              "pulsar://xxx.xxx.xxx.xxx:6650",
        OperationTimeout: 30 * time.Second,
    }
    
    var err error
    client, err = pulsar.NewClient(cp)
    if err != nil {
        fmt.Println("NewClient error:", err.Error())
        return
    }
    defer client.Close()
    
    if err := Produce(); err != nil{
        fmt.Println("Produce error:", err.Error())
        return
    }
    
    if err := Consume(); err != nil{
        fmt.Println("Consume error:", err.Error())
        return
    }
}

func Produce() error {
    
    // 创建schema
    properties := make(map[string]string)
    pas := pulsar.NewAvroSchema(AvroSchemaDef, properties)
    po := pulsar.ProducerOptions{
        Topic:       "persistent://test/schema/topic",
        Name:        "test_group",
        SendTimeout: 30 * time.Second,
        Schema:      pas,
    }
    
    // 创建生产者
    producer, err := client.CreateProducer(po)
    if err != nil {
        fmt.Println("CreateProducer error:", err.Error())
        return err
    }
    defer producer.Close()
    
    // 写消息
    t := TestSchema{
            Age: 10,
            Name: "test",
            Addr: "test_addr",
    }
    
    id, err := producer.Send(context.Background(), &pulsar.ProducerMessage{
            Key:       t.Age,
            Value:     t,
            EventTime: time.Now(),
        })
    if err != nil {
            fmt.Println("Send error:", err.Error())
             return err
    }
    
    fmt.Println("msgId:", id)
}

The above demo completes a producer with a schema. We look through the ProducerOptions class (struct) and find that there is a Schema member, so we know that we need to pass in a Schema object into it. We then go to the new Schema object and pass:

properties := make(map[string]string)
jas := pulsar.NewAvroSchema(jsonAvroSchemaDef, properties)

In addition to an Avro-type schema, we created many more, such as: json, pb, etc., you can choose according to your needs. If you are interested in reading more related content, Martin Kleppmann wrote a good blog post , compares the model evolution in different data formats. Then let's take a look at what limits the data structure. One of the constants is as follows:

  const jsonAvroSchemaDef = "{"type":"record","name":"test","namespace":"CodeGenTest","fields":[{"name":"age","type":"int"},{"name":"name","type":"string"},{"name":"addr","type":"string"}]}"

Expand to see:

{
    "type":"record",
    "name":"test",
    "namespace":"Test",
    "fields":[
        {
            "name":"age",
            "type":"int
        },
        {
            "name":"name",
            "type":["null","string"] // 表示可选字段
        },
        {
            "name":"addr",
            "type":"string"
            "default":"beijing", // 表示默认字段
        }
    ]
}

This is an avro schema (all verification types are written in this way), where fields represent the required field names and types, and the name of the schema and the specified namespace must be set to use the compatibility strategy. For the introduction of 's grammar, please refer to column [[4]](#), and the following table types:

consumer

First look at the code:

func Consume(ctx context.Context) error {
    cas := pulsar.NewAvroSchema(AvroSchemaDef, properties)
    consumer, err := client.Subscribe(pulsar.ConsumerOptions{
        Topic: "persistent://base/test/topic",
        SubscriptionName: "test",
        Type: pulsar.Failover,
        Schema: cas,
})
if err != nil {
    return err
}
defer consumer.Close()
 
    for {
        msg, err := consumer.Receive(ctx)
        if err != nil {
            return err
         }
 
        t := TestSchema{}
        if err := msg.GetSchemaValue(&t);err != nil{
            continue
        }
 
        consumer.Ack(msg)
        fmt.Println("msgId:", msg.ID(), " Payload:", string(msg.Payload()), " t:", t)
    }
}

We can see that if we use the schema, we finally need to use the GetSchemaValue() method to deserialize the message in order to truly ensure security. This is generally the case for the entire production and consumption framework. After that, we involved a concept, that is, schema evolution: the workflow of the schema principle Schema, as shown in the figure:

Confluent has developed a schema registry server independent of broker coordination in Kafka. Its workflow is:

When we send data to Kafka, we need to register the schema with the Schema Registry first, and then serialize and send it to Kafka;
The Schema Registry server provides a globally unique ID for each registered schema, and the assigned ID is guaranteed to be monotonically increasing, but not necessarily continuous;
When we need to consume data from Kafka, consumers will first determine whether the schema is in local memory before deserialization. If it is not in local memory, you need to get the schema from the Schema Registry, otherwise, you don't need to get it.

Pulsar is different:

Pulsar comes with schema evolution management, and stores related schema information on bookie;
The schema information is not in Pulsar's message protocol;
The consumer needs to pass in the schema by itself.

Although its principle is similar to Kafka, Pulsar adopts the design that the schema server and the broker are not separated, and the schema information is stored on the bookie, which solves the problem of high availability of the schema server. The compatibility check of schema evolution is performed on the broker side. (I'm not talking about serialization and deserialization here).

What did the client do? According to the above, we know that the final guarantee for schema security is actually a check of the corresponding type of decode and encode. From the source code, in the process of creating producers and consumers, the incoming schema will be checked. Here is a Independent message structure.

The method used by consumers is actually the decode() method we just mentioned.

The corresponding type only needs to implement the schema interface:

type Schema interface {
    Encode(v interface{}) ([]byte, error)
    Decode(data []byte, v interface{}) error
    Validate(message []byte) error
    GetSchemaInfo() *SchemaInfo
}

For specific implementation, please refer to Pulsar Go Client related file , which has the realization of multiple serialized data types.

add

Schema as the metadata of the Pulsar Topic can be provided to Pulsar SQL for use. The Pulsar SQL storage layer implements the interface of the Presto connector. The Schema will be displayed as the metadata of the Presto payload on the SQL layer, which greatly facilitates us to view messages, data analysis, etc. Work, what I said in my supplement above is the reason why we need Schema management. Thanks for reading.

About the Author

My name is Hou Shengxin, or I can be Dayun. I am currently working in Banyu Infrastructure, responsible for the maintenance and related development of message queues. I am a member of the Rust Daily Report team and like to study storage and service governance. When I first came into contact with Pulsar, I was attracted by the structure of separation of storage and computing. The smooth producer-consumer access and high throughput made me curious about the implementation of this project, and I hope to make some contributions to Pulsar's related functions in the future.

Blog recommendation｜Effective management of data security-Pulsar Schema management

Problem background

Desired goal

How we use

Producer

consumer

add

About the Author

Recommended reading

ApachePulsar

引用和评论

深入解析 Apache BookKeeper 系列：第二篇 — 写操作原理

定档 7 月！Community Over Code Asia 2025 议题征集全面启动！

Dolphinscheduler IDEA本地调试

蚂蚁技术研究院发布推理大模型强化学习框架，邀请开发者共同助力 AGI 生态

祝贺陈梓立(Tison)当选 2025 年度 Apache 软件基金会董事会

【Hadoop】HDFS架构解析

基于 MCP 的 AI Agent 应用开发实践