Efficient encoding method of protocol buffer

Introduction

How does the protocol buffer, an excellent encoding method, work at the bottom? Why can it achieve efficient and fast data transmission? It all starts with its encoding method.

Define a simple message

We know that the main body of the protocol buffer is the message. Next, we will start with a simple message and explain the encoding method in protobuf in detail.

For example, the following is a very simple message object:

message Student {
  optional int32 age = 1;
}

In the above example, we defined a Student message object, and defined a field named age for him, and set a value called 22 to it. Then use protobuf to serialize it. For such a large object, the bytes after serialization are as follows:

08 96 00

Very simple, using three bytes can represent a messag object, the amount of data is very small.

So what exactly do these three bytes mean? Let's take a look.

Base 128 Varints

Before explaining the meaning of the above three bytes, we need to understand the concept of a varints.

What is Varints? That is, when serializing integers, the size of the space occupied is different. Small integers take up less space, and large integers take up more space. In this way, there is no need to fix a specific length, which can reduce the length of the data, but it will bring The complexity of the analysis.

So how do you know how many bytes this data needs? In protobuf, the highest bit of each byte is a judgment bit. If this bit is set to 1, it means that the next byte and the byte are together, which means the same number. If this bit is set to 0, it means The next byte has nothing to do with this byte, and the data will end when it reaches this byte.

For example, a byte is 8 bits. If it represents the integer 1, then it can be represented by the following byte:

0000 0001

If a byte cannot hold an integer, then multiple bytes need to be used for connection operations. For example, the following data represents 300:

1010 1100 0000 0010

Why is it 300? First look at the first byte, its first bit is 1, which means there is another byte behind. Look at the second byte, its first bit is 0, which means it's over. We remove the judgment bit and turn it into the following number:

010 1100 000 0010

At this time, the value of the data cannot be calculated, because in protobuf, the number of bits of byte is reversed, so we need to swap the two bytes above:

000 0010 010 1100

That is:

10 010 1100

=256 + 32 + 8 + 4 = 300

The structure of the message body

From the definition of message, we can know that the structure of the message body in protobuf is in the form of key=value, where the key is the integer value 1, 2, 3, 4, etc. of the field defined in the message. And value is the value actually set to it.

When a message is encoded, these keys and values will be connected together to form a byte stream. When you want to parse it, you need to locate the specific length of key and value, so you need to include two parts in the key. The first part is the value of the field in the proto file, and the second part is occupied by the value part. Length size.

Only by combining the values of these two parts, the parser can correctly parse the field.

This format of key is called wire types. What are the wire types? Let's take a look:

type	meaning	scenes to be used
0	Varint	int32, int64, uint32, uint64, sint32, sint64, bool, enum
1	64-bit	fixed64, sfixed64, double
2	Length-delimited	string, bytes, embedded messages, packed repeated fields
3	Start group	groups (deprecated)
4	End group	groups (deprecated)
5	32-bit	fixed32, sfixed32, float

It can be seen that in addition to the two types 3 and 4, the other types can be divided into three types. One is the fixed-length type, such as 1, 5, which are 64-bit and 32-bit numbers, respectively.

The second type is 0, which means Varint, which is a variable type, used to represent general digital types, bool types and enumeration types. The third category 2, represents the type of length distinction, this type is usually used to represent character strings, byte numbers, and so on.

All keys are a varint type, and its value is: (field_number << 3) | wire_type , which means that the last three bits of the key are used to store the wire type.

The value of key in our example above is 08, expressed in binary:

000 1000

The last three digits are 0, which means it is a Varint type. Shift 08 to the right by three digits to get 1, which means that the field represented by the key is 1, which is age.

Then we look at the remaining part 96 00, which is replaced by binary:

96 00 = 1001 0110  0000 0000

According to the definition of Varint, the first bit represents the connection bit, which means that the content of the second byte and the content of the first byte are together. For Varint, the low-order byte and the high-order byte need to be exchanged, as follows:

1001 0110  0000 0000 去掉最高位的1 ：
001 0110  0000 0000  交换低位字节和高位字节：
0000 0000  001 0110

The above value is 16 + 4 + 2 = 22

In this way, we get a key with a value of 1, and the corresponding value is 22.

Signed integer

We know that there are two ways to represent signed integers, one is the standard int type: int32 and int64, and the other is the signed int type: sint32 and sint64.

The difference between these two types lies in the representation of the corresponding negative integers. For int32 and int64, all negative integers are represented by ten bytes, so the space occupied will be relatively large, which is not suitable for representing negative integers.

If you use sint32 and sint64, the encoding method used is ZigZag, which is more effective for negative integers.

ZigZag maps signed integers and unsigned integers. For each n, the following formula will be used to encode:

(n << 1) ^ (n >> 31)

For sint64 it is:

(n << 1) ^ (n >> 64)

for example:

Signed integer	Encoding result
0	0
-1	1
1	2
-2	3
2147483647	4294967294
-2147483648	4294967295

# String

The wire type of the string is 2, indicating that its value is the length of a varint encoding. for example:

 message Student {
  optional string name = 2;
}

Above we defined the second attribute name for Student. If the value "testing" is assigned to name, then the code obtained is:

12 07 [74 65 73 74 69 6e 67]

The encoding of the brackets is the UTF8 representation of "testing".

0x12 can be parsed like this:

 0x12
→ 0001 0010  (binary representation)
→ 00010 010  (regroup bits)
→ field_number = 2, wire_type = 2

0x12 indicates that the type of field 2 is 2, and the 07 that follows indicates the length of the subsequent byte.

Nested message

Messages can be nested in messages. Let's look at an example:

message Teacher {
  optional Student s = 3;
}

If we set the age field of s to 22, just like the first example, then the above encoding is:

 1a 03 08 96 00

You can see that the next three bytes are the same as the first example. The judgment method of the first two bytes and the string are one value, so I won't talk about it again.

Summarize

Well, the basic coding rules and implementation of protobuf have been explained. Does it sound amazing?

This article has been included in http://www.flydean.com/03-protobuf-encoding/
The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!
Welcome to pay attention to my official account: "Program those things", know technology, know you better!

Efficient encoding method of protocol buffer

Introduction

Define a simple message

Base 128 Varints

The structure of the message body

Signed integer

Nested message

Summarize

flydean

引用和评论

在stable diffussion中完美修复AI图片

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性