This article mainly explains serialization and deserialization.
Serialization is a very important mechanism in network communication. A good serialization method can directly affect the performance of data transmission.
Serialization
The so-called serialization is to convert an object into a certain form, and then transmit it in a data stream.
For example, an object is directly converted into a binary data stream for transmission. Of course, this object can be transformed into other forms and then transformed into a data stream.
Such as XML, JSON and other formats. They express the state of an object through another data format, and then convert these data into a binary data stream for network transmission.
Deserialization
Deserialization is the reverse process of serialization. The process of deserializing a byte array into an object and restoring a byte sequence to an object becomes the deserialization of an object
High-level understanding of serialization
The preceding code demonstrates how to implement object serialization transmission through the serialization of Java objects provided by the JDK, mainly through the output stream java.io.ObjectOutputStream and the object input stream java.io.ObjectInputStream.
java.io. ObjectOutputStream : Represents the object output stream, its writeObject(Object obj) method can serialize the obj object specified by the parameter, and write the resulting byte sequence to a target output stream.
java.io. ObjectInputStream : Represents the object input stream, its readObject() method reads the byte sequence from the source input stream, then deserializes them into an object, and returns it
It should be noted that the object to be serialized needs to implement the java.io.Serializable interface
The role of serialVersionUID
SerializeID can be generated in IDEA through the following settings, as shown in Figure 5-1
Literally means the serialized version number. All classes that implement the Serializable interface have a static variable that represents the serialized version identifier.
<center>Figure 5-1</center>
Let's demonstrate the role of serialVersionUID. First, you need to create an ordinary spring boot project, and then follow the steps below to demonstrate
Create User object
public class User implements Serializable {
private static final long serialVersionUID = -8826770719841981391L;
private String name;
private int age;
}
Write Java serialization code
public class JavaSerializer {
public static void main(String[] args) {
User user=new User();
user.setAge(18);
user.setName("Mic");
serialToFile(user);
System.out.println("序列化成功,开始反序列化");
User nuser=deserialFromFile();
System.out.println(nuser);
}
private static void serialToFile(User user){
try {
ObjectOutputStream objectOutputStream=
new ObjectOutputStream(new FileOutputStream(new File("user")));
objectOutputStream.writeObject(user);
} catch (IOException e) {
e.printStackTrace();
}
}
private static <T> T deserialFromFile(){
try {
ObjectInputStream objectInputStream=new ObjectInputStream(new FileInputStream(new File("user")));
return (T)objectInputStream.readObject();
} catch (IOException | ClassNotFoundException e) {
e.printStackTrace();
}
return null;
}
}
UID verification demo steps
- First serialize the user object to the file
- Then modify the user object, increase the serialVersionUID field
- Then extract the object through deserialization
- Demonstrate the expected result: Prompt that it cannot be deserialized
in conclusion
Java's serialization mechanism verifies version consistency by judging the serialVersionUID of the class. When deserializing, the JVM will compare the serialVersionUID in the byte stream passed with the serialVersionUID of the corresponding local entity class. If they are the same, they are considered to be consistent and can be deserialized, otherwise a serialized version will appear. The inconsistent exception is InvalidCastException.
It can be seen from the results that the class in the file stream and the class in the classpath, that is, the modified class, are incompatible. For safety reasons, the program throws an error and refuses to load. From the error result, if the serialVersionUID is not configured for the specified class, the java compiler will automatically perform a digest algorithm for this class, similar to the fingerprint algorithm, as long as there are any changes to this file, the UID obtained will be completely different. It can be guaranteed that this number is unique among so many categories. Therefore, because the serialVersionUID was not explicitly specified, the compiler generated a UID for us. Of course, it would not be the same as the one saved in the file before, so two serialized version numbers were inconsistent errors. Therefore, as long as we specify the serialVersionUID, we can add a field or method after serialization without affecting the later restoration. The restored object can still be used, and there are more methods or properties that can be used .
tips: serialVersionUID There are two display generation methods:
One is the default 1L, such as: private static final long serialVersionUID = 1L;
according to the class name, interface name, member method and attributes, etc.
When the class that implements the java.io.Serializable interface does not explicitly define a serialVersionUID variable, the Java serialization mechanism will automatically generate a serialVersionUID based on the compiled Class for serialized version comparison. In this case, if the Class file ( The class name, method description, etc.) have not changed (adding spaces, line breaks, adding comments, etc.), even if you compile multiple times, the serialVersionUID will not change.
Transient keywords
The role of the Transient keyword is to control the serialization of variables. Adding this keyword before the variable declaration can prevent the variable from being serialized to the file. After being deserialized, the value of the transient variable is set to the initial value. For example, the int type is 0, and the object type is null.
If we want the name field in the User class to not be serialized, then modify it according to the following scheme.
Modify the User class
public class User implements Serializable {
private static final long serialVersionUID = -8826770719841981391L;
private transient String name;
private int age;
}
Test effect
public class JavaSerializer {
public static void main(String[] args) {
User user=new User();
user.setAge(18);
user.setName("Mic");
serialToFile(user);
System.out.println("序列化成功,开始反序列化");
User nuser=deserialFromFile();
System.out.println(nuser.getName()); //打印反序列化的结果,发现结果是NULL.
}
}
Bypass the transient mechanism
Override the writeObject and readObject methods in the User class.
public class User implements Serializable {
private static final long serialVersionUID = -8826770719841981391L;
private transient String name;
private int age;
private void writeObject(ObjectOutputStream out) throws IOException {
out.defaultWriteObject();
out.writeObject(name);//增加写入name字段
}
private void readObject(ObjectInputStream in) throws Exception{
in.defaultReadObject();
name=(String)in.readObject();
}
}
These two methods are in ObjectInputStream and ObjectOutputStream, respectively, when deserializing and serializing the object, the two methods in the target object are called by reflection.
Summary of serialization
- Java serialization only saves the state of the object. As for the methods in the object, serialization does not care
- When a parent class implements serialization, then the subclass will automatically implement serialization, and there is no need to display the serialization interface
- When the instance variable of an object references other objects, when the object is serialized, the referenced object will also be serialized automatically (to achieve deep cloning)
- When a field is declared as transient, the default serialization mechanism will ignore this field
- Fields declared as transient, if you need to serialize, you can add two private methods: writeObject and readObject
Common serialization technology and pros and cons analysis
With the popularization of distributed architecture and microservice architecture. The communication between services has become the most basic requirement. At this time, we not only need to consider the performance of communication, but also need to consider the issue of language diversity
Therefore, for serialization, how to improve serialization performance and solve cross-language problems has become a key consideration.
There are two problems due to the serialization mechanism provided by Java itself
- The serialized data is relatively large and the transmission efficiency is low
- Other languages cannot be recognized and connected
So that for a long time later, the object serialization mechanism based on XML format encoding became the mainstream. On the one hand, it solved the multi-language compatibility problem, and on the other hand, it was easier to understand than the binary serialization method.
As a result, the XML-based SOAP protocol and the corresponding WebService framework have become necessary technologies for all mainstream development languages for a long time.
Later, the HTTP REST interface encoded in a simple text format based on JSON basically replaced the complex Web Service interface and became the primary choice for remote communication in a distributed architecture.
However, JSON serialized storage takes up large space and low performance. At the same time, mobile client applications need to transmit data more efficiently to improve user experience. In this case, the language-independent and efficient binary encoding protocol has become one of the hot technologies that everyone is pursuing.
The first open source binary serialization framework-MessagePack was born. It appeared earlier than Google's Protocol Buffers.
Introduction to XML serialization framework
The advantage of XML serialization is that it is readable and easy to read and debug. However, the bytecode file after serialization is relatively large, and the efficiency is not high. It is suitable for data exchange scenarios between enterprise-level internal systems with low performance and low QPS. At the same time, XML is language-independent, so It can also be used for data exchange and protocols between heterogeneous systems. For example, the well-known Webservice uses XML format to serialize data. There are many ways to implement XML serialization/deserialization. The well-known methods include XStream and Java's own XML serialization and deserialization.
Introduce the jar package
<dependency>
<groupId>com.thoughtworks.xstream</groupId>
<artifactId>xstream</artifactId>
<version>1.4.12</version>
</dependency>
Write test program
public class XMLSerializer {
public static void main(String[] args) {
User user=new User();
user.setName("Mic");
user.setAge(18);
String xml=serialize(user);
System.out.println("序列化完成:"+xml);
User nuser=deserialize(xml);
System.out.println(nuser);
}
private static String serialize(User user){
return new XStream(new DomDriver()).toXML(user);
}
private static User deserialize(String xml){
return (User)new XStream(new DomDriver()).fromXML(xml);
}
}
JSON serialization framework
JSON (JavaScript Object Notation) is a lightweight data exchange format. Compared with XML, JSON has a smaller byte stream and is very readable. Now the JSON data format is the most common in enterprise use
There are many open source tools commonly used for JSON serialization
- Jackson (https://github.com/FasterXML/jackson)
- Ali open source FastJson ( https://github.com/alibaba/fastjon)
- Google's GSON ( https://github.com/google/gson)
Among these json serialization tools, Jackson and fastjson have better performance than GSON, but Jackson and GSON have better stability than Fastjson. The advantage of fastjson is that the provided api is very easy to use
Introduce the jar package
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.72</version>
</dependency>
Write test program
public class JsonSerializer{
public static void main(String[] args) {
User user=new User();
user.setName("Mic");
user.setAge(18);
String xml=serializer(user);
System.out.println("序列化完成:"+xml);
User nuser=deserializer(xml);
System.out.println(nuser);
}
private static String serializer(User user){
return JSON.toJSONString(user);
}
private static User deserializer(String json){
return (User)JSON.parseObject(json,User.class);
}
}
Hessian serialization
Hessian is a binary serialization protocol that supports cross-language transmission. Compared with Java's default serialization mechanism, Hessian has better performance and ease of use, and supports a variety of different languages
In fact, Dubbo uses Hessian serialization to achieve, but Dubbo has refactored Hessian to achieve higher performance.
Introduce the jar package
<dependency>
<groupId>com.caucho</groupId>
<artifactId>hessian</artifactId>
<version>4.0.63</version>
</dependency>
Write test program
public class HessianSerializer {
public static void main(String[] args) throws IOException {
User user=new User();
user.setName("Mic");
user.setAge(18);
byte[] bytes=serializer(user);
System.out.println("序列化完成");
User nuser=deserializer(bytes);
System.out.println(nuser);
}
private static byte[] serializer(User user) throws IOException {
ByteArrayOutputStream bos=new ByteArrayOutputStream(); //表示输出到内存的实现
HessianOutput ho=new HessianOutput(bos);
ho.writeObject(user);
return bos.toByteArray();
}
private static User deserializer(byte[] data) throws IOException {
ByteArrayInputStream bis=new ByteArrayInputStream(data);
HessianInput hi=new HessianInput(bis);
return (User)hi.readObject();
}
}
Avro serialization
Avro is a data serialization system designed to support mass data exchange applications. Its main features are: support for binary serialization, which can handle large amounts of data conveniently and quickly; dynamic language is friendly, and the mechanism provided by Avro enables dynamic language to easily process Avro data.
Avro is a sub-project of hadoop under apache, which has serialization, deserialization, and RPC functions. The efficiency of serialization is higher than that of jdk, comparable to Google's protobuffer, and better than Facebook's open source Thrift (later managed by Apache).
Because avro uses a schema, if it is to serialize a large number of objects of the same type, only one copy of the structure information + data of the class needs to be saved, which greatly reduces the amount of network communication or data storage.
Introduce the jar package
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-ipc</artifactId>
<version>1.8.2</version>
</dependency>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.8.2</version>
<executions>
<execution>
<id>schemas</id>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/avro</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
Write avsc file
Create a /src/main/avro directory to store Avrode scheme definition files.
{
"namespace":"com.gupao.example",
"type":"record",
"name":"Person",
"fields":[
{"name":"name","type":"string"},
{"name":"age","type":"int"},
{"name":"sex","type":"string"}
]
}
The syntax definition in the avsc file is as follows:
- namespace: namespace, when using the plug-in to generate code, the package name of the User class is it
- type: there are records, enums, arrays, maps, unions, fixed values, records are equivalent to ordinary classes
- name: class name, the full name of the class consists of namespace+name
- doc: comment
- aliases: the aliases taken, other places can be referenced by aliases
fields: attributes
- name: attribute name
- type: The attribute type, which can be the default value with ["int","null"] or ["int",1]
- default: You can also use this field to specify the default value
- doc: comment
Generate code
Execute maven install
,
The Person class will be generated in the main/java directory.
Write test program
public class AvroSerializer {
public static void main(String[] args) throws IOException {
Person person=Person.newBuilder().setName("Mic").setAge(18).setSex("男").build();
ByteBuffer byteBuffer=person.toByteBuffer(); //序列化
System.out.println("序列化大小:"+byteBuffer.array().length);
Person nperson=Person.fromByteBuffer(byteBuffer);
System.out.println("反序列化:"+nperson);
}
}
The following method is based on the form of files to achieve serialization and deserialization
public class AvroSerializer {
public static void main(String[] args) throws IOException {
Person person=Person.newBuilder().setName("Mic").setAge(18).setSex("男").build();
/* ByteBuffer byteBuffer=person.toByteBuffer(); //序列化
System.out.println("序列化大小:"+byteBuffer.array().length);
Person nperson=Person.fromByteBuffer(byteBuffer);
System.out.println("反序列化:"+nperson);*/
DatumWriter<Person> personDatumWriter=new SpecificDatumWriter<>(Person.class);
DataFileWriter<Person> dataFileWriter=new DataFileWriter<>(personDatumWriter);
dataFileWriter.create(person.getSchema(),new File("person.avro"));
dataFileWriter.append(person);
dataFileWriter.close();
System.out.println("序列化成功.....");
DatumReader<Person> personDatumReader=new SpecificDatumReader<>(Person.class);
DataFileReader<Person> dataFileReader=new DataFileReader<Person>(new File("person.avro"),personDatumReader);
Person nper=dataFileReader.next();
System.out.println(nper);
}
}
kyro serialization framework
Kryo is a very mature serialization implementation, which has been widely used in Hive, Storm), but it cannot be cross-language. Currently dubbo has supported kyro's serialization mechanism in version 2.6. Its performance is better than the previous hessian2
Use jute as serialization in zookeeper
Protobuf serialization
Protobuf is a data exchange format of Google, which is independent of language and platform. Google provides a variety of languages to implement, such as Java, C, Go, Python, each implementation includes the corresponding language compiler and library files, Protobuf is a pure presentation layer protocol, can be with various transport layer protocols use together.
Protobuf is widely used, mainly because of its low space overhead and better performance. It is very suitable for RPC calls that require high performance within the company. In addition, due to the relatively high parsing performance, the amount of data after serialization is relatively small, so it can also be used in object persistence scenarios
But to use Protobuf will be relatively troublesome, because it has its own grammar, has its own compiler, if you need to use it, you must invest in the learning of this technology
The disadvantage of protobuf is that the structure of each class to be transmitted must generate a corresponding proto file. If a certain class is modified, the corresponding proto file of the class must be regenerated.
The general steps for using protobuf development are
- Configure the development environment, install the protocol compiler code compiler
- Write a .proto file to define the data structure of the serialized object
- Based on the written .proto file, use the protocol compiler to generate the corresponding serialization/deserialization tool class
- Based on the automatically generated code, write your own serialization application
Install protobuf compilation tool
- https://github.com/google/protobuf/releases found protoc-3.5.1-win32.zip
Write proto file
syntax="proto2"; package com.gupao.example; option java_outer_classname="UserProtos"; message User { required string name=1; required int32 age=2; }
The data types are described as follows:
- string / bytes / bool / int32 (4 bytes)/int64/float/double
- enum enumeration class
- message custom class
Modifier
- required means a required field
- optional means optional field
- repeated can be repeated, representing a collection
- 1, 2, 3, 4 need to be unique in the current range, indicating the order
Generate instance class, run the following command in cmd
protoc.exe --java_out=./ ./User.proto
Realize serialization
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>3.12.2</version>
</dependency>
Write test code.
public class ProtobufSerializer {
public static void main(String[] args) throws InvalidProtocolBufferException {
UserProtos.User user=UserProtos.User.newBuilder().setName("Mic").setAge(18).build();
ByteString bytes=user.toByteString();
System.out.println(bytes.toByteArray().length);
UserProtos.User nUser=UserProtos.User.parseFrom(bytes);
System.out.println(nUser);
}
}
Protobuf serialization principle analysis
We can print out the serialized data to see the result
public static void main(String[] args) {
UserProtos.User user=UserProtos.User.newBuilder().
setAge(300).setName("Mic").build();
byte[] bytes=user.toByteArray();
for(byte bt:bytes){
System.out.print(bt+" ");
}
}
10 3 77 105 99 16 -84 2
We can see that the serialized numbers are basically incomprehensible, but the data after serialization is really small, so let's take you to understand the underlying principles
Normally, to achieve the smallest serialization result, compression technology must be used, and two compression algorithms are used in protobuf, one is varint and the other is zigzag
varint
Let’s talk about the first one, let’s first look at how [Mic] is compressed
The character "Mic" needs to be converted into a number according to the ASCII comparison table.
M =77、i=105、c=99
So the result is 77 105 99
Everyone must have a question, why is the result here directly ASCII-encoded value? Why is there no compression? Is there any classmate who can answer it?
The reason is that varint compresses bytecode, but if the binary of this number only needs one byte to represent, in fact, the final encoded result will not change. If there is a way to express that requires more than one byte, compression is required.
For example, if we set age=300, we need 2 bytes to store. Let's take a look at how it is compressed.
300 is compressed
The results of these two bytes are: -84, 2
How is -84 calculated? We know the method of representing negative numbers in binary, the high bit is set to 1, and the complement is calculated after the binary inversion of the corresponding number (complement is the inverse +1)
So if you want to calculate it in reverse
- [Complement code] 10101100 -1 to get 10101011
- [Inverse code] 01010100 The result obtained is 84. Since the high bit is 1, which means a negative number, the result is -84
Storage format
protobuf uses TLV as a storage method
The calculation method of tag is field_number (the number of the current field) << 3 | wire_type
For example, the field number of Mic is 1, and the value of type wire_type is 2, so: 1 <<3 | 2 =10
The field number of age=300 is 2, and the value of type wire_type is 0, so: 2<<3|0 =16
So according to the TLV format, the first field is name, so its data is {10} {3} {77 105 99}, and the second field is age, {16} {2} {-82 2}
5.5.3 How to store negative numbers
In a computer, a negative number will be represented as a very large integer, because the computer defines the sign bit of a negative number as the highest bit of a number, so if a negative number is represented by varint encoding, then 5 bits must be required. Therefore, in protobuf, negative numbers are represented by sint32/sint64 types. The processing form of negative numbers is to first use zigzag encoding (convert a signed number into an unsigned number), and then use varint encoding.
sint32:(n << 1) ^ (n >> 31)
sint64:(n << 1) ^ (n >> 63)
For example, store a value of (-300).
Modify the original proto file
message User { required string name=1; required int32 age=2; required sint32 status=3; //增加一个sint的字段 }
Set a value
UserProtos.User user=UserProtos.User.newBuilder().setAge(300).setName("Mic").setStatus(-300).build();
- The output result at this time:
10 3 77 105 99 16 -84 2 24 -41 4
We found that the compressed data is different for negative number types. The encoding method used here is zigzag encoding, and then varint is used for encoding and compression.
For example, store a value of (-300)
-300
: 0001 0010 1100
reverse: 1110 1101 0011
plus 1: 1110 1101 0100
n<<1: Shift left one position as a whole, add 0 to the right -> 1101 1010 1000
n>>31: overall right shift 31 bits, left complement 1 -> 1111 1111 1111
n<<1 ^ n >>31
1101 1010 1000 ^ 1111 1111 1111 = 0010 0101 0111
Decimal: 0010 0101 0111 = 599
The purpose of this is to eliminate the high-order 1 to form a data that can be compressed. For 599, use varint to encode.
varint algorithm: Do it from the right, select 7 bits, and add 1/0 to the high bits (depending on the number of bytes)
gets two bytes
1101 0111 0000 0100
-41 、 4
5.5.4 Summary
The good performance of Protocol Buffer is mainly reflected in the small size of serialized data & fast serialization speed, which ultimately makes the transmission efficiency high. The reasons are as follows:
Reasons for fast serialization:
a. The encoding/decoding method is simple (only simple mathematical operations = displacement, etc.)
b. Use Protocol Buffer's own framework code and compiler to complete it together
Reasons for the small volume of serialized data (that is, the data compression effect is good):
a. Adopting unique encoding methods, such as Varint, Zigzag encoding methods, etc.
b. Adopt T-L-V data storage method: reduce the use of separators & store data compactly
Serialization technology selection
technical level
- Serialization space overhead, that is, the size of the result of serialization, which affects the performance of transmission
- The time consumed in the serialization process. The excessive time consumed by serialization affects the response time of the business
- Does the serialization protocol support cross-platform and cross-language. Because the current architecture is more flexible, if there are communication requirements for heterogeneous systems, then this must be considered
- Scalability/compatibility. In actual business development, the system often needs to be updated rapidly with the rapid iteration of requirements. This requires that the serialization protocol we adopt is based on good scalability/compatibility, such as existing A new business field is added to the serialized data structure, which will not affect the existing services
- The popularity of the technology, the more popular the technology means that more companies are used, so many pits have been flown and resolved, and the technical solutions are relatively mature.
- Learning difficulty and ease of use
Selection suggestion
- For scenarios with low performance requirements, XML-based SOAP protocol can be used
- For scenarios with relatively high requirements for performance and indirectness, Hessian, Protobuf, Thrift, and Avro are all available.
- Based on the separation of front and back ends, or independent external api services, it is better to use JSON, which is good for debugging and readability.
- Avro's design philosophy is biased towards dynamic type languages, so it is possible to use Avro in such scenarios
Copyright statement: All articles in this blog, except for special statements, adopt the CC BY-NC-SA 4.0 license agreement. Please indicate the reprint from Mic takes you to learn architecture!
If this article is helpful to you, please help me to follow and like. Your persistence is the motivation for my continuous creation. Welcome to follow the WeChat public account of the same name for more technical dry goods!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。