Protocol Buffers is a portable and efficient structured data storage format that can be used for serialization, or serialization, of structured data. It is ideal for data storage or RPC data exchange formats. A language-independent, platform-independent, extensible serialized structured data format for communication protocols, data storage, etc. Here is the official handbook of Protobuf developers. Google. Cn/protocol – bu…

Which points are considered by the serialization protocol

Why do many RPC frameworks use the Protobuf protocol as the serialization protocol underneath? It is not hard to imagine that for a PRC framework, if it can convey more information with less data transfer, and the speed of serialization and deserialization is certainly as fast as possible, and cross-language features are even better, so to sum up:

1. Code stream size after serialization (occupying network bandwidth) byte length 2. Performance of serialization and deserialization (OCCUPYING CPU resources) 3

Compared with other formats, Protobuf has higher resolution speed (i.e. faster serialization and deserialization), occupies less space, and has better compatibility. Therefore, Protobuf is suitable for data storage or data transmission between networks.

Other serialization protocols

  • JSON
  • XML
  • Hessian
  • Thrift
  • Kryo
  • protostuff
  • … .
  • Protobuf – Google’s open source

Java native serialization operations

Let’s start with an example of how Java’s native serialization is used:

public class Teacher implements Serializable { private static final long serialVersionUID = 8619259453444471644L; private long teacherId; private String name; private int age; private List<String> courses = new ArrayList<>(); public Teacher(long teacherId, String name, int age) { this.teacherId = teacherId; this.name = name; this.age = age; } // getter and setter... @Override public String toString() { return "Teacher{" + "teacherId=" + teacherId + ", name='" + name + '\'' + ", age=" + age + ", courses=" + courses + '}'; }}Copy the code

Now let’s test Java’s native serialization:

public class SerialTest { public static void main(String[] args) throws Exception { Teacher tim = new Teacher(1L, "Tim", 34); tim.getCourses().add("Java"); // serialize byte[] byteArray = serialize(Tim); System.out.println(Arrays.toString(byteArray)); Teacher = deserialize(byteArray); System.out.println(teacher); } private static byte[] serialize(Teacher Tim) throws IOException {ByteArrayOutputStream bos = new ByteArrayOutputStream(); ObjectOutputStream oos = new ObjectOutputStream(bos); oos.writeObject(tim); return bos.toByteArray(); } // deserialize private static Teacher deserialize(byte[] bytes) throws Exception {ObjectInputStream ois = new ObjectInputStream(new ByteArrayInputStream(bytes)); return (Teacher)ois.readObject(); }}Copy the code

By testing, we can see that for such a Teacher class, the result of the byteArray serialized by Java native serialization is

[-84, -19, 0, 5, 115, 114, 0, 23, 100, 97, 121, 95, 48, 53, 46, 112, 114, 111, 116, 111, 98, 117, 102, 46, 84, 101, 97, 99, 104, 101, 114, 119, -99, -62, -50, 93, 124, 59, 92, 2, 0, 4, 73, 0, 3, 97, 103, 101, 74, 0, 9, 116, 101, 97, 99, 104, 101, 114, 73, 100, 76, 0, 7, 99, 111, 117, 114, 115, 101, 115, 116, 0, 16, 76, 106, 97, 118, 97, 47, 117, 116, 105, 108, 47, 76, 105, 115, 116, 59, 76, 0, 4, 110, 97, 109, 101, 116, 0, 18, 76, 106, 97, 118, 97, 47, 108, 97, 110, 103, 47, 83, 116, 114, 105, 110, 103, 59, 120, 112, 0, 0, 34, 0, 0, 0, 0, 0, 0, 1, 115, 114, 0, 19, 106, 97, 118, 97, 46, 117, 116, 105, 108, 46, 65, 114, 114, 97, 121, 76, 105, 115, 116, 120, -127, -46, 29, -103, -57, 97, -99, 3, 0, 1, 73, 0, 4, 115, 105, 122, 101, 120, 112, 0, 0, 0, 1, 119, 4, 0, 0, 0, 1, 116, 0, 4, 74, 97, 118, 97, 120, 116, 0, 3, 84, 105, 109]Copy the code

Serialization via Protobuf

First download:

https://github.com/protocolbuffers/protobuf/releases/download/v3.7.0/protobuf-java-3.7.0.zip
https://github.com/protocolbuffers/protobuf/releases/download/v3.7.0/protoc-3.7.0-win64.zip
Copy the code

A teacher. Proto needs to be defined

syntax = "proto2";
option java_package = "edu.xpu";
option java_outer_classname = "TeacherSerializer";
message Teacher{
	required int64 teacherId = 1;
	required int32 age = 2;
	required string name = 3;
	repeated string courses = 4;
}
Copy the code

The above fields are interpreted as follows:

Message XXX {// Field rule: required -> Field must occur only once // field rule: optional -> Field can occur 0 or 1 times // field rule: A repeated -> field can appear as many times as you like (including 0). // The type can be int32, int64, sint32, sint64, String, or 32-bit.... // Field number: 0 to 536870911 (excluding 19000 to 19999) Field rule type name = field number; }Copy the code

The classes used in the currently generated Java files also require us to introduce Protobuf dependencies:

< the dependency > < groupId > com. Google. Protobuf < / groupId > < artifactId > protobuf - Java < / artifactId > < version > 3.13.0 < / version > </dependency>Copy the code

Copy the generated Java files into the project and test the serialization and deserialization

import java.util.Arrays; public class ProtobufTest { public static void main(String[] args) throws Exception { byte[] bytes = serialize(); System.out.println(Arrays.toString(bytes)); TeacherSerializer.Teacher teacher = deserialize(bytes); System.out.println(teacher); } // serialize private static byte[] serialize(){ To construct the Teacher TeacherSerializer. The Teacher. Builder Builder = TeacherSerializer. The Teacher. The newBuilder (); builder.setName("Tim") .setAge(34) .setTeacherId(1L) .addCourses("Java"); TeacherSerializer.Teacher teacher = builder.build(); return teacher.toByteArray(); } / / deserialization private static TeacherSerializer. The Teacher deserialize (byte [] bytes) throws the Exception {return TeacherSerializer.Teacher.parseFrom(bytes); }}Copy the code

JavaBean objects with the same properties, but the cost of Protobuf serialization and deserialization is much lower, in stark contrast to the size of Java native serialization:

So why are Java serialized objects so large? The Java native serialization method actually contains information about the Teacher Class (packages, types, methods, etc.) and information about the data itself. But how is a Protobuf saved? Protobuf actually stores class information through helper classes, which we generate on the command line, and that class already stores class-related information.

Protobuf features and basic principles

The generated serializer (helper class) holds the class information of the object to be serialized

2, dynamic scalability, int(1-5 bytes), long(1-9 bytes) for example: age = 34 only takes up 1 byte size, only 1 byte is allocated dynamically

Let’s take a look at how Protobuf dynamically scales, using an unsigned int as an example:

public void writeVarint32(int value) throws IOException{ while(true){ if((value & ~0x7F) == 0){ writeRawByte(value); return; }else{ writeRawByte(value & 0x7F | 0x80); value >>>= 7; }}}Copy the code

So what does this code mean? This is the essence of Protobuf dynamic scalability:

In-depth analysis of the Protobuf principle

Here’s how ProtoBuf can squeeze as much performance and efficiency out of coding as possible. Varint is a compact way to represent numbers. It uses one or more bytes to represent a number, and smaller numbers use fewer bytes, which reduces the number of bytes used to represent numbers. This is the example I analyzed above for dynamically scaling unsigned int types. Let’s take a look at ProtoBuf’s coding structure:

The Protobuf encoding structure uses the tag-length-value format. Tag is the unique identifier of the field, Length represents the Length of the Value data field, and the last Value is the data itself. ProtoBuf coding adopts a similar structure, but in fact there is a big difference, and its coding structure can be seen in the following figure:

In the wire_type field of Tag in the figure above, the Start group and End Group types have been abandoned. The mapping between these types in major programming languages can be found on the official website:

For int32 numbers, four bytes are generally required. But with Varint, small int32 numbers can be represented by 1 byte. Of course, there is always a good and bad side to everything, using Varint notation, large numbers require 5 bytes. From a statistical point of view, not all numbers in a message are large, so in most cases, with Varint, numeric information can be represented in fewer bytes. The highest bit of each byte in Varint has a special meaning. If the byte is 1, it indicates that subsequent bytes are part of the number. If the byte is 0, it ends. The other seven bits are used to represent numbers. Therefore, any number less than 128 can be represented by a byte. Numbers greater than 128, such as 300, are represented by two bytes: 1010 1100 0000 0010.

Among the data types that Type 0 can represent are int32 and sint32, two very similar data types. The main purpose of The Google Protocol Buffer distinction is also to reduce the number of bytes after encoding. In computers, a negative number is usually represented as a large integer because the computer defines the sign bit of a negative number as the highest digit. If Varint is used to represent a negative number, 5 bytes must be required. For this purpose, The Google Protocol Buffer defines sint32, which is zigZag encoded. Zigzag coding uses unsigned numbers to represent signed numbers, with positive and negative numbers interlaced. For details on Zigzag, see “Zigzag: Small and Clever Number Compression Algorithms”.

Pros and cons of Protobuf

The advantages of Protobuf

Protobuf is like XML, but it’s smaller, faster and simpler. You can define your own data structure and then use the code generated by the code generator to read and write the data structure. You can even update data structures without having to redeploy the program. With just one description of the data structure, you can easily read and write your structured data in a variety of languages or from a variety of data streams.

It has the nice feature of “backward” compatibility, where people can upgrade data structures without breaking already deployed programs that rely on “old” data formats. This way, your program doesn’t have to worry about massive code refactoring or migration due to message structure changes. Because adding a field in a new message does not cause any changes to the already published program.

Protobuf is semantically cleaner and doesn’t require anything like an XML parser (because the Protobuf compiler compilers.proto files to generate corresponding data access classes to serialize and deserialize Protobuf data).

With Protobuf there is no need to learn a complex document object model, Protobuf’s programming model is friendly, easy to learn, and it has good documentation and examples, making Protobuf more attractive than other technologies for people who like simple things.

The shortage of the Protobuf

Protobuf also has disadvantages compared to XML. It is too simple to represent complex concepts.

XML has become a standard authoring tool for many industries. Protobuf is only used internally by Google and is far from universal.

Because text is not suitable for describing data structures, Protobuf is also not suitable for modeling text-based markup documents such as HTML. In addition, because XML is somewhat self-explanatory, it can be read and edited directly, whereas Protobuf is not, it’s stored in binary, and you can’t read anything directly with a Protobuf unless you have a.proto definition.

Reference data: www.cnblogs.com/onlysun/p/4…