Why Protobuf is better than JSON or XML?

In Depth microservices

Hi, I’m Handsome Brother Miao.

Today, I take you see at a deeper level of Protobuf, if you are not familiar with usage of Protobuf, directly to: developers.google.com/protocol-bu… .

With a Protobuf serialized data, you will understand that it is more efficient to transfer data than JSON or XML.

So why is it high? This article takes a look at that question.

Look at the surface

For JSON and XML, the structured information of the data is preserved for easy readability during data transmission. For example, the following is an example of JSON:

{
  "name": "laomiao"."age": 18
}
Copy the code

When the message is sent, the recipient will understand that it is in the form of a “key/value” and that “name” is followed by the name and “age” is followed by the age.

So how do you compress that data?

We can remove “curly braces”, “name”, “age”, and other “colons”, “commas”, “quotes” and other structural data.

laomiao18
Copy the code

So how does the recipient know which one is the name if it’s deleted? Which is age?

Delete “structure”

Both the sender and the receiver need to retain the “structure” of the data. The sender only sends the data, and the receiver parses the data according to the locally retained “structure” after receiving the data.

Let’s say that the structure is as follows, and this is not real, but it’s just for the sake of description.

{
  name string 7
  age int 1
}
Copy the code

From this “structure”, we can know:

The name data precedes the age data.
The value of name is string, and the value of age is int.
The length of the name data byte is 7; the length of the age data byte is 1.

All the recipient needs to do is hold this structure and see how the laomiAO18 data is parsed.

Since the description

But there are still these problems:

What if the name data exceeds 7 bytes?
What if age data exceeds 1 byte?
The order in the structure can not be adjusted, too dead, how to do?

Of course, both the sender and the receiver can update their “structured” data, but that’s not practical because you can’t guarantee that the data is of fixed length.

For age data, we can define it as four bytes or eight bytes, as long as we can handle our business. But there’s a problem with that. Waste of space?

Let’s say age is defined as 4 bytes, 18 data is transferred, and for 18, only 1 byte is enough, and the remaining 3 bytes are wasted. But we can’t define it as one byte, because there might be a large number.

So how do you compress age data?

With a Protobuf, information is added to the data to solve this problem, that is, the data describes itself, or “self-description.”

What does Protobuf do? As follows:

Add “field” order information to data.
Add type information to data.
Minimize compression shaping data.

Protobuf

When a Protobuf serializes data, it divides a Protobuf data type into six categories, called “wire types “.

wire type	Proto type	meaning
0	int32, int64, uint32, uint64, sint32, sint64, bool, enum	Varint
1	fixed64, sfixed64, double	64-bit
2	string, bytes, embedded messages, packed repeated fields	Length-delimited
3	Groups (obsolete)	Start group
4	Groups (obsolete)	End group
5	fixed32, sfixed32, float	32-bit

The “3” and “4” types in “wire type” are deprecated.

Next, a message expands the description as follows:

message HelloRequest {
  string name = 1;
  int32 num = 2;
  float height = 3;
  repeated int32 hobbies= 4;
}
Copy the code

This is like the “structure” I mentioned above, through which the sender and receiver parse the data. Now let’s go over the questions left above.

1. Type and order

How to store “data type” and “order” in the transmitted data?

The data type corresponds to “Wire type” and the sequence corresponds to” field number “. Int32 num = 2

Wire type: 0, according to the table above.
Field number: 2, the unique code after the field.

Assemble the two pieces of information as follows:

(field_number << 3) | wire_type
Copy the code

To:

(2 < < 3) | 0 to 16Copy the code

2. Varint

How to compress the data stored in num field? Suppose num stores 300. The 4-byte storage is as follows:

00000000 00000000 00000001 00101100
Copy the code

As you can see from the result, the actual valid data is only 2 bytes. For compression, different data sizes will occupy different bytes.

So how do you record the data length? We can add another byte to record the actual number of bytes taken up by the real data. For 300 data, adding one byte to the record length adds up to 3 bytes with the data. Is there any way to reduce the number of bytes?

Of course there will be, or I’ll just say a bunch of crap. Let’s move on.

Please find Varint algorithm, the process is as follows:

The data is segmented by a group of 7 bits;
Reverse the order of the group, that is, change the “high → low” rule to “low → high”;
Identify each group and precede it with a “1” if there is data behind it, or a “0” otherwise.

Put data 300 into the algorithm, and the process is as follows:

300: 00000000 00000000 00000001 00101100 → 7 bits split: 00000000000 000000010 0101100 → reverse order: 0101100 0000010 0000000 0000 → Add 1/0:10 to group 101100 00000010 → Decimal: 172 2Copy the code

According to this algorithm, the data is compressed into two bytes of storage. After receiving the byte data, the receiver only needs to identify it according to the high order. If it is 0, it indicates that there is no data after that.

Finally, for int32 num = 2 and data 300, the compressed result is:

16, 172, 2Copy the code

3. Length-delimited

Now say string name = 1, which corresponds to “wire type” 2 and “field number” 2. Log “order” and “type” as described above.

This type is much simpler than using Varint to record the length of the data in bytes.

If the value of name is “Miao”, the final result is:

10 4 109 105 97 111
Copy the code

Explanation:

10:(2 < < 3) | 2 。
4: indicates the length of the string.
After: Save according to “UTF-8” code.

For message nesting, repeated (array or slice), byte array, this algorithm is also used to obtain.

For example, when repeated int32 hobbies= 4, suppose hobbies= [10, 20]

34 2 10 20
Copy the code

4. The floating point number

For floating-point types, it’s even simpler. Floating-point data is stored in fixed bytes, and the “order” and “type” records are the same as above.

If float height = 3 and the corresponding “wire type” is 5, the data assumption is 52.1, the final result is:

29 102 102 80 66
Copy the code

Explanation:

29:(3 < < 3) | 5 。
After: use a fixed number of bytes 4.

If a double is used, the corresponding “wire type” is 1 and the number of bytes consumed is 8.

5. sint32/sint64

I don’t know if you use these two types when writing proto files, but it’s important to understand this, otherwise sometimes the data will not be compressed.

In the Varint algorithm mentioned above, we know that a group of 7 bits is added with an “identification bit” to compress the data. But there’s a problem. If there’s a negative number, this kind of compression doesn’t work.

Why? How is it solved?

I’ll start with the results. If you write a proto file and set the data type to sint32 or sint64, the ZigZag algorithm will be used for data compression.

The ZigZag algorithm, I’m not going to repeat it, I’m going to read it.

summary

By the end of this article we know how Protobuf can compress data. To put it simply, it is to delete some useless information and record “type”, “order” and “data” in a self-describing way.

For types, only “wire type” is recorded, which determines roughly how the data will be processed.

Is it better than JSON or XML? Also is not.

Because to transmit data using a Protobuf, the sender and receiver must use the same set of structural rules, or protocols. So, if you want to make your data more readable and reduce this rule alignment, you can use JSON, XML.

Later I will use Go to implement the core algorithm for Protobuf serialization and deserialization, as long as I feel that I really understand the true meaning of the algorithm.

Sustainability Concerns: github.com/miaogaolin/… All the code in this series will be added later.

reference

Explanation: website algorithm developers.google.com/protocol-bu…