Protobuf is mainly known for small data encoding, mainly used for data interaction and data storage, reduce bandwidth, disk, poor mobile network environment to reduce packet size and other scenarios, about the serialization speed mainly depends on the SDK you use, so this article will not care about the serialization speed! This article will be introduced in proto3 syntax!

Agreement on

  1. The difference between PB3 and PB2 is basically,
  • Pb3 introduces default values for all primitive types, so the default values are not serialized during data transfer, which makes it difficult to tell whether the values are set or not! So then PB3 supports it againoptional!
  • If your business needs itrequiredKeyword, but PB3 is notrequiredKeyword!
  • And PB3 does not support default Settings, and does not supportgroup messageSo I still recommend if you are doing business development or PB2 is better!
  1. Basically the syntax is as follows, specific to see the official documentation: developers.google.com/protocol-bu…
syntax = "proto3";

message TestData {
  enum EnumType {
    UnknownType = 0; // Must start with 0!
    Test1Type = 1;
    Test2Type = 2;
  }
  message TestObj {
    int64 t_int64 = 1;
  }
  string t_string = 1;
  int64 t_int64 = 2;
  bool t_bool = 3;
  fixed64 t_fix64 = 4;
  repeated int64 t_list_i64 = 5;
  map<int64.string> t_map = 6;
  EnumType t_enum = 7;
  TestObj t_obj = 8;
  repeated TestObj t_list_obj = 9;
  map<string, TestData> t_map_obj = 10;
}
Copy the code

Let’s not say one of them

Compile:

/ Users/fanhaodong/software/protoc - 3.17.3 - osx - x86_64 / bin/protoc \ - experimental_allow_proto3_optional \ --proto_path=/Users/fanhaodong/go/src/code.xxxxxxxxx/fanhaodong.516/tool/pb_idl/test \ --plugin=protoc-gen-go=/Users/fanhaodong/go/bin/protoc-gen-go \ --go_opt=Mtest.proto=github.com/fanhaodong.516/go-tool/pb_gen/test \ --go_out=/Users/fanhaodong/go/src \ --plugin=protoc-gen-go-grpc=/Users/fanhaodong/go/bin/protoc-gen-go-grpc \ --go-grpc_opt=Mtest.proto=github.com/fanhaodong.516/go-tool/pb_gen/test \ --go-grpc_out=/Users/fanhaodong/go/src \ test.protoCopy the code
  1. Pb serialization used core thought is varint, specific can see official articles: developers.google.com/protocol-bu…

  2. The goal of this article is to make it easy to serialize and deserialize messages!

Encoding + decoding

About the encoding logic for each type of message: github.com/protocolbuf…

A simple example

Here is a test case, and you can see the output below

func Test_Marshal_Data(t *testing.T) {
   var request = test.TestData{
      TString: "hello",
      TInt64:  520,
   }
   marshal, err := proto.Marshal(&request)
   iferr ! =nil {
      t.Fatal(err)
   }
   t.Log(hex.Dump(marshal))
   // 00000000 0a 05 68 65 6c 6c 6f 10 88 04 |.. hello... |
}
Copy the code

Message format introduction (Key encoding introduction)

In particular, how do we analyze the code

The message itself is a series of KV pairs. In PB, the key is the ID of the field and the value is the message body! But in fact, key is the field ID+ field type, each key is applicable to varint encoding, generally read pb source code will be called this tag to represent key!

Pb, provide many types it at the time of transmission for type and made a map classification, probably is the 6 classes, specific can see official articles: developers.google.com/protocol-bu… ! So you can use the three bits, says so tag = (field_number < < 3) | wire_type

The first three bits of the tag represent the type, and the rest represent the ID!

  1. Here to talk about a little skill, actually see the agreement this source code, a lot of bit operations, but in general | said set bit operation, & said get bit operation!

  2. Varint contains only 7 fields, but tag contains only 3 fields, so the remaining 4 bytes store number, i.e. 1<<4 -1 field ID; Official: article introduces developers.google.com/protocol-bu…

  3. Second, the reason why the largest field is 2^29-1 is because nuber is ui32 at most, and there are only 29 bits left for 32-3 types, so it’s 2^29-1

So the code for this concrete implementation is as follows:

type pbCodec struct {
	buffer []byte
}

func (p *pbCodec) EncodeTag(fieldNumber uint32, writeType uint8) error {
	tag := fieldNumber<<3 | uint32(writeType)
	return p.EncodeVarInt(int64(tag))
}

// Will be implemented later
func (*pbCodec) EncodeVarInt(data int64) error {
	return nil
}

func (p *pbCodec) DecodeTag(a) (fieldNumber uint32, writeType uint8, err error) {
	result, err := p.DecodeVarInt()
	iferr ! =nil {
		return 0.0, err
	}
	// select * from 1
	writeType = uint8(result&1<<3 - 1)
	// 2 removes the lower 3 bits
	fieldNumber = uint32(result >> 3)
	return
}
Copy the code

Varint coding

The wiki introduce en.wikipedia.org/wiki/Variab…

The minimum value is 1 byte, and the maximum value is int64

  1. Case1:

For example: data=15 -> 0000 1111,

Encoding logic:

Varint is 0000 1111 because it can be represented in 7 bytes! So there is no need to set up MSB!

Analytic logic:

For example, 0000 1111&1 <<7 == 0 indicates that the MSB is not set. Then, the lower 7 bits are the true data. We can ignore this operation because 8 bits are also 0!

  1. case2:

For example, data=520 -> 0000 0010 0000 1000

Encoding logic:

Data & (1<<7) -1) = 000 1000, then set MSB 1000 1000, this is the first round;

The remaining bytes in the second round are 0000 0010 0= 4, which can be put down with 7 bytes, so 0000 0100

So the final result is 1000, 1000, 0000, 0100, which is [136,4], and you’ll see that it actually outputs a small endian representation!

Analytic logic:

First, varint actually prints a small endian representation, so we need to start with the low one!

The first byte is fetched 1000 1000, MSB is found, and the result is 000 1000 = 8

Then fetch the second byte, 0000 0100, find it is not MSB, and get the result 000 0100, we need to put it after 000 1000! How to do, actually with simple 000 0100 < < 7 | 000 1000 can get the result is 000, 0100, 000, 1000 = 0000, 0010, 0000, 1000

  1. Code implementation
func (p *pbCodec) EncodeVarInt(data int64) error {
	// 1. Fetch the lower 7 bits (if 7 bytes cannot be dropped!)
	// 2. Then set the high 8-bit identifier
	// 3. Then move right
	for data > (1<<7 - 1) {
		p.buffer = append(p.buffer, byte(data&(1<<7- 1) | (1<<7)))
		data >>= 7
	}
	p.buffer = append(p.buffer, byte(data))
	return nil
}

func (p *pbCodec) DecodeVarInt(a) (int64, error) {
	var (
		x int64
		n = 0
	)
	defer func(a) {
		p.buffer = p.buffer[n:]
	}()
	for shift := uint(0); shift < 64; shift += 7 { // The offset starts at 0 and +7 each time
		if n >= len(p.buffer) {
			return 0, fmt.Errorf("not enough buffer")}// 1. Remove the first self
		// 2. Then extract the lower 7 bits
		// 3. Since the data is small endian, the retrieved data needs to move the offset
		// 4. Then set it to the original data!
		b := int64(p.buffer[n])
		n++
		x |= (b & 0x7F) << shift
		if (b & 0x80) = =0 {
			return x, nil}}return 0, fmt.Errorf("not support")}Copy the code

Non-varint encoding type

  1. Fixed type 64

Encoding: Encodes bytes from the lowest to the highest, output the lowest endian

Decoding: Reverse as opposed to encoding

Such as:

520 INT64 00 00 00 00 00 02 08 => 08 02 00 00 00 00 00 00 00 00 00 00 00 00 00Copy the code
  1. Fixed 32
  2. string

The encoding, since string is variable length, is len + content, len is varint, and content is UTF8

Field_number + field_type (proto.wirevarint) content_len (varint) contentCopy the code
  1. embed message

We need to know how big the message body is (like string), so it’s len+payload, and len is also varint.

Field_number + field_type (proto.wirevarint) payload_len (Varint encoding) payloadCopy the code

Zigzag encoding

In fact, we can find that the high bit of varint is set to MBS, and then we know that the highest bit of the negative number must be 1 (because of the inverse code), and then we can find a problem, the varint of the negative number can be very large, at least 10 bytes, so pb proposed sint32, sint64 codes, to solve this problem. The core is zigZag coding

For example, this code:

func Test_Marshal_Data(t *testing.T) {
	t.Run("number".func(t *testing.T) {
		var request = test.TestData{
			TInt64: 520,
		}
		marshal, err := proto.Marshal(&request)
		iferr ! =nil {
			t.Fatal(err)
		}
		t.Log(hex.Dump(marshal))
		t.Logf("len: %d\n".len(marshal))
	})
	t.Run("negative number".func(t *testing.T) {
		var request = test.TestData{
			TInt64: - 520.,
		}
		marshal, err := proto.Marshal(&request)
		iferr ! =nil {
			t.Fatal(err)
		}
		t.Log(hex.Dump(marshal))
		t.Logf("len: %d\n".len(Marshal))})} == output, you can find a large gap!! === RUN Test_Marshal_Data === RUN Test_Marshal_Data/number demo_test.go:45: 00000000  10 88 04|... | demo_test.go:46: len: 3
=== RUN   Test_Marshal_Data/negative_number
    demo_test.go:56: 00000000  10 f8 fb ff ff ff ff ff  ff ff 01|... | demo_test.go:57: len: 11
Copy the code

It’s actually quite simple

And we can see that this is actually the unsigned half for the positive and half for the negative!

For example, 32-bit: (n << 1) ^ (n >> 31)

For example, 64-bit: (n << 1) ^ (n >> 63)

Unique or: The value is 0 for the same and 1 for different

Such as

#- 11111 1111 1111 1111 1111 1111 1111 1111 1111
# d1=uint32(n) << 11111 1111 1111 1111 1111 1111 1111 1111 1110#D2 =uint32(N >> 31) (If negative numbers are moved to the left, 1 is added)1111 1111 1111 1111 1111 1111 1111 1111 1111# d1 ^ d20000 0000 0000 0000 0000 0001

# 10000 0000 0000 0000 0000 0001#n<<10000 0000 0000 0000 0000 10#n>>310000 0000 0000 0000 0000 0000#The output0000 0000 0000 0000 0000 10Copy the code

Code implementation (the 32-bit implementation may be important to note here)

// EncodeZigZag64 does zig-zag encoding to convert the given
// signed 64-bit integer into a form that can be expressed
// efficiently as a varint, even for negative values.
func EncodeZigZag64(v int64) uint64 {
	return (uint64(v) << 1) ^ uint64(v>>63)}// EncodeZigZag32 does zig-zag encoding to convert the given
// signed 32-bit integer into a form that can be expressed
// efficiently as a varint, even for negative values.
func EncodeZigZag32(v int32) uint64 {
	return uint64((uint32(v) << 1) ^ uint32((v >> 31)))}// DecodeZigZag32 decodes a signed 32-bit integer from the given
// zig-zag encoded value.
func DecodeZigZag32(v uint64) int32 {
	return int32((uint32(v) >> 1) ^ uint32((int32(v&1) < <31) > >31))}// DecodeZigZag64 decodes a signed 64-bit integer from the given
// zig-zag encoded value.
func DecodeZigZag64(v uint64) int64 {
	return int64((v >> 1) ^ uint64((int64(v&1) < <63) > >63))}Copy the code

Repeated (list)

Protbuf provides a repeated keyword to provide a list of types.

There are two specific encoding implementations of repeated

  • Packed (Pb3 default, PB2 v2.1.0 introduced)
  • unpacked

The official document: developers.google.com/protocol-bu…

packed

You can use the demo provided on the official website as an example:

message Test4 {
  repeated int32 d = 4 [packed=true];
}
Copy the code

Coded output

22        // key (field number 4, wire type 2)
06        // payload size (6 bytes)
03        // first element (varint 3)
8E 02     // second element (varint 270)
9E A7 05  // third element (varint 86942)
Copy the code

In fact, this is our normal idea, ID-type-data! However, there are few supported types, only the following types are supported!

	WireVarint     = 0
	WireFixed32    = 5
	WireFixed64    = 1
Copy the code

unpacked

Again, the above example, if unpacked

message Test5 {
  repeated int32 d = 4 [packed = false];
}
Copy the code

Output:

00000000  20 03 20 8e 02 20 9e a7  05|... . |20  // key (field number 4, wire type=proto.WireVarint)
03 // (varint 3)
20 // key (field number 4, wire type=proto.WireVarint)
8e 02 // (varint 270)
20 // key (field number 4, wire type=proto.WireVarint)
9e a7  05 // (varint 86942)
Copy the code

map

As can be seen in file desc, a MAP is a repeated KV message.

{
    "name": "TestData"."field": [{"name": "t_map"."number": 6."label": 3."type": 11."type_name": ".TestData.TMapEntry"."json_name": "tMap"},]."nested_type": [{"name": "TMapEntry"."field": [{"name": "key"."number": 1."label": 1."type": 3."json_name": "key"
                },
                {
                    "name": "value"."number": 2."label": 1."type": 9."json_name": "value"}]."options": {
                "map_entry": true}}]."enum_type": []}Copy the code

So it’s easy to code, for example

message TestMapData1 {
// You can't define TMapEntry.
  map<int64.string> t_map = 6; } ==> actually generates this code!message TestMapData2 {
  message TMapEntry {
    int64 key = 1;
    string value = 2;
  }
  repeated TMapEntry t_map = 6;
}
Copy the code

So the code can be roughly:

t_map= {1:"1".2:"2"} = >32 05 08 01 12 01 31 32  05 08 02 12 01 32= >32 // field_number=6 and write_type=proto.WireBytes
05 // entry data length=5
08 // entry data key field_number=1 and write_type=proto.WireVarint
01 // entry data key_value=varint(1)
12 // entry data value field_number=2 and write_type=proto.WireBytes
01 // entry data value len= varint(1)
31 // entry data value="1"

32  // field_number=6 and write_type=proto.WireBytes
05  // entry data length=5
08 02 12 01 32 / / same as above!
Copy the code

field order

Pb coding does not care about the coding sequence of the field, that is, encode when the order is not the same will lead to the output of data is not the same!

The key order of the map is also affected!

So most apis specify whether they support Deterministic or not, and if set to true, the result is always guaranteed to be the same, otherwise it might be different!

But you know, the actual effect is that it must be slower than not to open it, because it needs to order!

Pb protocol overview

This is a similar to BNF paradigm, specific can consult: developers.google.com/protocol-bu…

message: = (tagValue)* You can think of this as "key value" tag := (field << 3) BIT_OR wire_type, encoded as varint value := (varint|zigzag) for wire_type==0 | fixed32bit for wire_type==5 | fixed64bit for wire_type==1 2 | | delimited for wire_type = = group_start for wire_type = = 3 | This is like "open parenthesis" group_end for wire_type = = 4 This is like "close parenthesis" varint: = int32 | int64 | uint32 | uint64 | bool | enum. encoded as varints zigzag := sint32 | sint64, encoded as zig-zag varints fixed32bit := sfixed32 | fixed32 | float, encoded as 4-byte little-endian; memcpy of the equivalent C types (u? int32_t, float) fixed64bit := sfixed64 | fixed64 | double, encoded as 8-byte little-endian; memcpy of the equivalent C types (u? int64_t, double) delimited := size (message | string | bytes | packed), size encoded as varint message := valid protobuf sub-message string := valid UTF-8 string (often simply ASCII); max 2GB of bytes bytes := any sequence of 8-bit bytes; max 2GB packed := varint* | fixed32bit* | fixed64bit*, consecutive values of the type described in the protocol definition varint encoding: Sets MSB of 8-bit byte to indicate "No more bytes" zigzag encoding: SINt32 and SINt64 types use zigzag encoding.Copy the code

Protoc command description

Here is a description of the protoc architecture diagram, generate architecture diagram

The first node protoc is actually written in c++, github.com/protocolbuf…

The second node is the protoc output binary CodeGeneratorRequest, see github.com/protocolbuf…

The third node is the business logic, the back end, we can according to the request to output products to the PB file

The fourth node outputs the CodeGeneratorResponse

At present, although there are a lot of projects in the resolution of protoc AST, are their own implementation of lexical parsing and syntax parsing, so if you can put protoc encapsulation lib libraries directly use go/ Java docking on the line!

Pb some other terms

  • Package, the same message and enumeration and enumeration fields cannot exist under the same package
  • Include (import), relative pata based on include_path as the root directory
  • Options, which can be interpreted as annotations, can mark fields, messages, enumerations, methods!

Refer to the article

  • Protobuf coding introduction
  • Varint/Zigzag protobuf encoding