As a programmer, I have a special liking for building wheels. Almost programmers have a dream in their heart to realize all the open source technologies. So starting from this article, I will open a series of building wheels.

preface

First of all, take a look at this. You probably read this resume a lot, right?

  1. Proficient in JAVA, Python and C++
  2. Proficient in Redis, Memcached, Mysql
  3. Proficient in Nginx configuration, module development
  4. Proficient in Kafka, ActiveMQ and other message queues
  5. Proficient in common data structures and algorithms
  6. Proficient in network programming, multi-threaded programming technology, high performance server technology
  7. Proficient in TCP/IP protocol stack, familiar with kernel network subsystem code
  8. Proficient in NGINx code and module development

Each one of these involves a lot of wheels, and each one is a master, if it can be done. So this guy is the fighter of the yard.
So our goal now is to build this fighter. This method is to build the wheel by myself. The purpose of building the wheel is not to use the wheel in the project, but to understand the structure of the wheel, and then experience the process of building the wheel by myself.



The rear wheels

Speaking of the rear wheels, everyone can name a long list, let’s roughly count them.

  • Top of the list: LVS, F5, HAProxy load balancing
  • Then there are Http services like Nginx, Apache and Lighttpd
  • Behind the HTTP service are the various containers where our business logic is deployed
  • Storage over here we have KV memory like Redis and Memcached and caching systems
  • For multi-machine deployment, there must be message queues such as Kafka and ActiveMQ that are responsible for decoupling
  • Thrift RPC framework and Protobuf serialization technology are indispensable for cluster communication
  • At the high end, in the distributed realm, there are more wheels. Zookeeper, Raft, etc
  • There is also the Big data series Hadoop. Spark…


This article starts with our first wheel, the data serialization anti-sequence technology required for server communication: Protobuf.

Base wheel: Protobuf

Before going into the basics, I have attached a picture of the perspectives that one of the techniques in Geek time needs to be covered, and this article will try to cover those perspectives as much as possible.




  • Application point of view
    • Question: “For what?”
    • Technical specification: “How to use”
    • Best Practice: “How to Use it Well”
    • Market Trends: “Who uses it, Where?”
  • Design point of view
    • Goal: “Get something done”
    • How it works: “How”
    • Pros and cons: “How well did you do?”
    • Evolution trend: “What’s the Future?”




The body of the

Protocol buffers


Application point of view


What’s it for?
For serializing data? When do I need serialization? When data needs to be stored or transferred over the network. Why is that?
When stored or transferred, we can see some binary data, i.e. 010101…… The bits.
Suppose we look at an object like:


Struct myData 
{ Int a; Int b; }data = myData { a:1, b:2,}

We received a byte stream on the network. In order to recover data from the byte stream, we need to do the following:



  1. Correctly identify where data starts and ends in the byte stream
  2. Identify the value of A, identify the value of B
One possible byte stream protocol is:



It starts with eight bits to indicate which structure the data will follow, followed by two four-bytes to indicate a and B.
Note !!!!! To make the above assumption, there are a few things we default to:
  1. We think the byte stream starts with 8 bits to indicate which data structure it is, in this case myData (ps: different structures are numbered differently)
  2. Supports up to 2^8 structures
  3. Both communication parties need to get the myData definition file


Here is an example of the code implemented above at GitHub.


You can see how easy it is to serialize and deserialize one of our data structures in GO.


Design point of view
Do what?
We’ve just implemented the simplest implementation of a serialization method, so let’s look at what we need to do to implement a serialization protocol in a production environment.


  1. Versatility: language and platform independent
  2. High performance: Fast serialization and deserialization
  3. High compression: Data is as small as possible after serialization. Small data means less data is transmitted over the network
  4. Compatibility: Data structures have changed to support both old and new versions


Now let’s look at how to use it from an application perspective with these goals in mind.
This is the official document developers.google.com/protocol-bu… , there are detailed instructions, in addition to my own use of a sample: github.com/zhuanxuhit/…


The following is the design perspective: how to do it?


First we look at our own simple serialization and de-serialization methods. We encode each structure and write in the header what the structure is, followed by the value of each field in the structure.
Simple, efficient and compatible, here’s how Protobuf has improved.


For a brief episode, protobuf stands for Protocol buffers, and buffers refer to a very important point of use, which is that when we deserialize a piece of binary data, we read it into buffer first, and then identify the beginning and end of a single data structure. Finally, it can deserialize correctly.


In the previous design, we also coded the data structure in the header. In order to achieve more efficient and smaller data, can we remove this header?
Of course it is possible, so our structure is changed to only the corresponding field value, so a premise is:!! We must be clear about which data structure we identify binary data to correspond to. !!!!!


Now that we’ve removed the structure description, how can we make it smaller? For int64, for example, 1 and 1<<32 do not need to be represented by 8 bytes. For example, we can encode the data type first, followed by the subsequent bit, followed by the actual data.
How many bits are there for each part?


  • Data types: Encoded according to the supported types, if a total of 16 types can be supported, that is 4bit
  • Subsequent valid bytes: This is tricky, since we are not sure of the data size, we cannot fix the bit representation.
That solution is: we remove the field of valid bytes, and we encode the information about whether there is more data in the data itself, as shown below:



Once you’ve solved the problem of encoding a field of type int, what if you encounter a string? This type is also a data type description, followed by the encoding of the next valid byte, which is an int, which can be encoded as above, followed by the valid data.
Above is the protobuf when encoding data using the main idea of the encoding, specific can see developers.google.com/protocol-bu… .


Current data type supported by protobuf



Here’s an idea: for signed numbers, we need to deal with them separately, because the highest bit of signed numbers is 0,1 to represent positive and negative, but the highest bit in the above code is used to indicate whether there is any subsequent data, so we need to use ZigZag code to convert signed to unsigned.


The principle is very simple, is through the following coding:






With coding out of the way, we come to the final issue: compatibility.


What if we change the data structure: add or delete fields?


For protobuf, the design code is as follows:


Image from: Efficient data compression encoding Protobuf


If field_num > 16, the tag will be encoded using more than 1 byte. For fields we use frequently, we recommend that the tag be encoded between 0 and 15, with fewer tag bits.




Summary of implementation principle
Let’s summarize the implementation principles described above


Efficient: variable length coding, non-self-describing
Compatible: Field is encoded


Let’s take a look at protobuf best practices and market trends from an application perspective.
Protobuf was originally designed to address interface compatibility issues and is currently used for RPC calls and data transfer between internal services. Currently protobuf is used as a serialized RPC framework with gRPC, which we will cover later.


Finally, we take a look at protobuf’s strengths, weaknesses and evolution trends from a design perspective.


advantages
With Protobuf, one of the biggest advantages is compatibility, already deployed services that use the old data format can continue to be used even if the interface is upgraded, and then performance, of course fast, depending on serialization/deserialization performance


disadvantages
Compared with JSON, the readability is poor, especially in the debugging phase, compared with JSON, we cannot clearly know the input and output.




The last
To summarize this article
  1. Protobuf is designed primarily to address compatibility issues, implementing a number for each field, and ignoring nonexistent fields when encountered.
  2. To achieve high performance, Protobuf uses tag-value (tag-length-value) during encoding, which makes serialized data more compact
  3. Protobuf, in order to achieve high performance, discards self-describing information, i.e. we only get the data, not the proto file, we can’t reverse sequence data
  4. Protobuf provides a set of compilation tools that can generate data serialization and deserialization methods in different languages, greatly improving ease of use


trailer
In the next article we will cover GRPC and see how Protobuf is used in RPC frameworks.