Still use JSON? Google Protocol Buffers are faster and smaller

Welcome to pay attention to the wechat public account “Conveniently Record Technical Team”, check out more technical articles of conveniently Record team. Reprint please indicate the source In this paper, the author: Ding Tong boat The original link: mp.weixin.qq.com/s/cyOHe1LS-…

background

It is easy to remember that during the interaction between the client and the server, there are high requirements on the size and efficiency of some data transmission, and common data formats such as JSON or XML cannot meet the requirements. Therefore, we decided to use the Protocol Buffers introduced by Google to achieve efficient data transmission.

introduce

Protocol Buffers is a cross-platform, multilingual, open source serialized data format proposed by Google. Protocol Buffers are smaller, faster, and simpler than similar XML and JSON. The syntax is currently split into proto2 and Proto3 formats.

The main advantage of Protocol buffers over traditional XML and JSON is that they are smaller and faster. For custom data structures, Protobuf can use generators to generate source files in different languages, making it easy to read and write.

Suppose you now have the following JSON data:

{
	"id":1."name":"jojo"."email":"[email protected]",}Copy the code

Using JSON encoding, the binary data with byte length of 43 is obtained:

7b226964 223a312c 226e616d 65223a22 6a6f6a6f 222c2265 6d61696c 223a2231 32334071 712e636f 6d227d
Copy the code

If Protobuf is used, the resulting binary data is only 20 bytes

0a046a6f 6a6f1001 1a0a3132 33407171 2e636f6d
Copy the code

coding

The biggest reason Protobuf is so small and fast compared to text-based data structures such as JSON and XML is because of its unique encoding method. The use and principles of Google Protocol Buffer provides a good parse of the Protobuf Encoding

For example, for int32 numbers, if they are small, protubuf can be represented with only 1 byte because of Varint.

Varint

The highest bit of each byte in Varint indicates whether the byte is the last. 1 indicates that subsequent bytes also indicate this number, and 0 indicates that this byte is the end byte.

For example, the number 300 is represented by Varint as 1010 1100 0000 0010

Image from Google Protocol Buffer usage and Principles

Note

Note that the two byte positions are first interchanged during parsing because the byte order is little-endian.

However, Varint encoding is not effective for signed numbers. Since signed numbers usually represent symbols in the highest bit, using Varint to represent a signed number must be 5 bytes regardless of size (the highest bit is not negligible, so the Varint representation for -1 becomes 010001).

Protobuf addresses this problem well by introducing ZigZag coding.

ZigZag

Image from “Integer Compression Coding ZigZag”

ZigZag coding about ZigZag, blog park on a blog integer compression coding ZigZag made a detailed explanation.

The ZigZag encoding, which sorts the numbers in ascending order by absolute value, converts the integers into an increasing 32-bit bit stream using a hash function h(n) = (n<<1)^(n>>31) (if sint64 h(n) = (n<<1)^(n>>63)).

n	complement	h(n)	ZigZag (hex)
0	00 00 00 00	00 00 00 00	00
– 1	ff ff ff ff	00 00 00 01	01
1	00 00 00 01	00 00 00 02	02
.	.	.	.
– 64.	ff ff ff c0	00 00 00 7f	7f
64	00 00 00 40	00 00 00 to 80	80 01
.	.	.	.

As to why ZigZag 64 is 80 01, the above article provides an explanation of the unique translatability of its encoding.

With ZigZag encoding, any number with a small absolute value can be represented in bytes with fewer bits. Fixed the problem that negative Varint bits would be long.

T-V and T-L-V

A Protobuf message structure is a series of serialized tag-value pairs. Tag consists of data field and writeType, and Value is binary data encoded by source data.

Suppose you have a message like this:

message Person {
  int32 id = 1;
  string name = 2;
}
Copy the code

In the command, the field of id is 1, and writeType is the serial number corresponding to int32. After coding id corresponds to the Tag for (field_number < < 3) | wire_type = 0000, 1000, low of the three logo writetype, other identification field.

The ordinal number for each type can be obtained from this table:

Type	Meaning	Used For
0	Varint	int32, int64, uint32, uint64, sint32, sint64, bool, enum
1	64	fixed64, sfixed64, double
2	Length-delimited	string, bytes, embedded messages, packed repeated fields
3	~~Start group~~	~~groups (deprecated)~~
4	~~End group~~	~~groups (deprecated)~~
5	32-bit	fixed32, sfixed32, float

Note that for string data (in the third row above), since the Length is variable, the t-V message structure cannot be satisfied. We need to add a Length field, that is, t-L-V structure.

Reflection mechanism

Protobuf is inherently reflective and can construct concrete Message objects using Type names. Chen Shuo’s article on the GPB reflection mechanism to do a detailed analysis and source code interpretation. The source code of the Protobuf-objectivec version is used to analyze the reflection mechanism of this version.

Image from “A Google Protobuf Network Transport Scheme for Automatically Reflecting Message Types”

Chen Shuo makes a detailed analysis of the class structure of Protobuf — the key class of its reflection mechanism is Descriptor class.

Each specific Message Type corresponds to one Descriptor object. Although we don’t call its functions directly, Descriptor plays an important role in creating Message objects of specific types based on type name, serving as a bridge

At the same time, Chen shuo analyzed the specific mechanism of its reflection according to the source code of the C++ version of GPB: The DescriptorPool class gets the Descriptor object pointer according to the Type name, and constructs the Message object according to the Descriptor instance using the MessageFactory class. Example code is as follows:

Message* createMessage(const std::string& typeName)
{
  Message* message = NULL;
  const Descriptor* descriptor = DescriptorPool::generated_pool()->FindMessageTypeByName(typeName);
  if (descriptor)
  {
    const Message* prototype = MessageFactory::generated_factory()->GetPrototype(descriptor);
    if(prototype) { message = prototype->New(); }}return message;
}
Copy the code

Note

DescriptorPool Contains all the Protobuf Message types linked at compile time of the program MessageFactory can create all the protobuf Message types linked at compile time of the program

Protobuf-objectivec

In the OC environment, suppose there is a Message data structure as follows:

message Person {
  string name = 1;
  int32 id = 2;
  string email = 3;
}
Copy the code

Decode binary data for this type of message:

Person *newP = [[Person alloc] initWithData:data error:nil];
Copy the code

This is called

- (instancetype)initWithData:(NSData *)data error:(NSError **)errorPtr {
    return [self initWithData:data extensionRegistry:nil error:errorPtr];
}
Copy the code

Another constructor is called internally:

- (instancetype)initWithData:(NSData *)data
           extensionRegistry:(GPBExtensionRegistry *)extensionRegistry
                       error:(NSError **)errorPtr {
  if ((self = [self init])) {
    @try{[self mergeFromData:data extensionRegistry:extensionRegistry];
	  / /...
    }
    @catch (NSException *exception) {
      / /...}}return self;
}
Copy the code

After stripping away some of the defense code and error handling, you can see that the mergeFromData: method is the final implementation construct:

- (void)mergeFromData:(NSData *)data extensionRegistry:(GPBExtensionRegistry *)extensionRegistry {
  GPBCodedInputStream *input = [[GPBCodedInputStream alloc] initWithData:data]; // Construct the data flow object from the passed 'data'
  [self mergeFromCodedInputStream:input extensionRegistry:extensionRegistry]; // Merge data flow objects
  [input checkLastTagWas:0]; / / check
  [input release];
}
Copy the code

This approach does two main things:

Through the data construct passed inGPBCodedInputStreamObject instance
Merge the data flow object constructed above

The job of GPBCodedInputStream is very simple. It mainly caches the source data and stores a series of state information, such as size, lastTag, etc. The data structure is very simple:

typedef struct GPBCodedInputStreamState {
const uint8_t *bytes;
size_t bufferSize;
size_t bufferPos;

// For parsing subsections of an input stream you can put a hard limit on
// how much should be read. Normally the limit is the end of the stream,
// but you can adjust it to anywhere, and if you hit it you will be at the
// end of the stream, until you adjust the limit.
size_t currentLimit;
int32_t lastTag;
NSUInteger recursionDepth;
} GPBCodedInputStreamState;

@interface GPBCodedInputStream(a){
@package
struct GPBCodedInputStreamState state_;
NSData *buffer_;
}
Copy the code

The internal implementation of merge is a bit more complicated. First you get an instance of the current Message object Descriptor, This Descriptor instance stores the source file Descriptor of Message and each field Descriptor, and then assigns values to each field of Message in a loop.

Descriptor is simplified as follows:

@interface GPBDescriptor : NSObject<NSCopying>
@property(nonatomic.readonly.strong.nullable) NSArray<GPBFieldDescriptor*> *fields;
@property(nonatomic.readonly.strong.nullable) NSArray<GPBOneofDescriptor*> *oneofs; // Filed for a repeated type
@property(nonatomic.readonly.assign) GPBFileDescriptor *file;
@end
Copy the code

Where GPBFieldDescriptor is defined as follows:

@interface GPBFieldDescriptor(a){
@package
 GPBMessageFieldDescription *description_;
 GPB_UNSAFE_UNRETAINED GPBOneofDescriptor *containingOneof_;

 SEL getSel_;
 SEL setSel_;
 SEL hasOrCountSel_;  // *Count for map<>/repeated fields, has* otherwise.
 SEL setHasSel_;
}
Copy the code

The GPBMessageFieldDescription preserved the field of all kinds of information, such as data types, the types of filed, filed id, etc. In addition, getSel and setSel are the setter and getter methods for this field in the corresponding class property.

MergeFromCodedInputStream: method is a simplified version of the following:

- (void)mergeFromCodedInputStream:(GPBCodedInputStream *)input
               extensionRegistry:(GPBExtensionRegistry *)extensionRegistry {
 GPBDescriptor *descriptor = [self descriptor]; // Generate an instance of the current Message Descriptor
 GPBFileSyntax syntax = descriptor.file.syntax; //syntax identifier. Proto file syntax version (proto2/proto3)
 NSUInteger startingIndex = 0; // The current position
 NSArray *fields = descriptor->fields_; // All fileds of the current Message
 
 // loop decode
 for (NSUInteger i = 0; i < fields.count; ++i) {
  // Get the current position 'FieldDescriptor'
     GPBFieldDescriptor *fieldDescriptor = fields[startingIndex];
     // Determine the current field type
     GPBFieldType fieldType = fieldDescriptor.fieldType;
     if (fieldType == GPBFieldTypeSingle) {
       / / ` MergeSingleFieldFromCodedInputStream ` function in decoding Single type of field data
       MergeSingleFieldFromCodedInputStream(self, fieldDescriptor, syntax, input, extensionRegistry);
       // Current position +1
       startingIndex += 1; 
     } else if (fieldType == GPBFieldTypeRepeated) {
	// ...
       // Repeated decode operation
     } else {  
       // ...
       // Other types of decoding operations}}// for(i < numFields)
}
Copy the code

As you can see, the descriptor here is obtained directly from the method in the Message object, rather than from the factory construct:

GPBDescriptor *descriptor = [self descriptor];

// 'desciptor' method definition
- (GPBDescriptor *)descriptor {
 return [[self class] descriptor]; 
}
Copy the code

The descriptor class methods here are actually implemented by subclasses of GPBMessage. For example, in the Person message structure, the descriptor method is defined as follows:

+ (GPBDescriptor *)descriptor {
 static GPBDescriptor *descriptor = nil;
 if(! descriptor) {static GPBMessageFieldDescription fields[] = {
     {
       .name = "name",
       .dataTypeSpecific.className = NULL,
       .number = Person_FieldNumber_Name,
       .hasIndex = 0,
       .offset = (uint32_t)offsetof(Person__storage_, name),
       .flags = GPBFieldOptional,
       .dataType = GPBDataTypeString,
     },
     / /...
     / / here every field define ` GPBMessageFieldDescription `
   };
   GPBDescriptor *localDescriptor = // An 'Descriptor' object is constructed based on fileds and other parameters
   descriptor = localDescriptor;
 }
 return descriptor;
}
Copy the code

Then, after constructing the Message Descriptor, all the fields are traversed and decoded. The decoding will call different decoding functions depending on the fieldType, such as for

fieldType == GPBFieldTypeSingle
Copy the code

A decoding function of type Single is called:

MergeSingleFieldFromCodedInputStream(self, fieldDescriptor, syntax, input, extensionRegistry);
Copy the code

MergeSingleFieldFromCodedInputStream internal provides a series of macro definition, according to different data types to decode the data.

#defineCASE_SINGLE_POD(NAME, TYPE, FUNC_TYPE) \ case GPBDataType##NAME: { \ TYPE val = GPBCodedInputStreamRead##NAME(&input->state_); \ GPBSet##FUNC_TYPE##IvarWithFieldInternal(self, field, val, syntax); \ break; The \}
#defineCASE_SINGLE_OBJECT(NAME) \ case GPBDataType##NAME: { \ id val = GPBCodedInputStreamReadRetained##NAME(&input->state_); \ GPBSetRetainedObjectIvarWithFieldInternal(self, field, val, syntax); \ break; The \}

     CASE_SINGLE_POD(Int32, int32_t, Int32)
  ...
       
#undef CASE_SINGLE_POD
#undef CASE_SINGLE_OBJECT
Copy the code

For int32 types of data, for example, will eventually call int32_t GPBCodedInputStreamReadInt32 (GPBCodedInputStreamState * state); Function reads data and assigns a value. The internal implementation is the Varint encoding decoding operation:

int32_t GPBCodedInputStreamReadInt32(GPBCodedInputStreamState *state) {
 int32_t value = ReadRawVarint32(state);
 return value;
}
Copy the code

In the data decoding is completed, a int32_t, assignment GPBSetInt32IvarWithFieldInternal will call at this time for operation, the simplified implementation is as follows:

void GPBSetInt32IvarWithFieldInternal(GPBMessage *self,
                                     GPBFieldDescriptor *field,
                                     int32_t value,
                                     GPBFileSyntax syntax) {

 // The final assignment
 // where 'self' is an instance of 'GPBMessage'
 uint8_t *storage = (uint8_t *)self->messageStorage_;
 int32_t *typePtr = (int32_t *)&storage[field->description_->offset];
 *typePtr = value;

}
Copy the code

Where typePtr is the pointer to the variable currently assigned. At this point, the assignment of the single field is complete.

To summarize, in the Protobuf-objectivec version, the flow for building a Message object in reflection is roughly as follows:

Construct a Descriptor from a concrete subclass of Message, which contains all of the field descriptors
The loop assigns to the specified field of the current Message object through each FieldDescriptor

A link to the

Google Protocol Buffers Docs
A Google Protobuf network transport scheme that automatically reflects message types
Integer compression encoding ZigZag
Usage and principles of Google Protocol Buffer