Serialization and deserialization

define

Serialization: The process of converting data structures or objects into binary strings
Deserialization: The process of converting binary strings generated during serialization into data structures or objects

Serialization reasons:

Save objects permanently, saving the byte sequence of the object to a local file or database;
To transmit and receive objects in the network in the form of byte stream;
Passing objects between processes;

Typical C/S serialization and deserialization

Interface Description Language (IDL) : Parties involved in communication need to make agreements on communication content. Conventions are described in a language independent of the specific development language or platform. This language is called Interface Description Language (IDL)
IDL Compiler: Converts IDL files into dynamic libraries corresponding to each language.
Stub/Skeleton Lib: Working code responsible for serialization and deserialization. Stub is a piece of code deployed on the client of a distributed system. On the one hand, it receives the parameters of the application layer, serializes them, and sends them to the server through the underlying protocol stack. On the other hand, it receives the serialized result data of the server, and delivers them to the client application layer after deserialization. Skeleton is deployed on the server side and functions as the opposite of the Stub. It receives serialization parameters from the transport layer, deserializes them and passes them to the server application layer. The execution results of the application layer are serialized and finally transmitted to the client Stub.

Ps: In general, serialization and deserialization frameworks need to share IDL files

Typical serialization solution

Examples are XML, JSON, Protobuf, Thrift, and Avro

JSON

Note: Parsing JSON is similar to XML; JSON is used as an example

JSON data type:

string
Number: indicates the number, including integer and floating point types
Boolean: Boolean
null

JSON structure:

Key /value pairs. Analogies object, struct, dictionary, hash table in other languages
Ordered list of values. Array, vector, list, or sequence in other languages.

JSON example:

{
    "key1": 1,
    "key2": ["value2"]}Copy the code

For efficiency’s sake, using streams is almost the only option, where the parser simply scans the JSON string from scratch to parse out the entire data structure.

Analytical steps:

Step 1: Character parsing example is as follows: for JSON string: {“name”: “Mary”, “age”: 18} Parsing result (Token stream) : {“name”: “m ary”,” A g E “: 18}
Step 2: Parse to JSON object/array based on Token stream

Token flow

token	meaning
NULL	null
NUMBER	digital
STRING	string
BOOLEAN	true/false
SEP_COLON	:
SEP_COMMA	.
BEGIN_OBJECT	{
END_OBJECT	}
BEGIN_ARRAY	[
END_ARRAY	]
END_DOCUMENT	JSON Document End

JSON state machine

The JSON parser is essentially a state machine.

The JSON state machine is as follows:

The explanation is as follows:

‘{‘ : expects a JSON object;
‘:’ : expects a JSON object value;
‘,’ : expects the next set of key-values of a JSON object, or the next element of a JSON array;
‘[‘ : expects a JSON array;
‘t’ : expects a true;
‘f’ : expect a false;
‘n’ : expects a null;
‘”‘ : expects a string;
0 to 9: Expect a number.

Protobuf

Liverpoolfc.tv: developers.google.com/protocol-bu…

Example:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}
Copy the code

`T - L - V`The data storage mode of

Definition: tag-length-value: identifier – Length (optional) – Storage mode of field values

Advantages:

You do not need a delimiter to separate fields
Compact storage
The field is not set to a field value, so no encoding is required and the corresponding field will be set to the default value when decoding

Analytical principle

The Protocol Buffer encodes each field in the message and stores the data in the T-L-V storage mode, resulting in a binary byte stream
The Protocol Buffer adopts different serialization methods for different data types, as shown in the following figure:

Note: the storageVarintNo byte length is stored when encoding dataLength, so in factProtocol BufferIs stored inT - V

Varint coding

Definition: a variable length encoding method

Coding steps:

Fetch the last 7 bits of the byte string

- If it is the last fetch, 0 is added to the highest bit to form 1 byte
- Otherwise, add 1 to the highest bit to form a byte

Continue to pick 7 bits from the end of the byte string by moving the entire byte string 7 bits to the right until it runs out
Concatenate each of the above formed bytes into a byte string in order

When Varint is decoded, the last byte of Varint is read as long as the highest byte is 0

Effect: The smaller the value, the fewer bytes used in the representation

Eg: Any number less than 128 can be represented by 1 byte. For other encodings, int32 numbers generally require 4 bytes.

Here is:

Code:

Left: 296, right: 104

Decoding:

Disadvantage: Using Varint will treat negative numbers as large integers (the highest bit is 1)

Solution: The Protocol Buffer defines sint32 / SINt64 for negative numbers, Zigzag encoding (converting signed numbers to unsigned numbers), and Varint encoding to reduce the number of encoded bytes

Zigzag encoding

Definition: a variable length encoding method

Principle: Use unsigned numbers to represent signed numbers;

Effect: The number with small absolute value can be represented by fewer bytes.

Sint 32 code:

(n <<1) ^ (n >>31)

Move the binary representation to the left by 1 bit (move left = move the whole binary left, fill the low position with 0)
Moving the binary representation 31 bits to the right of the binary (signed number) with its first digit of 1 is an arithmetic right shift, that is, moving the binary (unsigned number) with its first digit of 0 to the left is a logical left shift, that is, moving the binary representation right to the left of 0
Xor the above two

Sint 63 is just going to be shifted 31 places to 63 places to the right

Note: the binary = sign bit of negative number is 1, and the remaining bits are the source code of the absolute value of the number is reversed by bits; And then the whole binary number plus 1

Decode :(n >>> 1) ^ -(n & 1)

Note: >>> Unsigned right shift

Illustration: Example number -2

`T - V`storage

Protocol Buffer uses Varint and Zigzag encoding and stores data in T-V mode.

Tag: indicates the id of the message field

The identification number (field_number) and data type (wire_type) of the field are stored, that is
```
Tag = (field_number << 3) | wire_type
Copy the code
```
Field_number: The identification number corresponding to the message field in the.proto file, indicating the number of fields in the message

Wire_type: the value is 0 to 5, and only three characters are required
```
enum WireType { 
      WIRETYPE_VARINT = 0, 
      WIRETYPE_FIXED64 = 1, 
      WIRETYPE_LENGTH_DELIMITED = 2, 
      WIRETYPE_START_GROUP = 3, 
      WIRETYPE_END_GROUP = 4, 
      WIRETYPE_FIXED32 = 5
   };
Copy the code
```
Occupy one byte of space (if the id number exceeds 16, occupy one more byte of space)
When decoded, the Protocol Buffer corresponds values to fields in the message based on the Tag

eg:

message person
{ 
    required int32     id = 1;  // wire typeField_number = 1 required string name = 2; // wiretype= 2, Field_number = 2} // If a Tag binary = 0001 0010 // Id = field_number = field_number << 3 = move 3 bits to the right = 0000 0010 = 2 // Data type = Wire_type = Lowest three digit representation = 010 = 2Copy the code

Value:

The value of a Varint or Zigzag encoded message field encoded by the Protocol Buffer.

Here is:

Message Test {required INT32 ID1 = 1; Required INT32 ID2 = 2; } // add a value to id1:296 test.setid1 (300); // Add a value to id2:296 test.setid1 (300); Test.setid2 (296); Binary byte stream = [8, -84, 2, 16, -88, 2]Copy the code

The coding process is as follows:

Encoding of floating point numbers

Floating-point 64 (32) -bit encoding is simple: the encoded data has a fixed size = 64 bits (8 bytes) / 32 bits (4 bytes)

Data is stored in T-V mode, as above.

Wire Type = 2

Data storage mode: T-L-V

The Tag code is the same as above

Three data types of Value:

Type String
Nested Message type (Message) The V of the Message is the field of the nested Message
Prevent Tag redundancy by enclosing a repeat field (i.e. a packed repeated field)

conclusion

Application scenario: Data storage with a small amount of data to be transferred and an unstable network environment, or RPC data exchange, for example, instant IM

Note: Big data is not suitable for protobuf storage, mainly because Tag reuse in big data is unnecessary. See Avro below for a solution

Advantages:

Serialized data is very compact and compact, with about 1/3 to 1/10 of the serialized data compared to XML
Parsing is very fast, about 20-100 times faster than the corresponding XML
Standard IDL and IDL compilers, very engineer friendly
Cross-platform, cross-language
Good encryption, HTTP packet capture can only see bytecode
Provides a validation mechanism that is easier to extend

Disadvantages:

Unreadable by humans
Poor versatility, mainly used for internal transmission
Poor self-interpretation, need to use.proto file to understand the data structure

Thrift

Website: thrift.apache.org

Thrift request response model:

Messages and structs can be likened to headers and loads in TCP. A Message is a transmitted metadata, and a Struct is a transmitted data payload.

The Message:

Name: indicates the Name of the invoked method
Message Type: there are four types: Call, OneWay, Reply and Exception. In actual transmission, Type ID is transmitted. The corresponding Type IDS of these four types are as follows
```
Call      ---> 1
OneWay    ---> 2
Reply     ---> 3
Exception ---> 4
Copy the code
```
Call and OneWay are used in Request, Reply and Exception are used in Response.

The meanings of the four are as follows:

- Call: Invokes a remote method and expects a response.
- OneWay: Calls a remote method without expecting a response. There are no steps 3 and 4.
- Reply: Indicates that the processing is complete and the response is returned normally.
- Exception: indicates a processing error.

Sequence ID: Indicates the Sequence number, which is a signed four-byte integer. All outstanding requests on a transport layer connection must have a unique sequence number, which is used by the client to handle the out-of-order arrival of the response, matching the request and response. The server does not need to check the sequence number, nor does it have any logical dependence on the sequence number, but simply returns it as it is when it responds. Note here that the Thrift sequence number is distinguished from the unique ID we commonly use to prevent multiple submissions from non-idempotent requests.

Struct:

Example:

struct Person {
    1: required i32 age;
    2: required string name;
 }
Copy the code

Thrift supports multiple serialization protocols, such as Binary, Compact, and JSON.

Binary serialization

Message

Message is encoded in two ways:

The first is strict coding

Binary protocol Message, strict encoding, 12+ bytes:
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|1vvvvvvv|vvvvvvvv|unused  |00000mmm| name length          | name    | seq id    |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
Copy the code

VVVVVVVVVVVVV aN unsigned 15 bit number fixed to 1 (in binary: 000 0000 0000 0001). The leading bit is 1
unused is an ignored byte.
mmm is the message type, an unsigned 3 bit integer. The 5 leading bits must be 0 as some clients (checked for java in 0.9.1) take the whole byte.
name length is the byte length of the name field, a signed 32 bit integer encoded in network (big endian) order (must be >= 0).
name is the method name, a UTF-8 encoded string.
seq id is the sequence id, a signed 32 bit integer encoded in network (big endian) order.

The second kind: not strict coding

Binary protocol Message, old encoding, 9+ bytes: +--------+--------+--------+--------+--------+... +--------+--------+--------+--------+ | name length | name |00000mmm| seq id | + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- +... +--------+--------+--------+--------+Copy the code

Where name length, name, mmm, seq id are as above.

Because name length must be positive (therefore the first bit is always 0), the first bit allows the receiver to see whether the strict format or the old format is used. Therefore a server and client using the different variants of the binary protocol can transparently talk with each other. However, when strict mode is enforced, the old format is rejected.

There are four types of Message types:

Call: 1
Reply: 2
Exception: 3
Oneway: 4

Struct

Type name	The idl type name	Occupied bytes	Type the ID
byte	byte	1	3
short	i16	2	6
int	i32	4	8
bool	bool	1	2
long	i64	8	10
double	double	8	4
string	string	4+N	11
[]byte	binary	4+N
list	list	1+4+N	15
set	set	1+4+N	14
map	map	1+1+4+NX+NY	13
field		1+2+X
struct	struct	N*X	12
enum
union
exception

Fixed length encoding: bool, byte, short, int, long, double are all fixed byte encoding

struct        ::= ( field-header field-value )* stop-field
field-header  ::= field-type field-id
Copy the code

field id the field-id, a signed 16 bit integer in big endian order.
field-value the encoded field value.
Stop-field: 00000000, which marks the end of a Thrift message

Length prefix encoding (4+N) :

+--------+----------+
|size(4) |content(N)|
+--------+----------+
Copy the code

Map (1+1+4+NX+NY):

List and set encoding (1+4+N*X)

Note: Key and value are of a certain type

Compact serialization

Similar to Binary serialization, zigzag and Varint are used to compress integer types. Zigzag and Varint are described in Protobuf.

Data example: Person(age:18, name:yano)

Generate: [8, 0, 1, 0, 0, 0, 18, 11, 0, 2, 0, 0, 0, 4, 121, 97, 110, 111, 0]

Explanation:

8 // Data type is I32 0, 1 // Field ID is 1 0, 0, 0, 18 // Field ID is 1 (age), 4 bytes 11 // The data type is String 0, 2 // The field ID is 2 (name) 0, 0, 0, 4 // The length of the string name is 4 bytes 121, 97, 110, 111 //"yano"4 ASCII codes (utF-8 encoding) 0 // endCopy the code

Avro

Liverpoolfc.tv: avro.apache.org/docs/curren…

Overview: Avro is a subproject of Hadoop and an independent project of Apache. Avro is a high-performance middleware based on binary data transfer. Avro was designed to support data-intensive applications, suitable for remote or local storage and exchange of large-scale data.

Features:

Rich data structure types;
Fast and compressible binary data form, binary data serialization can save data storage space and network transmission bandwidth;
A file container for persistent data;
Remote procedure call RPC can be implemented;
Simple dynamic language combination features.

Avro relies on schemas, which dynamically load related data. Avro reads and writes data frequently, and these operations use schemas, which reduce the overhead of writing to each data file and make serialization fast and light. This self-description of data and its schemas facilitates the use of dynamic scripting languages. When Avro data is stored in a file, its schema is stored with it, so that any program can process the file.

Data structure:

For schema, see the use of Node Mongoose.

Storage mode:

Container file structure:

Comparison and application scenarios

JSON is suitable for HTTP-based projects with no extreme performance requirements and easy debugging, eg: Web platform;
PB has the characteristics of cross-platform, fast parsing speed, small serialized data volume, high scalability and easy to use. It is suitable for scenarios with small amount of data transmission and high requirements on delay and speed, eg: real-time communication;
Avro is suitable for dynamic language scenarios and big data transmission and storage scenarios.
Thrift is a framework, not just a serialization solution, with the advantage of language support and relative maturity.

Analytical performance:

Serialization space overhead: