4000 words long warning!!

background

Is JSON/XML bad?

Well, no serialization scheme is as popular, free, convenient, expressive, and cross-platform as JSON and XML. Is the default preference for common data transfer formats. However, with the increase of data volume and performance requirements, the performance problems brought by such freedom and generality cannot be ignored.

JSON and XML use strings to represent all data. For non-character data, literal representations take up a lot of extra storage space and are severely affected by numeric size and precision. A 32-bit floating point number, 1234.5678, takes up 4 bytes in memory, 9 bytes if stored as UTF8, and 18 bytes in an environment where strings are expressed in UTf16, such as JS. Data parsing using regular expressions is inefficient when dealing with non-character data. It takes a lot of computation to parse data structures and convert literals into corresponding data types.

In the face of massive data, this format itself can become the IO and computing bottleneck of the whole system, or even directly overflow.

What’s out there besides JSON/XML?

In many serialization schemes, according to the storage scheme, can be divided into string storage and binary storage. String storage is readable, but due to the above problems, only binary storage is considered here. Binary storage can be divided into IDL and no IDL, or into self-describing and non-self-describing (deserialization needs IDL or not).

Need IDL usage process:

  • Write the schema using the IDL syntax defined by the schema
  • Compile the schema into code (classes or modules) in both producer and consumer languages using the compiler provided by the solution
  • The data producer references this code, builds the data based on its interface, and serializes it
  • The consumer references this code and reads the data according to its interface

IDL does not need to use the process:

  • Producers and consumers agree on data structures through documentation
  • Producer serialization
  • Consumer deserialization

etc.

  • protocol buffers

    • The transport protocol used by gRPC, binary storage, requires IDL, non-self-describing
    • High compression rate, strong expression, widely used in Google products


  • flat buffers

    • Google launches serialization solution, binary storage, IDL, non-self-description (self-description solution is not cross-platform)
    • High-performance, small size, and supports String, number, and Boolean


  • avro

    • Hadoop uses a serialization scheme that combines the advantages of binary scheme and string scheme. Only the serialization process requires IDL, self-describing
    • However, the scene is limited and there is no mature JS implementation, which is not suitable for the Web environment. There is no comparison here


  • Thrift

    • Facebook’s solution, binary storage, requires IDL, non-self-describing
    • Basically, integration is only used in RPC, and there is no comparison here


  • DIMBIN

    • Serialization scheme for multidimensional array design, binary storage, do not need IDL, self-description
    • High-performance, small size, and supports String, number, and Boolean


optimization

Principle of spatial optimization

Using numeric types instead of literals to hold values can itself save a considerable amount of space. To achieve higher compression rates, the Protocol buffer uses varint to compress values. (The following tests, however, show that this scenario is not helpful in environments where gZIP is available.)

Time optimization principle

Binary format records the data structure and the offset of each node data through a specific location, saves the time spent parsing the data structure from the string, avoids the performance problems caused by long strings, and greatly reduces the generation of intermediate garbage in GC language.

In environments where direct memory operations are possible (including JS), data can also be read directly from memory offsets, avoiding copying operations and creating additional memory space. Both DIMBIN and FlatBuffers use this concept to optimize data storage performance. In a JS environment, the time required to create a DataView or TypedArray to extract data from a memory segment is negligible.

Storing strings in binary schemes requires extra logic for UTF8 encoding and decoding, which is not as powerful and bulky as string formats such as JSON.

What is DIMBIN?

The real-time update of millions or even tens of millions of data is often involved in our data visualization scenes. In order to solve the performance problem of JSON, we developed DIMBIN as a serialization scheme based on the idea of memory offset operation, and designed many transmission formats for Web data processing based on it.

As a straightforward optimization idea, DIMBIN has become our standard solution for data transfer, maintaining absolute simplicity and efficiency.

We just will DIMBIN open source that contribute to the community, hoping to bring a JSON/protocol/flatbuffers lighter, faster and more friendly solution to the Web.

Scheme comparison

For the use in the Web/JS environment, we choose JSON, Protocol buffers, flat buffers and DIMBIN to make comparison from seven aspects.

engineering

Protocolbuffers and FlatBuffers represent the complete Workflow pioneered by Google. Strict, standard, unified, idl-oriented, designed for multi-terminal collaboration, for python/ Java /c++. Code generation through IDL, consistent development process for multiple platforms/languages. If the team adopts this workflow, the solution is more manageable, with more controlled multi-endpoint collaboration and interface transitions.

But if left this engineering structure, it appears relatively complicated.

JSON/XML and DIMBIN are neutral, do not require IDL, and do not set or limit engineering solutions and technology selection. You can specify interfaces through documentation alone, or you can add your own schema constraints.

Deployment/coding complexity

Protocolbuffers and flatbuffers must be added early in the project design and as a key link in the workflow. If it is added for the purpose of performance optimization, it will have a great impact on the project architecture.

JSON is basically infrastructure for all platforms, with no deployment costs.

DIMBIN requires only one package to be installed, but requires flattening of the data structure, and will not benefit if the data structure cannot be flattened.

When used in JS:

  • Deserialization with JSON serialization is usually under 5 lines of code
  • About 10 lines with DIMBIN
  • To use protocol, you need to write a separate Schema (PROTO) file, introduce hundreds of lines of compiled code, and manipulate the data of each node through an object-oriented style interface during serialization and deserialization (each node in the data structure is an object).
  • The use of FlatBuffer requires a separate Schema (FBS) file, which introduces hundreds of lines of compiled code. The serialization process requires processing each node through a state machine-style interface, manually converting and putting data into each node, and the writing experience is quite tedious. The deserialization process reads data from each node through the object action interface

Performance (JS environment)

The Protocol website claims that the performance is better than JSON. The test data is obviously not on the JS side. Our test shows that the JS side is worse than JSON (much worse when the data is large).

All binary schemes work similarly with strings: utF16 in JS is decoded to Unicode, then encoded to UTF8, written to buffer, and the data address of each string is recorded. This process is costly in performance, and there is no advantage in volume if varint(Protocol buffers) is not used.

When processing string data, JSON performance is always the best, serializing performance JSON > DIMBIN > FlatBuffers > Proto, deserializing JSON > Proto > DIMBIN > Flatbuffers

Flatbuffers and DIMBIN have obvious advantages in processing numerical data.

For serialization performance of flat numerical data DIMBIN > flatbuffers > JSON > Proto,

Deserialize DIMBIN > FlatBuffers > 100,000x > JSON > Proto

volume

Protocol < DIMBIN < flat < JSON Protocol < DIMBIN < flat < JSON Protocol < DIMBIN < flat < JSON Protocol < DIMBIN < flat < JSON Protocol < DIMBIN < flat < JSON Protocol < DIMBIN < flat < JSON

After Gzip, DIMBIN and Flat have the smallest and basically the same volume, while Protocol has no advantage, which may be the side effect of Varint.

expressive

Protocol is designed for strongly typed languages and supports much richer types than JSON, and data structures can be complex. Flatbuffers support three basic types: value, Boolean, and string, and have a similar structure to JSON. DIMBIN supports three basic types: numeric, Boolean, and string. Currently, DIMBIN only supports multi-dimensional array structures (key-value pairs are not supported or encouraged). More complex structures need to be encapsulated on DIMBIN.

Degrees of freedom

JSON and DIMBIN are self-describing. No schema is required (in weakly typed languages). Users can generate data structures and data types on the go.

The Protocolbuffers and flatbuffers must be written in IDL and generated before the code is encoded. If the interface is modified, the IDL needs to be modified, the code is generated again, and the Protocolbuffers are deployed on the production end and consumer end.

  • The C++ and Java implementations of Protocolbuffers have a self-describing feature that can be embedded in a.proto file, but a top-level interface needs to be compiled to describe the “self-describing embedded data”. The documentation also states that this has never been used internally at Google (against IDL’s design principles).
  • Flatbuffers has a self-describing version of the branch (Flexbuffers), experimental, with no JS support and no documentation.

Multilingual support

Protocolbuffers and FlatBuffers both server-side and client language support is complete. Both are developed for C++/Java(android)/Python first. The JS side lacks some advanced functions and complete documentation, so you need to study example and generated code by yourself. However, the code is not long and the annotation coverage is complete.

Almost all programming languages have tools for JSON.

DIMBIN was developed and optimized for JS/TS and is currently available in c#, with c++, wasm, Java and python support planned.

Use cases (testing only the JS environment)

We generate a typical data, using both flat and non-flat structures, using JSON, DIMBIN, protocol and flat Buffers to achieve the same function, and compare the performance, volume and convenience of various schemes.

The test data

We generate two versions of the test data: non-flattened (multi-level key-value pair structure) data and equivalent flattened (multi-dimensional array) data

Considering the particularity of string processing, we tested mixed string/numeric data, pure string data, and pure numeric data separately during the test

// Non-flat dataexport const data = {
	items: [
		{
			position: [0, 0, 0],
			index: 0,
			info: {
				a: 'text text text... ', b: 10.12,},}, // * 200,000],} // Equivalent flat dataexportconst flattedData = { positions: [0, 0, 0, 0, 0, 1, ...] , indices: [0, 1, ...] , info_a: ['text text text'.'text'. ] , Info_B: [10.12, 12.04... ,}Copy the code

JSON

serialization

const jsonSerialize = () => {
	return JSON.stringify(data)
}Copy the code

deserialization

const jsonParse = str => {
	const _data = JSON.parse(str)
	letLen = _data.items. Length const len = _data.items. Length const len = _data.itemsfor (let i = 0; i < len; i++) {
		const item = _data.items[i]
		_read = item.info.a
		_read = item.info.b
		_read = item.index
		_read = item.position
	}
}Copy the code

DIMBIN

serialization

import DIMBIN from 'src/dimbin'

const dimbinSerialize = () => {
	return DIMBIN.serialize([
		new Float32Array(flattedData.positions),
		new Int32Array(flattedData.indices),
		DIMBIN.stringsSerialize(flattedData.info_a),
		new Float32Array(flattedData.info_b),
	])
}Copy the code

deserialization

const dimbinParse = buffer => {
	const dim = DIMBIN.parse(buffer)

	const result = {
		positions: dim[0],
		indices: dim[1],
		info_a: DIMBIN.stringsParse(dim[2]),
		info_b: dim[3],
	}
}Copy the code

DIMBIN currently only supports multidimensional arrays and cannot handle tree data structures, so comparisons are not done here.

Protocol Buffers

schema

You first need to write the schema according to the Proto3 syntax

syntax = "proto3";

message Info {
    string a = 1;
    float b = 2;
}

message Item {
    repeated float position = 1;
    int32 index = 2;
    Info info = 3;
}

message Data {
    repeated Item items = 1;
}

message FlattedData {
    repeated float positions = 1;
    repeated int32 indices = 2;
    repeated string info_a = 3;
    repeated float info_b = 4;
}Copy the code

Compiled into js

Compile the schema into JS modules using the Protoc compiler

. / lib/protoc - 3.8.0 - osx - x86_64 / bin/protoc. / SRC/data. The proto - js_out = import_style = commonjs, binary:. / SRC/generatedCopy the code

serialization

Const messages = require('src/generated/src/data_pb.js') const protoSerialize = () => {// Top-level const pbData = new messages.data () data.items. ForEach (item => {// node const PbInfo = new messages.info () // The node writes data to pbinfo.seta (item.info.a) pbinfo.setb (item.info.b) // Child node const pbItem = new messages.Item() pbItem.setInfo(pbInfo) pbItem.setIndex(item.index) pbItem.setPositionList(item.position) // serialize const buffer = pbdata.serializebinary ()returnBuffer // Flattening scheme:  // const pbData = new messages.FlattedData() // pbData.setPositionsList(flattedData.positions) // pbData.setIndicesList(flattedData.indices) // pbData.setInfoAList(flattedData.info_a) // pbData.setInfoBList(flattedData.info_b) // const buffer = pbData.serializeBinary() //return buffer
}Copy the code

deserialization

Const messages = require('src/generated/src/data_pb.js')

const protoParse = buffer => {
	const _data = messages.Data.deserializeBinary(buffer)

	let _read = null
	const items = _data.getItemsList()
	for (leti = 0; i < items.length; i++) { const item = items[i] const info = item.getInfo() _read = info.getA() _read = info.getB() _read = item.getIndex() _read = item.getpositionList ()} / / const _data while forming = messages. FlattedData. DeserializeBinary (buffer) / / / / read data (avoid delay calibration error) for reading / /let _read = null

	// _read = _data.getPositionsList()
	// _read = _data.getIndicesList()
	// _read = _data.getInfoAList()
	// _read = _data.getInfoBList()
}Copy the code

Flat buffers

schema

You first need to write the schema according to the Proto3 syntax

 table Info {
     a: string;
     b: float;
 }
 ​
 table Item {
     position: [float];
     index: int;
     info: Info;
 }
 ​
 table Data {
     items: [Item];
 }
 ​
 table FlattedData {
     positions:[float];
     indices:[int];
     info_a:[string];
     info_b:[float];
 }Copy the code

Compiled into js

/lib/flatbuffers-1.11.0/ flatc-o. / SRC /generated/ --js --binary./ SRC /data.fbsCopy the code

serialization

Const flatbuffers = require('flatbuffers'Const tables = require(const tables = require('src/generated/data_generated.js')

const flatbufferSerialize = () => {
	const builder = new flatbuffers.Builder(0)

	const items = []

	data.items.forEach(item => {
		letA = null // String processingif(item.info.a) {a = Builder.createString (item.info.a)} // Start operation on the info node tables.info.startInfo (builder) // Add a value item.info.a && tables.Info.addA(builder, a) tables.Info.addB(builder, Item.info.b) // Finish operation info node const fbInfo = tables.info.endInfo (Builder) // Array processinglet position = null
		if(item.position) { position = tables.Item.createPositionVector(builder, Item.position)} // Start operating on the item node tables.item.startitem (Builder) // Write data item.position && tables.item.addPosition (Builder, position) item.index && tables.Item.addIndex(builder, item.index) tables.Item.addInfo(builder, FbInfo) // Finish info const fbItem = tables.item.endItem (Builder) item.push (fbItem)}) // Array handling const pbItems = tables.Data.createItemsVector(builder, Data node tables.data. startData(builder) // Write data to tables.data. addItems(builder, PbItems) // Complete operation const fbData = tables.data.endData (builder) // Complete all operation Builder.finish (fbData) // output // @note The buffer is offset //return builder.asUint8Array().buffer
	returnBuilder.asuint8array ().slice().buffer  // const builder = new flatbuffers.Builder(0) // const pbPositions = tables.FlattedData.createPositionsVector(builder, flattedData.positions) // const pbIndices = tables.FlattedData.createIndicesVector(builder, flattedData.indices) // const pbInfoB = tables.FlattedData.createInfoBVector(builder, flattedData.info_b) // const infoAs = [] //for (let i = 0; i < flattedData.info_a.length; i++) {
	// 	const str = flattedData.info_a[i]
	// 	if(str) { // const a = builder.createString(str) // infoAs.push(a) // } // } // const pbInfoA = tables.FlattedData.createInfoAVector(builder, infoAs) // tables.FlattedData.startFlattedData(builder) // tables.FlattedData.addPositions(builder, pbPositions) // tables.FlattedData.addIndices(builder, pbIndices) // tables.FlattedData.addInfoA(builder, pbInfoA) // tables.FlattedData.addInfoB(builder, PbInfoB) / / const fbData = name FlattedData. EndFlattedData (builder) / / builder. Finish (fbData) / / / / / / the buffer is the offsetreturn builder.asUint8Array().slice().buffer
	// // return builder.asUint8Array().buffer
}Copy the code

deserialization

Const flatbuffers = require('flatbuffers'Const tables = require(const tables = require('src/generated/data_generated.js') const flatbufferParse = buffer => { buffer = new Uint8Array(buffer) buffer = new flatbuffers.ByteBuffer(buffer) const _data while forming = tables. Data. GetRootAsData (buffer) / / read Data (flatbuffer when parsing does not read the Data, so there needs to be active to read)let _read = null

	const len = _data.itemsLength()
	for (leti = 0; i < len; i++) { const item = _data.items(i) const info = item.info() _read = info.a() _read = info.b() _read = item.index() _read = item.positionArray()}  // buffer = new Uint8Array(buffer) // buffer = new flatbuffers.ByteBuffer(buffer) // const _data = Tables. FlattedData. GetRootAsFlattedData (buffer) / / / / read data (read flatbuffer is the use of the get function delay, thus here need to read the data) / /let _read = null

	// _read = _data.positionsArray()
	// _read = _data.indicesArray()
	// _read = _data.infoBArray()

	// const len = _data.infoALength()
	// for (let i = 0; i < len; i++) {
	// 	_read = _data.infoA(i)
	// }
}Copy the code

The performance of Flatbuffers for string parsing is poor. When the number of strings in the data is high, the overall serialization performance, parsing performance and volume of Flatbuffers are inferior to THAT of JSON. For purely numerical data, Flatbuffers have obvious advantages over JSON. The general interface design of the state machine is cumbersome for the construction of complex data structures.

Performance indicators

Test environment: 15′ MBP MID 2015 2.2ghz Intel Core I7, 16 GB 1600 MHz DDR3, macOS 10.14.3, Chrome 75

Test data: the data in the above example, 200,000 strings using UUID*2

Test method: Run the GZip command for 10 times to obtain the average value. Use the default GZip configuration./*

Unit: Time ms, volume Mb

  • The percentage of the string in the data, the length of the individual string, and the numeric size of Unicode in the string all affect the test.
  • Because DIMBIN designed for flattening data, so the flattening tests only the JSON data/protocol/flatbuffers

Serialization performance

Deserialization performance

Space occupied

Selection Suggestions

Based on testing results, it is always wise to flatten your data if your scenario has high performance requirements.

  • Small amount of data, fast iteration, contains a large number of string data, using JSON, convenient and fast;
  • Small amount of data, stable interface, static language dominant, multi-language collaboration, integrated IDL, dependent on gPRC, protocol buffers considered.
  • Large amount of data, stable interface, static language dominance, integrated IDL, data cannot be flattened, consider flat buffers.
  • High volume of data, fast iteration, high performance requirements, data can be flat, do not want to use heavyweight tools or modify the engineering structure, consider DIMBIN.

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.