With the release of version 2.0, Nebula Graph has undergone a series of changes, the most significant of which is a change to the underlying encoding format in terms of storage. The Nebula Graph’s underlying storage is stored in RocksDB based on KV. This article explains the differences between the old and new encoding formats and why the storage format should be changed.

Version 1.0 format

For a quick review of the version 1.0 encoding format, see this blog post with the Nebula Architecture Anatomy series: Storage Design for a Graph Database. Because in version 1.0, point ids could only be represented as integers, all underlying Vertexids were stored in INT64.

  • The format of the point

  • The format of the edge

Given any VertexID, the corresponding PartID can be hash, so that a point and all edges of the point (edges hash with the starting point) are mapped to the same shard. It should be noted that in version 1.0, the Type of the first byte of the dot and edge was the same. That is, for a point, all of its tags are not physically contiguous; for example, they might be stored as follows. For the SRC point, the three tags (tag1 tag2 tag3) may actually be separated by other edges.

This format can meet the needs of most 1.0 interfaces. For example, fetch and Go only need to specify the corresponding prefix to obtain the corresponding data.

Version 2.0 format

Prior to GA releases, the underlying storage format was essentially the same as 1.0. If VertexID is an integer, it is exactly the same format as 1.0. If VertexID supports string, then the 8-byte int64 is changed to FIXED_STRING, which needs to be specified by the user at create space. For short lengths, the system automatically uses \0 to complete, while VertexID that exceeds the specified length will report an error.

In the GA release, we have made some changes to the underlying storage format, so this update requires the upgrade tool to convert the data in the original format to the data in the new format. The following is the storage format used in the 2.0 GA version.

Version 2.0 storage format

  • The format of the point

  • The format of the edge

Compare with 1.0 storage format

There are a few major changes:

  1. The length of VertexID is changed from a fixed 8 bytes to n bytes. If the VertexID type is an integer, n is 8. If the VertexID type is string, n is the specified length.
  2. The dot removes the 1.0 timestamp. Edge changes the 1.0 timestamp to a one-byte placeholder.
  3. Instead of using the same Type for the first byte of the point and edge, the point and edge are physically separated.

These changes were mainly considered based on the following points:

  1. VertexID was changed mainly to support string ID and compatibility with version 1.0 int ID. In storage, all vertexids are processed as bytes and vertexids of the corresponding type are returned only when the result is returned, depending on the space setting.

    Why use FIXED_STRING for string ID? If you do not use a fixed length, you cannot scan with a prefix. The prefix length between all points and edges is the same through length insufficiency complement, so that the corresponding prefix query can be carried out.

  2. The main reason for removing the timestamp is that saving multiple versions of data will affect performance, and MVCC related work will not be considered for a while.

    There is also a byte placeholder inside the edge, which is reserved for TOSS (Transaction on storage side). It is mainly used to indicate whether the outgoing and incoming edges of an edge are completely inserted. It will not be described in detail here, and will be analyzed in detail in other articles later.

  3. The main benefit of point and edge separation is that it is easy to quickly pick up all the tags of a point (in Cypher)MATCHStatement used heavily). If it stays the sameType + VertexIDPrefix scanning, because dot edges can be doped together, can greatly affect performance. After Type is separated, pressVertexType + VertexIDPrefix scan, you can quickly get all tags.

    In version 1.0, because there was no requirement to take all the tags for a point, points and edges could be stored with the same prefix. However, at the code level, there is still a significant impact. For example, fetch interface is scanned according to the prefix of VertexID in 1.0, and the performance of tag retrieval is poor for super large points. In addition, if you use the scan interface provided by the storage to obtain all points in the whole graph, the entire RocksDB is actually scanned.

In addition to the dot and edge formatting changes, the index format has also changed.

For one thing, with NULL support in 2.0, indexes also need to be able to represent corresponding semantics. On the other hand, in version 1.0, string fields in the index were actually treated as variable-length strings. As long as the string index is used in the LOOKUP statement, only the equivalent query can be used. In version 2.0, a FIXED_STRING is used for the string field of an index, like the VertexID field in the data. For example, LOOKUP ON index1 WHERE col > “aaa”. There will also be other articles on the indexing functions and modifications.

Nebula Storage 2.0 Storage format

Like this article? GitHub: 🙇♂️🙇♀️

Ac graph database technology? NebulaGraphbot takes you into the NebulaGraphbot community

Recommended reading

  • Nebula Architecture: Storage Design for a Database