Column-oriented Storage is not a new technology, which can be traced back to the paper Cantor in 1983. However, limited by the early hardware conditions and usage scenarios, the mainstream transactional database (OLTP) mostly uses row storage, until the rise of analytical database (OLAP) in recent years, the concept of column storage becomes popular again.

On the one hand, the advantages of columnar storage are space saving and IO reduction. On the other hand, the columnar data structure can optimize the computation. This paper focuses on the data organization of column storage, including data layout, coding, compression and so on. The next article will introduce the computing layer and the overall DBMS architecture design.

What is column storage

Traditional OLTP databases typically use row storage. In the following figure, all columns are arranged in sequence to form a row, which is stored in the behavior unit. With B+ tree or SS-table as the index, the corresponding row data can be quickly found through the primary key.


Row storage is natural for OLTP scenarios: most operations are entity units, meaning that most operations add, delete, modify, and look up an entire row, so it is obvious that storing a row of data in physically adjacent locations is a good choice.

However, for an OLAP scenario, a typical query would traverse the entire table, grouping, sorting, aggregating, and so on, eliminating the advantages of row-by-row storage. To make matters worse, analytical SQL often does not use all columns, but only some columns of interest, and the irrelevant columns of that row have to be scanned.

Column storage is designed for such needs. As shown in the figure below, data from the same column is stored side by side, with each column of the table forming a long array.


Obviously, column storage is not olTP-friendly, as writing a single row of data requires simultaneous modification of multiple columns. But there are big advantages to OLAP scenarios:

  • When a query involves only partial columns, only the associated columns need to be scanned
  • The data in each column is of the same type and has greater correlation with each other, resulting in high efficiency of column data compression

Is BigTable (HBase) column storage?





Many articles classify BigTable as column storage. However, strictly speaking, BigTable is not a column storage. Although the paper mentions some design of column storage such as C-Store, BigTable itself stores data by key-value Pair, which has nothing to do with column storage.





A bit confusing is the concept of column family in BigTable. A column family can be assigned to a Locality group, which determines the physical location of the data in the column family, so that each column family of the same primary key can be stored on the optimal physical node. Because the data in a column family is usually similar, it is better to compress it than the entire table.





It is also worth emphasizing that a column database can be relational or NoSQL, regardless of whether it is column or not. The C-Store discussed in this article uses a relational model.

Column Families in BigTable

Origin: DSM paging mode

As we know, since mechanical disks are limited by the head addressing process, reads and writes are usually written in blocks and are abstracted as block devices in operating systems, as opposed to stream devices. This helps upper-layer applications better manage storage space, increase read and write efficiency, and so on. This feature directly affects the design of database storage formats: Database pages correspond to one or more physical sectors, so that the Page and sector of the database are aligned, improving read and write efficiency.

So how do you put the data on the page?

Most DBMSS that serve online query use NSM (N-ary Storage Model) that is, store complete rows (i.e. relation) from headers in sequence. There is an index at the end of the page that holds the starting offsets of each row on the page. Since each row is not necessarily fixed in length, an index can help us quickly find the rows we need without having to scan them one by one.

The disadvantage of NSM is that if only a small number of columns are involved in each query, the extra columns still use up precious memory and CPU Cache, resulting in more IO. In order to avoid this problem, many analytical databases use DSM (Decomposition Storage Model), i.e. column paging: relation is divided into multiple sub-relation by column. Similarly, an index is stored at the end of the page.


Incidentally, in 2001 Ailamaki et al. proposed the PAX (Partition Attributes Cross) format, which attempted to introduce some of the strengths of DSM into NSM by combining the strengths of the two. Specifically, NSM can retrieve a row of records more quickly, because the row data is adjacent to the same page; DSM makes better use of CPU Cache and uses more compact compression. The method of PAX is to divide a page into multiple minipages, which are stored in columns, and each minipage in a page can be combined into several complete relation.

Nowadays, with the popularization of distributed file systems and the improvement of disk performance, many advanced DBMSS have abandoned the per-page storage mode, but some of the ideas, such as data partitioning, in-partition indexes, column and column mixing, are still everywhere in these modern systems.

Although a distributed storage system does not have the concept of pages, it still cuts files into blocks for storage. However, the granularity of the blocks is much larger than the Size of ordinary sectors (for example, the Block Size of HDFS is 128MB). Larger read and write granularity is used to accommodate the lower bandwidth of network I/O for greater throughput, but it also sacrifices fine-grained random read and write.

Column data encoding and compression

For both disk and memory databases, IO is usually a performance bottleneck compared to CPU. Proper compression can not only save space, but also reduce IO and improve read performance. Column storage has natural advantages in data encoding and compression.

The following is a representative description of data encoding in C-Store. According to 1) whether the data itself is self-ordered and 2) how many distinct values the data has, the data can be divided into the following four cases for discussion:

  • Order and not many distinct values. Use a series of triplesColumn data encoding, indicating that the value v appears from the f row, and there are n of them (i.e., lines f through F +n−1). For example, if the value 4 appears in lines 12-18, the code is,12,7 (4).
  • Unordered and not many distinct values. Construct a binary string B for each value v, representing a bitmap of v’s location. For example, if the data in a column is 0,0,1,1,2,1,0,2,1, the code is(0, 110000100),(1, 001101001)(2000100). Since bitmaps are sparse, they can be reprogrammed.
  • It is ordered and has many distinct values. In this case, each value is represented as the previous value plus a change (delta), except of course for the first value. For example, a column of data 1,4,7,7,8,12 can be represented as the sequence 1,3,3,0,1,4. Obviously, encoded data is easier to be in the dense Pack and has a higher compression ratio.
  • Unordered and with many DISTINCT values. There is no good way to code for this.

After encoding, the data can also be compressed. Because of the similarity of data in a column, relatively good compression can be achieved even without special coding. The compression algorithm, such as Snappy, supports streaming processing and has high throughput.

Finally, coding and compression are not only space-saving tools, but more often ways to organize data. In PowerDrill, Dremel and other systems, we see a lot of coding that also has indexing capabilities, such as skipping unwanted partitions in scans or even completely changing the way table queries are executed.

Column storage and distributed file systems

In modern big data architecture, distributed file systems such as GFS and HDFS have become the mainstream way to store large-scale data sets. Compared with disks on a single machine, distributed file systems have many advantages such as multiple copies, high availability, large capacity, and low cost. However, distributed file systems also bring some problems that do not exist on a single machine:

  1. Read and write data go through the network, and the throughput can be as high as or even higher than that of hard disks, but the latency is much larger than that of hard disks, and is greatly affected by the network environment.
  2. Sequential read and write with large throughput can be performed, but random access performance is poor and random write is not supported in most cases. To offset network overhead, writes are usually written in tens of Megabytes.

The above disadvantages are fatal for OLTP scenarios that rely heavily on random reads and writes. As a result, we see many olAP-oriented column stores opting to forgo OLTP capabilities and build on top of distributed file systems.

There are several ways to maximize the performance of a distributed file system: read data by block (shard), stream read, append write, and so on. We’ll look at some of the popular open source column storage models later, incorporating these optimizations into the design of the storage format.


Example of column storage system

C-Store (2005) / Vertica

Most DBMSS were write-optimized, and C-Store was the first OLTP database system to be read-optimized, although from today’s point of view it should count as HTAP. Columnar storage can achieve better performance in scenarios such as ad-hoc analytical queries and ORM online queries, where most operations are queries rather than writes. Like mainstream DBMSS, C-Store supports a standard relational model.

As mentioned at the beginning of this article, column storage is nothing new. The main contributions of c-Store are as follows: multiple copies and multiple indexing of column data simultaneously through well-designed projection; Read and write layering takes into account (minimal) write performance. In addition, C-Store was probably the first modern implementation of a columnar storage database, and its design inspired countless later commercial or open source databases, such as Vertica.

The data model

A C-store is a relational database, and its logical tables are no different from those in any other database. But within the C-Store, the logical tables are split vertically into projections, and each projection can contain one or more columns, and even columns from other logical tables (which form indexes). Of course, each column exists on at least one aspect.

In the example below, the EMP table is stored as 3 projections and the DEPT is stored as 1 projection. Each projection is sorted by its own sort key, which is underlined in the figure.


Projection is stored in columns: each column is stored in a separate data structure. Horizontal shards with the values of the Sort key are also supported for each projection to avoid problems with long columns.

The C-store selects a covering set that covers all the columns in the result as a covering set, and then performs join calculation to reconstruct the original rows. To make projections join more efficient (that is, reordering by another key), the join index is introduced as an aid, which stores the subscript mapping from proj1 to proj2.

Projection is redundant, and it is common for one column to appear in multiple projections, but their order (sort key) is not the same, so c-store can select the optimal projections for the least cost of query execution.

Cleverly, c-Store’s projection redundancy is also used for k-safe high availability (tolerating up to K machine failures). When some of the nodes are down, c-Store can execute a query as long as a given set can still be found. Not necessarily the best covering set, though.

From another perspective, C-Store Projection can be regarded as a kind of materialized query result, that is, the query result is precomputed before query execution. And since each column appears in at least one Projection, there is no need to save the original logical table.





It is obviously not practical to pre-calculate the results for any query, but if you materialize some of the frequently used intermediate views, you can achieve a balance between the estimated cost and the query cost. A C-store materializes a set of column data that is ordered by a sort key (or even joined to other tables), along with the expected JOIN index.

The c-store handling of writes will be shown in the next article.

Apache ORC

Apache ORC was originally developed as a file format to support OLAP queries on Hive and is now widely used in the Hadoop ecosystem. ORC supports various formats of fields, including common int, string, etc., but also including struct, list, map and other combination fields; The meta information for the fields is placed at the end of the ORC file (this is called self-describing).

Data structures and indexes

Constructing indexes for partitions is a common optimization, and the ORC data structure is divided into three levels, with index information at each level to speed queries.


  • File Level: An ORC File. The Footer contains meta information about the data and index information about the data in the File, such as the maximum and minimum values (ranges) for each column, NULL distribution, and Bloom filter. These information can be used to quickly determine whether the File contains the data to be queried. Each ORC file contains multiple stripes.
  • Stripe Level Indicates a range partition of the original table, which contains the values of each column in the partition. Each Stripe also has its own index in the footer, similar to the file-level index.
  • Row-group Level: Every 10,000 rows in a column constitute a row-group. Each row-group has its own row-level index.

The Stripe in AN ORC is like a page in a traditional database. It is the basic unit for batch reads and writes to ORC files. This is because the distributed storage system has a large read/write latency. An I/O operation is cost-effective only when a certain amount of data is read in batches. This also has something in common with the idea of reading and writing disks by page.

Like many other storage formats, ORC and choose to place statistics and Metadata at the end of File and Stripe rather than at the head.





However, ORC also optimizes the read and write of Stripe by uniformly extracting indexes of structures with partition granularity smaller than Stripe (such as Column and row-group) and placing them in the head of Stripe. This is because in batch calculations the entire Stripe is read into batch processing, and extracting these indexes reduces the IO required in batch scenarios (batch reads can skip this section).

The ACID support

Apache ORC provides limited ACID transaction support. Due to the nature of distributed file systems, files cannot be written randomly, so how do you save changes?

Like MVCC in LSM-Tree, Writer does not modify the data directly, but instead generates a delta file for each transaction, in which the changes are superimposed on the original data. When a delta file grows, a minor compaction compacts multiple delta files into one. When the delta becomes large, execute a Major compaction to merge the delta with the original data.

This optimization method of keeping the baseline data unchanged and layering delta data is very common in column storage systems and is a common solution.

Remember that ORC delta files are also written to distributed storage, so the contents of each delta file should not be too short. This also explains why ORC files support transactions, but are mainly friendly to bulk write transactions, and are not suitable for frequent and small write transactions.

Dremel (2010) / Apache Parquet

Dremel is a query system developed by Google for large-scale read-only data, which is used for rapid ad-hoc query and makes up for the lack of interactive query capability of MapReduce. To avoid a second copy of the data, Dremel’s data is kept in place, usually on a distributed file system such as GFS, and a common file format needs to be devised.

Dremel’s system design and most of OLAP’s column databases didn’t have much innovation, but its elegant storage format became popular, and Apache Parquet is an open source copy of it. Note that Parquet, like ORC, is a storage format, not a complete system.

Nested data models

Protobuf is heavily used internally by Google as a cross-platform, cross-language data serialization format that is more compact and expressive than JSON. With Protobuf you can define not only required and optinal fields, but also repeated fields, meaning they can appear 0 to N times, similar to a variable-length array.

The Dremel format is designed to store Protobuf data in columns. This is more difficult than storing relational data by column because of repeated fields. The general idea might be to represent the end of each repeat with a terminator, but given that the data could be sparse, Dremel introduced a more compact format.

As an example, the left half of the figure shows the schema and two Document instances of the data, and the right half shows the serialized columns. After serialization, there are two more columns R and D, representing Repetition Level and Definition Level respectively, which ensure that the original data can be uniquely deserialized.


Repetition Level indicates at which Level the current value repeats. For non-repeated fields, simply fill in the trivial value 0; Otherwise, whenever the field is likely to be repeated (either by itself repeated or by its outer structure repeated), the layer on which R should be repeated should be filled with the current value.

For example, we have three non-null records for name.language.code.

  1. The first is theen-usAppears in the first Code of the first Name of the first Lanuage. Before this, these three elements are not repeated, are the first time. So its R = 0
  2. The second is theen“In the next Language. That is, Language is a repeating element. In name.language. Code, Language is the second, so its R=2
  3. The third is aen-gb, appear in the next Name, Name is the repeating element, the first row, so its R=1

Note that en-gb belongs to the third Name and not the second Name. To express this fact, we place a NULL R=1 between en and en-GB.

Definition Level specifies the layer at which NULL is defined, and thus declares that the repeat of that layer stops there. For non-NULL fields, simply fill in the trivial value, the level of the data itself.

Another example is for the name.language.country column

  1. usFor non-null values, enter the level of the Country field (D=3)
  2. NULLInside R1, it means that the current Name and all subsequent languages do not contain the Country field. So D is equal to 2.
  3. NULLIn R1, it means that the current Document and all subsequent names do not contain the Country field. So D is 1.
  4. gbFor non-null values, enter the level of the Country field (D=3)
  5. NULLIn R2, it means that all subsequent documents do not contain the Country field. So D is 0.

It can be proved that the original data can be uniquely constructed by combining R and D. In order to efficiently encode and decode, Dremel first constructed a state machine during execution, and then used the state machine to process column data. Moreover, the state machine will directly skip irrelevant data by combining query requirements and data structure.

The realization of state machine can be said to be the biggest contribution of Dremel’s paper. But limited by space, interested students please refer to the original paper.

conclusion

This paper introduces the storage structure design of column storage. Despite the complicated details, we can see that the following ideas or designs are common.

  1. Skip extraneous data. From row storage to column storage, is to eliminate irrelevant column scanning; ORC can quickly skip irrelevant data fragments through three levels of index information.
  2. Encoding is both compression and indexing. Dremel avoided a lot of NULls with clever nested coding; The c-store encoding of distinct values is also the index of distinct values. PowerDrill takes dictionary coding to the extreme (see next article).
  3. Assume that the data is immutable. Whether it’s C-Store, Dremel or ORC, they’re coded and compressed in a way that doesn’t take data updates into account at all. If you must have updates, write them elsewhere temporarily and merge them as you read.
  4. Data sharding. To deal with large scale data, it is necessary to slice both vertically and horizontally, needless to say.

In the next article, it will combine c-Store, MonetDB, Apache Kudu, PowerDrill and other modern column database systems, focusing on the overall architecture design of column DBMS and the unique query execution process. Stay tuned!

References

  1. Distinguishing Two Major Types of Column-Stores – Daniel Abadi
  2. Columnar Storage – Amazon Redshift
  3. Weaving Relations for Cache Performance – A Ailamaki, DJ DeWitt, MD Hill, M Skounakis
  4. C-Store and Google BigTable – Greg Linden
  5. The Design and Implementation of Modern Column-oriented Database Systems-D Abadi, P Boncz, S Harizopoulos…
  6. C-store: A Column-oriented DBMS – M Stonebraker, DJ Abadi, A Batkin, X Chen…
  7. Apache ORC Docs
  8. Dremel: Interactive Analysis of Web-based Datasets from Melnik, A Gubarev, JJ Long, G Romer…

Finally, special thanks

@ zhang eggplant
Students put forward various suggestions and opinions for this article!

This article adopts
CC BY – NC – SA 3.0License agreement. Reprint please indicate the source!






Original link:
Ericfu. Me/columnar – st…