Recently, I have been busy writing the graduation project and the project, and haven’t written some learning experience and work record for a long time. I have opened a new pit, and HOPE to continue to write and record the knowledge related to distributed storage. Why is it called a small perspective? Because it is a kind of random content, it is possible to examine the storage and computing technology of massive data from a small perspective, and divide the knowledge points into two or three chapters to sort out. A glimpse of the leopard, visible spot, hope to use this process to improve their own, but also welcome reading friends more corrections. The first chapter is based on a paper of Facebook called RCFile: A Fast and space-efficient Data Placement Structure in MapReduce-based Warehouse Systems Let’s take a look at how to adapt massive amounts of data to local computing needs. Get in, get in

The layout structure of data profoundly affects the efficiency and performance of data processing and how data is organized in the underlying storage system. How to layout data will directly affect the design and implementation of data query engine, and also affect the utilization efficiency of storage space. Good data storage and layout make better use of storage space and conform to query practices in service application scenarios. Next, let’s take a look at how the format for storing data changes depending on your data needs.

In the traditional database system, the following data layout structures are derived:

·(1) Horizontal row storage structure

·(2) Vertical column storage structure

·(3) Hybrid PAX storage structure

Each of these data layout methods has its advantages and disadvantages, so let’s sort it out step by step:

Row storage occupies a dominant position in traditional databases, such as MySQL’s MyISAM MYD file, InnoDB’s IDB file, and Hive’s Sequence file, which are all realized by row storage. As shown in the figure below, the individual data records are organized in an N-metamodel model, and the data records are arranged sequentially one after the other:

  


Of course, the advantage of this storage layout is that since each row is stored together, a single row can be loaded quickly, making it ideal for adding, deleting, and querying OLTP databases.

On the other hand, the disadvantage is also obvious, is not suitable for mass data storage OLAP application scenarios:

·(1) When only a single column or a small number of columns are processed, many additional unnecessary data need to be read, which will cause great performance loss. Because unnecessary columns are loaded each time, the cache is filled with useless data, and the loss multiplies as the volume of data increases.

·(2) The similarity of data stored in rows is very low, and it is difficult to achieve a high data compression ratio, so relatively speaking, it takes up more storage space.

Therefore, row storage is not suitable for the analysis and query of mass data, and a new storage mode is derived from row storage.

The column storage structure avoids the disadvantage of the row storage structure: it avoids reading unnecessary columns during the actual data reading process. And because the data in the same column is stored together, you can easily achieve a high compression ratio to achieve space savings.

  


There is no such thing as a free lunch. Since column storage provides many excellent features, it also brings its own disadvantages, as shown in the figure above: when a single row needs to be queried, column storage does not guarantee that all fields will be on the same Datanode, which is often impossible for a large table. In the figure above, four fields of the same record are located in three DIFFERENT HDFS blocks, which are probably distributed on different Datanodes. Therefore, reading rows will occupy a large amount of bandwidth resources in the cluster.

More troubling is that when data is deleted, the data on multiple data nodes needs to be synchronized because different data columns are distributed on different data nodes, resulting in a very difficult consistency problem.

So although column storage is suitable for data analysis queries on a single machine, it seems to be more expensive when massive data is stored on a distributed storage system.

· Row storage is flexible in accessing data records, but poor in cache utilization and data compression.

· Column storage obviously has better I/O performance and data compression capability, but the processing of single row data is not satisfactory in distributed environment.

Well, you’re both good, so give it a try, which leads to the following: hybrid PAX storage structure

PAX was originally a hybrid layout structure to improve CPU caching by optimizing records with multiple different fields to improve cache performance. PAX uses a cached page to store all the fields belonging to each field and lay out their distribution. (Equivalent to metadata)

The Record Columnar File(RCFile) designed by Facebook uses the PAX storage model for reference. By horizontal partitioning first, Vertical partition ensures that the data in the same row must reside in the same Datanode. At the same time, row storage is used to optimize data query and storage performance on a single Datanode.

  


As shown in the figure above, in rcFiles, Row groups of data are arranged on top of each HDFS block. Each Row Group contains three parts:

· Data separation flag

· Metadata (metadata stores the number of bytes of records that the Row Group has. How many bytes are in each column)

· Column storage of data (actual storage of data content, different columns can use different compression algorithms to maximize the compression of data storage space)

Everyone must have a full understanding of RCFile. Let’s talk about two details in the RCFile paper next:

· Lazy decompression:

For example, let’s say we have the following query:

Lazy decompression means that the columns are not necessarily decompressed in memory until the execution unit determines that the data in the column needs to be processed. Lazy decompression is suitable for conditional query scenarios. If conditions cannot meet all records in a row group, data decompression is not required, which greatly reduces memory and CPU usage.

For example, in the query above, if all a’s in the Row Group are less than or equal to 1, there is no need to decompress the contents of the Row Group and you can skip it. This, of course, depends on the content of the metadata.

·Row Group size:

Obviously, a larger Row Group is more conducive to data compression, but obviously, a large data storage capacity will affect the performance of lazy compression mentioned above. Bigger breasts aren’t always better, so Facebook opted for a 4MB Row Group size. (Keep that in mind and we’ll come back to it later.)

This paper mainly reviews the evolution from row storage to RCFile from the perspective of data layout, and analyzes the scenarios suitable for various storage layout modes. In the next article we will continue to explore this issue and see how ORCFile and Parquet’s approach to data storage formats for large-scale OLAP applications is more advanced.