Comparative analysis of the three Swordsmen of Data Lake - Delta, Hudi and Iceberg

This paper comes from the cloud community: https://yq.aliyun.com/articles/743514 author: xy_xin

In common

Qualitatively speaking, the three are the Data storage middle layer of Data Lake, and their Data management functions are all based on a series of meta files. The meta file functions as schema management, transaction management, and data management similar to database catalog/ WAL. Unlike a database, these meta files are stored in the storage engine along with the data files and are directly visible to the user. This approach directly inherits the tradition of big data analysis in which data is visible to users, but it also increases the risk of data being accidentally damaged. Once a user accidentally deletes the meta directory, the table is broken and very difficult to recover.

The Meta file contains the schema information for the table. Therefore, the system can master Schema changes and provide support for Schema evolution. Meta files also have transaction log functionality (atomic and consistent file system support). All changes to the table generate a new meta file, so the system has ACID and multi-version support, while providing access to history. In these respects, all three are the same.

Let’s talk about the differences.

Hudi

Say first Hudi. Hudi is designed to be just as the name implies, Hadoop Upserts mix and Incrementals, We emphasize that it mainly supports Upserts, Deletes and Incremental data processing. The main writing tools it provides are Spark HudiDataSource API and DeltaStreamer, both of which support three data writing modes: UPSERT, INSERT, and BULK_INSERT. Delete support is also supported by specifying certain options at write time, and does not support a pure Delete interface.

It is typically used to write upstream data to Hudi via DeltaStreamer via Kafka or Sqoop. DeltaStreamer is a resident service that continuously pulls data upstream and writes to HUDi. Writes are batch, and scheduling intervals between batches can be set. The default interval is 0, similar to the As-soon-as-possible policy of Spark Streaming. As data is written, small files are created. For these small files, DeltaStreamer can automatically trigger the task of merging small files.

Hudi supports Hive, Spark, and Presto for query.

In terms of performance, Hudi has designed a HoodieKey, something similar to a primary key. HoodieKey has Min/Max statistics, BloomFilter, used to quickly locate the file where the Record is located. During Upserts, if the HoodieKey does not exist in BloomFilter, perform insert; otherwise, check whether the HoodieKey exists; if it does, perform update. This upserts method based on HoodieKey + BloomFilter is relatively efficient. Otherwise, Join of all tables is required to achieve upserts. For query performance, the general requirement is to push to the datasource based on filtering conditions generated by query predicates. Hudi doesn’t do much work on this, and its performance is entirely based on the engine’s built-in predicate push-down and Partition prune capabilities.

Hudi also features support for Copy On Write and Merge On Read. The former merges data during write. The write performance is slightly worse, but the read performance is higher. The latter merges read data to check read performance but writes data in a timely manner. Therefore, the latter provides near-real-time data analysis.

Finally, Hudi provides a script called RunsyncTool to synchronize data schemas to Hive tables. Hudi also provides a command-line tool for managing Hudi tables.

hudiimage

Iceberg

Iceberg doesn’t have a similar HoodieKey design and doesn’t emphasize primary keys. As mentioned above, without primary keys, update/delete/merge operations are implemented through joins, which require an SQL-like execution engine. Iceberg doesn’t bind to an engine or have its own engine, so Update /delete/ Merge is not supported by Iceberg. If the user needs to update the data, it is best to figure out which partitions need to be updated and overwrite the data. Quickstart and Spark interfaces provided by Iceberg’s official website only mention the way to write data to Iceberg using Spark Dataframe API, without mentioning other data intake methods. Spark Streaming: Spark Streaming: Spark Streaming: Spark Streaming: Spark Streaming: Spark Streaming: Spark Streaming: Spark Streaming: Spark Streaming: Spark Streaming: Spark Streaming: Spark Streaming: Spark Streaming Streaming writing support means small files are a problem, and the website does not mention how to merge small files. I suspect that For streaming writing and small file merging, Perhaps Iceberg is not quite ready to produce it yet, so it is not mentioned (purely a personal guess).

For queries, Iceberg supports Spark and Presto.

Iceberg does a lot of work on query performance. It’s worth mentioning the hidden Partition feature. Hidden partition means that for the data entered by the user, the user can select some of the columns and perform appropriate Transform to form a new column as partition column. This partition column is only used to partition data and is not directly reflected in the schema of the table. For example, if the user has a TIMESTAMP column, a new partition column of timestamphour can be generated by Hour (TIMESTAMP). Timestamphour is not visible to the user and is only used to organize data. Partition Column Statistics about a Partition column, such as the data range contained by the Partition. When you query a partition, you can create a partition PRune based on the partition statistics.

In addition to hidden partitions, Iceberg also collects information for ordinary columns. These statistics are complete, including column size, column value count, null value count, maximum and minimum column values, and so on. All of this information can be used to filter data during queries.

Iceberg provides an API for table building. Users can use the API to specify representation, schema, partition information, etc., and then complete table building in Hive Catalog.

Delta

Let’s finish with Delta. Delta is a Data Lake storage layer that integrates streams and batches. It supports update, delete, and merge. Due to Databricks, Spark supports all data writing methods, including dataframe-based batch and streaming, as well as SQL Insert and Insert Overwrite (open source SQL write is not supported, EMR does). Like Iceberg, Delta does not emphasize primary keys, so its implementation of update/delete/merge is based on Spark’s join function. In terms of data writing, Delta is strongly bound to Spark, which differs from Hudi: Hudi data writing is not bound to Spark (it can be written using Spark or Hudi’s own write tool).

For queries, open source Delta currently supports Spark and Presto, but Spark is essential because it is used for Delta log processing. This means that if Presto is used to query a Delta, a Spark job is also run during the query. To make matters worse, Presto queries are based on SymlinkTextInputFormat. Before querying, run the Spark job to generate the Symlink file. If the table data is updated in real time, that means running SparkSQL and Presto each time before the query. So why not do it all in SparkSQL? This is a very painful design. EMR has improved this by supporting DeltaInputFormat, which allows users to query Delta data directly using Presto without having to start a Spark task beforehand.

The open source Delta has almost no optimization in terms of query performance. Apart from hidden partitions in Iceberg, there is no statistical information for ordinary columns. Databricks has made reservations about its proud Data Skipping technology. I have to say this is not a good thing to promote Delta. The EMR team is doing some work in this area to try to fill in the gaps.

Delta doesn’t perform as well as Hudi for data merges and as well as Iceberg for queries. Does that mean Delta is useless? It’s not. One of Delta’s strengths is its ability to integrate with Spark (which is still not perfect, but will be much better after Spark-3.0), especially its all-in-one streaming and batch design, multi-hop data pipeline, It supports analysis, Machine learning, CDC, and other scenarios. Flexible use and perfect scene support are its biggest advantages compared with Hudi and Iceberg. In addition, Delta claims to be an improved version of Lambda architecture and Kappa architecture, without needing to care about streaming batch and architecture. This is where Hudi and Iceberg are out of their depth.

deltaimage

conclusion

From the above analysis, it can be seen that the original scenarios of the three engines are not exactly the same. Hudi aims at incremental upserts, Iceberg aims at high-performance analysis and reliable data management, and Delta aims at streaming and batch integrated data processing. This difference in scene also results in the design difference of the three. Hudi, in particular, has a more distinct design than the other two. Over time, all three complement their missing abilities, and may converge and invade each other’s territory in the future. Of course, it is not clear who will win and who will lose.

Pay attention to my public number, background reply [JAVAPDF] get 200 pages of questions! 50000 people pay attention to the big data into the way of god, don’t you want to know? Fifty thousand people pay attention to the big data into the road of god, really not to understand it? Fifty thousand people pay attention to the big data into the way of god, sure really not to understand it?

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Comparative analysis of the three Swordsmen of Data Lake — Delta, Hudi and Iceberg

Welcome your attentionBig Data as the Road to God

Comparative analysis of the three Swordsmen of Data Lake — Delta, Hudi and Iceberg

Welcome your attentionBig Data as the Road to God

Related Posts

10 Common Python interview Questions

Prove your conjecture from Python source code

[Java] Dynamically modifies an attribute value of an annotation through reflection