Problem description

In the test environment, using Spark SQL3.1.1 to query the Iceberg tables stored on Hive Metastore and OSS, many tasks with very small data volumes were found.

The query schema is as follows:

SELECT * FROM hive_prod.iceberg_db.store_sales WHERE ss_customer_sk = 10702517;

View the query details in Spark’s Dashboard as follows:

Among the 12 tasks we divided out here, the data of 3 tasks is about 140MB, and the data of the other 9 tasks is about 13KB. The question is, why are 9 13KB tasks assigned?

In the following text, we will analyze this problem in detail.

Reasons for investigation

First of all, through Debug we can determine the stage of the click Plan Task, and push the filter SS_Customer_SK = 10702517 down to the level of the Parquet file. Since there are nearly 2000 files in the entire table and only 12 tasks are allocated, there must have been a push down at the file level, otherwise there would have been at least thousands of tasks.

Why are there 12 tasks?

This is because the value ss_customer_sk=10702517 falls into three different files on three different partitions through max-min analysis. All three files are around 512MB in size. The default value of Iceberg’s is 128MB. This means that the three files will be divided into tasks with a size of 128MB. This will add up to 12 tasks. Each task has a size of 128MB and the minimum unit allocated is the RowGroup size.

Because the user writes the data, according to the SS_CUSTOMER_SK sort and then write to the Iceberg table. So for each of the three files, only one row-group has the value ss_customer_sk. In other words, only three of the 12 tasks have the value to search for.

Analysis found that iceberg code again iceberg of parquet reader is already put filter on the row – level group (code reference:…

This means that the remaining 9 tasks simply check the statistics and conclude that the key must not be in this task. Thus, we can see that only 3 of the 12 tasks scanned 128MB of data, while the remaining 9 tasks only scanned 13KB of data.

Summary: Essentially, because Iceberg supports filtering down to row-group level, a large amount of row-group data can skip the scan directly. Those empty tasks are the skipped row-groups.

For more information about Apache Iceberg and Data Lake technology, you can apply to join the Data Lake Technology Exchange Group.

Or follow the Chinese community official account to get the latest information and share: