Spark SQL: Automatic partition inference for Parquet data sources

Automatic Partitioning Inference (1)

Table partitioning is a common optimization method. For example, Hive provides table partitioning. In a partitioned table, data from different partitions is usually stored in different directories, and the value of the partitioned column is usually included in the directory name of the partitioned directory. Parquet data source in Spark SQL to automatically infer partition information from directory names. For example, if you store population data in a partitioned table and use gender and country as partitioning columns. The directory structure might look like this:

tableName |- gender=male |- country=US ... . . |- country=CN ... |- gender=female |- country=US ... |- country=CH ...Copy the code

Automatic Partitioning Inference (2)

If /tableName is passed to sqlContext.read.parquet () or sqlContext.read.load (), Spark SQL will automatically infer partition information, gender and country, based on the directory structure. Even though the data file contains only two column values, name and age, the DataFrame returned by Spark SQL prints four column values: name, age, country, gender when the printSchema() method is called. This is the function of automatic partition inference.

In addition, the data type of the partitioned column is automatically inferred. Currently, Spark SQL only supports automatic inference of numeric and string types. Sometimes, users may not want Spark SQL to automatically infer the data types of partitioned columns. At this time, as long as set a configuration can be spark. SQL. Sources. PartitionColumnTypeInference. Enabled, the default is true, that is automatically inferring the partitioning column type, set to false, don’t automatically infer types. When you disable automatic inference of the type of a partitioned column, the default type of all partitioned columns is String.

Code:

package com.etc; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.SQLContext; /** * @author: fengze * @description: * Automatic inference partitioning for Parquet data sources * Sometimes we may not want Spark SQL to automatically partition data, * so we can spark. SQL. Sources. PartitionColumnTypeInference. Set * enabled by default, its value istrue;
 */
public class ParquetPartitionDiscovery {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName("ParquetPartitionDiscovery")
                .setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SQLContext sqlContext = new SQLContext(sc);
        DataFrame json = sqlContext.read().json("D:\ Documents\\Tencent Files\ 1433214538\ FileRecv\ stage 1 code \\ Lecture 76 -Spark SQL: Common Load and save operations for data Sources \\Documents\ people. Json");

        json.printSchema();
        json.show();
        //root
        // |-- age: long (nullable = true)
        // |-- name: string (nullable = true)}}Copy the code

Spark SQL: Automatic partition inference for Parquet data sources

Automatic Partitioning Inference (1)

Automatic Partitioning Inference (2)

Code:

Related Posts

A few simple commands can protect your Mac, too

Change the configuration of nginx so that acme.sh can automatically issue SSL certificates for free even in the case of reverse proxy

How does the /create-key interface create a key