Why convert RDD to DataFrame? In this way, Spark SQL can be used for SQL queries directly against any data that can be built as AN RDD, such as HDFS. This is incredibly powerful. Imagine being able to query data in HDFS directly using SQL.

Spark SQL supports two ways to convert an RDD to a DataFrame.

  • The first approach is to use reflection to infer the metadata of the RDD that contains a particular data type. This reflection-based approach, which has relatively clean code, is a great approach when you already know the metadata of your RDD.

  • The second way to create a DataFrame is through a programming interface. You can dynamically build a copy of metadata while the program is running and then apply it to an existing RDD. The code in this way is verbose, but if you don’t know the METADATA of the RDD when you write the program and only know it dynamically when the program is running, you can only build the metadata dynamically in this way.

Second, use reflection to infer metadata

  • Java version: Spark SQL supports converting RDD containing Javabeans to DataFrame. JavaBean information, which defines metadata. Spark SQL does not support metadata for Javabeans that contain complex data such as nested Javabeans or lists. Only a JavaBean containing a field of a simple data type is supported.

  • Scala version: Due to Scala’s implicit conversion, Spark SQL’s Scala interface automatically converts a RDD containing a Case class to a DataFrame. Case class defines metadata. Spark SQL reads the name of the parameter passed to the case class through reflection and then uses it as the column name. Unlike Java, Spark SQL supports case classes that contain nested data structures, such as arrays, as metadata.

Specify metadata programmatically

  • Java version: When javabeans cannot be defined and known up front, such as dynamically reading data structures from a file, metadata can only be dynamically specified programmatically. Start by creating a Row element from the original RDD. Next, create a StructType that represents the Row; Finally, dynamically defined metadata is applied to the RDD.

  • Scala version: Scala is implemented in the same way as Java.