Sedona Sql introduction - Intersection analysis of ten million points

The project has always been in need of spatial big data. Before, we used the spatial-Framework-for-Hadoop project to process spatial data. Due to the lack of spatial index, the processing speed has not been ideal, so WE have been looking for a suitable framework to process spatial data. Apache Sedona™(Incubating) is a Spark framework for large spatial data processing. It was named GeoSpark, and was incubated by Apache and renamed Sedona. Compared to traditional analysis tools such as ArcGIS and QGIS, Sedona can provide better distributed spatial analysis. GraalVM is the next generation of Java Virtual machines released by Oracle, available in community and enterprise versions. The Facebook team used GraalVM as an alternative to OpenJDK. In this scenario, migrating to GraalVM is very simple — you just need to switch running environments, and no application code changes are required. This transformation makes the application run faster thanks to GraalVM’s advanced performance optimization without any manual tuning. In this paper, GraalVM and Apache Sedona are used to analyze the intersection of more than 10 million faces and points. The framework uses Spark(3.1.2),Hadoop(3.2),Sedona(1.1.1-Incubating).

Data preparation

Data is still taken from Microsoft open source building data, processing the ID and WKT CSV,(WKT can not be used in single quotes, blood and tears experience)

Processing flow

Ensp initialization code, to set serialization, set partition, set to use rtree for spatial index

SparkSession spark = SparkSession.builder().
                     config("spark.serializer","org.apache.spark.serializer.KryoSerializer").
                     config("sedona.join.numpartition",900).config("sedona.globel.indextype","rtree").
                     config("spark.kryo.registrator","org.apache.sedona.core.serde.SedonaKryoRegistrator").
                     master("local[*]").appName("cn.dev.Learn04").getOrCreate();
SedonaSQLRegistrator.registerAll(spark);

Copy the code

Ensp loads point surface data and generates point surface data set

String inputCSVPath = Learn042.class.getResource("/capnt.csv").toString();
Dataset pointDF = spark.read().format("csv").
        option("delimiter", ",").
        option("header", "true").
        load(inputCSVPath);
pointDF.createOrReplaceTempView("pointdf");

String inputBuildingCSVPath = Learn042.class.getResource("/california20191107.csv").toString();
Dataset buildingDF = spark.read().format("csv").
        option("delimiter", ",").
        option("header", "true").
        load(inputBuildingCSVPath);
buildingDF.createOrReplaceTempView("buildingdf");
Copy the code

Point plane intersection analysis

sqlText = "select a.id,b.id from point as a,building as b where ST_contains(b.shape, a.shape)";
spatialDf = spark.sql(sqlText);
spatialDf.createOrReplaceTempView("re");
Copy the code

Results the optimized

The full results take about five minutes, which can be optimized to four minutes with the Graalvm virtual machine and caching of the dataset. Sedona is also a highly efficient spatial big data framework, which deserves more research.

spatialDf.cache();
spatialDf.count();
Copy the code

References:

Github.com/apache/incu… Zhuanlan.zhihu.com/p/106555993 blog.csdn.net/huangmingle… www.jianshu.com/p/e1d262ac3…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Sedona Sql introduction – Intersection analysis of ten million points

Data preparation

Processing flow

Results the optimized

References:

Sedona Sql introduction – Intersection analysis of ten million points

Data preparation

Processing flow

Results the optimized

References:

Related Posts

Challenges and solutions for building large-scale clusters with Kubernetes and Istio

Concurrent notes 01- Causes of concurrent problems

Java from the beginning to the grave series learning path directory index (continuously updated ~~~)