This article, contributed by Community user Xrfinbj, describes the Exchange tool’s process for importing Data from the Hive repository to Nebula Graph.

1 background

The Nebula Graph database was selected from an in-house scenario, and the Nebula Graph database needed to be validated for query performance in a real-world business scenario. So there was an urgent need to import the data into Nebula Graph and validate it. The process of importing data from the Hive repository to the Nebula Graph document via Exchange tools was not complete, so we documented the pitfalls of the process to give back to the community and save future generations from the Nebula Graph.

This article is based on my two previous posts in the forum:

  • How do I import Hive Data to Exchange
  • Description Exchange failed to import data from Hive

2 Environment Information

  • Nebula Graph Version: Nebula: Nightly
  • Deployment mode (distributed, standalone, Docker, and DBaaS) : Docker is deployed on a Mac
  • Hardware information
    • Hard disks (SSDS/HDDS) : SSDS on Mac PCS
    • CPU and memory: 16 GB
  • Data warehouse environment (local data warehouse built on Mac) :
    • Hive 3.1.2
    • Hadoop 3.2.1
  • Exchange tool: github.com/vesoft-inc/…

The JAR package is generated after compilation

  • Spark

Spark-2.4.7-bin-hadoop2.7 (Core-site. XML and hdFS-site. XML corresponding to Hadoop 3.2.1 are configured in the conf directory. XML set spark-env.sh) Scala Code Runner Version 2.13.3 — Copyright 2002-2020, LAMP/EPFL and Lightbend, Inc.

3 the configuration

1 Nebula Graph DDL

CREATE SPACE test_hive(partition_num=10, replica_factor=1); Create a graph space. In this example, we assume that only one copy is needed. Test CREATE TAG tagA(idInt int, idString string, tBoolean bool, tdouble double); TagA CREATE TAG tagB(idInt int, idString string, tBoolean bool, tdouble double); -- CREATE tag tagB CREATE EDGE edgeAB(idInt int, idString string, tBoolean bool, tdouble double); Create edge type edgeABCopy the code

2 Hive DDL

CREATE TABLE `tagA`( `id` bigint, `idInt` int, `idString` string, `tboolean` boolean, `tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n'; Insert into tagA select 1,1,'str1',true,11.11; Insert into tagA select 2,2,"str2",false,22.22; CREATE TABLE `tagB`( `id` bigint, `idInt` int, `idString` string, `tboolean` boolean, `tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n'; Insert into tagB select 3,3," STR 3",true,33.33; Insert into tagB select 4,4," STR 4",false,44.44; CREATE TABLE `edgeAB`( `id_source` bigint, `id_dst` bigint, `idInt` int, `idString` string, `tboolean` boolean, `tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n'; Insert into edgeAB SELECT 1,3,5,"edge 1",true,55.55; Insert into edgeAB select 2,4,6,"edge 2",false,66.66;Copy the code

3 My latest nebula_application.conf file

Note the exec, Fields, Nebula. Fields, Vertex, Source, Target field mappings

{ # Spark relation config spark: { app: { name: Spark Writer } driver: { cores: 1 maxResultSize: 1G } cores { max: 5}} # Nebula Graph relation config Nebula :{address:{Graph: ["192.168.1.110:3699"] meta: ["192.168.1.110:45500"]} user: user PSWD: password space: test_hive connection {timeout: 3000 retry: 3 } execution { retry: 3 } error: { max: 32 output: /tmp/error } rate: { limit: 1024 timeout: 1000 } } # Processing tags tags: [ # Loading from Hive { name: tagA type: { source: hive sink: client } exec: "select id,idint,idstring,tboolean,tdouble from nebula.taga" fields: [id,idstring,tboolean,tdouble] nebula.fields: [idInt,idString,tboolean,tdouble] vertex: id batch: 256 partition: 10 } { name: tagB type: { source: hive sink: client } exec: "select id,idint,idstring,tboolean,tdouble from nebula.tagb" fields: [id,idstring,tboolean,tdouble] nebula.fields: [idInt,idString,tboolean,tdouble] vertex: id batch: 256 partition: 10 } ] # Processing edges edges: [ # Loading from Hive { name: edgeAB type: { source: hive sink: client } exec: "select id_source,id_dst,idint,idstring,tboolean,tdouble from nebula.edgeab" fields: [id_source,idstring,tboolean,tdouble] nebula.fields: [idInt,idString,tboolean,tdouble] source: id_source target: id_dst batch: 256 partition: 10 } ] }Copy the code

4 Performing Import

4.1 Ensuring nebula Service is started

4.2 Ensuring that Hive Tables and Data are ready

4.3 Run spark-sql CLI to check whether the Hive table and data are normal to ensure that the Spark environment is normal

4.4 After all configurations are complete, run the Spark command.

Spark - submit - class com. Vesoft. Nebula. View the importer. The Exchange - master "local" [4] / XXX/Exchange - 1.0.1. Jar - c /xxx/nebula_application.conf -hCopy the code

4.5 After the import is successful, you can use the db_dump tool to check the amount of imported data

./db_dump --mode=stat --space=xxx --db_path=/home/xxx/data/storage0/nebula   --limit 20000000
Copy the code

Step 5 Pit and description

  • The first pitfall is that the spark-submit command does not add the -h parameter
  • The Nebula Graph tagName is case sensitive. The Tag name configured in the Tags configuration should be the Nebula Graph’s tag name
  • The Hive int is different from the Nebula Graph int. The Hive BigInt corresponds to the Nebula Graph int

Other notes:

  • Because the Nebula Graph underlying store is KV, repeated inserts are overwrites, and update operations with inserts instead perform better
  • Incomplete places in the document may temporarily only look at the source code to solve, while going to the forum to ask (development students are not easy and nervous development and answer users’ questions)
  • Importing data, Compact, and operation suggestions: docs.nebula-graph.com.cn/manual-CN/3…
  • I have verified the following two scenarios:
    • Import data from Hive 2 (Hadoop 2) to Nebula Graph using Spark 2.4
    • Import data from Hive3 (Hadoop 3) to Nebula Graph using Spark 2.4

Note: Exchange does not support Spark 3. An error occurs after compilation. Therefore, you cannot verify the Spark 3 environment

There are some questions

  • How to set parameters Batch and rate.limit in the nebula_application.conf file? How to choose parameters?
  • Exchange tool Hive data import principle (Spark)

6 Exchange source Debug

Spark Debug can be found at dzone.com/articles/ho…

Through Exchange source code learning and Debug can deepen the understanding of Exchange principle, but also can find some document description is not clear. For example, if you import SST files and Download and Ingest, you can only find the problem that the document description is not clear and the logic is not rigorous when you look at the source code.

Simple parameter configuration problems can also be found through source code Debug.

To get down to business:

Step one:

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000
Copy the code

Step 2:

Spark - submit - class com. Vesoft. Nebula. View the importer. The Exchange - master "local"/XXX/Exchange - 1.1.0. Jar - c /xxx/nebula_application.conf -h Listening for transport dt_socket at address: 4000Copy the code

Step 3: Configure IDEA

Step 4: Click Debug in IDEA

7 suggestions and thanks

Thanks to Vesoft for providing the most powerful Nebula Graph database in the universe, solving many practical problems in the business. I was lucky to get timely feedback from the community for any questions I met during the course. Thank you again

We are excited to see Exchange support for Nebula Graph 2.0

The resources

  • What is the relationship between Exchange and Spark Writer?
  • The Spark Writer manual
  • The Spark Writer manual

Like this article? GitHub: ๐Ÿ™‡โ™‚๏ธ๐Ÿ™‡โ™€๏ธ

Ac graph database technology? NebulaGraphbot takes you into the NebulaGraphbot community

Recommended reading

  • Some practical details in Spark data import
  • Neo4j imports the principles and practices of Nebula Graph