Getting Started with Apache Griffin

Data quality module is an essential functional component in the big data platform. Apache Griffin (hereinafter referred to as Griffin) is an open source data quality solution for big data. It supports batch processing and stream mode for data quality detection. You can measure data assets from different dimensions, such as checking whether the amount of data on the source and target ends is consistent after the offline task is completed, and the number of null values in the source table, to improve data accuracy and reliability.

Install the deployment

Depend on the preparation

  • JDK (1.8 or later versions)
  • MySQL(Version 5.6 and above)
  • Hadoop (server or later)
  • Hive (version 2.x)
  • The Spark (version 2.2.1)
  • Livy (Livy – 0.5.0 – incubating)
  • ElasticSearch (5.0 or later versions)

Initialize the

For details about initialization, see The Apache Griffin Deployment Guide. The installation steps of Hadoop and Hive are omitted. Only the steps of copying the configuration file and configuring the Hadoop configuration file directory are reserved.

1. MySQL:

Create database Quartz in MySQL, then execute the init_quartz_mysql_innodb.sql script to initialize the table information:

mysql -u <username> -p <password> < Init_quartz_mysql_innodb.sql
Copy the code

Hadoop and Hive:

Copy the configuration file from the Hadoop server to the Livy server, assuming that the configuration file is stored in the /usr/data/conf directory.

Create the /home/spark_conf directory on the Hadoop server and upload the Hive configuration file hive-site. XML to the directory:

Hadoop fs-put hive-site.xml hadoop fs-put hive-site.xml /home/spark_conf/Copy the code

3, set environment variables:

#! /bin/bash
export JAVA_HOME=/data/jdk1.8.0_192

# spark directory
export SPARK_HOME=/usr/data/spark2 -.1.1-bin2 -.6.3
#livy command directory
export LIVY_HOME=/usr/data/livy/bin
Hadoop configuration file directory
export HADOOP_CONF_DIR=/usr/data/conf
Copy the code

4. Livy configuration:

Update livy. Conf configuration file in livy/conf:

Host = 127.0.0.1 livy.spark.master = yarn livy.spark.deploymode = cluster Livy.repl. Enable-hive -context = trueCopy the code

Start the livy:

livy-server start
Copy the code

5. Elasticsearch

Create the Griffin index in ES:

curl -XPUT http://es:9200/griffin -d ' { "aliases": {}, "mappings": { "accuracy": { "properties": { "name": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" }, "tmst": { "type": "date" } } } }, "settings": { "index": { "number_of_replicas": "2", "number_of_shards": "5" } } } '
Copy the code

Source code package deployment

The source code address of Griffin is:Github.com/apache/grif… Griffin has a clear source code structure, mainly including four modules: Griffin-Doc, Measure, Service and UI. Among them, Griffin-doc is responsible for storing Griffin’s documents, measure is responsible for interacting with Spark to perform statistical tasks. The Service uses Spring Boot as the service implementation, provides restful apis for UI module interaction, saves statistical tasks, and displays statistical results.

After the source code is imported and built, you need to modify the configuration file. The specific modified configuration file is as follows:

1, service/SRC/main/resources/application. The properties:

Spring-datasource. Url = JDBC: MySQL://10.xxx.xx.xxx:3306/griffin_quartz? useSSL=false
spring.datasource.username=xxxxx
spring.datasource.password=xxxxx
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true# Hive metastore configuration information Hive.metastore. uris=thrift://namenode.test01.xxx:9083
hive.metastore.dbname=default
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000Kafka. # Kafka schema registry, on-demand configuration schema. Registry. Url = HTTP://namenode.test01.xxx:8081
# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
# schedule predicate job every 5 minutes and repeat 12times at most #interval time unit s:second m:minute h:hour d:day,only support these four units predicate.job.interval=5m  predicate.job.repeat.count=12
# external properties directory location
external.config.location=
# external BATCH or STREAMING env
external.env.location=
# login strategy ("default" or "ldap")
login.strategy=defaultIf the login policy is ldap, configure ldap.url=ldap://hostname:port
ldap.email=@example.com
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
# hdfs defaultName fs.defaultFS= # ElasticSearch Configure Elasticsearch. host=griffindq02-test1-rgtj1-tj1 Elasticsearch.port =9200# elasticsearch.user = user # elasticsearch.password = password # elasticsearch. uri= HTTP/ / 10.104. XXX. XXX: 8998 / batchesYarn.uri = HTTP:/ / 10.104. XXX. XXX: 8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook
Copy the code

2, service/SRC/main/resources/quartz properties

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses thisFile # to you under the Apache License, Version 2.0(the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
# 
#   http:/ / www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#
org.quartz.scheduler.instanceName=spring-boot-quartz
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=5
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
# If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
# If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate
# If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.useProperties=true
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000
Copy the code

3, service/SRC/main/resources/sparkProperties json:

{
  "file": "hdfs:///griffin/griffin-measure.jar"."className": "org.apache.griffin.measure.Application"."name": "griffin"."queue": "default"."numExecutors": 2."executorCores": 1."driverMemory": "1g"."executorMemory": "1g"."conf": {
    "spark.yarn.dist.files": "hdfs:///home/spark_conf/hive-site.xml"
  },
  "files": []}Copy the code

4, service/SRC/main/resources/env/env_batch json:

{
  "spark": {
    "log.level": "INFO"
  },
  "sinks": [{"type": "CONSOLE"."config": {
        "max.log.lines": 10}}, {"type": "HDFS"."config": {
        "path": "hdfs://namenodetest01.xx.xxxx.com:9001/griffin/persist"."max.persist.lines": 10000."max.lines.per.file": 10000}}, {"type": "ELASTICSEARCH"."config": {
        "method": "post"."api": "http://10.xxx.xxx.xxx:9200/griffin/accuracy"."connection.timeout": "1m"."retry": 10}}]."griffin.checkpoint": []}Copy the code

After the configuration file is modified, run the following maven command in the idea terminal to compile and package:

mvn -Dmaven.test.skip=true clean install
Copy the code

After the command is executed, two jars, service-0.4.0.jar and measure-0.4.0.jar, are displayed in the target directory of the service module and the target directory of the measure module. Copy the two jars to the server directory. The two jars are used as follows:

1. Use the following command to upload the jar of Measure-0.4.0.jar to the /griffin file directory of HDFS:

Change jar name mv measure-0.4. 0Jar # Upload griffin-measure.jar to the HDFS directory hadoop fs-put measure-0.4. 0.jar /griffin/
Copy the code

The main reason is that when Spark executes tasks on the YARN cluster, it needs to load griffin-measure.jar to the /griffin directory of the HDFS. Avoid class org. Apache. Griffin. Measure. The Application can not find the mistake.

2. Run service-0.4.0.jar to start Griffin management background:

nohup java -jar service-0.4. 0.jar>service.out 2> &1 &
Copy the code

After a few seconds, you can access Apache Griffin’s default UI(by default, Spring Boot’s port is 8080).

http://IP:8080
Copy the code

UI Operation document link:Apache Griffin User Guide. Through the UI operation interface, we can create our own statistical tasks. Part of the results are shown as follows:Create table demo_src and demo_tgt in Hive

--create hive tables here. hql script
--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_src`(
  `id` bigint,
  `age` int,
  `desc` string) 
PARTITIONED BY ( `dt` string, `hour` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
LOCATION
  'hdfs:///griffin/data/batch/demo_src';

--Note: replace hdfs location with your own path
CREATE EXTERNAL TABLE `demo_tgt`(
  `id` bigint,
  `age` int,
  `desc` string) 
PARTITIONED BY ( `dt` string, `hour` string)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
LOCATION
  'hdfs:///griffin/data/batch/demo_tgt';
Copy the code

2. Generate test data:

From the Hadoop griffin.apache.org/data/batch/ address download all the files to the server, and then use the following command execution gen – hive – data. Sh script:

nohup ./gen-hive-data.sh>gen.out 2> &1 &
Copy the code

Pay attention to the gen.out log file. If there are errors, adjust them as necessary. The test environment Hadoop and Hive are installed on the same server, so the script is run directly.

3. Create statistical tasks on the UI and follow The Apache Griffin User Guide step by step.

On the pit process

1. The gen-hive-data.sh script fails to generate data, and the message “No such file or directory” is displayed.

Cause: The dt= time directory in the /griffin/data/batch/demo_src/ and /griffin/data/batch/demo_tgt/ directories in the HDFS does not exist, for example, dt=20190113.

Solution: Add hadoop fs-mkdir to the script to create a directory as follows:

#! /bin/bash #create table hive -f create-table.hql echo"create table done"

#current hour
sudo ./gen_demo_data.sh
cur_date=`date +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql hive -f insert-data.hql src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour} hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour} hadoop fs -touchz ${src_done_path} hadoop fs -touchz ${tgt_done_path}  echo"insert data [$partition_date] done"

#last hour
sudo ./gen_demo_data.sh
cur_date=`date -d '1 hour ago' +%Y%m%d%H`
dt=${cur_date:0:8}
hour=${cur_date:8:2}
partition_date="dt='$dt',hour='$hour'"sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql hive -f insert-data.hql src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour} hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour} hadoop fs -touchz ${src_done_path} hadoop fs -touchz ${tgt_done_path}  echo"insert data [$partition_date] done"

#next hours
set +e
while true
do
  sudo ./gen_demo_data.sh
  cur_date=`date +%Y%m%d%H`
  next_date=`date -d "+1hour" '+%Y%m%d%H'`
  dt=${next_date:0:8}
  hour=${next_date:8:2}
  partition_date="dt='$dt',hour='$hour'"sed s/PARTITION_DATE/$partition_date/ ./insert-data.hql.template > insert-data.hql hive -f insert-data.hql src_done_path=/griffin/data/batch/demo_src/dt=${dt}/hour=${hour}/_DONE tgt_done_path=/griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour}/_DONE hadoop fs -mkdir -p /griffin/data/batch/demo_src/dt=${dt}/hour=${hour} hadoop fs -mkdir -p /griffin/data/batch/demo_tgt/dt=${dt}/hour=${hour} hadoop fs -touchz ${src_done_path} hadoop fs -touchz ${tgt_done_path}  echo"insert data [$partition_date] done"
  sleep 3600
done
set -e
Copy the code

2. There is no statistical result file in the /griffin/persist directory of the HDFS. Check the permission of the directory and set an appropriate permission.

3, the metric data in ES is empty, there are two possibilities:

  • Service/SRC/main/resources/env/env_batch ES in json configuration information is not correct
  • The HOSTNAME of the ES server is not configured on the YARN server that performs the Spark task, and the connection is abnormal

4. After starting service-0.4.0.jar, the USER interface cannot be accessed, and no exception is found in the startup log. Check whether the MVN package command is executed. Replace the command with the MVN -dmaven.test. skip=true clean install command.