In many cases, a big data cluster needs to obtain business data for analysis. There are usually two ways:

  • The mode in which services are written directly or indirectly
  • How to synchronize the relational database of the business to the big data cluster

The first could be to write code in the business that sends the data it feels like it needs to be sent to a message queue and eventually lands in a big data cluster.

The second method is to synchronize the relational data to the big data cluster. The data can be stored in the HDFS, analyzed using Hive, or directly stored in hbase.

Data synchronization can be divided into two types: incremental synchronization and CRUD synchronization.

Incremental synchronization synchronizes only new data in a relational database. Modification and deletion operations are not synchronized. This synchronization applies to data that remains unchanged once generated. CRUD synchronization means that data is added, deleted, and changed to be synchronized to ensure data consistency between the two libraries.

This article does not talk about binlog + Canal + message queue + JAR to achieve real-time data synchronization scheme, also does not talk about using Sqoop for offline synchronization. Instead, I’ll show you how to use Streamsets zero code to complete the entire real-time synchronization process. For more information on what Streamsets are and what else you can do, check out the Streamsets website. From the information I understand, Streamsets are very useful in data synchronization.

Mysql > mysql > mysql > mysql > mysql > Mysql > Mysql > Mysql > Mysql > Mysql > Mysql > Mysql > Mysql > Mysql

So, let’s get right down to business.

The installation

download

Streamsets can be downloaded directly from archives.streamsets.com

You can download a Full Tarball, Cloudera Parcel, or any other format. The advantage of downloading a Core Tarball is that it is small in size and can be downloaded from a Streamsets Web page if you need a library later. Compared to the Core Tarball, the Full Tarball helps you download a lot of libraries by default, but the file size is relatively large (>4G), and many libraries may not be available for the time being.

Or you can use this link to download directly: archives.streamsets.com/datacollect…

Unzip the start

Streamsets Core tarballs are easily unpacked and ready to use.

Tar XVZF streamsets datacollector - core - 3.7.1. TGZcdStreamsets datacollector - 3.7.1 / bin /. / streamsets dcCopy the code
Java 1.8 detected; adding $SDC_JAVA8_OPTS of "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -Djdk.nio.maxCachedBufferSize=262144" to $SDC_JAVA_OPTS
Configuration of maximum open file limit is too low: 1024 (expected at least 32768). Please consult https://goo.gl/6dmjXd
Copy the code

Change the operating system’s limit on the number of open Files.

#vi /etc/security/limits.conf
Copy the code

Add two lines of content:

  • soft nofile 65536
  • hard nofile 65536

Run ‘ulimit -n’ to see that the open Files setting has taken effect.

The Web page

Streamsets have a Web page with the default port of 18630. Enter IP address :18630 in the address box of the browser to access the Streamsets page. The default user name and password are admin.

Pipeline

The preparatory work

This is because mysql data needs to be synchronized to hbase in real time, but the downloaded Core Tarball does not contain the mysql Binary Log and hbase stage library. So you need to install them before creating new Pipeline.

Install the MySQL Binary Log library

Install the Hbase library. Note here that the Hbase library is in CDH, so select a CDH version for installation

MySQL Binary Log and Hbase are Installed in the Installed Stage Libraries

Create a Pipeline

MySQL Binary Log

Create a MySQL Binary Log

Set the mysql connection parameters (Hostname, Port, and Server ID), where the Server ID is the same as the Server ID in the mysql configuration file (usually /etc/my.cnf)

Example Set the user name and password of the mysql database

Other Settings: We set two Tables (separated by commas) in the Include Tables column. This means that we monitor the data changes in these two Tables and do not care about the other Tables.

Stream Selector

Create a Stream Selector and point the MySQL Binary Log you just created to the Stream Selector. ${record:value(“/Table”)==’cartype’} ${record:value(“/Table”)==’cartype’}

As you can see, the Stream Selector has two exits (1 and 2), as we will see later: 1 output to Hbase and 2 data to Trash

Hbase & Trash

Create Hbase and Trash separately and connect to the Stream Selector

Configure Hbase

Trash Does not need to be configured

Validation & startup

validation

Click on the “glasses” in the upper right corner to verify the whole process.

Here an error: “Java. Lang. RuntimeException: Unable to get a driver instance for jdbcUrl”. The jar package for mysql connection is missing. The solution is as simple as downloading a JAR package and placing it in the directory specified by streamSets. The full list on my side is: / opt/streamsets/streamsets – datacollector – 3.7.1 / streamsets – libs/streamsets datacollector – mysql binlog – lib/lib/mysql – conne Ctor-java-5.1.26-bin. jar, mysql-connector-java-5.1.26-bin.jar

In addition, the corresponding tables in Hbase must be created in advance. Otherwise, an error message will be displayed.

Restart streamsets on the page.

I reverified it and found it was successful.

Click the play TAB in the upper right corner to start the process, so that the whole process has been completed (the data has been synchronized in real time), and you can see how much data flows in and how much data flows out of each Stage. You can also log in to the hbase database to check whether data is generated.

This is how to use Streamsets to synchronize mysql data to hbase in real time. You must have found that the whole process does not write any code, is it more efficient than binlog + Canal + message queue + JAR scheme? Of course, there are pros and cons to any solution, and a more hands-on experience with Streamsets requires more observation.