Compared with the current popularity of big data technology, you may feel that there are too few Kettle usage scenarios, or there is no need to use such a device. After checking the Github Kettle, you find that there are some recent updates. In addition, for data users without programming experience, a very simple Kettle is used. Through the graphic interface design to achieve what business, do not need to write code to achieve, you can do some experiments, such as: crawl the website stock data, foreign exchange information and so on.

Kettle supports a variety of input and output formats, including text files, data tables, and database engines. In summary, Kettle’s powerful input, output, and conversion functions make it very easy to manipulate data.

Common input steps

File entry steps

Common text file input steps include CSV file input, Excel file input, and text file input.

The HelloWorld level function, “copying data from CSV files to Excel files,” was described in previous articles. See the steps for more details.

You can select all files in the same directory by selecting the directory and wildcard number wildcard files, or you can choose whether to read files in the subdirectories of the current directory, as shown below:

XML input Steps

XML is an extensible Markup language, which is mainly used to transmit and store data. In some traditional systems, data transmission and docking are still carried out in this way. By means of the input step of “Get data from XML”, data information in XML files is obtained. Using xpath to determine the location of a portion of data in an XML document, xpath provides the ability to find nodes in a tree of data structures based on XML.

The sample

Follow the wizard to complete an example of reading property configuration information from a POM file, as shown below:

Common expressions in xpath

expression instructions
nodename Selects all children of this node
/ From the root node
// Nodes in the document are selected from the current node selected by the match, regardless of their location
. Select the current node
. Selects the parent of the current node
@ Select properties
div Selects all the children of the div element
/div Select the root element div
div/p Select the child p under the div element
//div Select all div elements
div//p Selects all p elements under the div element
//@lang Select all the properties named lang

JSON input steps

Compared to XML, it is a lightweight data interchange format. JSON core concepts: arrays (data in []), objects (data in {}), attributes (data in k:v)

To implement the function of “call RESTful interface to import JSON results into the library”, whether using Java or Python to call RESTful interface to import JSON results into the library, there is a certain complexity. First, you need to load third-party REST component dependencies, and then connect to the database. Write the SQL statement and finally insert it into the target database. However, with the Kettle tool, Spoon can easily call the interface and import the database using the graphical interface.

Through a simple GET request, obtain “Shanghai and Shenzhen ipo notice” information, data situation and operation are as follows

The data sample

Interface information

Xueqiu.com/service/v5/…

Conversion steps:

The output

Configuration Procedure

Table entry step

Adding a database driver

If you want to manipulate the database, you need to download the database driver and place it in /data-integration/lib. The default driver includes the following:

(base) ~ / Documents/apps/data - integration/lib  ls - al | grep SQL - rw - r - r - @ 1 Yezhiwei staff. 1473091 Jun 11 2019 Hsqldb-2.3.2. jar-rw-r --r-- @1 Yezhiwei Staff 825943 Jun 11 2019 Postgresql-42.2.5.jar-rw-r --r-- @1 Yezhiwei Staff 3201133 Jun 11 2019 SQLite-jDBC-3.7.2.jarCopy the code

Add MySQL driver jar package mysql-connector-java-5.1.41.jar to /data-integration/lib:

Configuring database Connections

Get table SQL query statement, data preview

Select the database connection configuration, then click “Get SQL Query Statement”

Select the table, complete the Settings as prompted, automatically generate SQL, “preview” can view the data.

Common output steps

After data is processed, it is usually saved to a place for future use. The output is usually Excel or a database. Excel output has been described in previous examples.

The common database output steps are Insert, update, and delete. The examples focus on the Insert, update, and Delete output steps

Table output

Outputs data from other input sources to the table

update

Update refers to comparing existing records in the database with those in the data stream and updating them if they are different

Insert the updated

To insert an update is to insert redundant data from the data stream on the basis of the update

delete

Delete can be used with custom constant input to define a constant condition and delete all data that does not meet this condition

The sample
Data preparation

Data is inserted or updated from Excel to MySQL

Result of new data insertion

Update the results

conclusion

ETL is an indispensable tool in data warehouse technology. As a classic ETL tool, Kettle can be used to implement services through GRAPHICAL interface design without writing codes. It is also very simple for data users without programming experience.

Of course, there are also some problems, such as: the amount of data processing is relatively small, relatively few documents and so on. Have the opportunity to summarize the problems encountered in the process of use. Welcome to follow the public account HelloTech for more content