Introduction of Kettle

ETL (extract-Transform-Load), for data developers, we often encounter a variety of data processing, conversion, migration, so it is essential to understand and master the use of an ETL tool. The ETL tool we’re going to look at here is Kettle.

What is the Kettle

The Kettle is an open-source ETL tool that has no restrictions on business users. It is written in Pure Java and runs on Windows, Linux, and Unix. It requires no installation and is efficient and stable in data extraction. A Kettle allows you to manage data from different databases by putting different data into a Kettle and then streaming it out in a specified format. There are two types of script files in the Kettle: Transformation and Job. Transformation converts basic data, and Job controls the entire workflow. The start module under Job has a timing function, which can be set daily or weekly.

Core components of the Kettle

The name of the function
Spoon A graphical interface that allows you to design ETL transformations using a graphical interface.
Pan Run the command line tool for the transformation
Kitchen The command line tool that runs the job
Carte Carte is a lightweight Web container for setting up dedicated, remote ETL Servers
  • Jobs and transformations can be performed in a graphical interface, but only during development, testing, and debugging. Once development is complete, Spoon is rarely used when it needs to be deployed to a production environment, and the Kitchen and Pan command-line tools are used for actual production environments.
  • The deployment production phase is usually executed through command lines. You need to put the command lines into a Shell script and schedule the script periodically.
  • The Kitchen and Pan tools are Kettle command-line executants that encapsulate the Kettle execution engine. They simply interpret command line arguments, call them, and pass them to the Kettle engine.
  • Kitchen and Pan are very similar in concept and usage, and the parameters of the two commands are basically the same. The only difference is that Kitchen is used to perform the job and Pan is used to perform the transformation.

Kettle Conceptual Model

The Kettle execution is divided into two levels: Job (.kjb suffix) and Transformation (.ktr suffix).

Simply put, a transformation is an ETL process, and a job is a collection of transformations, jobs in which transformations or jobs can be scheduled, scheduled tasks, and so on.

In the actual process, the writing process should not be very complicated. When data extraction requires multiple steps, it needs to be divided into multiple transformations, which are integrated into a job and placed in sequence, and then executed.

Directory file function description

Download and Installation

Download each version from the official website: HTTPS://sourceforge.net/projects/pentaho/files/Data%20Integration/Kettle Forum: HTTPS://www.kettle.net.cn/
Copy the code

The Kettle is open-source software based on Java programming. You need to install JDK and configure environment variables. After decompressing the software, you do not need to install the software.

Other preparations: Place the database driver, for example, under the bin folder in the Kettle root directory.

To open Kettle, run Spoon. Bat (Win) or Spoon. Sh (Linux or macOS) to open Spoon.

Start the Kettle

In the following figure, run the./spoon.sh command

The welcome page

HelloWorld

Copy data from CSV files to Excel files

CSV file input

Drag “CSV File Input” to the working area on the right, double-click to edit it, browse and select the prepared test file, and click “Get Field” to automatically obtain the table header information in the CSV file. After the input configuration is complete, the next step is to output the configuration.

Excel output

Drag “Excel Output” to the right workspace, double-click to edit, this step is relatively simple, browse, select the output directory and set the file name, complete the configuration.

Convert file

Hold down the Shift + left mouse button to establish the connection and save the transformation configuration

Run the transformation

View the results

conclusion

This topic describes Kettle core components and their usage

  • Jobs and transformations can be performed in a graphical interface, but only during development, testing, and debugging. Once development is complete, Spoon is rarely used when it needs to be deployed to a production environment, and the Kitchen and Pan command-line tools are used for actual production environments.
  • The deployment production phase is usually executed through command lines. You need to put the command lines into a Shell script and schedule the script periodically.
  • The Kitchen and Pan tools are Kettle command-line executants that encapsulate the Kettle execution engine. They simply interpret command line arguments, call them, and pass them to the Kettle engine.
  • Kitchen and Pan are very similar in concept and usage, and the parameters of the two commands are basically the same. The only difference is that Kitchen is used to perform the job and Pan is used to perform the transformation.

Step through a HelloWrold process

Welcome to follow the public account HelloTech for more content