The core concept of Kettle, a big data ETL processing tool

Macroscopic Kettle

The previous article gave a brief introduction to Kettle and gave a quick taste of it, completing the “Copy data from CSV file to Excel file” HelloRold level function.

In the real world, you can use Kettle’s graphical way to define complex ETL programs and workflows, as shown in the diagram below, through a series of transformations to complete a Job process.

Kettle Core Concepts

conversion

Transformation is the most important part of ETL. It handles the extraction, transformation, and loading of various rows of data. The transformation consists of one or more steps, such as the “CSV file input” and “Excel output” steps in the figure above, as well as filtering data rows, cleaning data, deduplication of data, or loading data into the database, etc. Steps in a transformation are connected by a hop, which defines a one-way channel that allows data to flow from one step to another.

Step 1.

Step in Kettle is a basic part of transformation. In the case of quick experience in the last part, there are two steps, “CSV file input” and “Excel output”. One Step has the following key characteristics:

The step needs to have a name that is unique within the scope of the transformation.
Each step reads and writes rows of data (the only exception being the “generate record” step, which only writes data).
Steps A step that writes data to one or more associated output hops and passes it to the other end of the hop.
Most of the steps can have multiple output hops, and when there are multiple output hops, a warning pops up as shown in the figure below to choose whether to distribute or copy. Data sending for a step can be set up to distribute, where the target step takes turns receiving records, and replicate, where all records are sent simultaneously to all target steps.

Jump (Hop)

In Kettle, Hop, Hop is the line with arrows between the steps. The Hop defines the data path between the steps, as shown in the figure above. In Kettle, the unit of data is rows, the stream of data is the movement of rows from one step to another, and the jump is the cache of rows between the two steps called rows (the size of rows can be defined in the conversion Settings, as shown below). When the row set is full, the step that writes to the row set stops writing until the row set has space again. When the rowset is empty, the step that reads from the rowset stops reading until there are more readable rows in the rowset.

The data line

In Kettle, the unit of data is rows, and the data is moved along steps in rows. A data row is a collection of zero to more fields containing the following data types.

String: Character type data
Number: double precision floating point Number
10. signed long (64-bit) Integer
BigNumber: Arbitrary precision data
Date: Date time value with millisecond precision
Boolean: Boolean values of true and false
A Binary field can contain images, sounds, videos, and other types of Binary data

At the same time, each step outputs the rows with a description of the fields, which is the metadata for the rows. It usually contains the following information:

Name: The field name in the row should be unique
Data type: The data type of the field
Format: The way data is displayed, such as #, 0.00 for Integer
Length: The length of the string or of the BigNumber type
Precision: The decimal precision of the BigNumber data type
Currency symbol: ￥
Decimal symbol: The decimal point format of decimal data
Grouping symbol: A grouping symbol for numeric type data

The steps are parallel

This rules-based rowset caching (mentioned earlier in the “Hop” section) allows each step to be run by a separate thread for the highest degree of concurrency. This rule also allows data to be processed as a stream with minimal memory consumption (set to a reasonable rowset size). In the process of data warehouse construction, large amounts of data are often processed, so this way of concurrency and low memory consumption is also a core requirement of ETL tools.

In the case of a Kettle conversion, all the steps are executed concurrently, that is, when the conversion is started, all the steps are started at the same time, read data from their input hops and write the processed data to the input hops until there is no more data in the input hops and the execution of the steps is aborted. When all the steps are stopped, the whole transformation is stopped.

conclusion

Kettle completes a Job process through a series of transformations
By understanding the core concept of Kettle, we know that Kettle is a movement of data flow from one step to another by “Hop”. Each step is run by an independent thread, which improves the degree of concurrency, but it is more expensive than the Hadoop ecological mobile computing model
Kettle itself is developed in Java and requires proper JVM parameters to be configured

Welcome to pay attention to the public account: HelloTech, get more content

The core concept of Kettle, a big data ETL processing tool

Macroscopic Kettle

Kettle Core Concepts

conversion

Step 1.

Jump (Hop)

The data line

The steps are parallel

conclusion

Related Posts

Hardcore | Sqoop start guide

Top secret documents open! Unveils the number stack navigation design idea for the first time

CTO’s talk about data middle platform (top) : from requirements, methodology to application practice