concept

A transformation consists of one or more steps connected by hops. A hop defines a one-way channel that allows data to flow from one step to another. In a Kettle, the units of data are rows, and the data flow is the movement of data rows from one step to another.

steps

Is the basic component of the transformation, in the form of ICONS. For example (table input, text file output). Step A step that writes data to one or more output hops connected to it and then transmits it to the other end of the hop. So what this means is that a hop is a line with an arrow between steps, actually between two steps, called a rowset, a cache of rows of data. (The rowset size can be defined in the transformation)

One step of data sending can be set to take turns sending and copy sending; Send in turn: Send rows of data to each output hop in turn; Copy send: All data rows are sent to all output hops. (Shift + left mouse button can quickly create a jump)

In a Kettle, all steps are executed concurrently. After the conversion is enabled, all steps start at the same time, read data from the input hop, and write the processed data to the output hop until the input hop no longer has any data. When all the steps are aborted, the entire transformation is aborted. Data row: A data row is a collection of zero to more fields.

 

Characteristic one:

During conversion, the Kettle can send data to multiple data flows in different modes according to the user Settings. Note: There are two basic sending methods, distribution and copy. Distribution is similar to dealing playing cards, in which each row is sent to only one data stream in turn. Replication is sending one row of data to all data streams.

Character 2:

When the transformation is executed in parallel, you need a job that can be executed sequentially to handle the operations (the jobs are executed sequentially).

The characteristics of three:

The most important part of the ETL solution is that it handles various operations on data rows in the extraction, transformation, and load stages. The transformation consists of one or more steps, such as reading a file, filtering output lines, cleaning data, or loading data into a database. The steps in the transformation are connected by hops, which define a one-way channel that allows data to flow from one step to another.

In a Kettle, the unit of data is rows, and the data flow is the movement of rows from one step to another. Another synonym for data flow is record flow. Note that the transformation can also contain comments, a small text box that can be placed anywhere in the transformation flow diagram. The main purpose of comments is to document the transformation so that you can familiarize yourself and learn it later.

Characteristics of four:

The important point of transformation is that the steps are the basic components of the transformation, which are displayed graphically in the form of ICONS. A step has the following key features. The step needs to have a name that is unique within the scope of the transformation. Step A step that writes data to one or more connected outgoing hops and transmits it to the other end of the hop.

For the other end of the step, the hop is an input hop that the step uses to receive data. Most steps can have multiple output hops. One step of data sending can be set to alternate sending and replication sending. In round-robin mode, rows are sent to each output hop in turn. In round-robin mode, rows are sent to all output hops.

When running a transformation, a thread runs a step and multiple copies of the step, all the step threads run almost simultaneously, and rows of data flow continuously through the jump before the step.

Features five:

The jump of transition. A hop, “hop,” is a line with arrows between steps. A hop defines the data path between steps.

A hop is actually a cache of data rows between two steps called a row set (the size of a row set can be defined in the transform Settings). When the rowset is full, the writing to the rowset stops until there is more space. When the rowset is empty, the step to read data from the rowset stops reading until there are more readable rows in the rowset. Note that when creating a new jump, it is important to remember that the jump cannot loop in the transformation. Because each step in the transformation depends on the previous step to get the field value.

Features 6:

Parallelism of transformations

The rowset caching rule allows each step to be run by a separate thread for maximum concurrency.

This rule also runs the data in a stream that consumes the least memory. In data warehouses, we often have to deal with large amounts of data, so this kind of concurrent low memory approach is also a core requirement of ETL tools.

It is not possible to define an execution order for a kettle, and it is not necessary or possible to specify a starting point and an ending point. Because all steps are performed concurrently. When the conversion starts, all steps start at the same time, reading data from their input hops and writing all processed data to the output hops until there is no more data in the input hops. When all the steps stop, the transformation stops, that is, from a functional perspective, the transformation has a clear starting point and an end point. Note that the steps in the transformation are started almost simultaneously, so if you want a task to execute in the specified order, you use a job.

Features seven:

Design of the transformation

There are several data type rules to be aware of when designing transformations. All rows at the row level should have the same data structure. That is, when data is written from multiple steps to one step, the rows of data output from multiple steps should have the same structure, that is, the fields should be the same, the field data type should be the same, and the field order should be the same. The field metadata does not change during the transformation.

That is, strings are not automatically truncated to fit the specified length, and floating-point numbers are not automatically rounded to fit the specified precision. These functions must be accomplished through some specified steps. By default, the empty string “” is considered the same as NULL.