preface

If you want to see the concepts and features of the transformation in detail, click on the previous blog

1. Transformation is the fourth category of transformations.

Transform belongs to ETL T, T is Transform cleaning, Transform. Among the three parts of ETL, T takes the longest time, which is generally 2/3 of the entire ETL.

2, Concat fields

This is when multiple fields are joined together to form a new field.

3. Value mapping

Maps one value of a field to another. It is used a lot in data quality specifications, for example, many systems have different definitions for the gender field.

4. Add constants

Add a column of data with the same value to its own data stream.

5. Add sequence

Adds a sequence field to the data flow.

6. Field selection

Select fields, change names, and modify data types from the data flow.

You can select the fields you want to remove.

\

You can select the metadata information to change.

7. Calculator

A collection of functions to create new fields and to set whether fields are removed (temporary fields).

8. Cut strings

Specifies the location where input stream fields are clipped to cut out new fields.

9. String substitution

Specify what to search for and what to replace. If fields in the input stream match the search, replace them to generate a new field.

10. String operations

Removes whitespace and case toggles at both ends of the string and generates new fields.

11. Remove duplicate records

Remove identical rows from the data stream. Note: The data stream must be sorted first!

\

12. Sort records

Sorts a data stream by ascending or descending order of the specified fields.

13. The only row (hash) is the row where the data stream is duplicated.

Note: Unique rows (hash value) and (sort records + remove duplicate records) have the same effect, but the implementation principle is different!

\

Unique row (hash) execution is more efficient! The unique row hash value is compared against the hash value, while the removal of duplicate records is compared against the consistency of two adjacent rows.

14. Split the fields

Note: After splitting a field, the original field does not exist in the data stream!

15. Split columns into rows

Splits a field with a specified delimiter into multiple lines.

16. Change columns to rows

Convert multiple rows of data to one row of data in the specified field if the data column has the same value. Turn a column into a field by removing some of the original column names.

\

Note: The data flow must be sorted before the columns are changed to rows! You must use sort to record primitives!

17, row to column

Convert the field name of a data field to a column, and convert a data row to a data column.

18. Flatten the line

To combine multiple rows of data from the same group into one row. Note: Only the data stream of the same type of data row records can be used! The data stream must be sorted or the results will be incorrect!