026-Kettle table Input table output increase 50 times

This is the 26th article in the Technical Writing Project (including translation). Set a small goal of 999, at least 2 articles per week.

Recent work needs to be done, data needs to be derived from Oracle to Mysql, and it needs to be properly cleaned and converted. The amount of data is about 500 million. The hardware environment is Winserver 2008R2 64-bit, 64GB, 48-core, 1T HDD, kettle 8.2, and the data is pumped from Oracle (11G, Linux server, LAN connection) to mysql(5.7, local machine, Win Server). The speed before optimization is about 1000R /s for read (Oracle) and 1000R /s for write. The optimized speed is about 8Wr/s for read and 4Wr/s for write. Because of the size and type of columns in the table and whether the table has an index or not, the overall increase is about 20-50 times.

Mysql optimization

Mysql is only used to migrate data, actually CSV, or ClickHouse is ok. However, there are concerns that CSV may have problems handling dates, that ClickHouse does not run in Win, that there are no additional Linux resources available, and that mysql is a third-party open source framework (whether imported into HDFS), and that clickhouse is a shell for clickhouse. Or data presentation (Supserset, Metabase, etc.), or migration to TiDB, are very convenient. So we decided to use mysql.

Note: mysql is only used for temporary data migration, so you can restart and modify mysqld parameters at will. If it is used with services, consult the DBA to ensure that other services will not be affected.

Mysql installation and configuration optimization

From dev.mysql.com/downloads/m… Download 64-bit ZIP mysql
Unzip mysql-5.7.26-winx64.zip to a directory such as D:\mysql-5.7.26-winx64
Create the D: \ mysql – 5.7.26 – winx64 \ my ini

[mysqld]
port=3306
basedir=D:\mysql-5.7.26-winx64\
datadir=D:\mysql-5.7.26-winx64\data
net_buffer_length=5242880
max_allowed_packet=104857600
bulk_insert_buffer_size=104857600
max_connections = 1000
innodb_flush_log_at_trx_commit = 2
MyISAM is about 1 times better than InnoDB in this scenarioDefault-storage-engine =MyISAM general_log = 1 general_log_file=D:\mysql-5.7.26-winx64\logs\mysql.log innodb_buffer_pool_size = 36G innodb_log_files_in_group=2 innodb_log_file_size = 500M innodb_log_buffer_size = 50M sync_binlog=1 innodb_lock_wait_timeout = 50 innodb_thread_concurrency = 16 key_buffer_size=82M read_buffer_size=64K read_rnd_buffer_size=256K sort_buffer_size=256K myisam_max_sort_file_size=100G myisam_sort_buffer_size=100M transaction_isolation=READ-COMMITTEDCopy the code

See 2.3.5 Installing MySQL on Microsoft Windows Using a noinstall ZIP Archive
Start the mysql service

Kettle optimization

Optimization of startup parameters

To avoid OOM, increase the memory parameters and create the environment variable PENTAHO_DI_JAVA_OPTIONS = -xMS20480m -XMx30720m -xx :MaxPermSize= 1024M start 20 GB and maximum 30 GB

Table input and table output enable multithreading

Table input can duplicate data if multithreading is enabled. For example, if you select * from test, three times will be checked, and the final data will be three copies. It’s not gonna work. It’s not gonna work. Because the source is oracle, use the character with oracle: rownum and function: the mod, as well as the parameters of the kettle: Internal. Step. Unique. Count, Internal. Step. Unique. Number

select * from (SELECT test.*,rownum rn FROM test ) where mod(rn,${Internal.Step.Unique.Count}) = ${Internal.Step.Unique.Number}
Copy the code

explain

Rownum is the number of rows that are consecutively assigned by Oracle to return from the query. The first row is assigned 1, and the second row is assigned 2. This means that if the sort field or data changes, the RowNum will change. Rowid should be used, but roWID is not a number, but something like AAAR3sAAEAAAACXAAA), so rowNum needs to be solidified, so theSELECT test.*,rownum rn FROM testAs a subquery
Mods are Oracle’s modular operations, such as,The mod (5, 3)meaning5% 3 = 2, it is5/3 when = 1... 2If we can get the total number of threads, and the current number of threads, we can split the result set.Mod (line number, total threads)= current thread number
Kettle built-in function${Internal.Step.Unique.Count} 和 ${Internal.Step.Unique.Number}Respectively represent the total number of threads and the current thread number

It doesn’t matter how many threads I’m going to open, the kettle’s going to total and then I’m going to amortize it.

Change the number of copies started...

Change the number of submissions (default 100, but not larger is better)
Get rid of the clipping table, because it’s multithreaded, and you don’t want to delete something that thread A just inserted.
The database field must be specified because there is an extra row number field when the table is entered. The insertion fails. Of course, if you create a table with a row number field, as the increment id, you do not need to do this step.
Enabling Batch Insert

Note: Speed can be increased more than 5 times by enabling multi-threading.

Enable thread pooling and optimize JDBC parameters

Operational observation

Remaining room for improvement

In the SSD
Continue to tune mysql parameters
Switch engines, for example, tokudb
Replace extraction tools such as Streamsets,datax
In the database, such as clickhouse tidb, Cassandra
Kettle cluster

Want ads

Friends in Jinan, Shandong, welcome to join us and do things together. Long-term recruitment, Java programmer, big data engineer, operation and maintenance engineer, front-end engineer.