I have used Impala + Kudu several times to do real-time computing scenarios for big data. Here is my experience

  • When we first need to import kudU in full, we use Sqoop to import the relational database data into the temporary table and then impala to import the KUDu target table from the temporary table

Because SQoop imports relational data directly into Hive in parquet format, the default hive table format is text. Each time the temporary table is imported, the invalidate metadata table needs to be operated; otherwise, data cannot be found when directly imported to kudU

  • Why not use SQoop to draw the data directly to the Kudu table instead of transferring it to Hive

Since Sqoop does not have a direct interface to Kudu, you must first sqoop to Hive and then to Kudu

  • Except queries, you are advised to perform all impala operations on impala-shell rather than hue
  • When impala writes concurrently to kudu, the amount of data is large

The kudu configuration parameter –memory_limit_hard_bytes — can be as large as possible because kudu writes are stored in memory before overwriting to disk at a certain threshold. This is the most direct way to improve writing.

Of course, not all machines have that many resources, so you can make the — maintenance_manager_NUM_Threads parameter slightly larger, which requires debugging to improve the efficiency of writing data from memory to disk

  • Impala query kudu

If you want to perform a complete etL operation on all tables, you must perform the compute STATS table name. Otherwise, the impala execution of the SQL generated by the scheduled number of executions is not accurate in memory, which may cause the actual execution to fail

The KUDU table should not be compressed to ensure the best performance of the original scan. If the query performance requirements are higher than the storage requirements; Most enterprises require high efficiency of real-time query, and the storage cost is low after all;

Kudu partitions for large tables, preferably range and hash, if the primary key column contains an ID that can be hash, but range partitions must be done well. My experience is that they are usually based on time.

Query slow SQL, generally to take out; If it is convenient to explain, see if kudu has filtered part of the data keyword kudu predicates; If the SQL is ok, run the SQL in impala-shell and finally run the summray command to focus on the single peak points with large memory and time, and optimize the related tables to solve the data skewable problem

  • Kudu data deletion

< span style = “margin-bottom: 0pt; margin-bottom: 0pt; Disk space will be released

  • About impala + kudu and impala + Parquet

The impala + kudu is much better than the Impala + Parquet. Who believes who.

First, the two scenarios are different. Kudu usually works in real time and Hive usually works offline (usually T + 1 or T-1).

Hive is based on the HADOOP Distributed File System (HDFS), which provides a comprehensive storage mechanism to facilitate the operation of underlying data and files. Security and scalability are much better than KudU. Most importantly, Parquet + Impala is more efficient than Kudu. It is the first choice for the number of silos

The biggest advantage of Kudu is that it can do the same operation as relational database, insert, update, delete, so that hot data can be stored in kudu and updated at any time

  • Finally, real-time synchronization tools

We’re going to use streamsets here, a drag and drop tool that’s very useful; However, the memory usage is high. Through jConsole, we find that all tasks are started at the same time; The content of the new generation of the JVM is almost old, GC did not come, memory overflow; In the later part of the ETL tool, take a few servers separately and configure the G1 garbage collector for the JVM