Netease cloud


The main content of this blog post is not to analyze the performance of Kudu, but to analyze why kudu scan performance is so poor! At the beginning of the external publicity but added a variety of dark technology: column independent storage, Bloom filter, compression, in situ modification, B +tree, MVCC… .


Here’s a comparison of TPCDS results for kudu and Parquet:



Where there is no contrast, there is no harm; where there is contrast, there is pleasure. The ordinate is the time, in seconds, the yellow column for kudu is too high, so the kudu takes too long and the performance is poor!


Boss: Why is Kudu so bad? I: I don’t know… .


At that time, I really didn’t know the reason, because I was busy testing and eager to get the test index, and there was no time for analysis, let alone two large unfamiliar systems: Impala and Kudu, which was very embarrassing 🙁


After all the TPCDS test cases were finished, there was a gap, and I spent several days looking for the reason, reading materials, looking through documents, and googling. I won’t describe the process here, but I will focus on the reason below.


We know impala has an interactive management tool called impala-shell, which has a profile command that executes after each SQL execution to get the execution plan for that SQL and the elapsed time statistics for each point. Since both Kudu and Parquet were tested using impala for the computing engine, is it possible to get some information from this?


So I took the query7 and Query40 in the figure above, which are in obvious contrast, and ran them through Kudu and Parquet respectively. I collected their profiles, a total of 4 files, and then analyzed them. Believe it or not, the result of profile is really big, one file is close to 10,000 lines, do you still have confidence in analysis? (Query40’s profile is in the attachment below) AT that time, I was so confused that I had to find the reason, so I read it from beginning to end. By accident, I clicked Beyond Compare, which was often used to compare the code before, and compared the two profiles (Kudu and Parquet) of query40. A little bit down, in the execution plan section, I actually found treasure!



Parquet has the Runtime filter, kudu does not. Then scroll down to the disk scan section:





The result set obtained from disk scanning is also different!! No wonder the Kudu cluster has a lot of disk I/O and network transport overhead when running query during the comparison test, while the Parquet load is low! Do you understand?


Why does kudu not have the Runtime Filter? So I went to kudu’s Jira library and searched, well, I didn’t find it! Try impala’s Jira library. Matthew Jacobs, impala/ Kudu development engineer for Cloudera, found impala-3741 and Impala-4252



+



See here, basically the problem has been relatively clear, the answer is available, but I am not willing to ah, so no matter what, I registered an account, put a bug list in their JIRA library: Impala-4719 (normal case should be in userlist email inquiry, so I help them test jIRA library permissions =_=), confirm again whether support.


Then I re-read kudu’s official documents. There was something in the lines, but they didn’t attract enough attention:



This concludes the article. I hope you can learn from this experience, thank you!


Netease has

Enterprise big data visualization analysis platform. The self-service and agile analysis platform for business personnel adopts PPT mode to make reports, which is easier to learn and use. It has powerful exploration and analysis functions, and truly helps users to gain insight into data and discover value.

Click here – free trial.




Understand netease Cloud:

The official website of netease Cloud is www.163yun.com/

New user package: www.163yun.com/gift

Netease Cloud community: sq.163yun.com/