This article is edited and compiled according to “Building interactive Query System based on Impala Platform” shared by Jiang Hongxiang, teacher of netease Big Data, DataFun Talk — “Big Data from bottom processing to Data-driven Business”.

The following is the outline of the content shared today. The first one will talk about the characteristics of interactive query. There are many query platforms to choose from on the big data platform. The third part is about the basic introduction of Impala and the improvements to Impala. Next, the impala application scenario, and finally the impala underlying data flow, application scenario analysis, and some of the problems.

The first characteristic of interactive query is the large amount of data, the second relational schema is relatively complex, depending on your design, there are many kinds of relational schema. There is also a high response time requirements, for the vast majority of query return time in 10 seconds or less; Choose different storage according to the different amount of data, for millions of levels of data using MySQL,PostgreSQL, for millions of billions of levels, the traditional database can not meet, using analytical data warehouse Impala, Presto, Green Plum, Apache Drill; It is difficult to do big data analysis with the level of over 10 billion yuan. Offline data warehouse, Hive and Spark are adopted.

For BE system, many practical wide tables do, because of its many dimensions, a user may have hundreds of dimensions after slowly information accumulation, if the filter of 50 dimensions, the use of wide tables combined with some special data structures such as inversion will BE easy to achieve. Elastic Search, Solr is a Search engine, Click House is a good system developed in Russia, but join support is limited. Druid is used mostly in advertising. There are also combinatorial models, such as Elastic Search and Solr, which are widely used. Typical examples are Green Plum, Presto, and Impala.

Next, we will talk about what factors determine the platform we choose. The first is the familiarity of the project. If the project leader is familiar with the platform, he will choose the platform. If you are not familiar with the project, you will choose the endorsement of the big factory and use the same application as the big company. If neither of the two is present, the system is evaluated in terms of performance and advantages and disadvantages.

The third point is the amount of data. According to the amount of data in the system, the platform should at least meet my minimum performance index. And then there’s the complexity of the architecture, the complexity of the architecture, the complexity of the architecture, the complexity of the architecture. So choose something with a relatively simple architecture. The last one is the cost of operation and maintenance. The cost of operation and maintenance is very high, so it is impossible to make frequent changes. If you want to change something and you need to be familiar with the platform, then it will affect your selection.

Let’s talk about how we chose Impala, Presto and Greenplum. The first thing to think about is the data source. A lot of our data is in HDFS, so Greenplum is definitely not a good fit because it’s completely closed, it’s a do-it-yourself storage architecture. The community environment, the architecture, all three are similar, but not architecturally different. Impala is slightly better than Presto in terms of performance. Other features, such as programming languages, include C++, which runs a bit faster than Java, and therefore tends to be written in C++. Impala was chosen.

All three are MPP architectures, and Impala’s entire execution node is stateless, so there is no problem when a node is down and then restarted. Impala’s compatibility with Hive storage, as well as Apache’s top level projects, mature communities, compatibility with multiple data source formats, and efficient query performance were all specific selection factors we considered.

Impala is compatible with multiple data sources. Metastore directly connects to various DB and provides metadata services with catalogd. You can connect to DB directly or obtain data from catalogd using MetaStore in Hive. Impala is efficient because it caches raw data and catalogd starts browsing the cache for data. It has a Statestored service, which is a publish and subscribe service, where all the status and rotations are done. On the left are the execution nodes of the Impala. All queries are sent to these nodes. After the nodes are executed, they are sent to all related nodes.

Catalogd is a metadata service. Impala also caches some data when you perform a select. It does not enter the Catalogd service, but does apply the DDL operation to the Catalogd service. Statestored(sub/pub) has many topics and all impala nodes subscribe to these topics. Statestored(sub/pub) actually does a message subscription on many topics. The Impala node has SQL parsing, execution plan generation, data query, aggregation, and result return.

The figure above is a query of how each node is coordinated. If a Query enters this node, the node is called Query Planner and is responsible for generating the execution plan, routing the plan to the surrounding nodes, and finally feeding the results back to Query Planner. If there are aggregations, the results are aggregated and then returned to the total Query Planner, and the results are then aggregated.

Impala caches metadata cache, and the Impala caches blog information about related table data in HDFS. Therefore, a local read is performed during query to check whether the metadata is local. Log can connect to data only after local read. In the second point of parallel computation, Query Planner generates execution plans that are sent to surrounding nodes and then aggregated. The third one uses CodeGen technology to generate some execution code according to the execution environment, which can greatly improve performance. Another is a lot of operator push down, if the pursuit of high performance is not allowed to achieve operator push down, the interaction between the storage layer and the computing layer becomes smaller, filtering in the bottom layer rather than in the computing layer, so that the overall performance of the platform is greatly improved.

Broadcast Join Caches small tables to all nodes when large tables are associated, and then returns data for aggregation. For example, if an event table accumulates tens of billions of data and there are 500 million users, it cannot be bound to all nodes through broadcast Join. Therefore, partition join operations are performed on each node and then they are added to each node. And then there’s the CBO, which is not very accurate at the moment, sometimes very inaccurate. There is parallel computation, there is parallel aggregation. Data generation is preceded by aggregation, and aggregation is combined according to the column of Group by.

Next, let’s look at the storage engines impala supports. The most common ones are HDFS and Kudu, a product created to address HDFS and HBASE interaction. Hbase is mainly a KB query. However, the performance is poor when a large number of scans are performed. HDFS is strong in mass scanning, but it cannot perform KB query. Alluxio is a file record swap cache that connects to HDFS and supports multi-level caching. We did Alluxio mainly to deal with thermal data, which was previously solved by caching.

How do you do this with the Impala platform? First of all, it has the whole authorization and authentication mechanism. Kerberos, LDAP, and Audit Log can be used for authentication. Only authenticated users can access the system. Authorization is performed using Apache Sentry. The granularity is database, table, and column. Permission: The select, insert, all configuration open (authorization_policy_provider_class = org. Apache. Sentry. Provider. The file. The Local G roup R esource A Uthorization P rovider). These are some of the things you have to do to get online.

For a platform with many users doing some tasks on it, resource management is needed. The Admission Control mechanism is used to ensure that there is a direct user configuration on each Impala node, and each queue can set the resource amount, as well as the resource size of each SQL. This configuration is for impala nodes, if 300 GB is set to a user with 100 nodes, then only 2-3G is allocated to each node, exceeding this limit is also prohibited. The Impala node synchronises information through Statestored’s Impalad-Statistics topic item. Since Statestored is communicating with impalad via heartbeat, this resource information is actually somewhat delayed; In the current configuration, only memory items have actual effect. Vcore does not implement isolation. If the queue name is the same as the authentication user name, the SQL submitted by the user is automatically allocated to the queue.

Impala has a web side that is simple but useful, and the whole problem solving and positioning is often used. Each component provides a Web terminal and allocates corresponding ports. The basic information includes cluster nodes, Catalog information, memory information, and Query information. The Web side can enable the node memory consumption view (memory consumption per counter, memory consumption per query at this point), the node query analysis (query analysis, SQL diagnosis, abnormal query termination), and Metrics information view. The figure above shows some queues we have prepared, and the resource consumption of each queue, etc. Impala is used for join analysis, which embodies the execution plan in each SQL. Tags on the interface such as Query, summary, memory, etc. can be used for SQL analysis.

I talked about impala’s advantages, features, and how to use it, but based on an open source platform, there are many drawbacks. The first Catalogd &Statestored service is a single point, but fortunately the query is not affected, if Catalogd fails, metadata updates are not synchronized to the entire Impala node. Statestored fails. Updates will not be synchronized and only the previous messages will be lost. The second is that the Web information is not persistent, the information displayed is in the history information, if impala restarts the information will be lost. Resource isolation is not precise, the underlying storage is not able to differentiate users, and there is load balancing. Each impala can connect to SQL, but there are 100 impalas that are not easy to solve, so implement haProxy for impala. In addition, synchronization with Hive metadata requires manual operation. Impala is cache metadata, and HDFS operation is not aware of this operation.

Impala is tied to Hive. Hive server is based on ZK, and writes the URI of the IMPala you need to access into a dimension. Hive is native to ZK multi-dimensional node access. The second is the management server, because the information of the Impala page will not be saved, use the management server to save these things, check in the management server during the investigation, because you will not impala nodes and the information will not be saved. Fine-grained permission & proxy, impala access HDFS for low-level permission control. Json format, this is partial application requirements. Compatible with Ranger authority management, because our entire project authority management is based on Ranger. Batch metadata refresh is also a problem in practical applications, sometimes dozens of tables will be changed at a time, if the refresh every time will be very troublesome. Metadata synchronization: Modify Hive and Impala. Each time hive changes, the changes are written to the middle layer, and impala obtains the middle layer for synchronization. Metadata filtering. When there is a large amount of data, a large portion of the table is not needed for interactive queries, and Impala only requires a certain portion of the table, so it filters out unnecessary data through regular expressions. For ElasticSearch queries, push down the operators involved in ES, such as multidimensional filtering queries, which aggregate data based on inverted attributes faster than hash.

Impala application scenario. The figure shows the big data platform architecture of a department, from Kafka data to HDFS, structured data to semi-structured data access. After data cleaning, and then access to the upper layer, the upper layer uses ES storage, the top directly use Impala to query, which is basically the framework of the analysis system.

Above is one of our BI products, called Zho. The underlying platform also connects to impala, a data analysis and reporting platform that connects charts with data on maps. After structured or unstructured data is written to Hive, impala is used to detect metadata and synchronize metadata. Users can query data using impala. There are metadata synchronization problems to be considered. ETL write data impala is not aware and depends on metadata synchronization. Data real-time problem, avoid a large number of small files lead to NN instability, each time write file batch can not be too small. Another solution is to use Kudu to solve the problem of small files, write real-time data to Kudu, and implement joint check between Kudu and HDFS. You can see both kudu table and HDFS table on Impala.

— the END —

This article is published by DataFun community, public ID: Datafuntalk

Like us to share it ~~