impala

Reference:

https://www.cnblogs.com/Rainbow-G/articles/4282444.html

https://www.w3cschool.cn /impala/impala_architecture.html

Official website:

https://impala.apache.org/

Impala is Cloudera’s A real-time interactive SQL big data query tool (Impala, a real-time SQL query engine) inspired by Google’s Dremel, Impala no longer uses slow Hive+MapReduce batch processing, but uses distributed queries similar to commercial parallel relational databases The engine (consisting of Query Planner, Query Coordinator and Query Exec Engine) can directly query data from HDFS or HBase with SELECT, JOIN and statistical functions, thereby greatly reducing latency.

Impala is mainly composed of Impalad, State Store and CLI.

Impalad: It runs on the same node as the DataNode, represented by the Impalad process, and it receives the query request from the client (the Impalad that receives the query request is the Coordinator, and the Coordinator calls the java front-end through JNI Interpret the SQL query statement, generate the query plan tree, and then distribute the execution plan to other Impalads with corresponding data for execution through the scheduler), read and write the data, execute the query in parallel, and stream the results back to the Coordinator through the network. Returned to the client by the Coordinator. At the same time, Impalad also maintains a connection with the State Store to determine which Impalad is healthy and can accept new jobs.

Start three ThriftServers in Impalad: beeswax_server (connect client), hs2_server (borrow Hive metadata), be_server ( Impalad internal use) and an ImpalaServer service.

Impala State Store: Tracks the health status and location information of Impalad in the cluster, represented by the statestored process, which creates multiple threads to process Impalad’s registration, subscription and various Impalad maintains a heartbeat connection, and each Impalad caches a copy of the information in the State Store. When the State Store is offline (Impalad finds that the State Store is offline, it will enter the recovery mode and register repeatedly. When the State Store rejoins the cluster, it will automatically return to normal , Update cache data) Because Impalad has a State Store cache, it can still work, but because some Impalads are invalid, and the cached data cannot be updated, the execution plan is assigned to the invalid Impalad, causing the query to fail.

CLI: Provides a command line tool for users to query (Impala Shell is implemented in python). At the same time, Impala also provides interfaces for Hue, JDBC, and ODBC.

Relationship with Hive

Both Impala and Hive are data query tools built on Hadoop. Each has a different focus on adaptability, but from customers From the perspective of end use, Impala and Hive have a lot in common, such as data table metadata, ODBC/JDBC driver, SQL syntax, flexible file format, storage resource pool, etc. The relationship between Impala and Hive in Hadoop is shown in Figure 2. Hive is suitable for long-term batch query analysis, and Impala is suitable for real-time interactive SQL query. Impala provides data analysts with big data analysis tools for rapid experimentation and verification of ideas. You can first use hive for data conversion processing, and then use Impala to perform fast data analysis on the result data set processed by Hive.

Leave a Comment

Your email address will not be published.