As shown in the figure below, logically, Hive contains three major parts.

Hive Clients
Hive Services
Hive Storage and Computing

There are three main interfaces for users to operate Hive: CLI, Client and WUI.

The most commonly used one is CLI. When Cli is started, a copy of Hive will be started at the same time.

Client is the client of Hive, and the user connects to Hive Server. When starting Client mode, you need to point out the node where Hive Server is located, and start Hive Server on this node. And the client can be divided into three kinds of Thrift Client, JDBC Client, ODBC Client.

Web Interface is to access Hive through a browser.

Hive stores metadata in databases, such as mysql and derby. The metadata in Hive includes the name of the table, the columns and partitions of the table and their attributes, the attributes of the table (whether it is an external table, etc.), and the directory where the data of the table is located.
The interpreter, compiler, and optimizer complete the lexical analysis, syntax analysis, compilation, optimization, and query plan generation of HQL query statements. The generated query plan is stored in HDFS and then called and executed by MapReduce.
Hive data is stored in HDFS, and most of the queries and calculations are done by MapReduce (note that queries that contain , such as select from tbl will not generate MapRedcue tasks).
The Driver in the above figure will process all requests from the application to the metastore to the filed system for subsequent operations.

Hive component

Driver

The session handler is implemented, and the API for executing and obtaining information is implemented on the JDBC/ODBC interface.

Compiler

This component is used to parse and query different query expressions, semantic analysis, and finally generate an execution plain based on the table and partition metadata queried from the metastore .

Execution Egine

This component will execute the execution created by the compiler. From the data structure point of view, plan is a DAG. This component manages the dependencies between the different stages of the plan and the execution of these plans in the components.

Metastore

Hive’s metastore component is a centralized storage place for hive metadata. This component stores structured information including columns and column types in the variable table and partition information in the data warehouse (including column and column type information, serialization and deserialization information necessary for reading and writing data, and the data is stored in The location in the HDFS file).

The Metastore component includes two parts: metastore services and Meta storage database.

The medium of Metastore database is relational database, such as hive’s default embedded disk database derby, and mysql database.
Metastore services are service components that are built on the back-end data storage medium (HDFS) and can interact with hive services.
By default, metastore services and hive services are installed together and run in the same process. It is also possible to separate metastore services from hive services, install metastore independently in a cluster, and call metastore services remotely by hive. In this way, we can put the metadata layer behind the firewall. When the client accesses the hive service, it can connect to the metadata layer, thus providing better management and security.

Using remote metastore services allows metastore services and hive services to run in different processes, which also ensures the stability of hive and improves the efficiency of hive services.

Hive execution process

hive basic architecture

The general steps of the process are as follows:

Users submit queries and other tasks to Driver.
Driver creates a session handler for the query operation, and then dirver sends the query operation to the compiler to generate an execute plan
Compiler obtains the required Hive metadata from the MetaStore according to the user task information. These metadata are used for type detection and pruning of the abstract syntax tree in subsequent stages.
Compiler gets the metadata information, compiles the task, first converts HiveQL into an abstract syntax tree, then converts the abstract syntax tree into a query block, converts the query block into a logical query plan, and rewrites the logic Query the plan, transform the logical plan into a physical plan (MapReduce), and finally select the best strategy.
Submit the final plan to the Driver.
Driver transfers the plan to ExecutionEngine for execution, and submits the obtained metadata information to JobTracker or RsourceManager to execute the task, and the task will be directly read into HDFS for corresponding operations.
Get the result of execution.
Get and return the execution result.

Create a table

Analyze the Hive statement submitted by the user -> parse it -> Decompose into Hive objects such as tables, fields, and partitions

Construct the corresponding table, field, partition and other objects according to the parsed information, obtain the latest ID of the constructed object from SEQUENCE_TABLE, and write it into the meta-database table through the DAO method together with the constructed object information (name, type, etc.) , After success, set the latest ID+5 corresponding to SEQUENCE_TABLE.

In fact, common RDBMS are organized in this way, and the ID information is displayed in the system table like Hive metadata. The data can be easily read through these metadata.

Optimizer

The optimizer is a constantly updated component, and most of the plan transfer is done through the optimizer.

Combine multiple joins into one Muti-way join
Re-divide join, group-by, and custom MapReduce operations.
Reduce unnecessary columns.
Promote the use of assertions in table scan operations.
For partitioned tables, reduce unnecessary partitions.
In sampling queries, reduce unnecessary buckets.
The optimizer also adds a local aggregation operation to handle large group aggregation and a repartition operation to handle asymmetric group aggregation.

Hive Basic Architecture