hadoop hive advanced query
select basics
1.0 general query
1)select * from table_name< /p>
2)select * from table_name where name=’….’ limit 1;
1.1cte and nested query
1)with t as(select….) select * from t;
2)select * from(select….) a;(a must be added)
< p>1.2 column matching regular expression
Before adding data: SET hive.support.quoted.identifiers = none;
You can use matching columns : SELECT ^o.*
FROM offers;
1.3 Virtual Columns
Enter the file name: select < strong>input_file_name from emps;
Global file location: select block_offset_inside_file from emps; (emphasis is fixed format~~~~)
We will put the small table in front, and the following table is called the base table
When we associate internally and externally, we will first judge a row of data in the external table and a row of data in the sub-table. There is a keyword equivalent to exit (not exists), the keyword in (not in) in mysql
select * from userinfos u where userid not in(select b.userid from bankcards b where u.userid=b.userid group by userid );
Hive join-Mapjoin (internal and external association)
First we must first open the join operation: set hive.auto.convert.join
join— —>Equivalent to inner join
left join——>Check only the data on the left
right join——>Check only the data on the right
full join——>Check all Data
Mapjoin operation is not supported:
1)In UNION ALL, LATERAL VIEW, GROUP BY/JOIN/SORT BY /CLUSTER BY/DISTRIBUTE BY and other operations
2)Before UNION, JOIN and other MAPJOIN
Hive collection operation (union)
1) Union all: Keep duplicates after merging
2) Union: Delete duplicates after merging
Load data: load move data
1 )Load data local inpath’……’ overwrite into table…..
2)load data local inpath’…….’ overwrite in to table… .partition(field)
! ! ! No local is the address in hdfs
! ! ! LOCAL means the file is located locally, OVERWRITE means overwrite existing data
Load data: Insert table to insert data-2
1) Insert a single statement (insert one from a table)< /p>
from ctas_employee
insert overwrite table …..select’….’
! ! ! It is equivalent to two tables with the same number of columns and the same attributes, and the inserted data will have
2) Insert multiple statements (overwrite table followed by other tables)
from ctas_employee
p>
insert overwrite table employee select *
insert overwrite table employee_internal select *;
! ! ! Do not add at the end of the first statement; multiple statements can be executed
3) Insert into the partition
from ctas_patitioned
insert overwrite table employee PARTITION ( year, month)
select *,’2018′,’09’;
! ! ! Specify (year=2018, month=9) when performing static insertion.
! ! ! There is no need to specify when performing dynamic insertion. If there are fewer keywords to insert into the partition, just add the value directly in the select.
The insert statement inserts/exports data to a file
— Insert a local file, hdfs file, table (the key is the same data source) from the same data source )
from ctas_employee (fixed statement)
Local: insert overwrite local directory’/tmp/out1′ select *;
hdfs: insert overwrite directory’/ tmp/out1′ select *;
table: insert overwrite table employee_internal select *;
Hive data exchange-import/export
1) Use export to export data
export table table_name to’hdfs path’;
export table table_name_partition(year, month) to’hdfs path’; (year, month data will be generated only if there is Table)
import table table_name from’previously exported data address’
import table old_table from’previously exported data address’ (for partitions with tables, you can import )
Delete partition: alter table uu drop partition(year=2017,month=12);
Hive data sort ORDER BY
select * from table_name order by column name;
Hive data sorting-SORT BY/DISTRIBUTE BY
! ! ! Key: Set the number of reducers: set mapred.reduce.tasks = 15 (sort is related to the number of reducers)
1) sort by (sort the data in each reducer) )
Set the number of reducers: set mapred.reduce.tasks = 1
When the number of reducers is set to 1, the sorting of the table can be guaranteed to be effective
When the number of reducers is set to 2, it will be sorted in two segments (the sorting of the table is two sorts)
2) distribute by (similar to group by)
< p>Similar to grouping first and using with sort by desc (the previous attributes) are as follows
Hive data sorting-CLUSTER BY (cluster)
3)cluster by = distribute by + sort by
< p>SELECT name, employee_id FROM employee_hr CLUSTER BY name;
n In order to make full use of all Reducers to perform global sorting, you can use CLUSTER BY first, and then use ORDER BY
Example 1 : (Solve the data skew)
Set the number of reduce: set mapred.reduce.tasks = 15
1. Size table
? 1.mapreduce
p>
? 1.cacheFile
? 2.groupCombinapartition
? 2. Processing in hive
? 1. In mapjoin: set hive. auto.convert.join=true 25M
2. Some data is relatively small
? 1.mapreduce
? 1.groupCombinapartition
? 2.hive
? 1.partition by (year,month) partition table