Hive small knot

(1) Basic introduction to hive

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table and provide SQL-like query functions

Other knowledge:
DML (data manipulation language) data manipulation language
Commonly used select, update, insert, delete; mainly used to perform some operations on database data
DDL( data definition language) database definition language
commonly used create, alter, drop; mainly used to define or change the structure of the table, data types, links and constraints between tables and other initialization work
DCL (data control language)数据库控制语言
用来设置或更改数据库用户或角色权限的语句

######################## ###################
############hive common commands############# #####

1. Create a table

1.1 Table creation syntax
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[( col_name data_type [COMMENT col_comment], …)]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]

[CLUSTERED BY (col_name, col_name, …)

[SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]

[ ROW FORMAT row_format]

[STORED AS file_format]

[LOCATION hdfs_path]

1.2 Parameter description:
① The EXTERNAL keyword allows users Create an external table, specify a path to the actual data (LOCATION) while creating the table, when Hive creates the internal table , The data will be moved to the path pointed to by the data warehouse; if an external table is created, only the path where the data is located will be recorded, and no changes will be made to the location of the data. When deleting a table, the metadata and data of the internal table will be deleted together, while the external table only deletes the metadata, not the data
②ROW FORMAT users can customize the SerDe or use the built-in SerDe when creating the table . If ROW FORMAT or ROW FORMAT DELIMITED is not specified, the built-in SerDe will be used. When creating the table, the user also needs to specify the columns for the table. The user will also specify a custom SerDe when specifying the columns of the table. Hive determines the data of the specific columns of the table through SerDe
③STORED AS specifies the user source file The storage formats are commonly used as follows: TEXTFILE, the default format, this format is not specified when the table is built; SEQUENCEFILE, a binary file provided by the Hadoop API, is easy to use, can be divided, and can be compressed; RCFILE, one种行列存储相结合的存储方式; ORC,hive给出的新格式,属于RCFILE的升级版,性能有大幅度提升.

1.3样例:

CREATE EXTERNAL TABLE IF NOT EXISTS hive_table_user
(
userid STRING COMMENT’user id’
,username STRING COMMENT’user name’
,sex STRING COMMENT’user gender’
,address STRING COMMENT’home address’
)
COMMENT’user table’
PARTITIONED BY (pt_d VARCHAR(8) COMMENT’day partition’, pt_h VARCHAR(2) COMMENT’hour partition’)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY’01′
LINES TERMINATED BY’

STORED AS ORC
LOCATION’hdfs://test01/test02/hive_table_user’
TBLPROPERTIES( ‘orc.compress’=’ZLIB’);

2. Commonly used sql commands:
(1)show tables;
(2)use db2;
. ..

#################### ############################
############scala Operationshive### ###############
import org.apache.spark.sql.{Dataset , SparkSession}

val ss = SparkSession.builder().getOrCreate()
val sc = ss.sparkContext
val sqlContext = ss.sqlContext
import sqlContext.implicits._< /p>

//Read hive table, get Dataset[ROW]val contentDF = sqlContext.sql(“select userid,mathscore,physicscore,address from user_table where pt_d=20190101”).select(“userid”, ” mathscore”, “physicscore”, “address”) .where(“userid is not null”) .where(“mathscore is not null”) .where(“physicscore is not null”)

< p>

Leave a Comment

Your email address will not be published.