Hive Compression Type Test - compression, hive, test, type

Background:

1) 4 different types of tables have been created

2) Clean up the data in the hxh2, hxh3, and hxh4 tables, and keep the data in hxh1. The data size of the hxh1 table is: 74.1GB

3) Create hxh5 at the same time Table and hxh1 are TEXTFILE storage type

4) Original data size: 74.1 G

Start the test:

1, TextFile test

The default format of the Hive data table, storage method: row storage.
The Gzip compression algorithm can be used, but the compressed file does not support split
During the deserialization process, it is necessary to determine whether it is a delimiter or end of line character by character, so reverse Serialization overhead will be dozens of times higher than SequenceFile.

Enable compression:

set hive.exec.compress.output= true; —Enable compression format
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; —Specify the output compression format is Gzip
set mapred.output.compress=true; — Open mapred output results for compression
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; —Select GZIP To compress

Insert data into the hxh5 table:

insert into table hxh5 partition(createdate=”2019-07-21″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

After compression Hxh5 table data size: 23.8 G, consumption: 81.329 seconds

< strong>2、Sequence File test

Compressing data files can save disk space, but there are some shortcomings of native compressed files in Hadoop One is that segmentation is not supported. Files that support splitting can have multiple mapper programs to process large data files in parallel. Most files do not support splitting because these files can only be read from the beginning. Sequence File is a divisible file format that supports Hadoop’s block-level compression.
A binary file provided by the Hadoop API, serialized to the file in the form of key-value. Storage method: row storage.
Sequencefile supports three compression options: NONE, RECORD, and BLOCK. Record compression rate is low, RECORD is the default option, usually BLOCK will bring better compression performance than RECORD.
The advantage is that the file and MapFile in the hadoop api are compatible with each other

Enable compression :

set hive.exec.compress.output=true; —Enable compression format
set mapred .output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; —Specify the output compression format as Gzip
set mapred.output.compression.type=BLOCK; —The compression option is set to BLOCK
set mapred.output.compress=true;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

Insert data into the hxh2 table:

insert into table hxh2 partition(createdate=”2019-07-21″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

The compressed hxh2 table data size: 80.8 G, consumption: 186.495 seconds //Type type setting is not performed, the default is Record

The compressed hxh2 table data size: 25.2 G, consumption: 81.67 seconds //Set type The type is BLOCK

3 , RCFile test

Storage method: data is divided into blocks by row, and each block is stored by column. Combining the advantages of row storage and column storage:

First of all, RCFile guarantees that the data of the same row is located at the same node, so the overhead of tuple reconstruction is very low
Secondly, Like column storage, RCFile can use column dimension data compression, and can skip unnecessary column reading
Data append: RCFile does not support data writing in any way, only provides an append interface This is because the underlying HDFS currently only supports data appending to the end of the file.
Row group size: A larger row group can help improve the efficiency of data compression, but it may harm the performance of data reading, because it increases the consumption of Lazy decompression performance. Moreover, the row group will take up more memory, which will affect other MR jobs that are executed concurrently. Considering two aspects of storage space and query efficiency, Facebook chose 4MB as the default row group size. Of course, it also allows users to select parameters for configuration.

Enable compression:

set hive.exec.compress.output= true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.output.compress=true;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

Insert data into the hxh3 table:

insert into table hxh3 partition(createdate=”2019-07-01″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

The compressed table size: 22.5 G, time consumed: 136.659 seconds

4, ORC test

Storage method: data is divided into blocks by row, and each block is stored by column.

Fast compression, fast row access. It is more efficient than rcfile and is an improved version of rcfile.

Enable compression:

set hive.exec.compress.output=true ;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred. output.compress=true;
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

Insert data into the hxh4 table:

insert into table hxh4 partition(createdate =”2019-07-01″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

The compressed table size: 21.9 G, time-consuming : 76.602 seconds

5 , What is splittable

When considering how to compress the data that will be processed by MapReduce, it is important to consider whether the compression format supports splitting. Consider an uncompressed file stored in HDFS. Its size is 1GB, and the block size of HDFS is 64MB, so the file will be stored as 16 blocks. The MapReduce job using this file as input will create 1 input shard (Split, also known as “block”. For block, we collectively call it “block”.) Each slice is processed separately as an input to an independent map task.

Now suppose that the file is a compressed file in gzip format, and the compressed size is 1GB. As before, HDFS stores this file in 16 blocks. However, it is useless to create a block for each block, because it is impossible to start reading from any point in the gzip data stream, and it is also impossible for the map task to read the data in one block independently of other blocks. The gzip format uses DEFLATE to store compressed data, and DEFLATE stores the data as a series of compressed blocks. The problem is that the beginning of each block does not specify the user to locate the start position of the next block at any point in the data stream, but it synchronizes itself with the data stream. Therefore, gzip does not support the split (block) mechanism.

In this case, MapReduce does not split gzip files, because it knows that the input is in gzip compression format (known by the file extension), and the gzip compression mechanism does not support Segmentation mechanism. Therefore, a map task will process 16 HDFS blocks, and most of them are not map local data. At the same time, because there are few map tasks, the granularity of job segmentation is not fine enough, which leads to a longer running time.

6, compression mode Description

1. Evaluation of compression mode

The following three criteria can be used to evaluate compression methods:

Compression ratio: The higher the compression ratio, the smaller the file after compression, so the higher the compression ratio, the better.
Compression time: the faster the better.
Whether the compressed format files can be split again: The splittable format allows a single file to be processed by multiple Mapper programs, which can be better parallelized.

2. Comparison of compression modes

BZip2 has the highest compression ratio but also brings higher CPU overhead. Gzip is more BZip2 comes next. If based on disk utilization and I/O considerations, these two compression algorithms are more attractive algorithms.
LZO and Snappy algorithms have faster decompression speed. If you pay more attention to compression and decompression speed, they are both good choices. LZO and Snappy have roughly the same speed in compressing data, but the Snappy algorithm is faster than LZO in decompression speed.
Hadoop will split large files into splits of HDFS block (default 64MB) size, each of which corresponds to a Mapper program. Among these compression algorithms, BZip2, LZO, and Snappy compression are splittable, while Gzip does not support splitting.

7, Common compression format

< td>

Compression method

< td>

small

Compressed size	Compression speed	Can it be separated
GZIP	中	中	No
BZIP2	slow	Yes
LZO	large	Quick	Yes
Snapp	large	Quick	Yes
Note:	Here separable refers to: local files are compressed using a certain compression algorithm and then transferred to hdfs, and then MapReduce calculation is performed, whether the compressed files are supported in the mapper stage Split is separated, and whether the separation is valid.

Hadoop encoding/decoding method, as shown in the table below

< td>

Compression format

< /tr>

Corresponding encoder/decoder
DEFAULT	org.apache .hadoop.io.compress.DefaultCodec
Gzip	org.apache.hadoop.io.compress.GzipCodec
Bzip	org.apache.hadoop.io.compress.BZip2Codec
DEFLATE< /p>	org.apache.hadoop.io.compress.DeflateCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec (for intermediate output)
Lzo	org.apache.hadoop.io.compress.Lz4Codec (for intermediate output)

8, comparison results

74.1G before compression, the size of the file directory after compression

< td>

23.8 G

	TextFile	Sequence File	RCFile	ORC
GZip	25.2 G	22.5 G	21.9 G
Snappy	39.5 G	41.3 G	39.1 G	21.9 G
BZIP	17.9 G	18.9 G	18.7 G	21.9 G
LZO	39.9 G	41.8 G	40.8 G	21.9 G

Compressed file name

	TextFile	Sequence File	RCFile	ORC
GZip	*.gz	000000_0	000000_1000	000000_0
Snappy	*.snappy< /p>	000000_0	000000_1	000000_0
BZIP	*.bz2	000000_0	000000_1000	000000_0
LZO < /td>	*.lz4	000000_2	000000_0	000000_0

Importing data consumption time

< td>

76.6s

td>	TextFile	Sequence File	RCFile	ORC
GZip	81.329s< /p>	81.67s	136.6s
Snappy	226s	180s	79.8s	75s
BZIP	138.2s	134s	< p align="left">145.9s	98.3s
LZO	231.8	234s	86.1s	248.1s

Query speed

select count(1) from table_name

< tr>

	TextFile	Sequence File	RCFile	ORC
GZip	46.2s	50.4s	44.3s	38.3s
Snappy	46.3s	54.3s	42.2s	40.3s
BZIP	114.3s	110.3s < /td>	40.3s	38.2s
LZO	60.3s	52.2s	42.2s	50.3s

Summary:

< td>

Compression time:

Compression ratio:	BZip> Gzip> Snappy> LZO, but the compression ratio is the same for ORC storage types
Gzip
Data statistics time:	GZip
Compression type recommendation:	1) Both BZip and Gzip have good compression ratios but will bring higher CPU consumption If it is based on disk utilization and I/O, two compression algorithms can be considered 2) LZO and Snappy algorithms have Faster decompression speed, if you are concerned about compression and decompression speed, they are both good choices. If the storage type of hive table is RCFile and ORC, Snappy and LZO have considerable decompression efficiency, but Snappy is better than LZO in terms of compression 3) Hadoop divides large files into splits of the size of HDFS block, each of which corresponds to a Mapper program. In these compression algorithms, Bzip2, LZO, and Snappy compression can be separated, while Gzip does not support separation.

Leave a Comment Cancel reply