Hive Compression Type Test

Background:

1) 4 different types of tables have been created

2) Clean up the data in the hxh2, hxh3, and hxh4 tables, and keep the data in hxh1. The data size of the hxh1 table is: 74.1GB

3) Create hxh5 at the same time Table and hxh1 are TEXTFILE storage type

4) Original data size: 74.1 G

Start the test:

1, TextFile test

  1. The default format of the Hive data table, storage method: row storage.
  2. The Gzip compression algorithm can be used, but the compressed file does not support split
  3. During the deserialization process, it is necessary to determine whether it is a delimiter or end of line character by character, so reverse Serialization overhead will be dozens of times higher than SequenceFile.

Enable compression:

  1. set hive.exec.compress.output= true; —Enable compression format
  2. set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; —Specify the output compression format is Gzip
  3. set mapred.output.compress=true; — Open mapred output results for compression
  4. set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; —Select GZIP To compress

Insert data into the hxh5 table:

insert into table hxh5 partition(createdate=”2019-07-21″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

After compression Hxh5 table data size: 23.8 G, consumption: 81.329 seconds

< strong>2、Sequence File test

  1. Compressing data files can save disk space, but there are some shortcomings of native compressed files in Hadoop One is that segmentation is not supported. Files that support splitting can have multiple mapper programs to process large data files in parallel. Most files do not support splitting because these files can only be read from the beginning. Sequence File is a divisible file format that supports Hadoop’s block-level compression.
  2. A binary file provided by the Hadoop API, serialized to the file in the form of key-value. Storage method: row storage.
  3. Sequencefile supports three compression options: NONE, RECORD, and BLOCK. Record compression rate is low, RECORD is the default option, usually BLOCK will bring better compression performance than RECORD.
  4. The advantage is that the file and MapFile in the hadoop api are compatible with each other

Enable compression :

  1. set hive.exec.compress.output=true; —Enable compression format
  2. set mapred .output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; —Specify the output compression format as Gzip
  3. set mapred.output.compression.type=BLOCK; —The compression option is set to BLOCK
  4. set mapred.output.compress=true;
  5. set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

Insert data into the hxh2 table:

insert into table hxh2 partition(createdate=”2019-07-21″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

The compressed hxh2 table data size: 80.8 G, consumption: 186.495 seconds //Type type setting is not performed, the default is Record

The compressed hxh2 table data size: 25.2 G, consumption: 81.67 seconds //Set type The type is BLOCK

3 , RCFile test

Storage method: data is divided into blocks by row, and each block is stored by column. Combining the advantages of row storage and column storage:

  1. First of all, RCFile guarantees that the data of the same row is located at the same node, so the overhead of tuple reconstruction is very low
  2. Secondly, Like column storage, RCFile can use column dimension data compression, and can skip unnecessary column reading
  3. Data append: RCFile does not support data writing in any way, only provides an append interface This is because the underlying HDFS currently only supports data appending to the end of the file.
  4. Row group size: A larger row group can help improve the efficiency of data compression, but it may harm the performance of data reading, because it increases the consumption of Lazy decompression performance. Moreover, the row group will take up more memory, which will affect other MR jobs that are executed concurrently. Considering two aspects of storage space and query efficiency, Facebook chose 4MB as the default row group size. Of course, it also allows users to select parameters for configuration.

Enable compression:

  1. set hive.exec.compress.output= true;
  2. set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
  3. set mapred.output.compress=true;
  4. set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

Insert data into the hxh3 table:

insert into table hxh3 partition(createdate=”2019-07-01″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

The compressed table size: 22.5 G, time consumed: 136.659 seconds

4, ORC test

Storage method: data is divided into blocks by row, and each block is stored by column.

Fast compression, fast row access. It is more efficient than rcfile and is an improved version of rcfile.

Enable compression:

  1. set hive.exec.compress.output=true ;
  2. set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
  3. set mapred. output.compress=true;
  4. set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;

Insert data into the hxh4 table:

insert into table hxh4 partition(createdate =”2019-07-01″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;

The compressed table size: 21.9 G, time-consuming : 76.602 seconds

5 , What is splittable

  When considering how to compress the data that will be processed by MapReduce, it is important to consider whether the compression format supports splitting. Consider an uncompressed file stored in HDFS. Its size is 1GB, and the block size of HDFS is 64MB, so the file will be stored as 16 blocks. The MapReduce job using this file as input will create 1 input shard (Split, also known as “block”. For block, we collectively call it “block”.) Each slice is processed separately as an input to an independent map task.

   Now suppose that the file is a compressed file in gzip format, and the compressed size is 1GB. As before, HDFS stores this file in 16 blocks. However, it is useless to create a block for each block, because it is impossible to start reading from any point in the gzip data stream, and it is also impossible for the map task to read the data in one block independently of other blocks. The gzip format uses DEFLATE to store compressed data, and DEFLATE stores the data as a series of compressed blocks. The problem is that the beginning of each block does not specify the user to locate the start position of the next block at any point in the data stream, but it synchronizes itself with the data stream. Therefore, gzip does not support the split (block) mechanism.

  In this case, MapReduce does not split gzip files, because it knows that the input is in gzip compression format (known by the file extension), and the gzip compression mechanism does not support Segmentation mechanism. Therefore, a map task will process 16 HDFS blocks, and most of them are not map local data. At the same time, because there are few map tasks, the granularity of job segmentation is not fine enough, which leads to a longer running time.

6, compression mode Description

1. Evaluation of compression mode

The following three criteria can be used to evaluate compression methods:

  1. Compression ratio: The higher the compression ratio, the smaller the file after compression, so the higher the compression ratio, the better.
  2. Compression time: the faster the better.
  3. Whether the compressed format files can be split again: The splittable format allows a single file to be processed by multiple Mapper programs, which can be better parallelized.

2. Comparison of compression modes

  1. BZip2 has the highest compression ratio but also brings higher CPU overhead. Gzip is more BZip2 comes next. If based on disk utilization and I/O considerations, these two compression algorithms are more attractive algorithms.
  2. LZO and Snappy algorithms have faster decompression speed. If you pay more attention to compression and decompression speed, they are both good choices. LZO and Snappy have roughly the same speed in compressing data, but the Snappy algorithm is faster than LZO in decompression speed.
  3. Hadoop will split large files into splits of HDFS block (default 64MB) size, each of which corresponds to a Mapper program. Among these compression algorithms, BZip2, LZO, and Snappy compression are splittable, while Gzip does not support splitting.

7, Common compression format

< td>

Compression method

< td>

small

p>

Compressed size

Compression speed

Can it be separated

GZIP

No

BZIP2

slow

Yes

LZO

large

Quick

Yes

Snapp

large

Quick

Yes

Note:

Here separable refers to: local files are compressed using a certain compression algorithm and then transferred to hdfs, and then MapReduce calculation is performed, whether the compressed files are supported in the mapper stage Split is separated, and whether the separation is valid.

Hadoop encoding/decoding method, as shown in the table below

< td>

Compression format

< /tr>

Corresponding encoder/decoder

DEFAULT

org.apache .hadoop.io.compress.DefaultCodec

Gzip

org.apache.hadoop.io.compress.GzipCodec

Bzip

org.apache.hadoop.io.compress.BZip2Codec

DEFLATE< /p>

org.apache.hadoop.io.compress.DeflateCodec

Snappy

org.apache.hadoop.io.compress.SnappyCodec (for intermediate output)

Lzo

org.apache.hadoop.io.compress.Lz4Codec (for intermediate output)

8, comparison results

74.1G before compression, the size of the file directory after compression

p>

< td>

23.8 G

TextFile

Sequence File

RCFile

ORC

GZip

25.2 G

22.5 G

21.9 G

Snappy

39.5 G

41.3 G

39.1 G

21.9 G

BZIP

17.9 G

18.9 G

18.7 G

21.9 G

LZO

39.9 G

41.8 G

40.8 G

21.9 G

Compressed file name

TextFile

Sequence File

RCFile

ORC

GZip

*.gz

000000_0

000000_1000

000000_0

Snappy

*.snappy< /p>

000000_0

000000_1

000000_0

BZIP

*.bz2

000000_0

000000_1000

000000_0

LZO

< /td>

*.lz4

000000_2

000000_0

000000_0

Importing data consumption time

< td>

76.6s

td>

TextFile

Sequence File

RCFile

ORC

GZip

81.329s< /p>

81.67s

136.6s

Snappy

226s

180s

79.8s

75s

BZIP

138.2s

134s

< p align="left">145.9s

98.3s

LZO

231.8

234s

86.1s

248.1s

Query speed

select count(1) from table_name

< tr>

TextFile

Sequence File

RCFile

ORC

GZip

46.2s

50.4s

44.3s

38.3s

Snappy

46.3s

54.3s

42.2s

40.3s

BZIP

114.3s

110.3s

< /td>

40.3s

38.2s

LZO

60.3s

52.2s

42.2s

50.3s

Summary:

< td>

Compression time:

Compression ratio:

BZip> Gzip> Snappy> LZO, but the compression ratio is the same for ORC storage types

Gzip

Data statistics time:

GZip

Compression type recommendation:

1) Both BZip and Gzip have good compression ratios but will bring higher CPU consumption If it is based on disk utilization and I/O, two compression algorithms can be considered

2) LZO and Snappy algorithms have Faster decompression speed, if you are concerned about compression and decompression speed, they are both good choices. If the storage type of hive table is RCFile and ORC, Snappy and LZO have considerable decompression efficiency, but Snappy is better than LZO in terms of compression

3) Hadoop divides large files into splits of the size of HDFS block, each of which corresponds to a Mapper program. In these compression algorithms, Bzip2, LZO, and Snappy compression can be separated, while Gzip does not support separation.

Leave a Comment

Your email address will not be published.