Background:
1) 4 different types of tables have been created
2) Clean up the data in the hxh2, hxh3, and hxh4 tables, and keep the data in hxh1. The data size of the hxh1 table is: 74.1GB
3) Create hxh5 at the same time Table and hxh1 are TEXTFILE storage type
4) Original data size: 74.1 G
Start the test:
1, TextFile test span>
- The default format of the Hive data table, storage method: row storage.
- The Gzip compression algorithm can be used, but the compressed file does not support split
- During the deserialization process, it is necessary to determine whether it is a delimiter or end of line character by character, so reverse Serialization overhead will be dozens of times higher than SequenceFile.
Enable compression:
- set hive.exec.compress.output= true; —Enable compression format
- set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; —Specify the output compression format is Gzip
- set mapred.output.compress=true; — Open mapred output results for compression
- set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; —Select GZIP To compress
Insert data into the hxh5 table:
insert into table hxh5 partition(createdate=”2019-07-21″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;
After compression Hxh5 table data size: 23.8 G, consumption: 81.329 seconds
< strong>2、Sequence File test
- Compressing data files can save disk space, but there are some shortcomings of native compressed files in Hadoop One is that segmentation is not supported. Files that support splitting can have multiple mapper programs to process large data files in parallel. Most files do not support splitting because these files can only be read from the beginning. Sequence File is a divisible file format that supports Hadoop’s block-level compression.
- A binary file provided by the Hadoop API, serialized to the file in the form of key-value. Storage method: row storage.
- Sequencefile supports three compression options: NONE, RECORD, and BLOCK. Record compression rate is low, RECORD is the default option, usually BLOCK will bring better compression performance than RECORD.
- The advantage is that the file and MapFile in the hadoop api are compatible with each other
Enable compression :
- set hive.exec.compress.output=true; —Enable compression format
- set mapred .output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; —Specify the output compression format as Gzip
- set mapred.output.compression.type=BLOCK; —The compression option is set to BLOCK
- set mapred.output.compress=true;
- set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
Insert data into the hxh2 table:
insert into table hxh2 partition(createdate=”2019-07-21″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;
The compressed hxh2 table data size: 80.8 G, consumption: 186.495 seconds //Type type setting is not performed, the default is Record
The compressed hxh2 table data size: 25.2 G, consumption: 81.67 seconds //Set type The type is BLOCK
3 , RCFile test
Storage method: data is divided into blocks by row, and each block is stored by column. Combining the advantages of row storage and column storage:
- First of all, RCFile guarantees that the data of the same row is located at the same node, so the overhead of tuple reconstruction is very low
- Secondly, Like column storage, RCFile can use column dimension data compression, and can skip unnecessary column reading
- Data append: RCFile does not support data writing in any way, only provides an append interface This is because the underlying HDFS currently only supports data appending to the end of the file.
- Row group size: A larger row group can help improve the efficiency of data compression, but it may harm the performance of data reading, because it increases the consumption of Lazy decompression performance. Moreover, the row group will take up more memory, which will affect other MR jobs that are executed concurrently. Considering two aspects of storage space and query efficiency, Facebook chose 4MB as the default row group size. Of course, it also allows users to select parameters for configuration.
Enable compression:
- set hive.exec.compress.output= true;
- set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
- set mapred.output.compress=true;
- set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
Insert data into the hxh3 table:
insert into table hxh3 partition(createdate=”2019-07-01″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;
The compressed table size: 22.5 G, time consumed: 136.659 seconds
4, ORC test
Storage method: data is divided into blocks by row, and each block is stored by column.
Fast compression, fast row access. It is more efficient than rcfile and is an improved version of rcfile.
Enable compression:
- set hive.exec.compress.output=true ;
- set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
- set mapred. output.compress=true;
- set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec;
Insert data into the hxh4 table:
insert into table hxh4 partition(createdate =”2019-07-01″) select pvid,sgid,fr,ffr,mod,version,vendor from hxh1;
The compressed table size: 21.9 G, time-consuming : 76.602 seconds
5 , What is splittable
When considering how to compress the data that will be processed by MapReduce, it is important to consider whether the compression format supports splitting. Consider an uncompressed file stored in HDFS. Its size is 1GB, and the block size of HDFS is 64MB, so the file will be stored as 16 blocks. The MapReduce job using this file as input will create 1 input shard (Split, also known as “block”. For block, we collectively call it “block”.) Each slice is processed separately as an input to an independent map task.
Now suppose that the file is a compressed file in gzip format, and the compressed size is 1GB. As before, HDFS stores this file in 16 blocks. However, it is useless to create a block for each block, because it is impossible to start reading from any point in the gzip data stream, and it is also impossible for the map task to read the data in one block independently of other blocks. The gzip format uses DEFLATE to store compressed data, and DEFLATE stores the data as a series of compressed blocks. The problem is that the beginning of each block does not specify the user to locate the start position of the next block at any point in the data stream, but it synchronizes itself with the data stream. Therefore, gzip does not support the split (block) mechanism.
In this case, MapReduce does not split gzip files, because it knows that the input is in gzip compression format (known by the file extension), and the gzip compression mechanism does not support Segmentation mechanism. Therefore, a map task will process 16 HDFS blocks, and most of them are not map local data. At the same time, because there are few map tasks, the granularity of job segmentation is not fine enough, which leads to a longer running time.
6, compression mode Description
1. Evaluation of compression mode
The following three criteria can be used to evaluate compression methods:
- Compression ratio: The higher the compression ratio, the smaller the file after compression, so the higher the compression ratio, the better.
- Compression time: the faster the better.
- Whether the compressed format files can be split again: The splittable format allows a single file to be processed by multiple Mapper programs, which can be better parallelized.
2. Comparison of compression modes
- BZip2 has the highest compression ratio but also brings higher CPU overhead. Gzip is more BZip2 comes next. If based on disk utilization and I/O considerations, these two compression algorithms are more attractive algorithms.
- LZO and Snappy algorithms have faster decompression speed. If you pay more attention to compression and decompression speed, they are both good choices. LZO and Snappy have roughly the same speed in compressing data, but the Snappy algorithm is faster than LZO in decompression speed.
- Hadoop will split large files into splits of HDFS block (default 64MB) size, each of which corresponds to a Mapper program. Among these compression algorithms, BZip2, LZO, and Snappy compression are splittable, while Gzip does not support splitting.
7, Common compression format
Compressed size |
Compression speed |
Can it be separated |
|
GZIP |
中 |
中 td> |
No |
BZIP2 |
slow |
Yes |
|
LZO |
large |
Quick |
Yes |
Snapp |
large |
Quick |
Yes |
Note: |
Here separable refers to: local files are compressed using a certain compression algorithm and then transferred to hdfs, and then MapReduce calculation is performed, whether the compressed files are supported in the mapper stage Split is separated, and whether the separation is valid. |
Hadoop encoding/decoding method, as shown in the table below
Corresponding encoder/decoder |
|
DEFAULT |
org.apache .hadoop.io.compress.DefaultCodec |
Gzip |
org.apache.hadoop.io.compress.GzipCodec |
Bzip |
org.apache.hadoop.io.compress.BZip2Codec |
DEFLATE< /p> |
org.apache.hadoop.io.compress.DeflateCodec |
Snappy |
org.apache.hadoop.io.compress.SnappyCodec (for intermediate output) |
Lzo |
org.apache.hadoop.io.compress.Lz4Codec (for intermediate output) |
8, comparison results
74.1G before compression, the size of the file directory after compression
p>
TextFile |
Sequence File |
RCFile td> |
ORC |
|
GZip |
25.2 G |
22.5 G |
21.9 G |
|
Snappy |
39.5 G |
41.3 G td> |
39.1 G |
21.9 G |
BZIP |
17.9 G |
18.9 G |
18.7 G |
21.9 G |
LZO td> |
39.9 G |
41.8 G |
40.8 G |
21.9 G |
Compressed file name
TextFile |
Sequence File |
RCFile |
ORC |
|
GZip |
*.gz |
000000_0 |
000000_1000 |
000000_0 |
Snappy |
*.snappy< /p> |
000000_0 |
000000_1 |
000000_0 |
BZIP |
*.bz2 |
000000_0 |
000000_1000 p> |
000000_0 |
LZO < /td> |
*.lz4 |
000000_2 |
000000_0 |
000000_0 |
Importing data consumption time
td> |
TextFile |
Sequence File |
RCFile |
ORC |
GZip |
81.329s< /p> |
81.67s |
136.6s |
|
Snappy |
226s |
180s |
79.8s |
75s |
BZIP |
138.2s |
134s |
< p align="left">145.9s |
98.3s |
LZO |
231.8 |
234s |
86.1s |
248.1s |
Query speed
select count(1) from table_name
TextFile |
Sequence File |
RCFile |
ORC |
|
GZip |
46.2s |
50.4s |
44.3s td> |
38.3s |
Snappy |
46.3s |
54.3s |
42.2s |
40.3s |
BZIP |
114.3s |
110.3s < /td> |
40.3s |
38.2s |
LZO |
60.3s |
52.2s |
42.2s |
50.3s |
Summary:
Compression ratio: |
BZip> Gzip> Snappy> LZO, but the compression ratio is the same for ORC storage types |
Gzip |
|
Data statistics time: |
GZip |
Compression type recommendation: |
1) Both BZip and Gzip have good compression ratios but will bring higher CPU consumption If it is based on disk utilization and I/O, two compression algorithms can be considered
2) LZO and Snappy algorithms have Faster decompression speed, if you are concerned about compression and decompression speed, they are both good choices. If the storage type of hive table is RCFile and ORC, Snappy and LZO have considerable decompression efficiency, but Snappy is better than LZO in terms of compression
3) Hadoop divides large files into splits of the size of HDFS block, each of which corresponds to a Mapper program. In these compression algorithms, Bzip2, LZO, and Snappy compression can be separated, while Gzip does not support separation. |