9, Hadoop-HDFS Overview

1. Background and definition of HDFS generation

Background generation

As the amount of data becomes larger and larger, it is stored in a system If you don’t have all the data, you need to allocate more operating system management disks.

Singular is not convenient for management and maintenance. There is an urgent need for a system to manage files on multiple machines. This is also Distributed file unit there system

HDFS is only distributed A type of file management system

HDFS definition

HDFS
( hadoop Distributed File System) It is a file system that is used to store files. The files are defined through the catalog book. Have their own roles

HDFS usage scenarios:

Scenarios that are read multiple times and do not support file modification, suitable for data analysis , It’s not suitable for web disk applications

2, advantages and disadvantages

Advantages

1. High fault tolerance

Share pictures

2, suitable for processing big data

p>

Data scale: capable of processing data with a scale of GB, TB or even PB level
File scale: capable of processing a number of files over one million, which is quite large

p>

3. It can be built on a cheap machine to improve reliability through multiple copies mechanism

Disadvantages

1. It is not suitable for low-latency data access, such as millisecond-level data storage, which is impossible.

2. It is impossible to store a large number of small files efficiently< /p>

If you store a large number of small files, it will take up a lot of memory on the NameNode to store file directory and block information. This is undesirable because the memory of the NameNode is limited

The addressing time of small file storage will exceed the reading time, which violates the design goal of HDFS

3, does not support Concurrent writing, random file modification
A file can only be written by one, multiple threads are not allowed to write at the same time
Only data append (append) is supported, and random file modification is not supported

3. HDFS composition structure

Share pictures

1, NameNode (nn): it is the Master, a supervisor and manager

strong>
(1), manage HDFS name space
(2), configure copy strategy
(3), manage data block (Block) mapping information
(4), handle client read Write request

2, FateNode: Slave, NamdeNode issues the command, DataNode performs the actual operation
(1), stores the actual data block
(2) Execute the read/write operation of the data block

3, Client: client
(1), file segmentation.
When a file is uploaded to HDFS, the Client divides the file into blocks and uploads it
(2), interact with the NameNode to obtain the log information of the file
(3), interact with the DataNode, read Get or write data
(4), Client provides some special commands to manage HDFS, such as NameNode formatting
(5), Client can access HDFS addition, deletion and modification operations through some commands

< p>4. SecondaryNameNode: The Hot Standby NameNode that is not the NameNode hangs up
It cannot immediately replace the NameNode and provide services
(1), assist the NameNode and share its workload, such as Pick up and merge Fsimage and Edits and push them to NameNode
(2) In an emergency, NamdNode can be assisted in restoring NamdNode

4. HDFS file block size

What about HDFS, it is physically block storage (Block)
The size of the block can be specified by the configuration parameter (dfs.blocksize)
The default is 128M in the Hadoop 2.x version, the old version Yes 64M

Share a picture

< p>

Why can the block size be set too small or too large?
(1) If the block setting of HDFS is too small, it will increase the addressing time, and the program is always looking for the starting position of the block.
(2) If the block setting is too large, the data transfer time from the disk will be obvious More than the time required to locate the start position of this block
causes the program to be very slow when processing this block of data

Summary:
< strong>The block size setting of HDFS mainly depends on the transfer rate of the disk

Leave a Comment

Your email address will not be published.