HDFS advantages and disadvantages

Brief description of HDFS architecture

1. Introduction to HDFS

HDFS (Hadoop distributed File System): Hadoop distributed file system. It is developed based on the needs of streaming data mode to access and process large files, and it can run on cheap servers. Its high fault tolerance, high reliability, high scalability, high availability, high throughput and other characteristics provide massive data storage that is not afraid of failures, and brings a lot of convenience to the application of very large data sets. To put it simply, an oversized file is divided into a certain size and placed on multiple servers, so that multiple servers work at the same time, the efficiency is super high, and the security will be improved.

Second, HDFS composition structure and functions of each part

HDFS is mainly composed of four parts, namely Client, nameNode, DataNode, and Secondary NameNode.
2.1Client (client)

  • File segmentation: When a file is uploaded to HDFS, the client divides the file into small data blocks according to requirements, and then stores them.
  • Each small data block (block) has a copy on other servers, and the client will communicate with the namenode to obtain the file and its copy location for future planning.
  • Communicate with DataNode, read or write data.
  • Client can manage HDFS, such as startup or shutdown.

2.2nameNode (manager)
nameNode is a manager in HDFS, which is similar to the king status in the real world. Both the client and the secondary nameNode communicate with him. Then he gives orders to operate the DataNode. The main functions are:

  • Manage HDFS namespace
  • Manage data block (block) mapping information
  • Configure copy strategy
  • Handle client requests

2.3DataNode (executor)
The role of DataNode in the document is slave, which means slave in translation (so everyone knows his status) , Mainly used for storing data blocks and reading and writing data blocks:

  • Store actual data blocks (block)
  • Perform data block read and write operations

2.4secondary nameNode
He is a diehard of nameNode, he helps nameNode complete some work, and renews his life when nameNode is about to hang. The main functions are as follows:

    < li> Regularly merge the fsimage (image file) and fsedits (log) in the nameNode and push them to the nameNode to avoid the fsedits in the nameNode from being too large.
  • In an emergency, it can assist in restoring nameNode.

3. Copy placement strategy

The first copy: Place the DN of the uploaded file; if it is submitted outside the cluster, randomly select a disk that is not full , A node whose CPU is not too busy.
Second copy: Placed on a node in a different rack from the first copy.

——————————————- ————————————————– ————————————————– ————————————————– —————-

The above is reproduced from https://blog.csdn.net/a15732111571/article/details/89570865

————————————————– ————————————————– ————————————————– ————————————————– ———-

Advantages:

Data redundancy, hardware fault tolerance

Suitable for storing large files

Processing streaming data

Can be built on cheap machines

Disadvantages:

Cannot access low-latency data

Not suitable for small storage File (reason and solution: storing each data file corresponds to one metadata, multiple files need to store multiple metadata, which is process-consuming, solution: Hadoop Archive packs small files into xxx.har, and there is only one metadata for hadoop Data, the name of each file is managed by har)

Leave a Comment

Your email address will not be published.