Hadoop (1) Ali Cloud Hadoop Cluster Configuration

Cluster configuration

Three ECS cloud servers

Configuration steps

1. Preparations

1.1 Create /bigdata directory

mkdir /bigdata
cd /bigdata
mkdir /app

1.2 Modify the host name node01, node02, node03

1.3 Modify the hosts file

vim /etc/hosts

Add node01~node03Intranet IP mapping

127.0 .0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 172.16.237.91 node01 172.16.237.90 node02 172.16.221.55 node03
span>

1.4 install jdk

1.5 configure SSH password-free login

1.6 install zookeeper,

2. Start configuration

2.1 Before configuration Preparation

Upload and unzip the hadoop installation package to the /bigdata/app path

tar -zxvf hadoop-2.8.4.tar.gz -C /bigdata/app

Create a soft link

ln -s /bigdata/app/hadoop-2.8.4 /usr/local/hadoop

Add hadoop configuration information to environment variables
Note: Hadoop configuration file path is /usr /local/hadoop/etc/hadoop

vim /etc/profile

Add content such as Down:

export HADOOP_HOME=/usr/local/hadoop

export HADOOP_CONF_DIR
=$HADOOP_HOME/etc/hadoop
export YARN_HOME
=$HADOOP_HOME
export YARN_CONF_DIR
=$HADOOP_HOME/etc/hadoop
export PATH
=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin

Recompile the environment variables to make the configuration effective

< div class="code">

source /etc/profile

2.2 Configure HDFS
2.2.1 Enter To Hadoop configuration file path
cd /usr/local/hadoop/etc/hadoop
2.2.2 Modify hadoo-env.sh
Modify JDK path

< pre>export JAVA_HOME=/usr/local/jdk

2.2.3 Configure core-site.xml
2.2.4 Configure hdfs-site.xml

The configuration file is below

2.3 Configure YARN
2.3.1 Modify yarn-site.xml
2.3.2 Modify mapred-site.xml

The configuration file is in Below
2.3.3 Create hdpdata folder under /usr/local/hadoop path

cd /usr/local/< span style="color: #000000;">hadoop

mkdir hdpdata

2.4 Edit/ The slaves file under usr/local/hadoop/etc/hadoop

Set the host name of the datanode and nodemanager to start the node

Add the host name of the node in the slaves file

node02

node03

2.5 Copy the configured hadoop
scp -r hadoop-2.8 .4 [email protected]:/bigdata/app
scp -r hadoop-2.8.4 [email protected]:/bigdata/app

Perform the following three steps on each node
Step 1: Use the root user to create a soft link
ln -s /bigdata/app/hadoop-2.8.4 /usr/local/hadoop
Step 2: Set environment variables

vim /etc/profile

Add content:

export HADOOP_HOME=/usr/local/ hadoop

export HADOOP_CONF_DIR
=$HADOOP_HOME/etc/hadoop
export YARN_HOME
=$HADOOP_HOME
export YARN_CONF_DIR
=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin

Step 3: Recompile the environment variables to make the configuration take effect

source /etc/profile

3 .Cluster startup (note that the startup is strictly in order)

3.1 Start journalnode (start on node01, node02, and node03 respectively)

/usr/local/hadoop/sbin/hadoop-daemon.sh start journalnode

Run jps command Check that there are more JournalNode processes on node01, node02, and node03.

3.2 Format HDFS
Execute commands on node01:

hdfs namenode -format

After the format is successful, the dfs folder will be generated under the path specified by hadoop.tmp.dir in core-site.xml, and Copy the folder to the same path of node02

scp -r hdpdata [email protected]: /usr/local/hadoop

3.3 Perform formatting ZKFC operations on node01

hdfs zkfc -formatZK

The execution is successful, and the log outputs the following information
I NFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/ns in ZK

3.4 Start HDFS on node01

sbin/start-dfs.sh

3.5 in Start YARN on node02

sbin/start-yarn.sh

Start a ResourceManger separately on node01 as a backup node

sbin/yarn-daemon.sh  start resourcemanager

3.6 Start JobHistoryServer on node02

< pre>sbin/mr-jobhistory-daemon.sh start historyserver

After starting node02, a JobHistoryServer process will be added p>

3.7 hadoop installation and startup completed
HDFS HTTP access address
NameNode (active): http://node01:50070
NameNode (standby): http:/ /node02:50070
ResourceManager HTTP access address
ResourceManager: http://node02:8088
History log HTTP access address
JobHistoryServer: http:/node02: 19888

4. Cluster verification

4.1 Verify that HDFS is working properly and HA is highly available First upload a file to hdfs

hadoop fs -put /usr/local/hadoop/README.txt /

Manually close the active namenode on the active node

sbin/hadoop-daemon.sh stop namenode

Check the standby namenode through HTTP 50070 port Whether the status is converted to active
Manually start the namenode that was closed in the previous step

sbin/hadoop-daemon.sh< /span> start namenode

4.2 Verify ResourceManager HA is highly available
Manually close the ResourceManager of node02

< pre>sbin/yarn-daemon.sh stop resourcemanager

Visit the ResourceManager of node01 through HTTP 8088 port to view the status
Manually start the ResourceManager of node02

sbin/yarn-daemon.sh start resourcemanager

Startup script

Configuration file

core-site. xml

"1.0" encoding="UTF-8"?>

"text/xsl" href="configuration.xsl"?>







fs.defaultFS
hdfs://ns



hadoop.tmp.dir
/usr/local/hadoop/hdpdata/
Need to manually create hdpdata directory



ha.zookeeper.quorum
node01:2181,node02:2181,node03:< span style="color: #800080;">2181
zookeeper address, separated by commas

hdfs-site.xml

"1.0" span> encoding="UTF-8"?>

"text/xsl" href="configuration.xsl"?>







dfs.nameservices
ns
Specify the nameservice of hdfs as ns, which needs to be consistent with that in core-site.xml


dfs.ha.namenodes.ns
nn1,nn2
There are two NameNodes under the ns namespace, the logical code name, and the name is arbitrary, namely nn1, nn2


dfs.namenode.rpc-address.ns.nn1
node01:9000
RPC communication address of nn1


dfs.namenode.http-address.ns.nn1
node01:50070
http communication address of nn1


dfs.namenode.rpc-address.ns.nn2
node02:9000
RPC communication address of nn2


dfs.namenode.http-address.ns.nn2
node02:50070
http communication address of nn2



dfs.namenode.shared.edits.dir
qjournal://node01:8485;node02:8485;node03:8485 /ns


dfs.journalnode.edits.dir
/usr/local/hadoop/journaldata
Specify the location where JournalNode stores data on the local disk



dfs.ha.automatic-failover.enabled
true
Failed to enable NameNode to switch automatically


dfs.client.failover.proxy.provider.ns
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
Automatically switch the implementation mode if the configuration fails, use the built-in zkfc


dfs.ha.fencing.methods

sshfence
shell(
/bin/true)

Configure the isolation mechanism, multiple mechanisms are separated by line breaks, execute sshfence first, and execute shell(/bin/true) after the execution fails, /bin/true will directly return 0 to indicate success


dfs.ha.fencing.ssh.private-key-files
/root/.ssh/id_rsa
Ssh free login is required when using sshfence isolation mechanism


dfs.ha.fencing.ssh.connect-timeout
30000
Configure the timeout period of sshfence isolation mechanism



dfs.replication
3
The default number of block copies is 3, and the test environment is set to 1. Note that the production environment must have more than 3 copies.



dfs.block.size
134217728
Set the block size to 128M





<configuration >
<property >
<name >mapreduce.framework.namename>
<value >yarnvalue>
<description > Specify mr frame as yarn modedescription>
property>

<property >
<name >mapreduce.jobhistory.addressname>
<value >node02:10020value>
<description >History server port numberdescription>
property>
<property >
<name >mapreduce.jobhistory.webapp.addressname>
<value >node02:19888value>
<description > WEB UI port number of the history serverdescription>
property>
configuration>

Problems

Namenode can’t connect, check the log and find out

java.io.IOException: There appears to be a gap in the edit log. We expected txid 1, but got txid 2.

Fixed under the bin directory of hadoop Metadata

hadoop namenode -recover

Select y first and then c

< h3>Concept

A daemon process is a process that runs in the background and is not controlled by a terminal (such as input, output, etc.). General network services are run as a daemon. There are two main reasons for the daemon to leave the terminal: (1) The terminal used to start the daemon needs to perform other tasks after starting the daemon. (2) (If other users log in to the terminal, the error message of the previous daemon should not appear) The signal generated by some keys on the terminal (such as interrupt signal) should not be used for any daemon started from the terminal before The process has an impact. Pay attention to the difference between the daemon process and the background running program (that is, the program that is added & started).

Daemon and background program

(a) The daemon process has been completely separated from the terminal console, and the background program has not completely separated from the terminal. It will still go to the terminal before the terminal is closed. Output results
(b) The daemon process will not be affected when the terminal console is closed, and the background program will stop when the user exits. It needs to be run in the nohup command & format to avoid affecting the daemon process.
(c) The daemon process The session group and current directory, file descriptors are all independent. Running in the background is just a fork of the terminal, allowing the program to execute in the background, none of this has changed.

hadoop directory structure

1. Files in the $HADOOP_HOME/bin directory and their functions< /strong>

File name Description
hadoop Used to execute hadoop script commands, which are called and executed by hadoop-daemon.sh , Can also be executed separately, the core of all commands

2.$HADOOP_HOME /sbin files and their functions

file name Help
hadoop-daemon.sh

Start/stop a daemon by executing the hadoop command; this command will be invoked by all commands starting with start or stop in the bin directory to execute the command,

hadoop-daemons.sh is also passed Call hadoop-daemon.sh to execute commands, and hadoop-daemon.sh itself executes tasks by calling hadoop commands.

start-all.sh Start all, it will call start-dfs.sh and start-mapred .sh
start-dfs.sh Start NameNode, DataNode and SecondaryNameNode
start-mapred.sh Start MapReduce
stop-all.sh Stop all, it will call stop-dfs.sh and stop-mapred.sh
stop-balancer.sh Stop balancer
stop-dfs.sh Stop NameNode, DataNode and SecondaryNameNode
stop-mapred.sh Stop MapReduce

< /p>

3. Files in the $HADOOP_HOME/etc/hadoop directory and their functions

< strong>File name Description
core-site.xml < p>Ha Doop core global configuration file, you can reference the attributes defined in this file in other configuration files, such as hdfs-site.xml and mapred-site.xml, the attributes of this file will be quoted;

The template file exists in $HADOOP_HOME/src/core/core-default.xml. You can copy the template file to the conf directory and then modify it.

hadoop-env.sh Hadoop environment variables
hdfs-site .xml HDFS configuration file, the attributes of the template are inherited from core-site.xml; the template file of this file is stored in $HADOOP_HOME/src/hdfs/hdfs-default.xml, and the template file can be copied Go to the conf directory and modify it.
mapred-site.xml

MapReduce configuration file, the attributes of this template are inherited from core-site .xml; the template file of this file is stored in $HADOOP_HOME/src/mapred/mapredd-default.xml,

You can copy the template file to the conf directory and modify it again

slaves Used to set the name or IP of all slaves, one per line. If it is a name, then the set slave name must have an IP mapping configuration in /etc/hosts
< p>

4.$ HADOOP_HOME/lib directory

This directory stores the jar packages that Hadoop depends on when it runs. Hadoop will add all the jars under the lib directory to the classpath during execution.

5.$HADOOP_HOME/logs directory

This directory stores Hadoop running logs. Viewing the logs is very helpful for finding Hadoop running errors.

6.$HADOOP_HOME/include directory
The programming library header files provided to the outside (the specific dynamic library and static library are in the lib directory), these header files are used Defined in C++, usually used for C++ programs to access HDFS or to write MapReduce programs.
7.$HADOOP_HOME/libexec directory
The directory where the shell configuration file used by each service pair is located, which can be used to configure basic information such as log output and startup parameters (such as JVM parameters).
8.$HADOOP_HOME/share directory The directory where the jar package compiled by each Hadoop module is located.

mkdir /bigdata
cd /bigdata
mkdir /app
< /p>

vim /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 172.16.237.91 node01 172.16.237.90 node02 172.16.221.55 node03
span>

tar -zxvf hadoop-2.8.4.tar.gz- C /bigdata/app

ln -s /bigdata/app/hadoop-2.8.4 /usr/local/hadoop

 vim /etc/profile

export HADOOP_HOME=/usr/local/hadoop

export HADOOP_CONF_DIR
=$HADOOP_HOME/etc/hadoop
export YARN_HOME
=$HADOOP_HOME
export YARN_CONF_DIR
=$HADOOP_HOME/etc/hadoop
export PATH
=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin

source /etc/profile

export JAVA_HOME=/usr/local/jdk

cd /usr/local/hadoop

mkdir hdpdata

node02

node03

vim /etc/profile

export HADOOP_HOME=/usr/local/ hadoop

export HADOOP_CONF_DIR
=$HADOOP_HOME/etc/hadoop
export YARN_HOME
=$HADOOP_HOME
export YARN_CONF_DIR
=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin

source /etc/profile

/usr/local/hadoop/sbin/hadoop-daemon.sh start journalnode

hdfs namenode -format

scp -r hdpdata [email protected]:/usr/local/hadoop

hdfs zkfc -formatZK

sbin/start-dfs.sh

sbin/start-yarn.sh

sbin/yarn-daemon.sh start resourcemanager

sbin/mr-jobhistory-daemon.sh start historyserver

hadoop fs -put /usr/local/hadoop/README.txt /

sbin/hadoop-daemon.sh stop namenode

sbin/hadoop-daemon.sh start namenode

sbin/yarn-daemon.sh stop resourcemanager

sbin/yarn-daemon.sh start resourcemanager

"1.0" encoding="UTF-8"?>

"text/xsl" href="configuration.xsl"?>







fs.defaultFS
hdfs://ns



hadoop.tmp.dir
/usr/local/hadoop/hdpdata/
需要手动创建hdpdata目录



ha.zookeeper.quorum
node01:2181,node02:2181,node03:2181
zookeeper地址,多个用逗号隔开

"1.0" encoding="UTF-8"?>

"text/xsl" href="configuration.xsl"?>







dfs.nameservices
ns
指定hdfs的nameservice为ns,需要和core-site.xml中的保持一致


dfs.ha.namenodes.ns
nn1,nn2
ns命名空间下有两个NameNode,逻辑代号,随便起名字,分别是nn1,nn2


dfs.namenode.rpc-address.ns.nn1
node01:9000
nn1的RPC通信地址


dfs.namenode.http-address.ns.nn1
node01:50070
nn1的http通信地址


dfs.namenode.rpc-address.ns.nn2
node02:9000
nn2的RPC通信地址


dfs.namenode.http-address.ns.nn2
node02:50070
nn2的http通信地址



dfs.namenode.shared.edits.dir
qjournal://node01:8485;node02:8485;node03:8485/ns


dfs.journalnode.edits.dir
/usr/local/hadoop/journaldata
指定JournalNode在本地磁盘存放数据的位置



dfs.ha.automatic-failover.enabled
true
开启NameNode失败自动切换


dfs.client.failover.proxy.provider.ns
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
配置失败自动切换实现方式,使用内置的zkfc


dfs.ha.fencing.methods

sshfence
shell(
/bin/true)

配置隔离机制,多个机制用换行分割,先执行sshfence,执行失败后执行shell(/bin/true),/bin/true会直接返回0表示成功


dfs.ha.fencing.ssh.private-key-files
/root/.ssh/id_rsa
使用sshfence隔离机制时需要ssh免登陆


dfs.ha.fencing.ssh.connect-timeout
30000
配置sshfence隔离机制超时时间



dfs.replication
3
默认block副本数为3,测试环境这里设置为1,注意生产环境一定要设置3个副本以上



dfs.block.size
134217728
设置block大小是128M

   

    dfs.client.use.datanode.hostname
    true
    only cofig in clients


xml version="1.0"?>

xml-stylesheet type="text/xsl" href="configuration.xsl"?>




<configuration>
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
<description>指定mr框架为yarn方式 description>
property>

<property>
<name>mapreduce.jobhistory.addressname>
<value>node02:10020value>
<description>历史服务器端口号description>
property>
<property>
<name>mapreduce.jobhistory.webapp.addressname>
<value>node02:19888value>
<description>历史服务器的WEB UI端口号description>
property>
configuration>

java.io.IOException: There appears to be a gap in the edit log.  We expected txid 1, but got txid 2.

hadoop namenode -recover

Leave a Comment

Your email address will not be published.