18.8 Test Heartbeat
How to know whether the HA cluster is working normally, simulation environment test is a good method. Before putting the Heartbeat high-availability cluster in the production environment, you need to do the following five-step test to determine whether HA is working properly.
1. Normally shut down and restart the Heartbeat of the master node
First execute “service heartbeat on the master node node1 “stop” normally shuts down the Heartbeat process of the master node. At this time, use the ifconfig command to view the network card information of the master node. Under normal circumstances, you should be able to see that the master node has released the service IP address of the cluster and also released the mounted shared disk partition. Then check the backup node. Now the backup node has taken over the service IP of the cluster, and also automatically mounted the shared disk partition.
During this process, use the ping command to test the cluster service IP. It can be seen that the cluster IP is consistently in a communicable state, and there is no delay or congestion. That is to say, when the master node is shut down normally, the switch between the active and standby nodes is seamless, and the services provided by HA are also available. Uninterrupted operation.
Then, start the main node Heartbeat normally. After Heartbeat is started, the backup node will automatically release the cluster service IP and unmount the shared disk partition. The master node will again take over the cluster service IP and mount the shared disk partition. In fact, the release of resources by the backup node and the resources bound by the master node are performed synchronously. Therefore, this process is also a seamless switch.
2. Unplug the network cable on the main node
In the same way, when the master node network returns to normal, because the “auto_failback on” option is set, the cluster resources will be Automatically switch from the standby node to the primary node.
After unplugging the network cable at the master node, the log information is as follows, pay attention to the italicized part in the log:
- Nov 26 09:04:09 node1 heartbeat: [3689]: info: Link node2:eth0 dead.
- Nov 26 09:04:09 node1 heartbeat: [3689]: info:
Link 192.168.60.1:192.168.60.1 dead.- Nov 26 09:04:09 node1 ipfail: [3712]: info: Status update:
Node 192.168. 60.1 now has status dead- Nov 26 09:04:09 node1 harc[4279]: info: Running /etc/ha.d/rc.d/status status
- Nov 26 09:04:10 node1 ipfail: [3712]: info: NS: We are dead.:<
- Nov 26 09:04:10 node1 ipfail: [3712]: info: Link Status
update: Link node2/eth0 now has status dead- …… The middle part is omitted……
- Nov 26 09:04:20 node1 heartbeat: [3689]: info: node1 wants to go standby [all]
- Nov 26 09:04:20 node1 heartbeat: [3689]: info:standby:
node2can take our all resources- Nov 26 09:04:20 node1 heartbeat: [4295]: info: give up all
HA resources (standby).- Nov 26 09:04:21 node1 ResourceManager[4305]: info:Releasing
resource group: node1 192.168.60.200/24/eth0 Filesystem::/dev/sdb5::/webdata::ext3- Nov 26 09:04:21 node1 ResourceManager[4305]: info: Running
/etc/ha.d/resource.d/ Filesystem/dev/sdb5/we bdata ext3 stop- Nov 26 09:04:21 node1 Filesystem[4343]: INFO: Running stop for /dev/sdb5 on/webdata
- Nov 26 09:04:21 node1 Filesystem[4343]: INFO: Trying to unmount/webdata
- Nov 26 09:04:21 node1 Filesystem[4343]: INFO: unmounted/webdata successful
- Nov 26 09:04:21 node1 Filesystem[4340]: INFO: Success
- Nov 26 09:04 :22 node1 ResourceManager[4305]: info: Running
/etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 stop < /span>- Nov 26 09:04:22 node1 IPaddr[4428]: INFO:/sbin/ifconfig eth0:0 192.168.60.200 down
- Nov 26 09:04:22 node1 avahi-daemon[1854]: Withdrawing
< li style="list-style:decimal outside none; word-wrap:break-word; word-break:normal; border:none; line-height:18px; margin:0px!important; padding:0px 3px 0px 10px!important ">Nov 26 09:04:22 node1 IPaddr[4407]: INFO: Success
address record for 192.168.60.200 on eth0.
The log information when the standby node takes over the resources of the primary node is as follows:
- Nov 26 09:02:58 node2 heartbeat: [2110]: info: Link node1:eth0 dead.
- Nov 26 09:02:58 node2 ipfail: [2134]: info: Link Status
update: Link node1/eth0 now has status dead- Nov 26 09:02:59 node2 ipfail: [2134] : Info: Asking
other side for ping node count.- Nov 26 09:02:59 node2 ipfail: [2134]: info: Checking remote count of ping nodes.
< li class="alt" style="margin:0px!important; padding:0px 3px 0px 10px!important; color:inherit; list-style:decimal outside none; word-wrap:break-word; word-break:normal; border:none; line-height:18px">Nov 26 09:03:02 node2 ipfail:[ 2134]: info: Telling other
node that we have more visible ping nodes.- Nov 26 09:03:09 node2 heartbeat: [2110]: info: node1
wants to go standby[ all]- Nov 26 09:03:10 node2 heartbeat: [2110]: info: standby:
acquire [all] resources from node1- Nov 26 09:03:10 node2 heartbeat: [2281]: info: acquire all HA resources (standby).
- Nov 26 09:03:10 node2 ResourceManager[2291]: info: Acquiring
resource group: node1 192.168.60.200/24/eth0 Filesystem::/dev/sdb5::/webdata::ext3- Nov 26 09:03:10 node2 IPaddr[2315]: INFO: Resource is stopped
< li class="alt" style="margin:0px!important; padding:0px 3px 0px 10px!important; color:inherit; list-style:decimal outside none; word-wrap:break-word; word-break:normal; border:none; line-height:18px">Nov 26 09:03:11 node2 ResourceManager[2291 ]: info: Running
/etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 start- Nov 26 09:03:11 node2 IPaddr[2393]: INFO: Using calculated< br style="clear:both; width:0px; height:0px">netmask for 192.168.60.200: 255.255.255.0
- Nov 26 09:03:11 node2 IPaddr[2393]: DEBUG: Using calculated
broadcast for 192.168.60.200 : 192.168.60.255- Nov 26 09:03:11 node2 IPaddr[2393]: INFO: eval /sbin/ifconfig
eth0:0 192.168.60.200 netmask 255.255.255.0 broadcast 192.168.60.255 < /li>- Nov 26 09:03:12 node2 avahi-daemon[1844 ]: Registering new
address record for 192.168.60.200 on eth0.- Nov 26 09:03:12 node2 IPaddr[2393]: DEBUG: Sending Gratuitous
Arp for 192.168.60.200 on eth0:0 [eth0]- Nov 26 09:03:12 node2 IPaddr[2372]: INFO: Success
- Nov 26 09:03:12 node2 Filesystem[2482]: INFO: Resource is stopped
- Nov 26 09:03:12 node2 ResourceManager[2291]: info:Running
/etc/ha.d/resource.d/ Filesystem/dev/sdb5/webdata ext3 start- Nov 26 09:03:13 node2 Filesystem[2523]: INFO: Running start for /dev/sdb5 on/webdata
- Nov 26 09:03:13 node2 kernel: kjournald starting. Commit interval 5 seconds
- < span style="margin:0px; padding:0px; border:none; color:black; background-color:inherit">Nov 26 09:03:13 node2 kernel: EXT3 FS on sdb5, internal journal li> < li class="alt" style="margin:0px!important; padding:0px 3px 0px 10px!important; color:inherit; list-style:decimal outside none; word-wrap:break-word; word-break:normal; border:none; line-height:18px">Nov 26 09:03:13 node2 kernel: EXT3 -fs: mounted filesystem with ordered data mode.
- Nov 26 09:03:13 node2 Filesystem[2520]: INFO: Success
3. Unplug the power cord on the main node
After the main node is unplugged, the standby node The Heartbeat process will immediately receive the message that the master node has been shut down. If the Stonith device is configured on the cluster, the standby node will power off or reset to the primary node. When the Stonith device completes all operations, the backup node can take over the ownership of the resources of the master node, thereby taking over the resources of the master node.
After the main node is unplugged, the backup node has a log output similar to the following:
< pre style="margin-top:0px; margin-bottom:1em; padding:0px; color:rgb(51,51,51); font-family:'Courier New',monospace; width:591.015625px; overflow:auto ; line-height:28px; background:rgb(230,230,230)">
- Nov 26 09:24:54 node2 heartbeat: [2110]: info:
Received shutdown notice from’node1′. /span> - Nov 26 09:24:54 node2 heartbeat: [2110 ]: info:
Resources being acquired from node1. - Nov 26 09:24:54 node2 heartbeat: [2712]: info:
acquire local HA resources (stand by). - Nov 26 09:24:55 node2 ResourceManager[2762]:
info: Running /etc/ha.d/resource.d/IPaddr 192.168.60.200/24/eth0 start - Nov 26 09:24: 57 node2 ResourceManager[2762]:
info:Running/etc/ha.d/resource.d/Filesystem/dev/sdb5/webdataext3start span>
4. Cut off all network connections of the master node
After disconnecting the heartbeat line The standby node will output “eth1 dead” information in the log, but it will not cause resource switching between nodes. If the network cable connecting the master node to the public network is unplugged again, the resource switch of the master node and the backup node will occur, and the resources will be transferred from the master node to the backup node. At this point, connect the heartbeat line of the primary node and observe the system log. You can see that the Heartbeat process of the standby node will be restarted to control the cluster resources again. Finally, the external network cable of the master node is connected, and the cluster resources are transferred from the standby node to the master node again. This is the entire switching process.
5. Abnormally close the Heartbeat daemon on the master node
On the main node, you can pass “killall -9 heartbeat” command to close the Heartbeat process. Because the Heartbeat process was shut down illegally, the resources controlled by Heartbeat were not released. After the backup node does not receive a response from the master node for a short period of time, it will consider the master node to be faulty and then take over the resources of the master node. In this case, there is a resource contention situation, and both nodes occupy a resource, causing data conflicts. In response to this situation, the kernel monitoring module watchdog provided by Linux can be used to solve this problem and integrate watchdog into Heartbeat. If Heartbeat terminates abnormally or the system fails, watchdog will automatically restart the system to release cluster resources and avoid data conflicts.
本章节我们没有配置watchdog到集群中,如果配置了watchdog,在执行”killall -9 heartbeat “时,会在/var/log/messages中看到如下信息:
Softdog: WDT device closed unexpectedly. WDT will not stop!
这个错误告诉我们,系统出现问题,将重新启动。