Official website: http://linux -ha.org/wiki/Main_Page | ||
Introduction to Heartbeat< br> |
Heartbaet is an open source and highly available (Highly-Available) service software. Through heartbeat, resources (ip and program services, etc.) Resources) Quickly transfer from a failed computer to another normally operating machine to continue providing services, generally referred to as Highly available service. In actual production application scenarios, the function of heartbeat has many similarities with another highly available open source software keeplived, but in production, the actual business applications are also different. For example, keeplived mainly controls the drift of ip. Configuration and application are simple, and hearbeat can not only control the drift of ip, but also be better at controlling resource services (mysql restart), configuration and application are more complicated. | |
The working principle of Heartbeat: Keeplived and heartbeat high availability are operating system level, not (software level), you can use simple Script to achieve high availability at the software level. |
Heartbeat’sactive&standbymode, by modifying the configuration file of the heatbeat software, you can specify which heartbeat server is the primary server, and the other will automatically become the hot standby server. Then configure the Heartbeat daemon on the hot standby server to monitor the heartbeat information from the server. If the hot standby server does not listen to the heartbeat of the autonomous server within the specified time, it will start the failover procedure and obtain the authority of the related resource services of the primary server, and continue to provide services without interruption to the primary server to achieve high resources and services. The purpose of availability. | |
Heartbeat also supports main main mode (for different services), and the two servers are active and standby to each other, then they will send messages to each other to tell each other themselves If the heartbeat message sent by the other party is not received within the specified time, one party will think that the other party is effective or down. At this time, a normal host will start its own resource takeover module to take over Resources or services running on the host of the other party continue to provide services to users. Under normal circumstances, it is better to realize that after a host failure, enterprise services can still run continuously without interruption. Note that the so-called uninterrupted business also requires switching time during failover (for example: stopping the database and storage services, etc.). The switching time of the active and standby high availability of heartbeat is generally about 5-20 seconds (server downtime) Faster than switching services manually). | ||
Highly available serverSwitch Common condition scenarios:
|
1), server physical downtime (hardware damage, operating system Fault). This is the main solution goal! 2), the hearbeat service software itself is malfunctioning. 3) The heartbeat connection between the two primary and standby servers is faulty.
Service failure will not cause switching. The hearbeat service can be shut down through service downtime. |
< br> Heartbeat link Method (At least two hosts are required to complete ) |
< p style="white-space:normal;">1. Use a serial cable, the so-called serial cable to connect two servers (optional)|Common
|
The serial cable signal will not overlap with the Ethernet network, and there is no need to configure the ip address and other information separately, so the transmission stability is not easy to cause problems , When using the shortcomings of the serial port, the distance between the two servers cannot be too far.
The serial port cable corresponds to the server device for /dev/ttyS0 |
2. One Ethernet cable is directly connected to two network cards (optional)|Commonly used | Use an Ethernet cable (no special Crossover cable) Direct connection to the network card, the configuration is relatively simple, you only need to configure independent ip segment addresses for the two directly connected network cards to communicate with each other, ordinary network cables do. | |
3. Ethernet cable, connected through a switch and other network equipment (second choice) | The next choice when using a networked Ethernet cable and network card as the heartbeat line, because this link adds a failure point such as a switch device, and this line is not a dedicated heartbeat line, and it is easily affected by other data transmissions on the Ethernet. As a result, the heartbeat message sending is delayed or cannot be delivered. | |
< span style="font-size:18px;"> Heartbeat selection plan:
Reminder: The above connections can be used at the same time to increase the insurance factor to prevent split-brain problems. |
1. Data-related services have higher requirements and can be used in a direct connection between serial port and network cable. | |
2, web and http services can be directly connected by network cable or LAN communication. | ||
Heartbeat split brain span>: (Failed to receive the heartbeats of the active and standby nodes, causing the nodes to start their own resources and services)
|
Due to some reasons, the two highly available servers cannot detect each other’s heartbeat within a specified period of time. Start the failover function and obtain the ownership of resources and services. At this time, the two highly available servers are still alive and running normally. This will cause the same IP or service to start at the same time on both ends and cause serious conflicts. The most serious is that two hosts occupy the same VIP address. When the user writes data, it may be written to both ends separately, resulting in inconsistent data on both ends of the server or data loss. This situation is called split brain. | |
< p style="white-space:normal;"> Summary of the causes of split brain: strong>
|
2, high availability server is turned on The iptables firewall prevents the transmission of heartbeat messages; |Common | |
3. The configuration of the heartbeat network card address information on the high-availability server is incorrect, causing the heartbeat to be sent fail. |Common | ||
4. Other reasons such as improper configuration of other services, such as different heartbeat methods, heartbeat broadcast conflicts, software bugs, etc.; |
Prevent split brain span> Summary of methods |
1. Use a serial cable and an Ethernet cable at the same time, and use two heartbeat wires at the same time. Commonly used | |
2 When split brain is detected, the heartbeat node is forcibly closed. It is equivalent to that the standby node in the program finds a heartbeat line failure and sends a shutdown command to the main node (the scene is used less, Banks use more and need special equipment support, such as stonith to kill other nodes, fence) |
||
3. Monitoring alarm (depending on alarm) Method 1 : Detect whether the standby node has a VIP, and then detect whether the main service is abnormal) (Manually intervene in blanking, and interfere with the regular business of the website , For example, Baidu alarm monitoring upstream and downstream) Or leave enough time for personnel to handle the alarm before the server takes over. |
||
4. Enable disk lock. (Less used) That is: the serving party only activates the disk lock when it finds that the heartbeat is all disconnected, and it is not locked at ordinary times. The function is suitable for sharing scenarios; such as oracle |
||
5. Increase the arbitration mechanism (Arbitration is generally a gateway, not May hang) |
When the heartbeat is all disconnected, two Each node pings the reference ip, and the unreachable abandons the competition or restarts itself, allowing the connected end to take over the service. | |
Obtaining resources through third-party software arbitration | ||
Noun explanation: fence |
1, fence is a term in a cluster environment; 2, fence devices are intelligent power management devices in the hardware field; (commonly known as: smart power management device or remote management card, with an Ethernet port, used to restart the network when the ha switch is triggered Resource service, if you don’t know it, Baidu) |
|
Glossary: Arbitration |
There is a arbitration mechanism under RHCS, which is called arbitration disk, which is realized by additional storage, such as SAN, a special block device made by mkqdisk command. By default, in the two-node ha architecture, the number of votes of the master and slave servers is 1, and the two sides are equal. When there is a problem with the heartbeat, a split brain will occur. This arbitration can set the number of votes in the RHCS. Both nodes use the ping gateway to write their own survival status into the arbitration disk. Once the node heartbeat has a problem and the arbitration disk does not receive node survival information, the fence will be turned off or restarted. node. Note: Premise It happens when the main and standby cannot communicate (heartbeat); For example, ping the gateway, the connection between the main and the arbitration equipment, and the arbitration equipment controls the power of the main and standby servers |
|
Term explanation: stonith |
It is a component of the hearbeat software package, which allows the use of a remote or “smart” power device connected to the healthy server to automatically restart the power of the actual server. The sthonith device can turn off the power and respond to software commands. The server running hearbeat can use the serial port The cable or network cable sends commands to the stonith device, which controls the power supply of other servers in the high-availability server pair. (Theoretically, there is no limit to the number of servers, but two are best) |
< p>Heartbeat reference blog post: http://blog.chinaunix.net/uid-7921481-id-1617030.html< /p>
Heartbeat message type: |
Heartbeat message | The heartbeat message is about 150-byte data packets, which may be Serial port, unicast, broadcast or multicast, control the heartbeat frequency and how long to wait for failure to perform failover. |
Cluster conversion message |
ip_request and ip_request_resp When the primary server returns to the online state, through the ip_requset message to request the backup machine to release the primary server failure to report to the backup server to obtain resources; the backup server fails to release the primary server to obtain resources After the resources and services, the main server is notified through the ip_request-resp message that it no longer owns the resources and services. |
|
retransmission request | rexmit- request control retransmission of heartbeat requests (not important) |
< /strong>
Heartbeat IP address takeover and failover:
Heartbeat is failover through IP address takeover and arp broadcast.
arp broadcast: when the primary server fails, after the backup node takes over the resources, it will forcefully update the local arp tables of all clients
(Clear the resolution records of the vip address and mac address of the failed server in the local cache of the client to ensure that the client talks to the new main server)
vip / ip alias / auxiliary alias ip
real ip | The actual ip configured for the physical network card is called the management ip |
virtual ip/vip | ip temporarily bound to the physical network card |
p>
Common ways to configure VIP
alias ip alias IP |
ifconfig eth0:1 10.0.0.10 netmask 255.255.255.254 up Delete: ifconfig eth0:1 down Permanent: written as a configuration file, this alias ip will be abandoned in the future, use auxiliary |
Auxiliary ip secondary ip address < /span>Attention! Use auxiliary ip in the future |
Add: ip addr add 10.0.0.2/24 dev eth0 View: ip a Delete: ip add del 10.0.0.2/24 dev eth0 |
Note: Heartbeat 2.1.4 used the alias ip before, heartbeat 2.1.4 uses the auxiliary ip to provide VIP services, but keeplived has always been auxiliary ip provides services.