HeartBeat theory introduction - Heartbeat, Introduction, theory

Heartbeat theory introduction

1. Heartbeat function

Through heartbeat, resources (IP, program services and other resources ) Quickly transfer from a failed computer to another normally operating machine to continue to provide services, which is generally called a high-availability service. In actual production application scenarios, the function of heartbeat and another high-availability open source software keepalived have Many similarities are different when we generate actual business applications.

Official website: http://linux-ha.org/wiki/Main_Page

############################################# #########

2. The working principle of heartbeat

By modifying the configuration file of the heatbeat software, you can specify which heartbeat server is the main service, and the other The station will automatically become a hot standby server, and then configure the heartbeat daemon on the hot standby server to listen for heartbeat messages from the primary server. If the hot standby server does not listen to the heartbeat from the main server within the specified time, it will start the failover procedure. And obtain the ownership of the related resource services on the main server, and take over the main server to continue to provide services uninterruptedly, so as to achieve the purpose of high availability of resources and services

The above description Heartbeat master-backup mode, heartbeat also supports master-master mode, that is, two servers are master and backup for each other. At this time, they will send messages to each other to tell each other their current status. If they are not sent by the other party within a specified time The heartbeat message, then one party will think that the other party is down or down. At this time, each normal host will start its own resource takeover module to take over the resources or services running on the other host, and continue to provide users with Service, under normal circumstances, it can be better realized that after a host failure, the enterprise business can still provide continuous services without interruption.

Note: The so-called uninterrupted business also requires switching time during failover. The switching time of heartbeat is generally about 5-20 seconds.

In addition: the same as keepalived service , Heartbeat high availability is at the server level, not at the service level.

Common conditions for handover

a. The server is down

b. The heartbeat service itself fails

c. The heartbeat connection fails

Service failure will not lead to switching, but you can stop the heartbeat service through service downtime

######## ###############################################

3. Heartbeat connection

After the previous description, readers should be very clear that to deploy the heartbeat service, at least two hosts are required to complete, then, to achieve high availability services, this How do two hosts communicate and detect each other?

The following are some common and feasible methods for direct communication between two heartbeat hosts:

a. Serial cable. (Disadvantages should not be too far away)

b. One Ethernet cable connects two network cards directly. (Recommended method)

c. Ethernet cable, connected through a switch and other network equipment (second choice: reason, increase the failure point of the switch)

The reason of the second choice, increase the switch The point of failure, at the same time. The line is not a professional heartbeat line, and it is easily affected by other data transmissions, resulting in heartbeat message sending problems

Tip: The heartbeat software on the high-availability service pair will use this heartbeat line To check whether the opposite machine is alive, and then decide whether to do failover and resource switching to ensure business continuity

If conditions permit, the above continuity can be used at the same time to increase the insurance coefficient to prevent brain damage. When cracking occurs, in a production environment, one of the first two or a combination is often used.

############################### ######################

4, heartbaet split brain

a, what is split brain

Since the high-availability server cannot detect each other’s heartbeats directly within the specified time, each initiates the failover function and obtains the ownership of the resources and services. At this time, the two high-availability server pairs are still Being alive and running normally, this will cause the same IP resource or service to start at the same time on both ends and cause a serious problem of conflicts. The most serious is that two hosts occupy the same VIP address. When users write data, they may be different Write to both ends, which may cause data inconsistency or data loss on both ends of the server. This situation is called split brain, and some people call it partition cluster or vertical brain partition

b. Reasons for split brain

1. The heartbeat link between the high-availability server pair fails and cannot lead to normal traffic. (For example, the heartbeat line is broken, broken, aging, the network card is broken, the network card driver is broken, the Ethernet IP configuration conflicts, and the punching machine has problems)

2. The high-availability server pair turns on the firewall, blocking the transmission of heartbeat messages

3. The high-availability server incorrectly configures the upper and lower heartbeat network card addresses and other information, resulting in heartbeat sending failure

4. There is a problem with the arbitration machine

5. Other service configurations are not down and other reasons, such as different heartbeat modes, heartbeat broadcast conflicts, software bugs, etc.

c. Eight kinds of cheats to prevent split-brain

When split-brain occurs, the impact on business is extremely Serious, sometimes fatal, such as: a split brain between two high-availability service teams, resulting in mutual contention for the same IP resource, just like the common IP address conflicts in LAN underwear. One or both of the two machines will be abnormal, which will affect the normal user access to the server. If it is applied to your database or storage service, which is extremely important and highly available, it may cause the data released by the user to be intermittently written in The consequences of two different servers make it extremely difficult or difficult to recover data in the end. (Of course, hardware with public storage such as NAS may be better)

In the actual production environment, we can prevent split-brain problems from the following aspects

1. Use serial cable and Ethernet cable to connect at the same time, and use two heartbeat routes at the same time , So that one of them is broken, and the other one is still good, and it can still send heartbeat messages.

2. Detecting split brain is forcibly closing a heartbeat node. (This function requires special equipment support, such as stoninth, fence). It is equivalent to that the standby node in the program finds that the heartbeat line is faulty, and a shutdown command is issued to the primary node.

3. Do a good job of monitoring and alarming split brain, (if the email has already Etc.), when a problem occurs, it is the first time to intervene in arbitration to reduce the loss. Of course, when implementing a high-availability solution, it is necessary to determine whether such a loss can be tolerated according to the actual needs of the business. For general website business, this loss is acceptable. Controlled.

4. Start the disk lock. The serving party locks the shared disk. When a split-brain occurs, the other party cannot grab the shared disk resources. However, there is a big problem when using a lock disk. If the party occupying the shared disk does not actively unlock it, the other party will never get the right to use the shared disk. In reality, there may be a sudden crash or crash of the service node, and it is impossible to execute the unlock command. In this way, the back node cannot take over shared resources and application services. Or develop smart locks.

5. Before the alarm package is taken over by the server, allow enough time for personnel to process it

Such as: the alarm was issued in one minute, but the service did not take over at this time, but took over after 5 minutes. The takeover time is longer, but the data will not be lost, causing the user to be unable to write data

6. After the alarm, the service will not be taken over automatically, but will be taken over by human personnel.

7. Increase the blanking mechanism to determine who should get resources.

Fence is just a term in the HA cluster environment. In the field of hardware, a fence device is actually an intelligent power management device with internal fences and external fences.

############################### #######################

5. Heartbeat message type

a. Heartbeat message

b. Cluster conversion message (ip-request, ip-request-resp)

c. Retransmission request

Heartbeat message: The heartbeat message is about 150 bytes, which may be unicast, broadcast or multicast. Control how long to wait for failover when the heartbeat frequency has failed

Cluster conversion message: (ip-request and ip-request-resp)

After the failure of the primary node is repaired, the backup server is required to release the resources obtained by the backup server when the primary service fails through the ip-request message, and then the backup server is shut down to release the resources and services obtained when the primary server fails.

After the backup server releases the resources and services obtained when the primary server fails, it will pass The ip-request-resp message informs the primary server that it no longer owns the resource and service. After the primary server receives the notification of the disappearance of the ip-request-resp of the standby node, it will start the resource and service released when it fails, and start to provide normal access

Retransmission request:

rexmit-request controls retransmission of heartbeat requests. This message is not very important,

################ ######################################

6, heartbeat IP address takeover and failover (notes)

Heartbeat is failover through IP address takeover and ARP broadcast.

ARP broadcast: When the primary server fails, after the standby node takes over the resources, it will immediately forcefully update the local ARP tables of all clients (that is, clear the VIP address and MAC address of the failed server in the local cache of the client Parsing records). Ensure that the client and the new main server are properly connected

Use arp -a to refer to the windows local arp cache

######## ##############################################

7. VIP/IP alias/secondary IP

The real IP, called the management IP, generally refers to the actual IP configured on the physical network card, which can be regarded as your own name . In a load balancing and high availability environment, the management IP is generally a service that does not provide external user access. It is only for the management server. For example, SSH can connect to the server through this management IP.

VIP is a virtual IP. This is just a concept, which may mislead you. In fact, heartbeat is temporarily bound to the physical The alias IP on the network card (there are only auxiliary IPs above heartbeat3). For example, eth0:XX is any number from 0 to 255, and you can bind multiple alias IPs to one network card. In the actual production environment, it is necessary to resolve the website domain name address to this VIP address in the DNS configuration. This VIP provides services to users.

The advantage is that when the server providing the service goes down, it will take over The server is automatically configured to provide the same VIP service. If the management IP is used, it will be difficult to migrate back and forth. Moreover, after the management IP is migrated, we can only go to the computer room to connect to the server. The essence of the VIP is Make sure that the two servers each have a management IP that you don’t understand, that is, you can connect to the machine at any time, and then add and bind other IPs, so that even if the VIP is transferred, the server itself will not fail to connect. There should be a management IP.

How to configure VIP manually

How to configure alias IP:

ifconfig eth0: 10 10.0.0.1/24 up (alias IP configuration)

ifconfig eth0:10 down (alias IP deleted)

Use ip add to view the auxiliary ip. Configuration method:

ip addr add 10.0.15.1/24 broadcast 10.0.15.255 dev eth0 (Assistant IP configuration)

ip addr del 10.0.15.1/24 broadcast 10.0.15.255 dev eth0 (delete auxiliary IP)

You can use ip add to view the auxiliary IP and alias IP, but ifconfig cannot view the auxiliary IP

Heartbeat and keepalived use the above commands to configure VIP when they are started, and also use the above commands to delete VIP when they are stopped. The above two configurations and deleting VIPs have the same effect in a high availability environment. There is no difference, but the selected configuration commands are different due to historical reasons such as the system environment at the time.

Heartbeat uses the alias IP before heartbeat3, and after heartbeat3, it uses the same IP as keepalived.

############## ##########################################

9. Heartbeat script default directory

Start script: /etc/init.d/

Resource directory: /etc/ha.d/resource.d /

/etc/ha.d/resource.d/ This is an important resource directory for heartbeat. If you develop your own program in the future, put it in this place, and then call it directly in the haresource file

##################################### #################

10.heartbaet configuration file

The default configuration file directory of heartbeat is /etc/ha.d .

There are three commonly used configuration files for heartbeat, namely ha.cf, authkey, and haresource. If you look closely, you will find that the name information is just like its actual function. The specific functions are as follows

Configure name as Remarks

ha.cf Heartbeat parameter configuration file Here, configure some basic parameters of heartbeat

authkey Heartbeat certification file span>The high-availability service pair authenticates the peer based on the peer’s authkey

haresource heartbeat resource configuration file Such as configuring IP resources and script programs, etc.

Important resource directory: /etc/ha.d/resource.d/, if you develop your own programs in the future , Put it in this place, and then call it directly in the haresource file

Deploy heartbeat high availability requirements

General front-end The web does not use heartbeat. Instead, use keepalived, lvs, nginx, haproxy

It is generally better to use heartbeat to synchronize resources such as back-end storage and databases.

Source: Old Boy Architect Video

Leave a Comment Cancel reply