Analysis of the timeout problem of ZooKeeper cluster

The three nodes of the ZK installed by CDH are basically the default configuration. They have been used normally. Today there is a problem.
The client connection timeout is 6 times longer, and the default maximum session timeout period is one minute.
Cause analysis:
1. First of all, we must confirm that the network is correct. Confirm clock synchronization.
2. Check the existing configuration, basically the default configuration JVM configuration is 1G and 2g, which is different. 3. Check the dataDir directory, du -sh. It is found that there are more than 500 M
specific The reason is uncertain, I did not see the problem in the log.
Analysis may be because the amount of data information in ZOOKEEPER increases with the passage of time. After startup
because of the amount of data that needs to be synchronized and the initial synchronization time Too short (initLimit=10) and other reasons,
cause the cluster is unhealthy,
Solution:
1. Increase the JVM stack memory from 1G to 3G, and confirm that there is enough memory on the machine to not SWAP.
2. Increase TICKTIME FROM 2000 TO 3000 Increase the value of “tickTime” or “initLimit and syncLimit”, or increase both.
3. Increase the maximum number of client connections to 600 (just in case)

Query related information:

1 Refer to this brother’s article: Rookie Xiaoxuan: https://www.jianshu.com/p/f30ae8e75d6d

The data broadcast by a server includes 4 parts:< /p>

The id of the leader you selected: In the first election, this id is your id;

The zxid of the data saved by the current server: the newer the more It should be selected by other servers as the leader to ensure that the data is up to date

Logical clock: that is, this is the number of elections, the larger the election, the newer the election;

Status: LOOKING, FOLLOWING, OBSERVING, LEADING;

After each server receives the value sent by other servers, it judges and selects the latest saved data (zxid is the largest ), the logic clock is the largest, and the id in the election state is used as the leader (of course, there are other conditions, the logic is more complicated, and I will not repeat them here), and re-broadcast. After going back and forth several times, the system reached an agreement, and the leader who got the most votes was elected.

Now the leader is selected, but this does not mean that it can sit firmly in the position of the leader, because next, the leader has to synchronize its saved data to all followers (multiple write problems ). If there is an error or a timeout in this process, the leader needs to be re-elected;

So what is the general cause of the zookeeper cluster hanging? In the final analysis: the data to be synchronized is too big! How old? 500M

The limit value of the leader and follower synchronization data in the zookeeper cluster is 500M. This 500M data, loaded into the memory, occupies about 3G of memory. The data is too large. After each election, it needs to be synchronized from the server to the follower, which is likely to cause the following two problems: Cause a re-election.

The id of the leader selected by yourself: the first time For election, this id is your own id;

The zxid of the data saved by the current server: The newer the server, the more it should be selected as the leader by other servers to ensure that the data is up to date

Logical clock : This is the number of elections. The bigger the election, the newer the election;

The state of the machine: LOOKING, FOLLOWING, OBSERVING, LEADING;

After each server receives the value sent by other servers, it judges and selects the latest saved data (the largest zxid), the largest logical clock, and the id in the election state as the leader (of course, there are other conditions , The logic is more complicated, so I won’t repeat it here), and re-broadcast. After going back and forth several times, the system reached an agreement, and the leader who got the most votes was elected.

Now the leader is selected, but this does not mean that it can sit firmly in the position of the leader, because next, the leader has to synchronize its saved data to all followers (multiple write problems ). If there is an error or a timeout in this process, the leader needs to be re-elected;

So what is the general cause of the zookeeper cluster hanging? In the final analysis: the data to be synchronized is too big! How old? 500M

The limit value of the leader and follower synchronization data in the zookeeper cluster is 500M. This 500M data, loaded into the memory, occupies about 3G of memory. The data is too large. After each election, it needs to be synchronized from the server to the follower, which is likely to cause the following two problems: Cause a re-election.

The id of the leader you selected: In the first election, this id is yourself The id of the data;

The zxid of the data saved by the current server: the newer the server, the more it should be selected as the leader by other servers to ensure that the data is up to date

Logical clock: which is the number The second election, the bigger the election, the newer the election;

Local state: LOOKING, FOLLOWING, OBSERVING, LEADING;

each server After receiving the value sent by other servers, make a judgment and select the latest saved data (the largest zxid), the largest logical clock, and the id in the election state as the leader (of course, there are other conditions, and the logic is more complicated. Repeat), and re-broadcast. After going back and forth several times, the system reached an agreement, and the leader who got the most votes was elected.

Now the leader is selected, but this does not mean that it can sit firmly in the position of the leader, because next, the leader has to synchronize its saved data to all followers (multiple write problems ). If there is an error or a timeout in this process, the leader needs to be re-elected;

So what is the general cause of the zookeeper cluster hanging? In the final analysis: the data to be synchronized is too big! How old? 500M

The limit value of the leader and follower synchronization data in the zookeeper cluster is 500M. This 500M data, loaded into the memory, occupies about 3G of memory. The data is too large. After each election, it needs to be synchronized from the server to the follower, which is likely to cause the following two problems: Cause a re-election.

After each server receives the value sent by other servers, it judges and selects The saved data is the latest (zxid is the largest), the logical clock is the largest, and the id in the election state is used as the leader (of course, there are other conditions, the logic is more complicated, and I will not repeat them here), and re-broadcast. After going back and forth several times, the system reached an agreement, and the leader with the most votes was elected.

Now the leader is selected, but this does not mean that it can sit firmly in the position of the leader, because next, the leader has to synchronize its saved data to all followers (multiple write problems ). If there is an error or a timeout in this process, the leader needs to be re-elected;

So what is the general cause of the zookeeper cluster hanging? In the final analysis, one sentence: The data to be synchronized is too large! How old? 500M

The limit value of the leader and follower synchronization data in the zookeeper cluster is 500M. This 500M data, loaded into the memory, occupies about 3G of memory. The data is too large. After each election, it needs to be synchronized from the server to the follower, which is likely to cause the following two problems: Cause a re-election.

After each server receives the value sent by other servers, it judges and selects the latest saved data (zxid Maximum), the logic clock is the largest, and the id in the election state is used as the leader (of course, there are other conditions, the logic is more complicated, and I will not repeat them here), and re-broadcast. After going back and forth several times, the system reached an agreement, and the leader who got the most votes was elected.

Now the leader is selected, but this does not mean that it can sit firmly in the position of the leader, because next, the leader has to synchronize its saved data to all followers (multiple write problems ). If there is an error or a timeout in this process, the leader needs to be re-elected;

So what is the general cause of the zookeeper cluster hanging? In the final analysis: the data to be synchronized is too big! How old? 500M

The limit value of the leader and follower synchronization data in the zookeeper cluster is 500M. This 500M data, loaded into the memory, occupies about 3G of memory. The data is too large. After each election, it needs to be synchronized from the server to the follower, which is likely to cause the following two problems: Cause a re-election.

So what is the general reason that the zookeeper cluster hangs? In the final analysis, one sentence: The data to be synchronized is too large! How old? 500M

The limit value of the leader and follower synchronization data in the zookeeper cluster is 500M. This 500M data, loaded into the memory, occupies about 3G of memory. The data is too large. After each election, it needs to be synchronized from the server to the follower, which is likely to cause the following two problems: Cause a re-election.

So what is the general reason that the zookeeper cluster hangs? In the final analysis, one sentence: The data to be synchronized is too large! How old? 500M

The limit value of the leader and follower synchronization data in the zookeeper cluster is 500M. This 500M data, loaded into the memory, occupies about 3G of memory. The data is too large. After each election, it needs to be synchronized from the server to the follower, which is likely to cause the following two problems: Cause a re-election.

Leave a Comment

Your email address will not be published.