ZooKeeper knowledge point - Knowledge, point, Zookeeper

This article is reproduced from: https://www.cnblogs.com/lanqiu5ge/p/9405601.html#_label2, there will be minor adjustments.

Extended reading:

http://www.voidcn.com/cata/817705

http://www.cnblogs.com/leesf456/p /6239578.htm

https://www.cnblogs.com/gnivor/p/6264080.html

Contents

1. What is ZooKeeper?
2. What does ZooKeeper provide?
3. Zookeeper file system
4. ZAB protocol?
5. Four types of data nodes Znode
6. Zookeeper Watcher mechanism-data change notification
7. Client registration Watcher implementation
8. Server processing Watcher implementation
9. Client callback Watcher
10. ACL permission control mechanism
- UGO (User/Group/Others )
- ACL (Access Control List) access control list
11. Chroot features
12. Session management
13. Server role
- Leader
- Follower
- Observer
14. Under Zookeeper Server working status
15. Leader election
16. Data synchronization
- Direct differential synchronization (DIFF synchronization)
- First roll back and then Differentiated synchronization (TRUNC+DIFF synchronization)
- Only rollback synchronization (TRUNC synchronization)
- Full synchronization (SNAP synchronization)

< li>17. How does zookeeper ensure the sequence consistency of transactions?

18. Why is there a Master in a distributed cluster?
19. How to deal with zk node downtime?
20. The difference between zookeeper load balancing and nginx load balancing
21. How many deployment modes does Zookeeper have?
22. The cluster requires at least a few machines. What are the cluster rules?
23. Does the cluster support dynamic addition of machines?
24. Is Zookeeper’s watch monitoring notification to the node permanent? Why is it not permanent?
25. What are the Java clients of Zookeeper?
26. What is chubby, and how do you compare to zookeeper?
27. Say a few commonly used commands in zookeeper.
28. The connection and difference between ZAB and Paxos algorithm?
29. Typical application scenarios of Zookeeper
- 1. Data publishing/subscription
- 2. Load balancing

< /ul>

1. What is ZooKeeper?

ZooKeeper is an open source distributed coordination service, which is the manager of the cluster, monitoring the cluster The state of each node performs the next reasonable operation based on the feedback submitted by the node. In the end, a simple and easy-to-use interface and a system with high performance and stable functions will be provided to users.

Distributed applications can be implemented based on Zookeeper such as data publishing/subscription, load balancing, naming services, distributed coordination/notification, cluster management, Master Functions such as elections, distributed locks, and distributed queues.

Zookeeper guarantees the following distributed consistency features:

Sequential consistency
Atomicity
Single view

li>

Reliability
Real-time (eventual consistency)

The client’s read request can be processed by any machine in the cluster, If the read request has a listener registered on the node, this listener is also processed by the connected zookeeper machine. For write requests, these requests will be sent to other zookeeper machines at the same time and a consensus will be reached before the request will return success. Therefore, As the number of zookeeper cluster machines increases, the throughput of read requests will increase but the throughput of write requests will decrease.

Sequence is a very important feature in zookeeper. All updates are globally ordered, and each update has a unique timestamp. This timestamp is called zxid (Zookeeper Transaction Id). The read request will only be ordered relative to the update, that is, the latest zxid of this zookeeper will be included in the return result of the read request.

2. What does ZooKeeper provide?

1, file system
2, notification mechanism

3. Zookeeper file system

Zookeeper provides a multi-level Node namespace (nodes are called znodes). Different from the file system, these nodes can all set the associated data, but in the file system, only file nodes can store data but directory nodes cannot.
ZookeeperIn order to ensure high throughput and low latency, this tree-like directory structure is maintained in memory, This feature prevents Zookeeper from being used to store large amounts of data, and the upper limit of data storage for each node is 1M.

4. ZAB agreement?

ZAB protocol is an atomic broadcast protocol which supports crash recovery specially designed for the distributed coordination service Zookeeper.

ZAB protocol includes two basic modes: crash recovery and message broadcasting.

When the entire zookeeper cluster has just started or the Leader server is down, restarted, or network failure causes no more than half of the servers to keep with the Leader server During normal communication, all processes (servers) enter the crash recovery mode, first elects a new Leader server, and then the Follower server in the cluster starts to synchronize data with the new Leader server /span>, when more than half of the machines in the cluster complete data synchronization with the Leader server, they exit the recovery mode and enter the message broadcast mode. The Leader server starts to receive transaction requests from the client to generate transaction proposals for transaction request processing.

5. Four types of data nodes Znode

PERSISTENT-Persistent Node
Unless manually deleted, the node always exists on Zookeeper
EPHEMERAL-temporary node
The life cycle of the temporary node is bound to the client session. Once the client session fails (the client is disconnected from zookeeper Certain session fails), then all temporary nodes created by this client will be removed.
PERSISTENT_SEQUENTIAL-Persistent Sequence Node
The basic characteristics are the same as the Persistent node, but the is added style=”background-color: #ffff00″>Sequence attribute, an auto-incrementing integer number maintained by the parent node will be appended to the node name.
EPHEMERAL_SEQUENTIAL-temporary sequence node
The basic features are the same as the temporary node, with the addition of Sequence attribute, a self-increasing integer number maintained by the parent node will be appended to the node name.

6. Zookeeper Watcher mechanism– Data change notification

Zookeeper allows the client to register a Watcher to a certain Znode on the server side, when some specified events on the server side trigger the Watcher , The server will send an event notification to the designated client to realize the distributed notification function, and then the client will make business changes according to the Watcher notification status and event type.

Working mechanism:

Client registration watcher
Server processing watcher
Client callback watcher

Watcher feature summary:

One-time
Whether it is a server or a client, once a Watcher is triggered, Zookeeper will move it from the corresponding storage remove. This design effectively reduces the pressure on the server. Otherwise, for nodes that are updated very frequently, the server will continuously send event notifications to the client, which puts great pressure on the network and the server.
Client serial execution
The process of client Watcher callback is a serial synchronization process.
Lightweight
- Watcher notification is very simple, only telling the client that an event has occurred, but not the specific content of the event.
- When the client registers Watcher with the server, it does not pass the client’s real Watcher object entity to the server, but only uses the boolean type attribute to mark the client request.
Watcher event asynchronously sends watcher notification events from server to client is asynchronous, there is a problem, different clients and servers communicate through sockets , Due to network delay or other factors, the client can listen to the event at the time of failure. Zookeeper itself provides an ordering guarantee, that is, after the client listens to the event, it will perceive changes in the znode it monitors. Therefore, we cannot expect to be able to monitor every change of the node using Zookeeper. Zookeeper can only guarantee final consistency, but cannot guarantee strong consistency.
Register watcher getData, exists, getChildren
Trigger watcher create, delete, setData
When a client connects to a new server, watch will Will be triggered by any session event. When the connection with a server is lost, the watch cannot be received. When the client reconnects, if necessary, all previously registered watches will be re-registered. Usually this is completely transparent. Only in one special case, the watch may be lost: for an exist watch of an uncreated znode, if it is created during the disconnection of the client, and then deleted before the client connects, in this case , This watch event may be lost.

7. Customers Realization of Watcher registration on the end

Call getData()/getChildren()/exist() three APIs, pass in the Watcher object
Mark the request request, encapsulate Watcher to WatchRegistration< /li>
Encapsulate it into a Packet object, and send the server to send a request
After receiving the response from the server, register the Watcher to ZKWatcherManager for management
Return the request and complete the registration .

8. Service End processing Watcher implementation

The server receives the Watcher and stores it
receives the client request, processes the request to determine whether the Watcher needs to be registered, and if necessary, the node path of the data node and ServerCnxn (ServerCnxn represents a connection between a client and a server, and implements the process interface of Watcher, which can be regarded as a Watcher object at this time) is stored in WatchTable and watch2Paths of WatcherManager.
Watcher trigger
Take the server receiving a setData() transaction request to trigger the NodeDataChanged event as an example:
- Encapsulate WatchedEvent
  will notify the status (SyncConnected), The event type (NodeDataChanged) and the node path are encapsulated into a WatchedEvent object
- Query Watcher
  Find Watcher from the WatchTable according to the node path
- Not found; indicating that there is no client in the data Watcher registered on the node
- Find; extract and delete the corresponding Watcher from WatchTable and Watch2Paths (From here, we can see that Watcher is one-time on the server, and it will be invalid once triggered. >)
Call the process method to trigger Watcher
The process here is mainly to send Watcher event notifications through the TCP connection corresponding to ServerCnxn.

9. Client callback Watcher

Client SendThread thread Receive event notifications and hand them to the EventThread thread to call back Watcher. The Watcher mechanism of the client is also one-time. Once triggered, the Watcher becomes invalid.

10. ACL permission control mechanism

UGO (User/Group/Others)

Currently used in Linux/Unix file systems, it is also the most widely used permission control method. It is a coarse-grained file system permission control mode.

ACL (Access Control List) Access Control List

Includes three aspects:

Authorization Mode (Scheme)
- IP: Access control based on IP address granularity
- Digest: The most commonly used, it is carried out with an authorization identifier similar to username:password Permission configuration, easy to distinguish different applications for permission control
- World: The most open permission control method is a special digest mode with only one permission identification “world:anyone”
- Super: Super user
Authorization object
Authorization object refers to the user or a specified entity given the authority, such as IP address or machine light.
Permission Permission
- CREATE: data node creation permission, allowing authorized objects to create child nodes under this Znode
- DELETE: child node deletion permission, Allow the authorized object to delete the child nodes of the data node
- READ: The read permission of the data node, allow the authorized object to access the data node and read its data content or child node list, etc.
- WRITE: Data node update authority, allowing authorized objects to update the data node
- ADMIN: Data node management authority, allowing authorized objects to perform ACL related setting operations on the data node
back to top

11. Chroot feature

After version 3.2.0, Chroot feature has been added , This feature allows each client to set a namespace for itself. If a client has Chroot set up, then any operation of the client on the server will be restricted to its own namespace.

By setting Chroot, a client can be applied to a subtree of the Zookeeper server. In the scenario where multiple applications share a Zookeeper into the group, it is possible to isolate different applications from each other. Very helpful.

Back to top

12. Session management

Bagging strategy: Put similar sessions in the same block for management , So that Zookeeper can isolate different blocks of the session and uniformly process the same block.

Assignment principle: “ExpirationTime” of each session

Calculation formula:
```
ExpirationTime_ = currentTime + sessionTimeout ExpirationTime = (ExpirationTime_ / ExpirationInrerval + 1) * ExpirationInterval, ExpirationInterval refers to the Zookeeper session timeout check interval, default tickTime 
```
Back to top

13. Server role

Leader
- The only scheduler and processor of transaction requests to ensure cluster transaction processing Sequential
- Scheduler of services in the cluster
Follower
- Process the client’s non-transactional request and forward the transaction request to the Leader server
- Participate in the transaction request Proposal vote
- Participate in Leader election voting
Observer

A server role introduced after version 3.3.0 does not affect the cluster Improve the non-transactional processing capabilities of the cluster on the basis of transaction processing capabilities
- Processing non-transactional requests from the client and forwarding transaction requests to the Leader server
- Do not participate in any form of voting< /li>
Back to top

14. Zookeeper server working status

server The server has four states, namely LOOKING, FOLLOWING, LEADING, OBSERVING.
- LOOKING: Look for Leader status. When the server is in this state, it will think that there is no leader in the current cluster, so it needs to enter the leader election state.
- FOLLOWING: Follower status. Indicates that the current server role is Follower.
- LEADING: Leader status. Indicates that the current server role is Leader.
- OBSERVING: Observer status. Indicates that the current server role is Observer.
Back to top

15. Leader election

Leader election is the key to ensuring the consistency of distributed data Where. When one of the following two situations occurs on a server in the Zookeeper cluster, it needs to enter the Leader election.

　　(1) The server is initialized and started.

　　 (2) The server cannot maintain a connection with the leader during the running period.

　　 The following two situations are analyzed and explained.

　　1. Leader election during server startup

　　 If leader election is to be conducted, at least two machines are required , Here we select a server cluster composed of 3 machines as an example. In the cluster initialization phase, when one server Server1 starts, it alone cannot conduct and complete Leader election. When the second server Server2 starts, the two machines can communicate with each other at this time, and each machine tries to find the Leader, so enter Leader election process. The election process is as follows

　　(1) Each server sends a vote. Because it is the initial situation, Server1 and Server2 will use themselves as the Leader server to vote, and each vote will include the myid and ZXID, use (myid, ZXID) to represent, at this time Server1’s vote is (1, 0), Server2’s vote is (2, 0), and then each < span style="background-color: #ffff00">Send this vote to other machines in the cluster.

　　(2) Accept votes from various servers. After each server in the cluster receives a vote, it first judges the validity of the vote, such as checking whether it is the current round of voting and whether it comes from a server in the LOOKING state.

　　(3) Process voting. For each vote, the server needs to PK the other people’s vote with its own vote. The PK rules are as follows

　　　　· Check ZXID first . The server with the larger ZXID is preferred as the leader.

　　　　· If the ZXID is the same, then compare myid. The server with the larger myid serves as the leader server.

　　 For Server1, its vote is (1, 0), and the vote of receiving Server2 is (2, 0). First, the ZXIDs of the two are compared, which are both 0, and then the myid is compared. When Server2’s myid is the largest, it updates its vote to (2, 0), and then re-votes. For Server2, it does not need to update its vote, but sends the last voting information to all machines in the cluster again.

　　(4) Count votes. After each vote, the server will count the voting information to determine whether more than half of the machines have received the same voting information. For Server1 and Server2, it is calculated that two machines in the cluster have already accepted (2, 0) votes. Information, at this time it is considered that the Leader has been selected.

　　(5) Change server status. Once the Leader is determined, each server will update its own status. If it is a Follower, it will be changed to FOLLOWING, and if it is a Leader, it will be changed to LEADING.

　　2. Leader election during server operation

　　 During the operation of Zookeeper, Leader and non-Leader servers are different Performing its duties, even when a non-Leader server is down or newly added, it will not affect the Leader at this time, but once the Leader server is down, the entire cluster will suspend external services and enter a new round of Leader elections, the process and start The Leader election process during the period was basically the same. Assuming that there are three servers running, Server1, Server2, and Server3, and the current leader is Server2, if the leader hangs up at a certain moment, the leader election will start at this time. The election process is as follows

　　(1) change status. After the leader is suspended, the remaining non-Observer servers will change their server status to LOOKING, and then start the into the Leader election process.

　　(2) Each server will send a vote. During operation, the ZXID on each server may be different. At this time, assume that the ZXID of Server1 is 123 and the ZXID of Server3 is 122; in the first round of voting, both Server1 and Server3 will vote for themselves and generate votes (1, 123), (3, 122), and then send their votes to all machines in the cluster.

　　(3) Receive votes from various servers. The process is the same as at startup.

　　(4) Process voting. The process is the same as that at startup. At this time, Server1 will become the leader.

　　(5) Count votes. The process is the same as at startup.

　　(6) Change the status of the server. The process is the same as at startup.

　　2.2 Leader election algorithm analysis

　　 The version of Zookeeper after 3.4.0 only retains the TCP version of the FastLeaderElection election algorithm. When a machine enters the Leader election, the current cluster may be in the following two states

　　　　· Leader already exists in the cluster.

　　　　· There is no Leader in the cluster.

　　 For the leader that already exists in the cluster, this situation is generally that a certain machine starts late. Before it starts, the cluster is already working normally. In this case, when the machine tries to elect a Leader, it will be notified of the Leader information of the current server. For this machine, it only needs to establish a connection with the Leader machine and Just synchronize the status. However, when there is no Leader in the cluster, it will be relatively complicated. The steps are as follows

　　(1) The first vote. No matter which leads to leader election, all machines in the cluster are in a state of trying to elect a leader, that is, the LOOKING state. The LOOKING machine will send a message to all other machines. This message is called a vote. The voting includes SID (the unique identifier of the server) and ZXID (transaction ID), (SID, ZXID) form to identify a voting information. Suppose Zookeeper is composed of 5 machines, SIDs are 1, 2, 3, 4, 5, ZXIDs are 9, 9, 9, 8, 8, and the machine with SID 2 is the leader machine. At a certain moment, The machine where 1 and 2 are located fails, so the cluster starts the leader election. In the first vote, each machine will vote for itself, so the voting status of machines with SIDs 3, 4, and 5 are (3, 9), (4, 8), (5, 8), respectively.

　　(2) Change vote. After each machine sends a vote, it will also receive votes from other machines. Each machine will process the votes received from other machines according to certain rules, and use this to Decide whether you need to change your vote. This rule is also the core of the entire Leader election algorithm. The terms are described as follows

　　　　· Vote_sid: The received vote is Recommend the SID of the Leader server.

　　　　· vote_zxid: recommended the ZXID of the Leader server in the received vote.

　　　　· self_sid: the current server’s own SID.

　　　　· self_zxid: The current server’s own ZXID.

　　 Each time the received vote is processed, it is a process of comparing (vote_sid, vote_zxid) and (self_sid, self_zxid).

　　　　 Rule 1: If vote_zxid is greater than self_zxid, the currently received vote is recognized and the vote is sent out again.

　　　　 Rule 2: If vote_zxid is less than self_zxid, then stick to your vote without making any changes.

　　　　 Rule 3: If vote_zxid is equal to self_zxid, then compare the SIDs of the two, if vote_sid is greater than self_sid, then the currently received vote is recognized and the vote is sent out again.

　　　　 Rule 4: If vote_zxid is equal to self_zxid, and vote_sid is less than self_sid, then stick to your vote without making any changes.

　　Combined with the above rules, the following cluster change process is given.

(3) determine the Leader. After the second round of voting, each machine in the cluster will again receive votes from other machines, and then start counting votes. If a machine receives more than half of the same votes, then the SID machine corresponding to this vote is the leader. At this time Server3 will become the Leader.

　　It can be seen from the above rules that generally the newer the data on that server (the larger the ZXID will be), the more likely it is to become a leader, and the more it can guarantee data recovery. If the ZXID is the same, the larger the SID, the greater the chance.

　　2.3 Leader election implementation details

　　1. Server status

　　 The server has four states, namely LOOKING, FOLLOWING, LEADING, OBSERVING.

　　LOOKING: Look for Leader status. When the server is in this state, it will think that there is no leader in the current cluster, so it needs to enter the leader election state.

　　FOLLOWING: Follower status. Indicates that the current server role is Follower.

　　LEADING: Leader status. Indicates that the current server role is Leader.

　　OBSERVING: Observer status. Indicates that the current server role is Observer.

　　2. Voting data structure

　　 Each vote contains two most basic information, the SID and ZXID of the recommended server, and the voting (Vote) contains the following fields in Zookeeper

　　id: The SID of the recommended leader.

　　zxid: The recommended Leader transaction ID.

　　electionEpoch: Logical clock, used to determine whether multiple votes are in the same round of election cycle. The value is incremented by 1.

　　peerEpoch: The epoch of the elected Leader.

　　state: The state of the current server.

　　3. QuorumCnxManager: Network I/O

　　 During the startup process of each server, a QuorumPeerManager will be started, which is responsible for the network in the underlying Leader election process between the servers Communication.

　　(1) Message queue. QuorumCnxManager maintains a series of queues internally to store received and to-be-sent messages and message senders. Except for the receiving queue, other queues are grouped by SID to form a queue set. For example, in a cluster, in addition to itself, there are 3 One machine, then a sending queue will be created for each of the three machines, without interfering with each other.

　　　　· recvQueue: message receiving queue, used to store messages received from other servers.

　　　　·queueSendMap: The message sending queue is used to save the messages to be sent and group them according to SID.

　　　　· senderWorkerMap: sender set, each SenderWorker message sender corresponds to a remote Zookeeper server, responsible for sending messages, and grouping them according to SID.

　　　　·lastMessageSent: The most recently sent message, and the most recently sent message is reserved for each SID.

　　(2) Establish a connection. In order to be able to vote with each other, all machines in the Zookeeper cluster need to establish a network connection in pairs. QuorumCnxManager will create a ServerSocket to monitor the communication port elected by the Leader (3888 by default) when it starts. After the monitoring is turned on, Zookeeper can continuously receive connection creation requests from other servers, and will process it when it receives TCP connection requests from other servers. In order to avoid repeated creation of TCP connections between two machines, Zookeeper only allows servers with a large SID to actively establish connections with other machines, otherwise the connection will be disconnected. After receiving the connection creation request, the server judges whether to accept the connection request by comparing the SID value of itself and the remote server. If the current server finds that its SID is larger, it will disconnect the current connection, and then actively establish a connection with the remote server. . Once the connection is established, the corresponding message sender SendWorker and message receiver RecvWorker will be created according to the SID of the remote server and started.

　　(3) Message receiving and sending. Message reception: The message receiver RecvWorker is responsible. Since Zookeeper allocates a separate RecvWorker for each remote server, each RecvWorker only needs to continuously read messages from this TCP connection and save them to the recvQueue queue . Message sending: Since Zookeeper allocates a separate SendWorker for each remote server, each SendWorker only needs to continuously get a message to send from the corresponding message sending queue, and put the message in lastMessageSent at the same time. In SendWorker, once Zookeeper finds that the message sending queue for the current server is empty, then you need to take out a recently sent message from lastMessageSent to send again. This is to solve the problem that the receiver receives or receives the message before the message is received. After the server hung up, the message has not been processed correctly. At the same time, Zookeeper can ensure that the receiver will correctly handle duplicate messages when processing messages.

　　4. FastLeaderElection: The core of the election algorithm

　　· External voting: specifically refers to votes sent by other servers.

　　· Internal voting: the current voting of the server itself.

　　· Election round: The election round of Zookeeper server Leader, namely logicalclock.

　　·PK: Compare internal voting with external voting to determine whether internal voting needs to be changed.

　　(1) Ballot Management

　　· sendqueue: The ballot sending queue is used to save the ballots to be sent.

　　· recvqueue: The ballot receiving queue is used to save the received external votes.

　　·WorkerReceiver: vote receiver. It will continuously obtain election messages from other servers from QuorumCnxManager, convert it into a ballot, and then save it in recvqueue. During the ballot receiving process, if the election round of the external ballot is found to be less than that of the current server, Then ignore the external vote and send your own internal vote immediately.

　　·WorkerSender: The vote sender, which continuously obtains the votes to be sent from the sendqueue and passes them to the underlying QuorumCnxManager.

　　(2) Algorithm core

　　 The picture above shows how the FastLeaderElection module interacts with the underlying network I/O. The basic process of leader election is as follows

　　 1. Self-increasing election rounds. Zookeeper stipulates that all valid votes must be in the same round. When starting a new round of voting, the logicalclock will first be incremented.

　　2. Initialize the ballot. Before starting a new round of voting, each server will initialize its own votes, and during the initialization phase, each server will elect itself as a leader.

　　3. Send the initial ballot. After completing the initialization of the ballot, the server will initiate the first ballot. Zookeeper will put the just initialized ballot into sendqueue, and send it out by WorkerSender.

　　 4. Receive external votes. Each server will continuously obtain external votes from the recvqueue queue.如果服务器发现无法获取到任何外部投票，那么就会立即确认自己是否和集群中其他服务器保持着有效的连接，如果没有连接，则马上建立连接，如果已经建立了连接，则再次发送自己当前的内部投票。

　　5. 判断选举轮次。在发送完初始化选票之后，接着开始处理外部投票。在处理外部投票时，会根据选举轮次来进行不同的处理。

　　　　· 外部投票的选举轮次大于内部投票。若服务器自身的选举轮次落后于该外部投票对应服务器的选举轮次，那么就会立即更新自己的选举轮次(logicalclock)，并且清空所有已经收到的投票，然后使用初始化的投票来进行PK以确定是否变更内部投票。最终再将内部投票发送出去。

　　　　· 外部投票的选举轮次小于内部投票。若服务器接收的外选票的选举轮次落后于自身的选举轮次，那么Zookeeper就会直接忽略该外部投票，不做任何处理，并返回步骤4。

　　　　· 外部投票的选举轮次等于内部投票。此时可以开始进行选票PK。

　　6. 选票PK。在进行选票PK时，符合任意一个条件就需要变更投票。

　　　　· 若外部投票中推举的Leader服务器的选举轮次大于内部投票，那么需要变更投票。

　　　　· 若选举轮次一致，那么就对比两者的ZXID，若外部投票的ZXID大，那么需要变更投票。

　　　　· 若两者的ZXID一致，那么就对比两者的SID，若外部投票的SID大，那么就需要变更投票。

　　7. 变更投票。经过PK后，若确定了外部投票优于内部投票，那么就变更投票，即使用外部投票的选票信息来覆盖内部投票，变更完成后，再次将这个变更后的内部投票发送出去。

　　8. 选票归档。无论是否变更了投票，都会将刚刚收到的那份外部投票放入选票集合recvset中进行归档。 recvset用于记录当前服务器在本轮次的Leader选举中收到的所有外部投票（按照服务队的SID区别，如{(1, vote1), (2, vote2)…}）。

　　9. 统计投票。完成选票归档后，就可以开始统计投票，统计投票是为了统计集群中是否已经有过半的服务器认可了当前的内部投票，如果确定已经有过半服务器认可了该投票，则终止投票。否则返回步骤4。

　　10. 更新服务器状态。若已经确定可以终止投票，那么就开始更新服务器状态，服务器首选判断当前被过半服务器认可的投票所对应的Leader服务器是否是自己，若是自己，则将自己的服务器状态更新为LEADING，若不是，则根据具体情况来确定自己是FOLLOWING或是OBSERVING。

　　以上10个步骤就是FastLeaderElection的核心，其中步骤4-9会经过几轮循环，直到有Leader选举产生。

回到顶部

16. 数据同步

整个集群完成Leader选举之后，Learner（Follower和Observer的统称）会向Leader服务器进行注册。当Learner服务器想Leader服务器完成注册后，进入数据同步环节。

数据同步流程：（均以消息传递的方式进行）

i. Learner向Learder注册

ii. 数据同步

iii. 同步确认

Zookeeper的数据同步通常分为四类：
- 直接差异化同步（DIFF同步）
- 先回滚再差异化同步（TRUNC+DIFF同步）
- 仅回滚同步（TRUNC同步）
- 全量同步（SNAP同步）
在进行数据同步前，Leader服务器会完成数据同步初始化：
- peerLastZxid：从learner服务器注册时发送的ACKEPOCH消息中提取lastZxid（该Learner服务器最后处理的ZXID）
- minCommittedLog：Leader服务器Proposal缓存队列committedLog中最小ZXID
- maxCommittedLog：Leader服务器Proposal缓存队列committedLog中最大ZXID
直接差异化同步（DIFF同步）

场景：peerLastZxid介于minCommittedLog和maxCommittedLog之间

先回滚再差异化同步（TRUNC+DIFF同步）

场景：当新的Leader服务器发现某个Learner服务器包含了一条自己没有的事务记录，那么就需要让该Learner服务器进行事务回滚–回滚到Leader服务器上存在的，同时也是最接近于peerLastZxid的ZXID

仅回滚同步（TRUNC同步）

场景：peerLastZxid 大于 maxCommittedLog

全量同步（SNAP同步）

场景一：peerLastZxid 小于 minCommittedLog
场景二：Leader服务器上没有Proposal缓存队列且peerLastZxid不等于lastProcessZxid

回到顶部

17. zookeeper是如何保证事务的顺序一致性的？

zookeeper采用了全局递增的事务Id来标识，所有的proposal（提议）都在被提出的时候加上了zxid，zxid实际上是一个64位的数字，高32位是epoch（时期; 纪元; 世; 新时代）用来标识leader周期，如果有新的leader产生出来，epoch会自增，低32位用来递增计数。当新产生proposal的时候，会依据数据库的两阶段过程，首先会向其他的server发出事务执行请求，如果超过半数的机器都能执行并且能够成功，那么就会开始执行。

回到顶部

18. 分布式集群中为什么会有Master？

在分布式环境中，有些业务逻辑只需要集群中的某一台机器进行执行，其他的机器可以共享这个结果，这样可以大大减少重复计算，提高性能，于是就需要进行leader选举。

回到顶部

19. zk节点宕机如何处理？

Zookeeper本身也是集群，推荐配置不少于3个服务器。 Zookeeper自身也要保证当一个节点宕机时，其他节点会继续提供服务。
如果是一个Follower宕机，还有2台服务器提供访问，因为Zookeeper上的数据是有多个副本的，数据并不会丢失；
如果是一个Leader宕机，Zookeeper会选举出新的Leader。
ZK集群的机制是只要超过半数的节点正常，集群就能正常提供服务。只有在ZK节点挂得太多，只剩一半或不到一半节点能工作，集群才失效。
所以
3个节点的cluster可以挂掉1个节点(leader可以得到2票>1.5)
2个节点的cluster就不能挂掉任何1个节点了(leader可以得到1票<=1)

回到顶部

20. zookeeper负载均衡和nginx负载均衡区别

zk的负载均衡是可以调控，nginx只是能调权重，其他需要可控的都需要自己写插件；但是nginx的吞吐量比zk大很多，应该说按业务选择用哪种方式。

回到顶部

21. Zookeeper有哪几种几种部署模式？

部署模式：单机模式、伪集群模式、集群模式。

回到顶部

22. 集群最少要几台机器，集群规则是怎样的?

集群规则为2N+1台，N>0，即3台。

回到顶部

23. 集群支持动态添加机器吗？

其实就是水平扩容了，Zookeeper在这方面不太好。两种方式：
- 全部重启：关闭所有Zookeeper服务，修改配置之后启动。不影响之前客户端的会话。
- 逐个重启：在过半存活即可用的原则下，一台机器重启不影响整个集群对外提供服务。这是比较常用的方式。
3.5版本开始支持动态扩容。

回到顶部

24. Zookeeper对节点的watch监听通知是永久的吗？为什么不是永久的?

不是。官方声明：一个Watch事件是一个一次性的触发器，当被设置了Watch的数据发生了改变的时候，则服务器将这个改变发送给设置了Watch的客户端，以便通知它们。

为什么不是永久的，举个例子，如果服务端变动频繁，而监听的客户端很多情况下，每次变动都要通知到所有的客户端，给网络和服务器造成很大压力。
一般是客户端执行getData(“/节点A”,true)，如果节点A发生了变更或删除，客户端会得到它的watch事件，但是在之后节点A又发生了变更，而客户端又没有设置watch事件，就不再给客户端发送。
在实际应用中，很多情况下，我们的客户端不需要知道服务端的每一次变动，我只要最新的数据即可。

回到顶部

25. Zookeeper的java客户端都有哪些？

java客户端：zk自带的zkclient及Apache开源的Curator。

回到顶部

26. chubby是什么，和zookeeper比你怎么看？

chubby是google的，完全实现paxos算法，不开源。 zookeeper是chubby的开源实现，使用zab协议，paxos算法的变种。

回到顶部

27. 说几个zookeeper常用的命令。

常用命令：ls get set create delete等。

回到顶部

28. ZAB和Paxos算法的联系与区别？
- 相同点：
  - 两者都存在一个类似于Leader进程的角色，由其负责协调多个Follower进程的运行
  - Leader进程都会等待超过半数的Follower做出正确的反馈后，才会将一个提案进行提交
  - ZAB协议中，每个Proposal中都包含一个 epoch 值来代表当前的Leader周期，Paxos中名字为Ballot
- 不同点：
  ZAB用来构建高可用的分布式数据主备系统（Zookeeper），Paxos是用来构建分布式一致性状态机系统。
回到顶部

29. Zookeeper的典型应用场景

Zookeeper是一个典型的发布/订阅模式的分布式数据管理与协调框架，开发人员可以使用它来进行分布式数据的发布和订阅。

通过对Zookeeper中丰富的数据节点进行交叉使用，配合Watcher事件通知机制，可以非常方便的构建一系列分布式应用中年都会涉及的核心功能，如：
- 数据发布/订阅
- 负载均衡
- 命名服务
- 分布式协调/通知
- 集群管理
- Master选举
- 分布式锁
- 分布式队列
1. 数据发布/订阅

介绍

数据发布/订阅系统，即所谓的配置中心，顾名思义就是发布者发布数据供订阅者进行数据订阅。

目的
- 动态获取数据（配置信息）
- 实现数据（配置信息）的集中式管理和数据的动态更新
设计模式
- Push 模式
- Pull 模式
数据（配置信息）特性：
- 数据量通常比较小
- 数据内容在运行时会发生动态更新
- 集群中各机器共享，配置一致
如：机器列表信息、运行时开关配置、数据库配置信息等

基于Zookeeper的实现方式
1. 数据存储：将数据（配置信息）存储到Zookeeper上的一个数据节点
2. 数据获取：应用在启动初始化节点从Zookeeper数据节点读取数据，并在该节点上注册一个数据变更Watcher
3. 数据变更：当变更数据时，更新Zookeeper对应节点数据，Zookeeper会将数据变更通知发到各客户端，客户端接到通知后重新读取变更后的数据即可。
2. 负载均衡

zk的命名服务
命名服务是指通过指定的名字来获取资源或者服务的地址，利用zk创建一个全局的路径，这个路径就可以作为一个名字，指向集群中的集群，提供的服务的地址，或者一个远程的对象等等。

分布式通知和协调
对于系统调度来说：操作人员发送通知实际是通过控制台改变某个节点的状态，然后zk将这些变化发送给注册了这个节点的watcher的所有客户端。
对于执行情况汇报：每个工作进程都在某个目录下创建一个临时节点。并携带工作的进度数据，这样汇总的进程可以监控目录子节点的变化获得工作进度的实时的全局情况。

7.zk的命名服务（文件系统）
命名服务是指通过指定的名字来获取资源或者服务的地址，利用zk创建一个全局的路径，即是唯一的路径，这个路径就可以作为一个名字，指向集群中的集群，提供的服务的地址，或者一个远程的对象等等。

8.zk的配置管理（文件系统、通知机制）
程序分布式的部署在不同的机器上，将程序的配置信息放在zk的znode下，当有配置发生改变时，也就是znode发生变化时，可以通过改变zk中某个目录节点的内容，利用watcher通知给各个客户端，从而更改配置。

9.Zookeeper集群管理（文件系统、通知机制）
所谓集群管理无在乎两点：是否有机器退出和加入、选举master。
对于第一点，所有机器约定在父目录下创建临时目录节点，然后监听父目录节点的子节点变化消息。一旦有机器挂掉，该机器与 zookeeper的连接断开，其所创建的临时目录节点被删除，所有其他机器都收到通知：某个兄弟目录被删除，于是，所有人都知道：它上船了。
新机器加入也是类似，所有机器收到通知：新兄弟目录加入，highcount又有了，对于第二点，我们稍微改变一下，所有机器创建临时顺序编号目录节点，每次选取编号最小的机器作为master就好。

10.Zookeeper分布式锁（文件系统、通知机制）
有了zookeeper的一致性文件系统，锁的问题变得容易。 Lock services can be divided into two categories, one is to keep exclusive, and the other is to control timing.
For the first category, we regard a znode on zookeeper as a lock, which is implemented by createznode. All clients create the /distribute_lock node, and the client that is successfully created in the end also owns the lock.用完删除掉自己创建的distribute_lock 节点就释放出锁。
对于第二类， /distribute_lock 已经预先存在，所有客户端在它下面创建临时顺序编号目录节点，和选master一样，编号最小的获得锁，用完删除，依次方便。

11.获取分布式锁的流程
clipboard.png

在获取分布式锁的时候在locker节点下创建临时顺序节点，释放锁的时候删除该临时节点。客户端调用createNode方法在locker下创建临时顺序节点，
然后调用getChildren(“locker”)来获取locker下面的所有子节点，注意此时不用设置任何Watcher。客户端获取到所有的子节点path之后，如果发现自己创建的节点在所有创建的子节点序号最小，那么就认为该客户端获取到了锁。如果发现自己创建的节点并非locker所有子节点中最小的，说明自己还没有获取到锁，此时客户端需要找到比自己小的那个节点，然后对其调用exist()方法，同时对其注册事件监听器。之后，让这个被关注的节点删除，则客户端的Watcher会收到相应通知，此时再次判断自己创建的节点是否是locker子节点中序号最小的，如果是则获取到了锁，如果不是则重复以上步骤继续获取到比自己小的一个节点并注册监听。当前这个过程中还需要许多的逻辑判断。
clipboard.png

代码的实现主要是基于互斥锁，获取分布式锁的重点逻辑在于BaseDistributedLock，实现了基于Zookeeper实现分布式锁的细节。

12.Zookeeper队列管理（文件系统、通知机制）两种类型的队列：1、同步队列，当一个队列的成员都聚齐时，这个队列才可用，否则一直等待所有成员到达。 2、队列按照 FIFO 方式进行入队和出队操作。第一类，在约定目录下创建临时目录节点，监听节点数目是否是我们要求的数目。 The second category is consistent with the basic principle of the control sequence scenario in the distributed lock service. The entry is numbered, and the exit is numbered.在特定的目录下创建PERSISTENT_SEQUENTIAL节点，创建成功时Watcher通知等待的队列，队列删除序列号最小的节点用以消费。此场景下Zookeeper的znode用于消息存储，znode存储的数据就是消息队列中的消息内容，SEQUENTIAL序列号就是消息的编号，按序取出即可。由于创建的节点是持久化的，所以不必担心队列消息的丢失问题。

http://www.voidcn.com/cata/817705

http://www.cnblogs.com/leesf456/p/6239578.htm

https://www.cnblogs.com/gnivor/p/6264080.html

目录
- 1. ZooKeeper是什么？
- 2. ZooKeeper提供了什么？
- 3. Zookeeper文件系统
- 4. ZAB协议？
- 5. 四种类型的数据节点 Znode
- 6. Zookeeper Watcher 机制 — 数据变更通知
- 7. 客户端注册Watcher实现
- 8. 服务端处理Watcher实现
- 9. 客户端回调Watcher
- 10. ACL权限控制机制
  - UGO（User/Group/Others）
  - ACL（Access Control List）访问控制列表
- 11. Chroot特性
- 12. 会话管理
- 13. 服务器角色
  - Leader
  - Follower
  - Observer
- 14. Zookeeper 下 Server工作状态
- 15. Leader 选举
- 16. 数据同步
  - 直接差异化同步（DIFF同步）
  - 先回滚再差异化同步（TRUNC+DIFF同步）
  - 仅回滚同步（TRUNC同步）
  - 全量同步（SNAP同步）
- 17. zookeeper是如何保证事务的顺序一致性的？
- 18. 分布式集群中为什么会有Master？
- 19. zk节点宕机如何处理？
- 20. zookeeper负载均衡和nginx负载均衡区别
- 21. Zookeeper有哪几种几种部署模式？
- 22. 集群最少要几台机器，集群规则是怎样的?
- 23. 集群支持动态添加机器吗？
- 24. Zookeeper对节点的watch监听通知是永久的吗？为什么不是永久的?
- 25. Zookeeper的java客户端都有哪些？
- 26. chubby是什么，和zookeeper比你怎么看？
- 27. 说几个zookeeper常用的命令。
- 28. ZAB和Paxos算法的联系与区别？
- 29. Zookeeper的典型应用场景
  - 1. 数据发布/订阅
  - 2. 负载均衡
1. ZooKeeper是什么？

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

回到顶部

2. What does ZooKeeper provide?

3. Zookeeper file system

4. ZAB agreement?

5. Four types of data nodes Znode

6. Zookeeper Watcher mechanism– Data change notification

7. Customers Realization of Watcher registration on the end

8. Service End processing Watcher implementation

9. Client callback Watcher

10. ACL permission control mechanism

UGO (User/Group/Others)

ACL (Access Control List) Access Control List

11. Chroot feature

12. Session management

13. Server role

Leader

Follower

Observer

14. Zookeeper server working status

15. Leader election

16. 数据同步

直接差异化同步（DIFF同步）

先回滚再差异化同步（TRUNC+DIFF同步）

仅回滚同步（TRUNC同步）

全量同步（SNAP同步）

17. zookeeper是如何保证事务的顺序一致性的？

18. 分布式集群中为什么会有Master？

19. zk节点宕机如何处理？

20. zookeeper负载均衡和nginx负载均衡区别

21. Zookeeper有哪几种几种部署模式？

22. 集群最少要几台机器，集群规则是怎样的?

23. 集群支持动态添加机器吗？

24. Zookeeper对节点的watch监听通知是永久的吗？为什么不是永久的?

25. Zookeeper的java客户端都有哪些？

26. chubby是什么，和zookeeper比你怎么看？

27. 说几个zookeeper常用的命令。

28. ZAB和Paxos算法的联系与区别？

29. Zookeeper的典型应用场景

1. 数据发布/订阅

介绍

目的

设计模式

数据（配置信息）特性：

基于Zookeeper的实现方式

2. 负载均衡

Leave a Comment Cancel reply