ZooKeeper Learning Notes

summarize

ZooKeeper is a distributed orchestration service originally designed to provide consistency services for distributed software.ZooKeeper provides a tree structure similar to the Linux filesystem.Each node of ZooKeeper can be either a directory or data, and at the same time, ZooKeeper provides a monitoring and notification mechanism for each node. ZooKeeper-based consistency services can easily implement distributed locking, distributed election, service discovery and monitoring, configuration center and other functions.

ZooKeeper is a highly available cluster based on master/standby replication, and the roles of ZooKeeper include Leader, Follower, Observer

Leader: A running ZooKeeper cluster has only one leader, who has two main responsibilities: one is responsible for writing cluster data, and the other is to initiate and maintain heartbeats between Follower and Observer in order to monitor the cluster's operation status. In a ZooKeeper cluster, all write operations must go through the Leader, and only after the Leader completes a write operation will it broadcast the write operation to the other Follower(s), and only when more than half of the nodes (excluding the Observer node) write successfully will the write request be considered successful.

Follower: A ZooKeeper cluster can have more than one follower, which is connected to the leader by heartbeat. follower has two main responsibilities: one is responsible for reading cluster data, and the other is to participate in the cluster's leader election. follower first determines whether a request is a read request or a write request when it receives a request from a client. When Follower receives a client request, it will first determine whether the request is a read request or a write request, if it is a read request, Follower will read the data from the local node and return it to the client, if it is a write request, Follower will forward the write request to the leader for processing. Meanwhile, after the leader fails, Follower needs to vote in the cluster election.

Observer: A ZooKeeper cluster can have multiple Observers, which are mainly responsible for reading the cluster data.The functions of Observers are similar to those of Follower, but the main difference is that Observers do not have the right to vote.A ZooKeeper cluster needs to add more service instances in order to support more concurrent client operations. To support more concurrent client operations, ZooKeeper cluster needs to add more service instances, and too many service instances will make the voting phase of the cluster complex, and the time for the cluster to select the master is too long, which is not conducive to the rapid recovery of the cluster failure. Therefore, ZooKeeper introduces the role of Observer, which does not participate in voting, but only receives client connections and responds to client read requests, and forwards write requests to the Leader node. Adding more Observer nodes not only improves the throughput of the ZooKeeper cluster, but also ensures the stability of the system.

ZAB Protocol

The ZAB (ZooKeeper Atomic Broadcast) protocol guarantees the uniqueness of the cluster state through the unique transaction number Zxid (ZooKeeper Transaction id).

Epoch: refers to the cycle number of the current cluster, every time the cluster Leader changes, a new cycle number will be generated, the cycle number is generated by adding 1 to the previous cycle number, so that after the previous Leader crashes and recovers, it will find that its own cycle number is smaller than the current cycle number, which means that a new Leader has been generated, and the old one will join the cluster as a Follower again. The old Leader will join the cluster as a Follower again.
Zxid: refers to the transaction number of the ZAB protocol, which is a 64-bit number, where the lower 32 bits store a simple monotonically incrementing counter that is incremented by 1 for each transaction request from the client, and the upper 32 bits store the Epoch of the Leader, which is the Epoch number of the Leader that is elected every time the Leader is elected. Each time a new leader is elected, the leader will retrieve the Zxid of the largest transaction from the current server's logs, get the Epoch value of the high 32 bits and add 1 to it as the new Epoch, and start counting again from 0 for the low 32 bits.

The ZAB protocol has two modes, recovery mode (cluster picks master) and broadcast mode (data synchronization)

Recovery Mode: The cluster will start selecting a master after a startup, reboot, or Leader crash, and the process is in recovery mode.
Broadcast Mode: After the Leader is elected, it will broadcast the latest cluster state to other Follower, and the process is in broadcast mode. After more than half of the Follower have finished synchronizing with the Leader, the broadcast mode ends.

The four phases of the ZAB protocol

Leader Election: at the beginning of the cluster election, all nodes are in the election phase. When a node receives more than half of the votes, it will be elected as a Quasi-Leader, the purpose of the election phase is to generate a Quasi-Leader, and the Quasi-Leader will become the real Leader only when it reaches the Broadcast phase.
Discovery: In the discovery phase, each Follower starts communicating with the Prospective Leader to synchronize the transaction proposals recently received by the Follower. At this point, the prospective Leader will generate a new Epoch and try to let other Follower receive the Epoch before updating it locally. In the discovery phase, a Follower only connects to one Leader, if node 1 thinks that node 2 is the Leader, then when node 1 tries to connect to node 2, if the connection is rejected, the cluster will enter the re-election phase.
Synchronization: The synchronization stage is mainly to synchronize the latest proposal information obtained by the Leader in the previous stage to all the replicas in the cluster, only when more than half of the nodes have completed the synchronization, the quasi-Leader will become the real Leader. the Follower only receives proposals with a Zxid larger than its own lastZxid. Proposals
Broadcast: In the broadcast phase, the ZooKeeper cluster starts to formally provide transaction services to the outside world, at which time the Leader performs message broadcasting to notify other Follower of its state, and if a new node joins in the future, the Leader will synchronize the state of the new node.

ZooKeeper election mechanism and process

The ZooKeeper election mechanism is defined as follows: each Server first proposes that it is the Leader and votes for itself, then compares the results of the vote with the votes of the other Servers, and the one with the most weight wins, updating its own ballot box with the more heavily weighted votes.

The specific election process is as follows:

Each Server asks the other Servers who to vote for after startup, and the other Servers reply with their recommended Leader based on their status and return the corresponding Leader id and Zxid. when the cluster is first started, each Serve recommends itself as the Leader.
After Server receives replies from all other Servers, it calculates the Server with the largest Zxid and sets that Server as the one to be voted on next time.
During the calculation process, the Server with the most votes will be the winner, and if the number of votes for the winner exceeds half of the number of clusters, then that Server will be elected as the Leader; otherwise, the voting continues until the Leader is elected.
Leader waits for other Server connections
Follower connects to Leader, sends largest Zxid to Leader
The Leader determines the synchronization point based on the Zxid of the Follower, so the election phase is complete.

Zookeeper's Data Model

ZooKeeper uses a tree-structured namespace to represent its data structure, similar to the directory tree of a file system.Each node in the ZooKeeper tree is called a Znode, and each node in the ZooKeeper tree can have child nodes.Each node in the ZooKeeper tree stores data information, and also provides monitoring of node information. Operations

Znode consists of three parts:

Stat: status information, used to store the version, permissions, timestamps, etc. of the Znode
Data: the data stored by the Znode.
Children: information description of Znode child nodes

While Znode nodes can store data, they are not capable of storing large amounts of data like a database. Znode was originally designed to store metadata information such as configuration files, cluster state, etc. in distributed applications

Znode's controlled access:

ACL: Each Znode node has an Access Control List (ACL), which specifies the user's access to the node, and the application can categorize users into read-only, write-only, and read/write users according to the requirements.
Atomic operations: The data on each Znode node is characterized by atomic operations, where a read operation will fetch the data associated with the node and a write operation will replace the data on the node.

There are two types of nodes in ZooKeeper, temporary nodes and permanent nodes. The type of node is determined at the time of creation and cannot be changed:

Temporary node: the life cycle of the temporary node depends on the expiration time, the system will automatically delete the node after the expiration of the temporary node, the temporary node is not allowed to have child nodes
Permanent Node: The data of the permanent node will be stored until the user calls the interface to delete its data, generally used to store some permanent configuration information

Znode's Node Watch: Each node in ZooKeeper has a Watch that is used to monitor changes in the node's data, and when the node's state changes, the Watch's corresponding action will be triggered. When the Watch is triggered, ZooKeeper sends a notification to the client that monitors the node, describing the node's changes.

Zookeeper Application Scenarios

Uniform naming of services: In distributed environments, applications often need to unify the naming of services in order to identify different services and quickly obtain the list of services, the application can maintain the service name and service address information in the ZooKeeper, the client through the ZooKeeper to obtain the list of available services
Configuration management: In a distributed environment, applications can unify configuration files to be managed in ZooKeeper. Configuration information can be categorized according to system configuration, alarm configuration, service switch configuration, service value configuration, etc. and stored on different Znodes. Each service reads the configuration from ZooKeeper when it starts up, and at the same time listens to the Znode of each node, once the configuration in the Znode has been modified, Zookeeper will notify each service to update the configuration.
Cluster Management: In a distributed environment, real-time management of the state of each service is the most widely used scenario for ZooKeeper.
Distributed Notification Coordination: Based on Znode's temporary node and Watch features, applications can easily implement a distributed notification coordination system. For example, a temporary node with a period of 30s is created for each service in the cluster as a service status monitor, and each service is required to report its monitoring status to the ZooKeeper every 10s. When the ZooKeeper does not receive status feedback from a service for 30 consecutive seconds, the service is considered abnormal and is removed from the service list, and the result is notified to the service that monitors the status of the node.
Distributed locks: Since ZooKeeper is strongly consistent, when multiple clients create the same Znode in ZooKeeper at the same time, only one of them can create it successfully. Based on this mechanism, the application can realize lock exclusivity, when multiple clients create the same Znode in ZooKeeper at the same time, the one that succeeds in creating will get the lock, and the other clients will wait for the lock.