The difference between Eureka and ZooKeeper

Background

In this article, we will use the problems we encountered in practice to explain why it is an error to use ZooKeeper to do the service discovery service.

Service deployment environment

Let’s start from the beginning. When we deploy services, we should first consider the service deployment platform (platform environment), and then we can consider the software system running on the platform or how to build a system on the selected platform. For example, for a cloud deployment platform, the hardware scaling of the platform (Note: The author should refer to the redundancy design of the system, that is, the system encounters a single point of failure and can quickly switch to other nodes to complete the task) and how Dealing with network failures is the first consideration. When your service runs on a cluster built by a large number of servers (note: the original word is a large number of replaceable devices), there will definitely be a single point of failure. For Knewton, although we are deployed on AWS, we have encountered various failures in the past operation and maintenance; therefore, you should design the system to be “expecting failure”. In fact, many companies that also use AWS have encountered similar problems with us (and there are many books on this aspect). You must be able to anticipate the possible problems of the platform in advance, such as: unexpected failures (note: the original text is box failure, you can only realize that the author refers to the error prompt box that pops up unexpectedly), high latency and network segmentation problems (note: The original text is network partitions. It means that when the network switch fails, the communication between different subnets will be interrupted)-At the same time, we must be able to build a system that is flexible enough to deal with their occurrence.

Never expect your deployment service platform to be the same as others! Of course, if you are operating and maintaining a data center on your own, you may spend a lot of time and money to avoid hardware failures and network segmentation problems. This is another situation; but in cloud computing platforms, such as AWS, it will be different. Problems and different solutions. You will understand when you actually use it, but you’d better deal with them in advance (Note: Refers to the unexpected failures, high latency, and network segmentation issues mentioned in the previous section).

The problem of ZooKeeper as a service discovery

ZooKeeper (Note: ZooKeeper is a sub-project of the famous Hadoop, which aims to solve the problem of large-scale distributed application scenarios. Coordinate Service (Coordinate Service) problem; it can provide other services in the same distributed system: unified naming service, configuration management, distributed lock service, cluster management, etc.) is a great open source project, it is very mature , There is a considerable community to support its development, and it is widely used in the production environment; but it is a mistake to use it as a service discovery service solution.

There is a well-known CAP theorem in the field of distributed systems (C-data consistency; A-service availability; P-service fault tolerance to network partition failures. These three characteristics are in any distributed system Can not be satisfied at the same time, at most two at the same time); ZooKeeper is a CP, that is, access requests to ZooKeeper at any time can get consistent data results, and the system is fault-tolerant to network segmentation; but it cannot guarantee the availability of each service request (Note: That is, in extreme environments, ZooKeeper may discard some requests, and the consumer program needs to request again to get the result). But don’t forget, ZooKeeper is a distributed coordination service, and its responsibility is to ensure that data (Note: configuration data, status data) is synchronized and consistent among all services under its jurisdiction; so it’s not difficult to understand why ZooKeeper is designed It is CP instead of AP. If it is AP, it will bring horrible consequences (Note: ZooKeeper is like a signal light at an intersection. Can you imagine a sudden signal light failure in a traffic artery?). Moreover, as the core implementation algorithm of ZooKeeper, Zab solves the problem of how data is kept synchronized between multiple services in a distributed system.

As a distributed collaborative service, ZooKeeper is very good, but it is not suitable for the Service discovery service; because for the Service discovery service, even if it returns a result containing false information, it is better than anything. It’s better not to return; moreover, for the Service discovery service, it is better to return the information on which servers a certain service was available 5 minutes ago, and not to return because of a temporary network failure and can’t find an available server. Any result. Therefore, it is definitely wrong to use ZooKeeper to do the service discovery service. If you use it this way, it will be miserable!

What’s more, if it is used as a service discovery service, ZooKeeper itself does not correctly handle the problem of network segmentation; and in the cloud, network segmentation problems do occur like other types of failures; so it is best Be 100% prepared for this issue in advance. As Jepsen said in the blog posted on the ZooKeeper website: In ZooKeeper, if the number of nodes in the same network partition does not reach the “quorum” of the Leader node selected by ZooKeeper, they will It will be disconnected from ZooKeeper, and of course it will not be able to provide Service discovery services at the same time.

If you add client-side caching to ZooKeeper (note: equip ZooKeeper nodes with local caching) or other similar technologies, you can alleviate ZooKeeper’s problem of node synchronization information errors caused by network failures. Pinterest and Airbnb have used this method to prevent ZooKeeper failures. This method can solve this problem on the surface. Specifically, when some or all nodes are disconnected from ZooKeeper, each node can also obtain data from the local cache; however, even so, all nodes under ZooKeeper It is impossible to guarantee that all service registration information can be cached at any time. If all nodes under ZooKeeper are disconnected, or there is a network segmentation failure in the cluster (Note: The subnets under the switch cannot communicate with each other due to a switch failure); then ZooKeeper will remove them from its management scope , The outside world cannot access these nodes, even if these nodes themselves are “healthy” and can provide services normally; therefore, the service requests that reach these nodes are lost. (Note: This is also the reason why ZooKeeper does not meet the A in CAP)

The deeper reason is that ZooKeeper is built according to the CP principle, which means that it can ensure that the data of each node is consistent, and The purpose of adding a cache to ZooKeeper is to make ZooKeeper more reliable (available); however, ZooKeeper is designed to keep the node data consistent, which is CP. Therefore, in this way, you may neither get a data-consistent (CP) nor a highly available (AP) Service discovery service; because this is equivalent to forcing you on an existing CP system A system with an AP is tied, which is essentially not working! A Service discovery service should be designed to be highly available from the beginning!

If you ignore the CAP principle, it will be very difficult to set up and maintain the ZooKeeper service correctly; errors will often occur, resulting in many projects being established just to reduce the difficulty of maintaining ZooKeeper. These errors exist not only with the client but also with the ZooKeeper server itself. Many failures of the Knewton platform are caused by improper use of ZooKeeper. Those seemingly simple operations, such as correct reestablishing watcher, client Session and exception handling, and memory management in the ZK window, are very easy to cause ZooKeeper errors. At the same time, we did encounter some classic bugs of ZooKeeper: ZooKeeper-1159 and ZooKeeper-1576; We even encountered the failure of ZooKeeper election leader node in the production environment. These problems arise because ZooKeeper needs to manage and protect the Session and network connection resources of the service group under its jurisdiction (note: the management of these resources is extremely difficult in a distributed system environment); but it is not responsible for managing the discovery of services , So the use of ZooKeeper when the Service finds that the service is not worth the loss.

Make the right choice: Eureka’s success

We switched the Service discovery service from ZooKeeper to the Eureka platform, which is an open source service discovery The solution was developed by Netflix. (Note: Eureka consists of two components: Eureka server and Eureka client. Eureka server is used as a service registration server. Eureka client is a java client used to simplify the interaction with the server, as a polling load balancer, and Provide service failover support.) Eureka was designed from the beginning as a highly available and scalable Service discovery service. These two features are also the two features of all platforms developed by Netflix. (They are all discussing Eureka). Since the start of the switch work, we have realized that all products that rely on Eureka in the production environment have no offline maintenance records. We have also been told that service migration on the cloud platform is doomed to fail; but the experience we have learned from this example is that an excellent Service discovery service plays a vital role in it!

First of all, in the Eureka platform, if a server goes down, Eureka will not have a leader election process similar to ZooKeeper; client requests will automatically switch to the new Eureka node; when the server goes down After the restoration, Eureka will again incorporate it into the server cluster management; and for it, all it has to do is to synchronize some new service registration information. Therefore, there is no need to worry about the risk that the “left behind” server will be removed from the Eureka server cluster after it is restored. Eureka is even designed to cope with a wider range of network segmentation failures and achieve “0” downtime maintenance requirements. When a network segmentation failure occurs, each Eureka node will continue to provide external services (Note: ZooKeeper will not): receive new service registrations and provide them to downstream service discovery requests. In this way, it can be implemented in the same subnet (same side of partition), and newly released services can still be discovered and accessed.

However, Eureka does more than that. Under normal configuration, Eureka has a built-in heartbeat service to eliminate some “dying” servers; if the service registered in Eureka, its “heartbeat” becomes slow, Eureka will remove it from the management scope (this It’s a bit like ZooKeeper’s approach). This is a good feature, but when a network segmentation failure occurs, it is also very dangerous; because the servers that are removed due to network problems (note: slow heartbeat is removed) are themselves very “healthy”. It’s just that the Eureka cluster is divided into independent subnets due to a network segmentation failure and cannot communicate with each other.

Fortunately, Netflix has considered this flaw. If the Eureka service node loses a large number of heartbeat connections in a short period of time (note: a network failure may have occurred), then this Eureka node will enter the “self-protection mode”, while keeping those “heartbeat death” service registration information not expired. At this time, this Eureka node can still provide registration services for new services, and the “dead” ones are still reserved, in case there are clients requesting it. When the network failure is restored, this Eureka node will exit the “self-protection mode”. So Eureka’s philosophy is that it is better to keep both “good data” and “bad data” than to lose any “good data”, so this model is very effective in practice.

Finally, Eureka also has a client-side caching function (Note: Eureka is divided into two parts: a client-side program and a server-side program, and the client-side program is responsible for providing registration and discovery service interfaces). Therefore, even if all nodes in the Eureka cluster fail, or a network segmentation failure occurs, the client cannot access any Eureka server; consumers of the Eureka service can still obtain the existing service registration information through the Eureka client cache. Even in the most extreme environment, all normal Eureka nodes do not respond to requests, and there is no better server solution to solve this problem; thanks to Eureka’s client-side caching technology, consumer services can still pass Eureka It is very important for the client to query and obtain registration service information.

Eureka’s architecture ensures that it can become a Service discovery service. Compared with ZooKeeper, it eliminates the leader node selection or transaction log mechanism, which helps reduce the difficulty of user maintenance and ensures the robustness of Eureka at runtime. And Eureka is designed for discovery services. It has an independent client library and provides heartbeat services, service health monitoring, automatic publishing services, and automatic cache refresh functions. However, if you use ZooKeeper you must implement these functions yourself. All Eureka libraries are open source, and everyone can see and use the source code, which is better than those client libraries that only one or two people can see or maintain.

Maintaining the Eureka server is also very simple. For example, to switch a node, you only need to remove an existing node under the existing EIP and then add a new one. Eureka provides a web-based graphical operation and maintenance interface, in which you can view the operating status information of the registration services managed by Eureka: whether it is healthy, operating logs, etc. Eureka even provides a Restful-API interface to facilitate third-party programs to integrate Eureka’s functions.

Conclusion

About Service Discovery Service Through this article, we would like to explain two points: 1. Pay attention to the hardware platform on which the service runs; 2. Always pay attention to what you want Solve the problem, and then decide what platform to use. Knewton considers using Eureka to replace ZooKeeper as a service discovery service from these two aspects. The cloud deployment platform is full of unreliability. Eureka can deal with these shortcomings. At the same time, Service finds that the service must have both high reliability and high elasticity. Eureke is what we want!

Leave a Comment

Your email address will not be published.