Google’s three papers have affected many, many people and many systems. These three papers have always been classics in the distributed field. According to MapReduce, so we have Hadoop; according to GFS, so we have HDFS; according to BigTable, so we have HBase. And in these three papers, a lock service of Google—Chubby is mentioned, oh, so we have Zookeeper.
With the popularity of big data, Hxx people have become familiar with it. Now as a developer, if you don’t know these terms, go out They all seem embarrassed to greet people. But in fact, for those of us who are not big data developers, Zookeeper is a basic service that may be exposed to more than Hxx. However, what is helpless is that it has been silently located in the second line, and it has never been as dazzling as the Hxx. So what exactly is Zookeeper? What can Zookeeper be used for? How will we use Zookeeper? How is Zookeeper implemented?
With Zookeeper there are two papers: one is Zab, which introduces the consistency protocol used behind Zookeeper (Zookeeper atomic broadcast protocol ), and another one is to introduce Zookeeper itself. In these two papers, Zookeeper is mentioned as a service for coordinating processes of distributed applications. So what is the distributed coordination service? First, let’s look at what “coordination” means.
When it comes to coordination, the first thing I think of is the traffic coordinators at many intersections in Beijing. They hold small red flags and direct vehicles and Is it passable for pedestrians? If we compare vehicles and pedestrians to units (threads) running in a computer, what does this coordinator do? Many people will think, isn’t this just a lock? Yes, in a concurrent environment, in order to prevent multiple operating units from modifying the shared data at the same time, causing data corruption, we must rely on a coordination mechanism like locks so that some threads can operate these resources first , And then other threads wait. For in-process locks, the various language platforms we use have already prepared many choices for us. Take Java as an example, there are the most common synchronization methods or synchronization blocks:
public synchronized void sharedMethod(){//Operate shared data}
After using this method , When multiple threads operate on the sharedMethod, they will coordinate the steps, and will not damage the resources in the sharedMethod, resulting in inconsistencies. This is the simplest coordination method, but sometimes we may need more complex coordination. For example, we often use read-write locks to improve performance. Because most of the time we read more resources and modify less, and if we use exclusive write locks regardless of the three seven twenty one, then performance may be affected. Let’s use java as an example:
public class SharedSource{private ReadWriteLock rwlock=newReentrantReadWriteLock();private Lockrlock=rwlock.readLock();privateLock;wlock=rwlock.publicLock() void read () {rlock.lock (); try {// read the resource} finally {rlock.unlock ();}} public void write () {wlock.lock (); try {// write the resource} finally { wlock.unlock(); }}}
We also have various coordination mechanisms in the process (generally we call it a synchronization mechanism). Now we probably understand what coordination is, but the coordination described above is all in-process coordination. For coordination within the process, we can use the mechanisms provided for us by languages, platforms, operating systems, etc. So what if we are in a distributed environment? That is, our programs are running on different machines. These machines may be located in the same rack, in the same computer room or in different data centers. In such an environment, what should we do to achieve coordination? So this is what the distributed coordination service wants to do. ok, some people may say, this seems not difficult. It is nothing more than realizing some primitives originally in the same process in a distributed environment through the network. Yes, it can be said on the surface. But in a distributed system, it is often easier to say than to do it. In a distributed system, all assumptions in the same process do not exist: because the network is unreliable. For example, in the same process, if your call to a method succeeds, it is successful (of course, if your code has If the call fails, such as throwing an exception, then the call failed. In the same process, if this method is called first and executed first, it is executed first. But what about in a distributed environment? Due to the unreliability of the network, your failure to call a service does not necessarily mean it failed. It may be that the execution was successful, but the response failed when it returned. In addition, both A and B call the C service. In terms of time, A will call some first, and B will call later. Is the final result certain that A’s request arrives before B? We have to rethink these various assumptions that were originally in the same process, and we have to think about the impact of these issues on our design and coding. Also, in a distributed environment, in order to improve reliability, we often deploy multiple sets of services, but how to achieve consistency in multiple sets of services is a problem that is easy to solve in the same process, but in a distributed environment It is indeed a big problem. So distributed coordination is much more complicated than coordination in the same process, so basic services like Zookeeper came into being . These systems have been tested in various systems, and their reliability and usability have been verified by theory and practice. So when we are building some distributed systems, we can use this type of system as a starting point to build our system, which will save a lot of costs, and there will be fewer bugs. This article attempts to introduce from the outside what kind of service Zookeeper is and why we need such a service. In a later article, I will introduce what Zookeeper can do.