Research distributed unique ID generation, after reading this article

  Many large Internet companies have a large amount of data, and they all use sub-databases and tables, so a unified unique ID is required for storage after sub-databases. This ID can be in increments of numbers, or it can be of UUID type.

   If it is incremental, then after the database is split, it can be evenly distributed to the database according to the hash of the id, and if the mysql database stores the incremental field as the primary key, it will greatly improve the storage speed. But if you increase the order ID by number, others can easily guess how many orders you have. In this case, a non-digital increment method may be required to generate the ID.

   Thinking of the generation of distributed ID, you might think of using Redis to generate ID, using Redis INCR command to generate and obtain this self-increasing ID, this is no problem, but the QPS speed of the INCR generation is 200400 (test results published on the official website), which is 20W, if the QPS does not exceed these, it is obviously more appropriate to use Redis.

   So do we have any better ideas for achieving high availability, high QPS, and low latency. Next, let’s take a look at the snowflake algorithm, the snowflake algorithm open sourced by Twitter.

  share picture

  snowflake has a total of 64 bits:

  1. The first digit is not used.

   2. 41 bits are the timestamp. If 2^41 is in milliseconds, 69 years can be obtained, which is quite enough.

  3. A 10-bit work machine can have 2^10=1024 work nodes. Some companies divide it into a 5-bit work center code, and 5 bits are allocated to the work machine.

  4. The last 12 bits are used to generate a total of 4096 incremental data.

   If you use this theoretical QPS, the QPS is 409W/S.

The advantages of    are:

   1. QPS is very high and performance is very sufficient. The high performance conditions are also met.

   2. No need to rely on other third-party middleware, such as Redis. With less dependence, the availability rate is improved.

  3. Can be adjusted according to your own customization. That is, the 10 bits inside are freely allocated.

   Disadvantages:

   1. This kind of algorithm is very dependent on the clock. If the clock is dialed back, it will be possible to generate the same ID.

  UUID is generated using 32-bit binary data. It generates very good performance, but it is generated based on the machine MAC address and is not distributed, so it is not the scope of our discussion.

   Let’s take a look at the distributed ID implementation mechanism of some large companies. Create a table by generating, using 8 Bytes, 64 bits for storage and use, and use this table to record the location of the generated ID. For example, ID starts from 0, and then 1000 is used, then the maximum value in the record in the database is one thousand, and there is also a step value, such as 1000, then the maximum value is 2001 when the next value is obtained, that is, the largest is not The value used.

The specific implementation steps of    are as follows:

  1. Provide a service for generating distributed ID. This ID service is required for reading the value in the database and the step value calculation and generation. The value and range, and then the service consumer will store the number segment in the cache for use after getting it.

  2. When it is given to the service caller, the database will update the data immediately.

The advantages of    in this case are:

  1. The disaster tolerance performance is good. If there is a problem with the DB, because the data is placed in the memory, it can still support a period of time.

   2. 8 Bytes can meet the needs of business generation ID.

  3. The maximum value can be defined by yourself, so that some migrated businesses can define the maximum value by themselves and continue to use it.

  Of course, there are disadvantages:

  1. When the database is hung up, the entire system cannot be used.

  2. The number segment is increasing, which is easy to be guessed by others.

  3. If many services access to obtain this ID at the same time or network fluctuations cause the database IO to increase, system stability will be problematic.

   Then the solution to the above situation is that they use a double buffer mechanism, that is, the number segment is read into the memory and then used, and a new thread is restarted when it reaches 10%, and then it becomes a After the cache is used up, another piece of cached data is used. When the data in the other cache reaches 10%, it restarts and stimulates a new thread to obtain it, and repeats in turn.

  The advantage of doing this is to avoid accessing a large number of databases at the same time, resulting in increased I/O. At the same time, two cache segments can be used to solve the situation that a single cache will run out quickly. Of course, set this number segment to 600 times the size of the QPS, so that the database can continue to provide services within 10-20 minutes.

   One problem has been mentioned above, that is, the ID increment. How do we solve this problem? It is to use snowflake, and then solve the clock problem inside. Some companies use ZK to compare whether the time used by the current workerId, which is the node ID, has a callback. If there is a callback, do sleep for a fixed time to see if it can catch up with the time. If it catches up, continue to generate ID. If it has not caught up to a certain value, then an error will be reported. Because the middle 10 bits represent different nodes, the IDs generated by different nodes will not increase.

   These ideas are already implemented by a company. If you are interested in continuing to research, then search for the open source Leaf on GITHUB and use it directly.

   If there is something wrong, please correct me.

Leave a Comment

Your email address will not be published.