Many large Internet companies have a large amount of data, and they all use sub-databases and tables, so after sub-databases, a unified unique ID is required for storage. This ID can be in increments of numbers, or it can be of UUID type.
If it is incremental, then after the database is split, it can be evenly distributed to the database according to the hash of the id, and the mysql database will greatly improve if the incremental field is stored as the primary key Storage speed. But if the order ID is incremented by number, others can easily guess how many orders you have. In this case, a non-digital increment method may be required for ID generation.
Think of the generation of distributed ID, you may think of using Redis to generate ID, using Redis INCR command to generate and obtain this self-incremented ID, this is no problem, but the QPS speed of the INCR generation is 200400 (test results published on the official website), which is 20W, if the QPS does not exceed these, it is obviously more appropriate to use Redis.
So do we have better ideas for achieving high availability, high QPS, and low latency? Let’s take a look at the snowflake algorithm, the snowflake algorithm open sourced by Twitter.
Snowflake has a total of 64 bits:
-
1. Don’t use the first one.
-
2. 41 bits are timestamps. If 2^41 is in milliseconds, 69 years can be obtained, which is quite enough.
-
3. A 10-bit work machine can have 2^10=1024 work nodes. Some companies split it into 5-bit work center codes and 5-bit work centers. Give the working machine.
- 4. The last 12 bits are used to generate a total of 4096 incremental data.
If you use this theoretical QPS, the QPS is 409W/S.
The advantages of this method are:
-
1. QPS is very high and performance is very sufficient. The high performance conditions are also met.
-
2. No need to rely on other third-party middleware, such as Redis. With less dependence, the availability rate is improved.
- 3. Can be adjusted according to your own customization. That is, the 10 bits inside are freely allocated.
Disadvantages:
This algorithm is very dependent on the clock. If the clock is dialed back, it will be possible to generate the same ID.
UUID is generated using 32-bit binary data. It generates very good performance, but it is generated based on the machine MAC address and is not distributed, so it is not the scope of our discussion.
Let’s take a look at the distributed ID implementation mechanisms of some large companies. Create a table by generating, using 8 Bytes, 64 bits for storage and use, and use this table to record the location of the generated ID. For example, the ID starts from 0, and then 1000 is used, then the maximum value in the record in the database is one thousand, and there is also a step value, such as 1000, then the maximum value is 2001 when the next value is obtained, that is, the largest is not The value used.
The specific implementation steps are as follows:
-
1. Provide a service for generating a distributed ID. The ID service is to read the values and steps in the database. The long value calculation generates the required value and range, and then the service consumer takes it and stores the number segment in the cache for use.
- 2. When it is given to the service caller, the database will update the data immediately.
The advantages in this case are:
-
1.? The disaster tolerance performance is good, if there is a problem with the DB, because the data is placed in the memory It can still be supported for a period of time.
-
2. 8 Bytes can meet the needs of business generation ID.
- 3. The maximum value can be defined by yourself, so that some migrated businesses can define the maximum value and continue to use it.
Of course, there are also shortcomings:
-
1. When the database is hung up, the entire system cannot be used.
-
2. The number segment is increasing, which is easy to be guessed by others.
- 3. If many services access to obtain this ID at the same time or network fluctuations cause the database IO to increase, system stability will be problematic.
Then the solution to the above situation is that they use a double buffer mechanism, that is, the number segment is read into the memory and then used, and a new thread is restarted when it reaches 10%. , And then use another piece of cached data when one cache is used up. When the data in the other cache reaches 10%, it restarts and excites a new thread to acquire, and repeats in turn.
The advantage of doing so is to avoid accessing a large number of databases at the same time, resulting in increased I/O. At the same time, two cache segments can be used to solve the situation that a single cache will run out quickly. Of course, set this number segment to 600 times the size of the QPS, so that the database can continue to provide services within 10-20 minutes.
A problem has been mentioned above, that is, the ID increment. How do we solve this problem? It is to use snowflake, and then solve the clock problem inside. Some companies use ZK to compare whether the time used by the current workerId, which is the node ID, has a callback. If there is a callback, do sleep for a fixed time to see if it can catch up with the time. If it catches up, continue to generate ID. If it has not caught up to a certain value, then an error will be reported. Because the middle 10 bits represent different nodes, the IDs generated by different nodes will not increase.
These ideas have been implemented by a certain company. If you are interested in continuing to study, then search for the open source Leaf on GITHUB and use it directly.