Location>code7788 >text

These 10 distributed IDs are so great!

Popularity:39 ℃/2024-09-19 11:20:39

preamble

Distributed ID, in fact, is used quite a lot in our daily development.

There are many business scenarios in use, for example:

  1. Trace_id for distributed link systems
  2. Primary key in a single table
  3. Distributed lock keys in Redis
  4. The id of the table after splitting library and table

Today I'm going to talk to you about some common scenarios for distributed ID, and I hope you'll find them helpful.

1 UUID

UUID (Universally Unique IDentifier), also known as GUID (Globally Unique IDentifier).

The UUID is a 128-bit length identifier that ensures its uniqueness in time and space.

UUID was initially used in the Apollo network computing system and subsequently in the Distributed Computing Environment (DCE) of the Open Software Foundation (OSF).

Allows distributed systems to generate unique identifiers, such as unique IDs for logging, without the use of a central node.

UUIDs are generated based on a variety of factors such as timestamps, MAC addresses, random numbers, etc., and are theoretically nearly impossible to duplicate globally.

Unique strings can be obtained in Java by using the randomUUID method of the UUID:

import ;

/**
 * @author Su San
 * @date 2024/9/13 morning10:38
 */
public class UuidTest {
    public static void main(String[] args) {
        String uuid = ().toString();
        (uuid);
    }
}

Run results:

22527933-d0a7-4c2b-a377-aeb438a31b02

Advantages: UUID does not use the central node, you can maintain the independence of the program, you can ensure that the program between different databases, do data migration, are not affected.

Cons: The string generated by UUID is too long, and the efficiency of querying the data through the index is relatively low. In addition, the order of the string generated by UUID is not guaranteed, it is not incremental, which does not satisfy some business scenarios at work.

In a distributed logging system or a distributed link tracking system, UUIDs can be used to generate unique identifiers that can be used to string together requested logs.

2 Database self-incrementing IDs

In many databases self-incremented primary key IDs are guaranteed to be unique by the database itself.

auto_increment in MySQL.

sequences in Oracle.

We do not need to do any processing in the business code, the value of this ID, is automatically generated by the database, and it will ensure the uniqueness of the data.

Pros: very simple and very efficient in querying data.

Disadvantages: can only guarantee the uniqueness of data in a single table, if cross-table or cross-database, ID may be duplicated. id is self-incrementing, the generation rules can be easily guessed, there is a security risk. id is generated based on the database, in the case of high concurrency, there may be performance problems.

In some old systems or internal management systems of companies, database incremental ID may be used as a distributed ID scheme, and the concurrent users of these systems are usually small and the amount of data is not large.

3 Database numbering model

In highly concurrent systems, frequent access to the database can affect the performance of the system.

An optimization can be made to the database self-incrementing ID scheme.

Generate IDs one at a time for a certain step size, e.g., the step size is 1000, and each time the database increments itself by 1000, the ID value changes from 100001 to 101001.


The 1000 IDs of the number segment 100002~101001 are cached into the server's memory from.

When a request to get a distributed ID comes through, the data is first fetched from the server's memory, and if it can be fetched, it is returned directly.

If it is not fetched, it means that the data of the cached number has been fetched.

At this point, you need to fetch the ID of the new number segment from the database once again, and cache it into the server's memory, so that next time you can fetch the ID directly from memory again.

Advantages: simple to implement, the dependence on the database is weakened, which can improve the performance of the system.

Disadvantages: ID is self-incrementing, the generation rules can be easily guessed, there is a security risk. If the database is a single node, there is a risk of rock machine.

4 Multi-master model for databases

To solve the above single node rock machine problem, we can use the multi-master mode of the database.

That is, there are multiple master database instances.

When generating IDs, a request can only write to one master instance.

In order to ensure the uniqueness of IDs under different master instances, we need to specify in advance a large interval under each master, e.g., master1's data starts with 10, master2's data starts with 11, and master3's data starts with 12.

Then each MASTER, still following the database number segment model.

Advantages: avoids the risk of single node rock machine of database number mode, improves the stability of the system, and the performance of the system is OK due to the combination of using the number mode.

Cons: IDs generated across multiple master instances may not be incremental.

5 Redis Generation ID

In addition to using a database, Redis can actually generate self-incrementing IDs.

We can use the incr command in Redis:

redis> SET ID_VALUE 1000
OK

redis> INCR ID_VALUE
(integer) 1001

redis> GET ID_VALUE 
"1001"

Set the value of 1000 for ID_VALUE and then use the INCR command to add 1 each time.

This scenario is similar to the one we previously discussed for scenario 1 (database self-incrementing ID).

Pros: simple scheme, better performance than scheme 1, avoids cross-table or cross-database, ID duplication problem.

Disadvantages: ID is self-incrementing, the generation rules can be easily guessed, there is a security risk. And Redis may also have the risk of single node, rock machine.

6 Zookeeper generates IDs

Zookeeper generates a sequence number mainly through its znode data version, which can generate 32-bit and 64-bit data version numbers, and clients can use this version number as a unique sequence number.

Due to the need to be highly dependent on Zookeeper and the fact that it is a synchronous call to the API, you need to consider using distributed locks if you are in a situation where there is a lot of contention.

As a result, performance is also less than optimal in highly concurrent distributed environments.

Few people will use Zookeeper to generate unique IDs.

7 Snowflake Algorithm

Snowflake (Snowflake Algorithm) is Twitter's open source distributed ID algorithm.

Core idea: use a 64 bit long type number as the global unique id.

The highest bit is the sign bit, which is always 0 and not available.

A 41-bit time series, accurate to the millisecond level, can be used for 69 years in 41-bit length. Another very important function of time bits is that they can be sorted according to time.

10-bit machine identification, 10-bit length supports deployment of up to 1024 nodes

12-bit counting sequence number, sequence number that is a series of self-incrementing id, can support the same node in the same millisecond to generate multiple ID serial number, 12-bit counting sequence number support each node every millisecond to generate 4096 ID serial number.

Advantages: simple algorithm, performed in memory, high efficiency. Generate non-repeating IDs in a highly concurrent distributed environment, up to a million non-repeating IDs per second.
Based on the timestamp, as well as the same timestamp under the sequence number of self-incrementation, basically to ensure that the ID of the orderly increment. And does not rely on third-party libraries or middleware, better stability.

Cons: Dependent on server time, duplicate IDs may be generated when the server clock is dialed back.

8 Leaf

Leaf is the open source distributed ID generation system of Meituan, which provides two ways to generate IDs:

  • Leaf-segment number mode
  • Leaf-snowflake snowflake algorithm

Leaf-segment number pattern, a table needs to be created:

This schema is the database number segment schema we talked about in Section 3.

The biz_tag is used to differentiate between businesses. max_id indicates the maximum value of the ID segment to which the biz_tag is currently assigned, and STEP indicates the length of the segment each time it is assigned.

Originally getting the ID required writing to the database each time, now you just need to set the STEP large enough, say 1000. then only when the 1000 numbers have been consumed does it go to re-reading and writing to the database once.

Leaf-snowflake snowflake algorithm, is on top of the traditional snowflake algorithm, plus Zookeeper, do a little transformation:

The Leaf-snowflake service needs to fetch the workId sequentially from Zookeeper, which will be cached locally.

If the Zookeeper has an exception, the Leaf-snowflake service will directly get the local workId, which is equivalent to a weak dependency on the Zookeeper.

Because this scheme relies on time, there is a risk of generating duplicate ID numbers if the machine's clock is rolled back, and it has an internal mechanism to address the issue of machine clock rollback:

If you want to know more details about Mission Leaf, take a look at the Github address:/Meituan-Dianping/Leaf

Recently organized a 100,000-word interview dictionary, can be given to you for free, get the way to add my WeChat: su_san_java, Remarks: Interview.

9 Tinyid

Tinyid is a distributed id generation system developed by Drip in Java , based on the database number algorithm to achieve .

Tinyid is extended on the basis of the ID generation algorithm Leaf in the United States, support for the database multi-master node mode, it provides a REST API and JavaClient two ways to get, relatively more convenient to use.

However, unlike the American Leaf, Tinyid only supports one mode for the number segment and does not support the Snowflake mode.

A simple architectural solution based on the database numbering model:

The ID generation system provides http service to the outside world, and the request is load-balanced router, which can route to one of the tinyid-servers, so that it can get an ID from the pre-loaded number.

If the number segment has not been loaded, or has already been used up, it is necessary to request a new available segment from the db. Between multiple servers, because of the atomicity of the segment generation algorithm, and to ensure that the available segments on each server are not heavy, so that the id generation is not heavy.

But it also brings up these questions:

  • When the id runs out you need to access the db to load a new section, db updates may also have version conflicts, at this time the id generation time consuming significantly increased.
  • The db is a single point, and while the db can be built with a highly available architecture such as master-slave, it is always a single point.
  • Getting an id using the http method has network overhead and is not very good in terms of performance and availability.

To solve these these problems: add tinyid-client to generate IDs locally, use double-segment caching, and add multi-db support to improve the stability of the service.

The final architectural solution is as follows:

The Tinyid solution mainly does the following optimizations:

  • Add tinyid-client: tinyid-client sends a request to tinyid-server to get the available number, and then builds the double number and id generation locally, so id generation becomes a purely local operation, which greatly improves the performance.
  • Use double segment cache: to avoid the program to get the unique ID slower in the case of getting a new segment, the segment in Tinyid will go to asynchronously load the next segment when it is used up to a certain level, to make sure that there are always available segments in the memory.
  • Add multi-db support: unique IDs can be generated for each DB, improving usability.

If you want more details on Drip Tinyid, check out the Github address:/didi/tinyid

10 UidGenerator

Baidu UID-Generator uses the Java language, based on the snowflake algorithm implementation.

UidGenerator work in the form of components in the application project, support for customizing the number of workerId bits and initialization policy, so as to apply to docker and other virtualization environments such as automatic restart of instances, drift and other scenarios.

In terms of implementation, UidGenerator solves the natural concurrency limitation of sequence by borrowing future time.

Adopting RingBuffer to cache generated UIDs, parallelize UID production and consumption, and replenish CacheLine to avoid hardware-level "pseudo-sharing" problems caused by RingBuffer. The final single-machine QPS can reach 6 million.

Description of Snowflake algorithm: specify the machine & at the same moment & a certain concurrent sequence that is unique. A 64 bits unique ID (long) can be generated accordingly. By default, the above byte allocation is used:

  • sign(1bit): fix the 1bit sign identity, i.e. the generated UID is positive.
  • delta seconds (28 bits): current time, incremental value relative to the time base "2016-05-20", unit: second, up to a maximum of about 8.7 years
  • worker id (22 bits): machine id, can support up to about 420w machine starts. The built-in implementation is to be allocated by the database at startup, the default allocation policy is use-it-or-lose-it, and a reuse policy can be provided later.
  • sequence (13 bits): concurrent sequence per second, 13 bits can support 8192 concurrences per second.

sequence determines the concurrency of the UidGenerator, 13 bits of sequence can support 8192/s concurrency, but in reality it is likely to be insufficient, thus giving birth to the CachedUidGenerator.

CachedUidGenerator uses RingBuffer to cache generated ids. ringBuffer is a ring array with a default size of 8192 (you can set the size via the boostPower parameter).

RingBuffer ring array, each element of the array becomes a slot.

The Tail pointer and Cursor pointer are used to read and write slots on a circular array:

  • Tail pointer: indicates the maximum number of slots produced by the Producer (this number starts from 0 and keeps incrementing).Tail can not exceed the Cursor, i.e. the Producer can not overwrite the unconsumed slots.When the Tail has already caught up with the curosr, then it can be specified by rejectedPutBufferHandler. PutRejectPolicy.
  • Cursor pointer: the smallest serial number that the Consumer has consumed (the serial number sequence is the same as the Producer's sequence), the Cursor can not exceed the Tail, that is, it can not consume the slot that has not been produced, when the Cursor has caught up with the tail, then it can be specified by rejectedTakeBufferHandler. TakeRejectPolicy.

RingBuffer fill trigger mechanism.

  • The RingBuffer is filled to the brim when the program starts.
  • When calling the getUID() method to get the ids, if the number of remaining ids in the RingBuffer is detected to be less than 50% of the total number, fill the RingBuffer to the brim.
  • Timed Fill (configurable whether to use it and the period of the timed task).

If you want more details about Baidu uid-generator, check out the Github address:/baidu/uid-generator

One final note (ask for attention, don't patronize me)

If this article is helpful to you, or if you are inspired, help scan the QR code below to pay attention to it, your support is my biggest motivation to keep writing.

Ask for a one-click trifecta: like, retweet, and watch at.
Concerned about the public number: [Su San said technology], in the public number reply: interviews, code artifacts, development manuals, time management there are awesome fan benefits, in addition to the reply: to add a group, you can communicate with a lot of BAT factory seniors and learn.