Why does "Java Interview Guide" need distributed ID? What is Dachang’s distributed ID generation scheme?

Today’s recommendation: Github marked 100k! What is the latest Java learning roadmap for 2021?

Good afternoon, I'm Guide brother!

Today, I will share an interview question that a friend actually encountered when going to JD for an interview: "Why do you need a distributed ID? How do you do it in your project?". The interview question is: "Why do you need a distributed ID?". ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

In this article, I will talk about my own views and introduce in detail the content related to distributed ID, including the basic requirements of distributed ID and common solutions for distributed ID.

This article is in the form of vernacular throughout, I hope it can help you!

Originality is not easy, if it helps, like/share is my biggest encouragement!

Personal abilities are limited. If there is anything that needs to be added/improved/modified in the article, please point it out in the comment area and make progress together!

Distributed ID

What is ID?

In daily development, we need to use IDs to uniquely represent various data in the system. For example, the user ID corresponds to and only corresponds to one person, the product ID corresponds to and only corresponds to one product, and the order ID corresponds to and only corresponds to one order.

We also have various IDs in real life, such as ID card ID corresponding and only corresponding to one person, address ID corresponding and only corresponding

Simply put, the ID is the unique identification of the data .

What is distributed ID?

Distributed ID is the ID under the distributed system. Distributed ID does not exist in real life and belongs to a concept in computer systems.

Let me briefly give an example of sub-database and sub-table.

A project of our company uses a stand-alone MySQL. However, what I did not expect is that one month after the project went live, as the number of users increases, the amount of data in the entire system will become larger and larger.

There is no way to support the stand-alone MySQL, and you need to sub-database and table (Sharding-JDBC is recommended).

After the database is split, the data is spread across the databases on different servers, and the auto-incrementing primary key of the database can no longer satisfy the uniqueness of the generated primary key. How do we generate a globally unique primary key for different data nodes?

At this time, you need to generate distributed ID .

What requirements does a distributed ID need to meet?

As an indispensable part of a distributed system, distributed ID is used in many places.

A most basic distributed ID needs to meet the following requirements:

Global unique : The global uniqueness of ID must be satisfied first!
High-performance : Distributed ID generation must be fast and consume less local resources.
high availability : The service that generates distributed ID must ensure that the availability is infinitely close to 100%.
convenient and easy to use : It is ready to use, easy to use, and quick to access!

In addition to these, a better distributed ID should also ensure:

Security : ID does not contain sensitive information.
orderly increment : If you want to store the ID in the database, the order of the ID can improve the database write speed. And, in many cases, we are likely to sort directly by ID.
has specific business meaning : If the generated ID can have specific business meaning, it can make the positioning problem and development more transparent (the ID can determine which business it is).
independent deployment : that is, the distributed system has a separate issuer service dedicated to generating distributed IDs. In this way, ID-generated services can be decoupled from business-related services. However, this also brings about the problem of increased network call consumption. In general, if there are more scenarios that need to use distributed ID, independent deployment of the issuer service is still necessary.

Common solutions for distributed ID

database

Database primary key increment

This method is relatively simple and straightforward, that is, the unique ID is generated by the self-incrementing primary key of the relational database.

Taking MySQL as an example, we can use the following method.

1. Create a database table.

CREATE TABLE `sequence_id` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `stub` char(10) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

stub field is meaningless, just to occupy a place, so that we can insert or modify data. In addition, stub field to ensure its uniqueness.

2. Insert data replace into

BEGIN;
REPLACE INTO sequence_id (stub) VALUES ('stub');
SELECT LAST_INSERT_ID();
COMMIT;

Insert data here, we did not use insert into but the use of replace into to insert the data, specific steps are like this:

1) The first step: Try to insert data into the table.

2) Step 2: If there is a duplicate data error in the primary key or unique index field and the insertion fails, first delete the conflicting row containing the duplicate key value from the table, and then try to insert the data into the table again.

The advantages and disadvantages of this approach are also obvious:

Advantages : It is relatively simple to implement, ID is increased in an orderly manner, and storage space is small
Disadvantages : The amount of concurrency supported is not large, there is a single point of database problem (can be solved by using a database cluster, but the complexity is increased), ID has no specific business meaning, security issues (for example, it can be calculated according to the increasing law of order ID The number of orders per day, commercial secrets! ), every time you get an ID, you must visit the database once (increasing the pressure on the database, and the speed of obtaining is slow)

Database number segment mode

The database primary key self-increment model requires access to the database every time an ID is obtained. When the ID requirement is relatively large, it will definitely not work.

If we can get them in batches and store them in the memory, when we need them, we will be comfortable taking them directly from the memory! This is what we call to generate distributed ID based on the number segment pattern of the database.

The number segment mode of the database is also a current mainstream distributed ID generation method. Like Didi Open Source Tinyid is based on this approach. However, TinyId uses double-number segment caching and multi-db support to further optimize it.

Taking MySQL as an example, we can use the following method.

1. Create a database table.

CREATE TABLE `sequence_id_generator` (
  `id` int(10) NOT NULL,
  `current_max_id` bigint(20) NOT NULL COMMENT '当前最大id',
  `step` int(10) NOT NULL COMMENT '号段的长度',
  `version` int(20) NOT NULL COMMENT '版本号',
  `biz_type`    int(20) NOT NULL COMMENT '业务类型',
   PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

current_max_id field and the step field are mainly used to obtain the batch ID, and the obtained batch ID is: current_max_id ~ current_max_id+step .

version field is mainly used to solve the concurrency problem (optimistic locking), and biz_type mainly used to indicate the amateur type.

2. Insert a row of data first.

INSERT INTO `sequence_id_generator` (`id`, `current_max_id`, `step`, `version`, `biz_type`)
VALUES
    (1, 0, 100, 0, 101);

3. Obtain the batch unique ID

SELECT `current_max_id`, `step`,`version` FROM `sequence_id_generator` where `biz_type` = 101

result:

id    current_max_id    step    version    biz_type
1    0    100    1    101

4. If it is not enough, just re-SELECT after the update.

UPDATE sequence_id_generator SET current_max_id = 0+100, version=version+1 WHERE version = 0  AND `biz_type` = 101
SELECT `current_max_id`, `step`,`version` FROM `sequence_id_generator` where `biz_type` = 101

result:

id    current_max_id    step    version    biz_type
1    100    100    1    101

Compared with the self-increment method of the database primary key, database has fewer visits to the database, and the database pressure is less.

In addition, in order to avoid single points of problems, you can use the master-slave mode to improve usability.

and disadvantages of 160d59b25d97d2 database number segment mode:

Advantages : IDs increase in an orderly manner, and storage space consumption is small
Disadvantages : There is a single point of database problem (it can be solved by using a database cluster, but the complexity is increased), ID has no specific business meaning, and security issues (for example, the daily order volume can be calculated according to the increasing law of order ID, trade secrets) what! )

NoSQL

In general, NoSQL solutions use Redis more. We can achieve the atomic sequential increment of id through the incr

127.0.0.1:6379> set sequence_id_biz_type 1
OK
127.0.0.1:6379> incr sequence_id_biz_type
(integer) 2
127.0.0.1:6379> get sequence_id_biz_type
"2"

To improve usability and concurrency, we can use Redis Cluser. Redis Cluser is the Redis cluster solution (version 3.0+) officially provided by Redis.

In addition to Redis Cluser, you can also use the open source Redis cluster solution Codis (recommended for large-scale clusters such as hundreds of nodes).

In addition to high availability and concurrency, we know that Redis is based on memory, and we need to persist data to avoid data loss after restarting the machine or machine failure. Redis supports two different persistence methods: snapshots (snapshotting, RDB) , only append-only file (append-only file, AOF) . In addition, Redis 4.0 began to support RDB and AOF mixed persistence (default closed, can be turned on aof-use-rdb-preamble

Regarding Redis persistence, I won't introduce too much here. For those who don’t understand this part, you can check out JavaGuide's summary of Redis knowledge points .

The advantages and disadvantages of the

Advantages : The performance is good and the ID generated is orderly increasing
disadvantages of the database primary key auto-increment scheme

In addition to Redis, MongoDB ObjectId is often used as a distributed ID solution.

MongoDB ObjectId requires a total of 12 bytes to store:

0~3: Timestamp
3~6: on behalf of the machine ID
7~8: Machine process ID
9~11: Self-value-added

The advantages and disadvantages of the MongoDB solution:

Advantages : The performance is good and the generated ID is orderly increasing
Disadvantages : Need to solve the problem of duplicate IDs (when the machine time is wrong, it may cause duplicate IDs), there are security problems (the ID generation is regular)

algorithm

UUID

UUID is the abbreviation of Universally Unique Identifier. UUID contains 32 hexadecimal digits (8-4-4-4-12).

JDK provides a ready-made UUID method, just one line of code.

//输出示例：cb4a9ede-fa5e-4585-b9bb-d60bce986eaa
UUID.randomUUID()

The example of UUID in RFC 4122

Let's focus here on this Version (version), the UUID generation rules corresponding to different versions are different.

The meanings of the 5 different Version values (refer to Wikipedia for the introduction of UUID ):

version 1 : UUID is generated based on time and node ID (usually MAC address);
version 2 : UUID is generated based on identifier (usually group or user ID), time and node ID;
Version 3, Version 5 : Version 5-Deterministic UUID is generated by hashing namespace identifiers and names;
version 4 : UUID is generated randomness or pseudo randomness

The following is an example of UUID generated under Version 1:

The version of the UUID generated by randomUUID() UUID in the JDK is 4 by default.

UUID uuid = UUID.randomUUID();
int version = uuid.version();// 4

In addition, Variant also has 4 different values, which respectively correspond to different meanings. I won’t introduce it here, and it seems that I don’t need to pay much attention to it.

When you need it, just go to Wikipedia for the introduction of Variant of UUID.

As can be seen from the above introduction, UUID can guarantee uniqueness, because its generation rules include MAC address, timestamp, namespace (Namespace), random or pseudo-random number, timing and other elements. The UUID generated by the computer based on these rules is It will definitely not be repeated.

Although UUID can be globally unique, we rarely use it.

For example, it is very inappropriate to use UUID as the primary key of the MySQL database:

The primary key of the database should be as short as possible, and the storage space consumed by UUID is relatively large (32 strings, 128 bits).
UUID is unordered. Under the InnoDB engine, the disorder of the database primary key will seriously affect database performance.

Finally, let's briefly analyze the advantages and disadvantages of (may be asked during the interview!):

Advantages : generation speed is relatively fast, easy to use
Disadvantages : Large storage space consumption (32 strings, 128 bits), insecure (the algorithm that generates UUID based on the MAC address will cause MAC address leakage), disorder (not self-increment), no specific business meaning, need to be resolved Duplicate ID problem (when the machine time is wrong, it may cause duplicate IDs)

Snowflake (Snowflake Algorithm)

Snowflake is Twitter's open source distributed ID generation algorithm. Snowflake is composed of 64-bit binary numbers. This 64-bit binary is divided into several parts, and the data stored in each part has a specific meaning:

bit 0 : sign bit (marking positive and negative), always 0, useless, don't care.
1~41 bits : A total of 41 bits, used to represent the timestamp, in milliseconds, which can support 2^41 milliseconds (about 69 years)
The 42nd~52nd : A total of 10 digits. Generally speaking, the first 5 digits represent the ID of the computer room, and the last 5 digits represent the ID of the machine (the actual project can be adjusted according to the actual situation). In this way, nodes in different clusters/computer rooms can be distinguished.
53~64 bits : a total of 12 bits, used to indicate the serial number. The serial number is self-incremental, representing the maximum number of IDs that a single machine can generate per millisecond (2^12 = 4096), which means that a single machine can generate up to 4096 unique IDs per millisecond.

If you want to use the Snowflake algorithm, you generally don't need to reinvent the wheel yourself. There are many open source implementations based on the Snowflake algorithm, such as Meituan's Leaf and Baidu's UidGenerator, and these open source implementations optimize the original Snowflake algorithm.

In addition, in actual projects, we generally also modify the Snowflake algorithm. The most common thing is to add business type information to the ID generated by the Snowflake algorithm.

Let's take a look at the advantages and disadvantages of the Snowflake algorithm:

Advantages : The generation speed is relatively fast, the generated ID is increased in an orderly manner, and it is more flexible (the Snowflake algorithm can be simply modified, such as adding business ID)
Disadvantages : Need to solve the problem of duplicate IDs (depending on time, when the machine time is not correct, it may cause duplicate IDs).

Open source framework

UidGenerator (Baidu)

UidGenerator is a unique ID generator based on Snowflake (Snowflake algorithm) open sourced by Baidu.

However, UidGenerator has improved Snowflake (snowflake algorithm), and the unique ID generated is composed as follows. ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

It can be seen that the composition of the unique ID generated by the original Snowflake (snowflake algorithm) is different. Moreover, we can customize all of the above parameters.

The introduction in the UidGenerator official document is as follows:

Since 18 years later, UidGenerator has basically not been maintained, and I will not introduce it here. who want to know more about it, can look at 160d59b25de427 UidGenerator's official introduction .

Leaf (Meituan)

Leaf is a distributed ID solution open sourced by Meituan. The name of this project Leaf (leaves) originated from the words of the German philosopher and mathematician Leibniz: "There are no two identical leaves in the world". The name is really good, it smells like a young literary youth!

Leaf provides the number segment mode and Snowflake (snow algorithm) two modes to generate distributed ID. In addition, it supports dual-number segments and solves the problem of snowflake ID system clock callback. However, the solution of the clock problem needs to be weakly dependent on Zookeeper.

Leaf was born to solve the problem of various and unreliable methods for generating distributed IDs in various business lines of Meituan.

Leaf has improved the original number segment mode. For example, it adds a double number segment to avoid blocking the thread requesting ID when acquiring the number segment. To put it simply, before I used up one number segment, I took the initiative to get the next number segment in advance (picture from the official article of : 160d59b25de54b "Leaf-Meituan Dianping Distributed ID Generation System" ).

According to the project README introduction, based on the 4C8G VM, the QPS pressure test result is nearly 5w/s, and TP999 is 1ms.

Tinyid (Didi)

Tinyid is a unique ID generator based on the database number segment model open sourced by Didi.

The principle of the database number segment mode has been introduced above. What are the highlights of

In order to clarify this problem, let's take a look at a simple architecture scheme based on the database number segment model. (The picture comes from Tinyid's official wiki: "Tinyid Principle Introduction" )

In this architecture mode, we apply for a unique ID from the issuer service through HTTP requests. The load balancing router will send our request to one of the tinyid-servers.

What's wrong with this scheme? In my opinion (Tinyid official wiki also introduces it), there are mainly the following two questions:

In the case of obtaining a new number segment, the speed of the program to obtain the unique ID is relatively slow.
Need to ensure that the DB is highly available, which is cumbersome and resource intensive.

In addition, HTTP calls also have network overhead.

The principle of Tinyid is relatively simple, and its architecture is shown in the figure below:

Compared with the simple architecture scheme based on the database number segment mode, the Tinyid scheme mainly makes the following optimizations:

double-number segment cache : In order to avoid the acquisition of a new number segment, the speed of the program to obtain the unique ID is relatively slow. When the number segment in Tinyid is used to a certain extent, it will load the next number segment asynchronously to ensure that there is always a usable number segment in the memory.
adds multi-db support : supports multiple DBs, and each DB can generate a unique ID, which improves usability.
adds tinyid-client : pure local operation, no HTTP request consumption, performance and availability are greatly improved.

The advantages and disadvantages of Tinyid will not be analyzed here. Combine the advantages and disadvantages of the database number segment mode and the principle of Tinyid to know.

Summary of Distributed ID Generation Scheme

In this article, I have basically summarized the most common distributed ID generation schemes.

postscript

Finally, I recommend a very good Java tutorial open source project: JavaGuide . When I was preparing for the autumn recruitment interview in my junior year, I created the JavaGuide project. Currently this project has 100k+ stars, related reading: "1049 days, 100K! Simple review! " .

It is very helpful for you to learn Java and prepare for the interview in the direction of Java! As the author said, this is a: Java learning + interview guide covering the core knowledge that most Java programmers need to master!

related suggestion:

I am Guide brother, embrace open source and like cooking. Open source project JavaGuide author, Github: Snailclimb-Overview . In the next few years, I hope to continue to improve JavaGuide, and strive to help more friends who learn Java! mutual encouragement! Hoo! Click to view my 2020 work report!

In addition to the methods described above, middleware like ZooKeeper can also help us generate unique IDs. no silver bullet for 160d59b25e30e4, so you must choose the most suitable plan based on the actual project.

Why does "Java Interview Guide" need distributed ID? What is Dachang’s distributed ID generation scheme? | JavaGuide

Distributed ID

What is ID?

What is distributed ID?

What requirements does a distributed ID need to meet?

Common solutions for distributed ID

database

Database primary key increment

Database number segment mode

NoSQL

algorithm

UUID

Snowflake (Snowflake Algorithm)

Open source framework

UidGenerator (Baidu)

Leaf (Meituan)

Tinyid (Didi)

Summary of Distributed ID Generation Scheme

postscript

JavaGuide

引用和评论

社招 Java 中厂面试记录，难度有点大！

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性