[Miscellaneous Talk] How to choose the primary key ID

1. How does the generation position affect the selection?

Database round trip time

When using autoincrement, the ID is generated by the database when performing an INSERT operation; while the UUID can be generated at the application layer.

Consider a scenario like this:

A method requires inserting two entities A and B. The data of B needs to refer to the ID of A.

If using the auto-increment ID, two network requests must be made with the database:

1. Insert A first to get the ID of A;

2. Then use the ID of A to build B and insert B.

If UUID is used, the IDs of A and B can be generated at the application layer at the same time, and then the two entities are submitted to the database at one time, thereby reducing the number of network round trips and improving efficiency.

ID conflict

The self-increment number will not conflict within one table, what about the two tables?

When the business develops to the point where the database and table are divided, the self-increment number can no longer be used as the ID, because the two libraries/tables will generate the same ID. This kind of ID conflict and no errors will lead to data confusion.

Although UUID is globally unique, extreme situations will also conflict, so the application layer needs to find a way to handle it to ensure the uniqueness of the ID, such as IID generation combined with machine ID.

2. How does order and disorder affect selection?

Whether the ID is ordered has an important impact on database performance, especially using B-Tree index.

Ordered ID:For B-Tree index, the key values are ordered, so the data inserted continuously will be on one or a few nodes, which means that the database only needs to read and write I/O to a small number of physical blocks, and the I/O range. Small, good performance.

eg: Insert 20 pieces of data in succession, all on one node, just load a physical block.

Unordered ID:The key values are disordered, so the data inserted continuously may be randomly distributed on each node, and the database needs to read and write a large number of physical blocks, which has poor performance.

eg: 20 pieces of data are inserted continuously, and after calculation, they are distributed on 20 different nodes, and 20 physical blocks need to be loaded.

Obviously, the autoincrement ID is ordered, so in most cases it performs better than the unordered UUID.

It is worth mentioning that UUID also has multiple versions, some of which are ordered, such as UUIDv7 and Snowflake algorithm (Snowflake)

The UUID used by Java JDK is v4 version and is unordered.

How does size affect the selection?

ID size refers to the storage space occupied. The autoincrement number is usually INT or BIGINT, which is 4 bytes and 8 bytes respectively. UUID is generally composed of 32 characters and accounts for 16 bytes.

The height of the tree

In B-Tree, for fixed-size nodes, the key value size will affect the number of key values that a node can store (here is the number of IDs). If the key values stored by a single node will affect the number of nodes, and thus the height of the tree.

The higher the tree, the more nodes you need to access in one query, and the slower the query.

Memory overhead

The same number of records, the size of the ID will affect the cache (index and data rows) of the database engine. For the same size cache space, the larger the single ID, the less data it can cache.

For example: InnoDB uses innodb_buffer_pool_size to control cache pool size

Summarize

In general, ordered UUID (UUIDv7, snowflake algorithm) is more suitable for distributed services, and self-increasing ID is more suitable for stand-alone services.

But it also depends on the specific business. For example, if the data in the sub-table does not have cross-table access, then the ID conflict will conflict without any impact.

Moreover, ID policies can also be used in a mixed manner. Different tables in a database can use different ID generation strategies.