Vlad Mihalcea:The best UUID type for a database Primary Key(转)

2023-09-12 数据库, database, PrimaryKey

0 评论 • 0 浏览

转载Vlad Mihalcea:"The best UUID type for a database Primary Key"

Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.

So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!

想象一下，有一个可以自动检测 JPA 和 Hibernate 性能问题的工具。那不是很棒吗？

嗯，Hypersistence Optimizer就是这个工具！它可以与 Spring Boot、Spring Framework、Jakarta EE、Java EE、Quarkus 或 Play Framework 配合使用。

因此，享受将时间花在您喜欢的事情上，而不是在周六晚上修复生产系统中的性能问题！

Introduction

In this article, we are going to see what UUID (Universally Unique Identifier) type works best for a database column that has a Primary Key constraint.

While the standard 128-bit random UUID is a very popular choice, you’ll see that this is a terrible fit for a database Primary Key column.

在本文中，我们将了解哪种 UUID（通用唯一标识符）类型最适合具有主键约束的数据库列。

虽然标准 128 位随机 UUID 是一个非常流行的选择，但您会发现它非常适合数据库主键列。

Standard UUID and database Primary Key

A universally unique identifier (UUID) is a 128-bit pseudo-random sequence that can be generated independently without the need for a single centralized system in charge of ensuring the identifier’s uniqueness.

The RFC 4122 specification defines five standardized versions of UUID, which are implemented by various database functions or programming languages.

For instance, the UUID() MySQL function returns a version 1 UUID number.

And the Java UUID.randomUUID() function returns a version 4 UUID number.

通用唯一标识符 (UUID) 是一个 128 位伪随机序列，可以独立生成，无需单个集中式系统负责确保标识符的唯一性。

RFC 4122 规范定义了UUID 的五个标准化版本，它们由各种数据库函数或编程语言实现。

例如，UUID()MySQL 函数返回版本 1 UUID 编号。

JavaUUID.randomUUID()函数返回版本 4 UUID 号。

For many devs, using these standard UUIDs as a database identifier is very appealing because:

The ids can be generated by the application. Hence no central coordination is required.
The chance of identifier collision is extremely low.
The id value being random, you can safely send it to the UI as the user would not be able to guess other identifier values and use them to see other people’s data.

对于许多开发人员来说，使用这些标准 UUID 作为数据库标识符非常有吸引力，因为：

id 可以由应用程序生成。因此不需要中央协调。
标识符冲突的可能性极低。
id 值是随机的，您可以安全地将其发送到 UI，因为用户无法猜测其他标识符值并使用它们来查看其他人的数据。

But, using a random UUID as a database table Primary Key is a bad idea for multiple reasons.

First, the UUID is huge. Every single record will need 16 bytes for the database identifier, and this impacts all associated Foreign Key columns as well.

Second, the Primary Key column usually has an associated B+Tree index to speed up lookups or joins, and B+Tree indexes store data in sorted order.

但是，出于多种原因，使用随机 UUID 作为数据库表主键并不是一个好主意。

首先，UUID 很大。每条记录都需要 16 个字节作为数据库标识符，这也会影响所有关联的外键列。

其次，主键列通常有一个关联的 B+Tree 索引来加速查找或连接，并且 B+Tree 索引按排序顺序存储数据。

However, indexing random values using B+Tree causes a lot of problems:

Index pages will have a very low fill factor because the values come randomly. So, a page of 8kB will end up storing just a few elements, therefore wasting a lot of space, both on the disk and in the database memory, as index pages could be cached in the Buffer Pool.
Because the B+Tree index needs to rebalance itself in order to maintain its equidistant tree structure, the random key values will cause more index page splits and merges as there is no pre-determined order of filling the tree structure.

然而，使用 B+Tree 索引随机值会导致很多问题：

索引页的填充因子非常低，因为这些值是随机出现的。因此，8kB 的页面最终将只存储几个元素，因此浪费了磁盘和数据库内存上的大量空间，因为索引页可以缓存在缓冲池中。
由于B+Tree索引需要重新平衡自身以维持其等距树结构，因此随机键值将导致更多索引页分裂和合并，因为没有预先确定填充树结构的顺序。

If you’re using SQL Server or MySQL, then it’s even worse because the entire table is basically a clustered index.

如果您使用的是 SQL Server 或 MySQL，那么情况会更糟，因为整个表基本上都是聚集索引。

And all these problems will affect the secondary indexes as well because they store the Primary Key value in the secondary index leaf nodes.

所有这些问题也会影响二级索引，因为它们将主键值存储在二级索引叶节点中。

In fact, almost any database expert will tell you to avoid using the standard UUIDs as database table Primary Keys:

事实上，几乎所有数据库专家都会告诉您避免使用标准 UUID 作为数据库表主键：

TSID – Time-Sorted Unique Identifiers

If you plan to store UUID values in a Primary Key column, then you are better off using a TSID (time-sorted unique identifier).

One such implementation is offered by the Hypersistence TSID OSS library, which provides a 64-bit TSID that’s made of two parts:

a 42-bit time component
a 22-bit random component

The random component has two parts:

a node identifier (0 to 20 bits)
a counter (2 to 22 bits)

如果您计划将 UUID 值存储在主键列中，那么最好使用 TSID（按时间排序的唯一标识符）。

Hypersistence TSID OSS 库提供了一种这样的实现，它提供了一个由两部分组成的 64 位 TSID：

42 位时间分量
22 位随机分量

随机分量有两部分：

节点标识符（0 到 20 位）
计数器（2 至 22 位）

You can create a TSID object like this:

TSID tsid = TSID.fast();
From the Tsid object, we can extract the following values:

the 64-bit numerical value,
the Crockford’s Base32 String value that encodes the 64-bit value,
the Unix milliseconds since epoch that is stored in the 42-bit sequence
To visualize these values, we can print them into the log:

long tsidLong = tsid.toLong();
String tsidString = tsid.toString();
long tsidMillis = tsid.getUnixMilliseconds();
 
LOGGER.info(
    "TSID numerical value: {}",
    tsidLong
);
 
LOGGER.info(
    "TSID string value: {}",
    tsidString
);
 
LOGGER.info(
    "TSID time millis since epoch value: {}",
    tsidMillis
);
And we get the following output:

TSID numerical value: 470157774998054993
TSID string value: 0D1JP05FJBA2H
TSID time millis since epoch value: 1689931148668
When generating ten values:

for (int i = 0; i < 10; i++) {
    LOGGER.info(
        "TSID numerical value: {}",
        TSID.fast().toLong()
    );
}
We can see that the values are monotonically increasing:

TSID numerical value: 470157775044192338
TSID numerical value: 470157775044192339
TSID numerical value: 470157775044192340
TSID numerical value: 470157775044192341
TSID numerical value: 470157775044192342
TSID numerical value: 470157775044192343
TSID numerical value: 470157775044192344
TSID numerical value: 470157775048386649
TSID numerical value: 470157775048386650
TSID numerical value: 470157775048386651
Awesome, right?

Using the TSID in your application

Because the default TSID factories come with a synchronized random value generator, it’s better to use a custom TsidFactory that provides the following optimizations:

It can generate the random values using a ThreadLocalRandom, therefore avoiding Thread blocking on synchronized blocks
It can use a small number of node bits, therefore leaving us more bits for the random-generated numerical value.

So, we can define the following TsidUtil that provides us a TsidFactory to use whenever we want to generate a new TSID object:

由于默认 TSID 工厂附带同步随机值生成器，因此最好使用 TsidFactory提供以下优化的自定义：

它可以使用 a 生成随机值 ThreadLocalRandom，从而避免同步块上的线程阻塞
它可以使用少量的节点位，因此为随机生成的数值留下更多位。

因此，我们可以定义以下内容 TsidUtil，以便在我们 TsidFactory想要生成新 TSID对象时使用：

public class TsidUtils {
    public static final String TSID_NODE_COUNT_PROPERTY =
        "tsid.node.count";
    public static final String TSID_NODE_COUNT_ENV =
        "TSID_NODE_COUNT";
 
    public static TSID.Factory TSID_FACTORY;
 
    static {
        String nodeCountSetting = System.getProperty(
            TSID_NODE_COUNT_PROPERTY
        );
        if (nodeCountSetting == null) {
            nodeCountSetting = System.getenv(
                TSID_NODE_COUNT_ENV
            );
        }
 
        int nodeCount = nodeCountSetting != null ?
            Integer.parseInt(nodeCountSetting) :
            256;
 
        TSID_FACTORY = getTsidFactory(nodeCount);
    }
 
    private TsidUtils() {
        throw new UnsupportedOperationException(
            "TsidUtils is not instantiable!"
        );
    }
 
    public static TSID randomTsid() {
        return TSID_FACTORY.generate();
    }
 
    public static TSID.Factory getTsidFactory(int nodeCount) {
        int nodeBits = (int) (Math.log(nodeCount) / Math.log(2));
 
        return TSID.Factory.builder()
            .withRandomFunction(
                TSID.Factory.THREAD_LOCAL_RANDOM_FUNCTION
            )
            .withNodeBits(nodeBits)
            .build();
    }
 
    public static TSID.Factory getTsidFactory(int nodeCount, int nodeId) {
        int nodeBits = (int) (Math.log(nodeCount) / Math.log(2));
 
        return TSID.Factory.builder()
            .withRandomFunction(TSID.Factory.THREAD_LOCAL_RANDOM_FUNCTION)
            .withNodeBits(nodeBits)
            .withNode(nodeId)
            .build();
    }
}
When running this test case that operates 16 threads on 2 nodes:

int threadCount = 16;
int iterationCount = 100_000;
 
CountDownLatch endLatch = new CountDownLatch(threadCount);
 
ConcurrentMap<TSID, Integer> tsidMap = new ConcurrentHashMap<>();
 
long startNanos = System.nanoTime();
 
AtomicLong collisionCount = new AtomicLong();
 
int nodeCount = 2;
 
for (int i = 0; i < threadCount; i++) {
    final int threadId = i;
    new Thread(() -> {
        TSID.Factory tsidFactory = TsidUtils.getTsidFactory(
            nodeCount,
            threadId % nodeCount
        );
 
        for (int j = 0; j < iterationCount; j++) {
            TSID tsid = tsidFactory.generate();
            Integer existingTsid = tsidMap.put(
                tsid,
                (threadId * iterationCount) + j
            );
            if(existingTsid != null) {
                collisionCount.incrementAndGet();
            }
        }
 
        endLatch.countDown();
    }).start();
}
LOGGER.info("Starting threads");
endLatch.await();
 
LOGGER.info(
    "{} threads generated {} TSIDs in {} ms with {} collisions",
    threadCount,
    new DecimalFormat("###,###,###").format(
        threadCount * iterationCount
    ),
    TimeUnit.NANOSECONDS.toMillis(
        System.nanoTime() - startNanos
    ),
    collisionCount
);

We get the following result:

16 threads generated 1,600,000 TSIDs<span> </span>``in `1179 ms with 0 collisions`

The thread id is included in the random part, so up to 256 threads there shouldn’t be any collisions when generating the TSID.

In the unlikely case that you get a collision and you end up with a constraint violation, you can use the @Retry annotation to retry the business method and avoid the collision to propagate to the client.

Check out this article for more details about how you can use the @Retry annotation.

我们得到以下结果：

16 threads generated 1,600,000 TSIDs<span> </span>``in `1179 ms with 0 collisions`

线程ID包含在随机部分中，因此最多256个线程在生成TSID时不应该有任何冲突。

在不太可能发生的情况下，您会遇到冲突并最终违反约束，您可以使用注释 @Retry来重试业务方法并避免冲突传播到客户端。

请查看这篇文章，了解有关如何使用 @Retry注释的更多详细信息。

Using the TSID as a Primary Key value

Since the TSID is a time-sorted 64-bit number, the best way to store it in the database is to use a bigint column type:

由于 TSID 是按时间排序的 64 位数字，因此将其存储在数据库中的最佳方法是使用 bigint列类型：

CREATE TABLE post (
    id bigint NOT NULL,
    title varchar(255),
    PRIMARY KEY (id)
)
And on the application side, you need to use a 64-bit number, like the Java Long object type:

@Entity
@Table(name = "post")
public class Post {
 
    @Id
    private Long id;
 
    private String title;
   
}

That’s it!

Conclusion

Using the standard UUID as a Primary Key value is not a good idea unless the first bytes are monotonically increasing.

For this reason, using a time-sorted TSID is a much better idea. Not only that it requires half the number of bytes as a standard UUID, but it fits better as a B+Tree index key.

While SQL Server offers a time-sorted GUID via the NEWSEQUENTIALID, the size of the GUID is 128 bits, so it’s twice as large as a TSID.

The same issue is with version 7 of the UUID specification, which provides a time-sorted UUID. However, it uses the same canonical format (128 bits) which is way too large. The impact of the Primary Key column storage is amplified by every referencing Foreign Key columns.

If all your Primary keys are 128-bit UUIDs, then the Primary Key and Foreign Key indexes are going to require a lot of space, both on the disk and in the database memory, as the Buffer Pool holds both table and index pages.

使用标准 UUID 作为主键值并不是一个好主意，除非第一个字节单调递增。

因此，使用按时间排序的 TSID 是一个更好的主意。它不仅需要标准 UUID 一半的字节数，而且更适合作为 B+Tree 索引键。

虽然 SQL Server 通过提供按时间排序的 GUID NEWSEQUENTIALID，但 GUID 的大小为 128 位，因此它是 TSID 的两倍。

UUID 规范第 7 版也存在同样的问题，它提供了按时间排序的 UUID。然而，它使用相同的规范格式（128 位），但太大了。每个引用外键列都会放大主键列存储的影响。

如果所有主键都是 128 位 UUID，那么主键和外键索引将需要大量磁盘和数据库内存空间，因为缓冲池同时保存表页和索引页。

标题：Vlad Mihalcea:The best UUID type for a database Primary Key(转)
作者：yazong
地址：https://blog.llyweb.com/articles/2023/09/12/1694459148053.html

NEWER 新一篇 OLDER 日志：每个软件工程师都应该知道的有关实时数据的统一抽象