The perils of UUID primary keys in SQLite

36 points by andersmurphy a day ago on lobsters | 26 comments

lifepillar | a day ago

Nice. It would be interesting to see the numbers for rowid tables where the integer key is random rather than monotonically increasing.

One thing the article doesn't mention, but I think it's important, is that the primary index on rowid tables is a B+-tree, while for “without rowid” tables it is a B-tree. So, generally speaking, the latter is not ideal when the average record's size exceeds a certain threshold because the internal nodes of the index store full records (iirc, SQLite's manual suggests 1/20th of the page size as a rule of thumb).

[OP] andersmurphy | a day ago

Thanks. I'll run those tests later today and report back/update the article.

jklowden | a day ago

All that work to measure UUID performance, and no consideration natural keys.

Surrogate keys of any kind — integer or UUID or other — add complexity, add no information, and disguise normalization errors.

The author cites “avoid duplicates” as a reason to reach for a UUID. It is not. Every row in every table on every database should have at least one key: a group of columns that uniquely identify the row. Any database without such keys harbors duplicate information and is thus subject to logical inconsistency. A database having such keys has no need of surrogates. Assertions of “needed for performance” — provided natural keys are already present and enforced — require proof because they represent additional cost and otherwise needless complexity.

Maybe I'm just not very imaginative, but when working with data that comes from the real world, I generally struggle to find any true natural keys there. Seemingly unique values tend not to be as unique as I'd hoped, or values that I'd expected to be immutable end up needing to be changed. Surrogate keys, on the other hand, mean that I get to define what identity looks like in my system, without having to rely on how anyone else does (or usually doesn't do) identity.

There are exceptions - if I'm defining the whole model, and none of the data comes from the real world, as it were, then natural keys make more sense. But I feel like surrogate keys come in handy whenever I'm trying to design a schema to fit some real-world data that has probably never been fully normalised.

[OP] andersmurphy | 14 hours ago

The author cites “avoid duplicates” as a reason to reach for a UUID

Eh? Those words don't appear in the blog post. In fact I never mention why you would use UUIDs at all.

An integer surrogate monotonous key has a bunch of immediate practical uses:

  • Allows you to refer to any row using an integer. Which simplifies access enormously.
  • Provides a human friendly way to refer it. Easy enough to remember, dead simple to write in queries.
  • It does carry information on its own specially if it's monotonous. This is redundancy but it is still true.
  • Optimizes lookup up speed. It's the optimal case for being index using Brees, bitmaps. Etc.

People use UUIDs mostly because they are confused. Their argument is usually to obfuscate the key and make it impossible to guess. Why they think one must impose these on the primary key and not a dedicated identifier for those purposes, it's something I gave up understanding. I guess they are confused somehow?

Student | a day ago

Are people really using uuids as their primary keys? What’s the benefit rather than having the uuid as an auxiliary key for when you want a uuid?

markerz | a day ago

It's for scale. If you want to generate a unique ID in a distributed manner (many computers, many datacenters around the world) at a high rate (e.g., S3 uploads) you don't want to lock on a single incrementing integer. Locks are slow to sync globally.

GUID and UUIDs solve this by their construction.

  • v1 and v6 encode (machine ID + timestamp), so it's kind of an auto-incrementing integer namespaced to each machine

The confusion I see many people have is assuming UUID is random. That's only true for v4, and choosing that has unfortunately costs. There are many times where you want some determinism (v3, v5, v7) because it organizes your data (for perf or collision guarantees).

bitshift | a day ago

Regarding scale: there's also a human factor. You absolutely can use an integer primary key with an auxiliary UUID. But if you need that UUID for some correctness reason, such as preventing duplicates in a distributed system, you need to be more disciplined about never exposing those integer IDs outside of your database.

It's not impossible to do integers on the inside, UUIDs on the outside. But having two different IDs does tempt Murphy's Law.

ubernostrum | 19 hours ago

Honestly, though, at the point where you decide you need a solution like this for "scale"/distributed systems, it's probably better to just switch to something like Snowflake IDs that are purpose-built for that use case.

UUIDv7 is the result of standardizing all the various somethings like Snowflake IDs.

ubernostrum | 49 minutes ago

As I understand it, UUID7 uses randomness to make collisions unlikely, where things like Snowflake use guaranteed-unique node identifiers to make collisions impossible.

bitshift | 5 minutes ago

I guess the distinction is that UUID v7 is still going for "universally unique", whereas Snowflake is "unique within a specific system"?

From skimming Wikipedia it looks like Snowflake produces 64-bit IDs, which might give a minor boost to storage/performance versus UUIDs (which kind of have to be at least 128-bit in order to have a high probability of universal uniqueness).

[OP] andersmurphy | a day ago

A random UUID (or any random values in a uniform distribution) on an auxiliary key is still going to slow down your inserts as although it's not the clustered index it's still random inserts into a B-tree for that index.

bitshift | a day ago

Isn't that the case for adding an index in general, though?

I would assume that indexing a v4 UUID column would slow down inserts, but that the slowdown would be on the same order of magnitude as indexing an email address or a username. Those aren't uniformly distributed, but the inserts would be: any given insert is going to be equally likely to fall above the median element as it is to fall below it.

[OP] andersmurphy | a day ago

Really depends on the distribution. A lot of things follow a normal distribution in which case there's a lot less rebalancing going on.

bitshift | 21 hours ago

This didn't match my intuition, so I wrote an ~80-line benchmark program in Python. It's similar to your experiment (inserting 10 million rows in batches of 1 million), except instead of benchmarking different kinds of primary keys, I'm benchmarking different kinds of values in an auxiliary indexed column:

  • As a baseline, indexing a random hex string that mostly increases in value (mock v7 UUID): the final insert batch was relatively fast, 1619–1862 ms. Which makes sense, because inserting at the end of the tree ought to be fast.
  • Indexing a normally-distributed hex string (many values close to "7fff…"/"8000…" while having enough range to avoid most duplicates): slow, 3666–3755 ms.
  • Indexing a uniformly-distributed hex string (mock v4 UUID): about as slow, 3556–3746 ms.

Granted, normally-distributed strings are not a great approximation for columns like email or username. But at least in this limited test, I didn't see any noticeable perf difference from changing the distribution.

[OP] andersmurphy | 14 hours ago

Oh that's super interesting and definitely fodder for my next post. Thank you! My intuition was clearly completely off.

I guess that might actually be intuitive given some more thought as you still have a degree of randomness/balancing going on in the middle of the distribution. So it should actually be similar to a uniform distribution?

bitshift | an hour ago

At least in my test, clustered values and uniform values were similarly slow. There are lots of things I didn't try, such as if there are many duplicates, or if there are multiple tight clusters.

As far as distributions go, I suspect the bigger factor is whether all values come from the same distribution. I suspect UUID v7 is also normally or lognormally distributed, but the active ingredient that makes it faster is that the bell curve is constantly shifting to the right: almost every new value is to the right of the existing B-tree. Technically, each value comes from a different distribution.

Whereas if you had a distribution that tended to produce more new timestamps than old ones (such as a constant minus a lognormal), but the distribution's parameters stayed constant, I suspect it would perform similarly to uniform inserts. But I haven't tested it!

[OP] andersmurphy | 21 minutes ago

Interesting! So I've just done a follow up post. Where I do a basic pre sort of the batch before insertion to make the data constantly shift to the right as you described (at least within a batch) and it does improve performance. Which gets interesting as with a single writer batching is where you end up anyway

https://andersmurphy.com/2026/06/07/sqlite-improving-performance-with-pre-sort.html

bryce | 8 hours ago

You can generate the primary keys at different times and places without coordination and be reasonably sure they’ll be unique. The reason for UUID over big RNG is that there’s lots of writing, defaults, library support, and documentation for it by that name.

silentbicycle | a day ago

Version 7 UUIDs have a 48-bit timestamp at the head, so they aren't randomly distributed like this. That will also help with the excessive paging and rebalancing.

[OP] andersmurphy | a day ago

Yes UUID7 is what the article explores.

chiply | 5 hours ago

Beautiful! Very intuitive as well. I often find that database topics are difficult to comprehend, but you made this very intuitive. How did you create the flame graph?

[OP] andersmurphy | 3 hours ago

Thank you! The flamegraphs were made with clj-async-profiler (clojure/JVM specific).

https://github.com/clojure-goes-fast/clj-async-profiler

chiply | 3 hours ago

Clojure to boot! I love lisps, right on :)