Not too long ago the Honeybadger team was debating between ULID vs UUID choice for primary keys. Ben, our dev-ops master, mentioned that he wished he'd used ULIDs instead of UUIDs for a particular system we built.
Like any seasoned engineer in software development, my initial reaction was to mumble something non-committal and then sneak over to Google to try to figure out what the hell a ULID is. UUIDs are a bit more common, but I hadn't even heard of ULIDs.
Two hours later, I emerged with a thousand-yard stare and the realization that the world of unique identifiers is larger and more wondrous than I ever could have imagined.
Before we get started with the UUID vs ULID debate, let's go back to the basics and discuss what UUIDs are.
What's the problem with "regular" IDs?
Most web applications that use databases default to numeric IDs that increment automatically as a primary key and unique identifier. For example, in Rails, you're probably used to seeing behavior like this:
p1 = Person.create!
p1.id
# => 1
p2 = Person.create!
p2.id
# => 2
Rails leans on integers for primary keys by default, which has some advantages but also challenges.
The database can generate sequential IDs because it stores a counter that increments the creation of each new record. The sequential ID is usually an integer value but occasionally it's a BIG INT value for records you know you'll have a lot of.
This pattern can also be seen outside of databases. Sometimes we need to assign identifiers manually, and we might store a custom counter in something faster than a database (like a Redis instance.)
Sequential IDs are easy to implement for low-volume use cases - and are relatively human-readable - but they become more problematic as volume increases for a few reasons:
- It's impossible to create records concurrently because each insert has to wait in line to receive its ID. Picking the "next" ID can't be done in parallel.
- Requesting a sequential ID may require a network round trip and result in slower database performance.
- It isn't easy to scale out data stores that provide sequential IDs. If you have any kind of distributed system, you have to worry about counters on different servers getting out of sync.
- If you keep the counter on just one server, it's easy for the node with the counter to become a single point of failure.
Sequential IDs also leak data, which may be a problem in some cases:
- You can easily guess the IDs of resources that may not belong to you - they're not random numbers
- If you create a user and its ID is 20, you know that the service has 20 users.
- It opens up a system for easier scraping of records
UUIDs solve many problems with sequential IDs
A UUID, or universally unique identifier, looks a lot different than sequential IDs. They are 128-bit numbers, typically expressed as 32 hexadecimal digits:
123e4567-e89b-12d3-a456-426655440000
A universally unique identifier is created using specific algorithms defined in RFC 4122. They attempt to solve many of the problems that occur with sequential IDs:
For starters, you can generate UUIDs on any number of nodes without any shared state or coordination between nodes. This makes them more resilient than sequential IDs for all sorts of distributed systems. Creating a new UUID does not require knowledge of the "previous" UUID, so you can generate them in parallel easier. With distributed systems becoming more common, this is one of the strongest arguments for UUIDs.
They're also a little less guessable than sequential IDs, which makes your systems more secure by design. Combine this with the fact that they don't divulge the number of records and the choice seems obvious. Imagine your user edit profile page looks like yourwebsite/users/793/details
. Using the sequential ID in the URL like this is a common pattern but makes exploiting security holes easier.
If you had a bug that didn't validate that the authenticated user was the same user as the one whose details are being requested, users could see details for other users. This is a problem by itself but is exasperated by sequential ID usage because someone could run a script to get the details of all the users by simply incrementing the ID in the URL. We'll dig into this a bit later.
So why would anyone ever not use UUIDs? The tradeoff here is that there's a small chance of two nodes independently generating the same ID. This event is called a "collision" and is obviously not ideal when IDs are meant to be unique.
The different types of UUIDs
There are five types of UUID algorithms defined in the RFC. They fall into two categories:
The first UUID algorithm category is algorithms that are time and randomness-based. These algorithms result in a new UUID for every run:
- Type 4: A randomly-generated id. This is probably our best bet for new code.
- Type 1: The ID contains the host's MAC address and the current timestamp. These are deprecated because they're too easy to guess.
- Type 2: These seem to be uncommon. They appear to be purpose-built for an antiquated form of RPC.
The second set of UUID algorithms are name-based algorithms, which are a little different. They always produce the same UUID for a given set of inputs. The types are:
- Type 5: Uses an SHA-1 hash to generate the UUID. Recommended.
- Type 3: Uses an MD5 hash and is deprecated because MD5 is too insecure.
In Ruby, you can generate UUIDs via the uuidtools
gem. It supports every type, except the mysterious type 2. Here are a few examples of creating a new UUID with uuidtools
:
# Code stolen from the uuidtools readme. :)
require "uuidtools"
# Type 1
UUIDTools::UUID.timestamp_create
# => #<UUID:0x2adfdc UUID:64a5189c-25b3-11da-a97b-00c04fd430c8>
# Type 4
UUIDTools::UUID.random_create
# => #<UUID:0x19013a UUID:984265dc-4200-4f02-ae70-fe4f48964159>
# Type 3
UUIDTools::UUID.md5_create(UUIDTools::UUID_DNS_NAMESPACE, "www.widgets.com")
# => #<UUID:0x287576 UUID:3d813cbb-47fb-32ba-91df-831e1593ac29>
# Type 5
UUIDTools::UUID.sha1_create(UUIDTools::UUID_DNS_NAMESPACE, "www.widgets.com")
# => #<UUID:0x2a0116 UUID:21f7f8de-8051-5b89-8680-0195ef798b6a>
Now that you've had a crash course in UUIDs, you're probably wondering what's the deal with ULIDs? We'll dig that and the UUID vs ULID argument next.
Digging into ULIDs vs UUIDs
ULIDs, Universally Unique Lexicographically Sortable Identifiers, are a useful new take on unique identifiers. The most obvious difference is that they look a little different:
01ARZ3NDEKTSV4RRFFQ69G5FAV
They are made up of two base32-encoded numbers; a UNIX timestamp followed by a random number. Here's the structure, as defined in the specification:
01AN4Z07BY 79KA1307SR9X4MV3
|----------| |----------------|
Timestamp Randomness
48bits 80bits
This structure is fascinating! If you recall, UUIDs rely either on timestamps or randomness, but ULIDs use both timestamps and randomness.
As a result, ULIDs have some interesting and unique properties:
- They are lexicographically (i.e., alphabetically) sortable.
- The timestamp is accurate to the millisecond
- They're prettier than UUIDs :)
These open up some cool possibilities. To start, if you're partitioning your database by date, you can use the timestamp embedded in the ULID to select the correct partition. This isn't possible with UUIDs using most algorithms.
You can also sort by ULID instead of a separate created_at column if millisecond precision is acceptable, saving you an entirely separate database column. It's pretty neat that you can use the same attribute for a unique identifier and a timestamp!
There are some possible downsides to be aware of. First, if exposing the timestamp is a bad idea for your application, ULIDs may not be the best option. Second, the sort by ULID
approach may not work if you need sub-millisecond accuracy because of processes that run in the same millisecond. It's up to you to decide if these tradeoffs are okay for your application!
Picking your unique identifiers
UUIDs are and will continue to be the standard for unique identifiers. They have clear advantages over sequential integer IDs and are easy to use. They've been around forever, so libraries are available in every language imaginable. Between sequential IDs, UUIDs, and ULIDs, using UUIDs is a reasonable bet.
However, new approaches like ULIDs vs UUIDs are worth considering, especially as we enter a software development world that's increasingly run by distributed systems. New unique-id approaches may help us solve problems that weren't prevalent at the publication of RFC4122, so keep your eyes (and your mind!) open.