How Shopify’s engineering improved database writes by 50% with ULID

Рет қаралды 110,827

Hussein Nasser

Күн бұрын

Пікірлер: 107

@hnasr Жыл бұрын

Get my fundamentals of database engineering udemy course database.husseinnasser.com

@kapssul Ай бұрын

20:56 ULID composed of 2 parts, first part is ordred but the second part is still random, so it will lead to non-sequential writes.. and this will cause page splits and rearrangement in the database, impacting performance.. am i missing something

@gwapolaub Жыл бұрын

Pure database architectural gold. Thank you for this analysis.

@pqnet84 Жыл бұрын

24:30 you can tradeoff cache memory for mutex contention by adding some random bits at the START of the id so that for a certain timestamp you have multiple pages

@johannsebastianbach3411 Жыл бұрын

Could you elaborate on that a little further for a simpleton like me? 😂

@pqnet84 Жыл бұрын

@@johannsebastianbach3411 by having random bits at the start of the key you are distributing the objects on various parts of the btree. Example, if your first bit is random half of the keys will start with 0 then a timestamp, other half will start with 1 then a timestamp. Because of the timestamp all the 0s are going to be grouped together and all the 1s are going to be grouped together but they will be in different part of the btree. Like having the same street name, but different cities. So if you have a cache big enough for 2 pages you can keep both in memory and both will be fast (you are still limited to 2 pages), but you won't always access the same page (they are randomly load balanced to the two pages) so you reduce the contemption on the page mutex

@johannsebastianbach3411 Жыл бұрын

@@pqnet84 omg, that makes perfect sense! Thanks. We use something similar at work for foldering strategy, like a/b/abab-abab-ababa-restOfUUID, where we get the first two folder names from the first two chars of the uuid, never would have thought about doing that for indexing though!

@amitmahadik4653 10 ай бұрын

For race condition, we can think of synchronization (allow one thread to write), but it will generally slow down insertion

@abdelrahmandwedar 8 ай бұрын

That's exactly what I thought about. How would it increase the speed of insert by 50 percent while having everything synchronized?

@justinsullard1519 Жыл бұрын

I've been doing something like this for almost a decade now. The first 12 hex digits a time stamp, the next 4 he'd digits used as either a shard ID or a sequence, and then the rest random. Useful for so many reasons.

@JAlejandroMG 9 ай бұрын

Hey @justinsullard1519, from somebody who has not worked a lot in DB, how secure is this approach? Have you had any security risk issue using this? I've been looking for info about this aspect, but haven't found anything at all.

@Winnetou17 Жыл бұрын

Hussein: "A GET request by definition must be idempotent [...] Nothing chages on the backend, nothing should change" View counters: 😨😰

@torocatala Жыл бұрын

Logs: 😨😰

@jerynmathew Жыл бұрын

One could always use POST for fetching counters and logs. Like how elasticsearch exposes /_search API

@abdelrahmandwedar 8 ай бұрын

Well... Yeah that didn't change the response of GET request for most cases... 😕 There's cases like YT watch counter but mostly it doesn't increase on your GET request itself.

@lfsr1288 Жыл бұрын

It's a great approach. The only case when you ended up inserting "old time request" is with the retry logic retrying some requests since they have a timeout of 24hs or less. But again, this is a great approach to improve the insert

@dhillaz Жыл бұрын

Having the timestamp embedded in the UID also means you can use the sorted index for deleting/archiving old rows when doing housekeeping, without the overhead of maintaining an additional index.

@nemtudom5074 Жыл бұрын

This is the kinda optimizing we used to see in the 90's when you *HAD* to optimize your code because you couldnt just buy more performance, not like this anyway. Not to mention that even if you did, you didnt get much. Things perform so well now that you can just let your code be unoptimized and people wont care. Im glad they are putting in the effort.

@javedutube10 Жыл бұрын

Just bought 'Fundamentals of Networking for Effective Backend Design' from UDEMY, even though I could learn it from different channels that also suggested by you : ), including yours, of course. That's your magic. Thanks. Love you.

@guss77 Жыл бұрын

Correction about UUID randomness - please note that only UUID version 4 are random. There are other UUID versions, notably version 1 - which is time based and has lexicographic sorting, just like ULID, but it is standard and have a lot of supported implementations. The main problem with UUIDv1 is that the bit order is a bit weird and if you look at it as an opaque bit set, it looks quite random as the least significant bits of the time (time low in UUID spec) are in front. This was done purposefully. MysQL version 8 improved support for using UUIDv1 as primary keys, with good index locality, by adding specific functions to reorder the time fields in a UUID so that they are ordered from most significant to less significant - similar to cardinals, and that convert the UUID string representation to a compact binary storage that is almost as efficient as using an integer.

@gsilva877 Жыл бұрын

Thanks for the coment, is useful ❤

@abhaygoswami7021 Жыл бұрын

One can also use UUIDv7 which also use timestamp and is sortable in nature and 62 bits randomness as compare to 80 bits in ULID

@DrBatroni Жыл бұрын

I also wonder why? They come from UUID v4, so in both cases the new timespamt would be new. So i wonder whats the big diffrence between UUID v7 and ULID?

@Tordek Жыл бұрын

ULIDs are (slightly) shorter, the encoding is trivial (and easier to split: you split the first 8 characters for the timestamp without fiddling with dashes or variable length fields). Standards only matter if you need to interact with someone else, and you shouldn't depend on other people parsing your IDs. There doesn't seem to be any downsides to ULID over UUID.

@CTimmerman Жыл бұрын

TL;DW: UUID tends to be too long and random. ULID is 10 byes shorter and starts with a sortable timestamp.

@EngineerNick Жыл бұрын

Thankyou for this :) I love the idea that data can be 'tucked in' to a db 🧸

@MaxPrehl 5 ай бұрын

This is a great breakdown! Wish you also had some visual aids!

@kenbee85 Жыл бұрын

I just stumble upon your video as suggested by YT. Though I merely focusing on Frontend I really got curious about this.

@sanzharsuleimenov6380 Жыл бұрын

Certainly! Here's a simpler version of your text: Thanks! I found your video about Discord switching from Cassandra, and I really like how you share your thoughts.

@ankgaur Жыл бұрын

Can anyone name youtube channels like this with this kind of detailed backend videos

@prathameshbhat9816 Жыл бұрын

Please tag me as well if someone is replying

@tesla1772 Жыл бұрын

You can check arpit bhayani and bytebytego

@ankgaur Жыл бұрын

@@prathameshbhat9816 FYI

@ankgaur Жыл бұрын

@@tesla1772 thanks

@siddharthsingh7281 Жыл бұрын

Theo brown is the best

@Aleks-fp1kq Жыл бұрын

I didn't understand the part about 24hrs indempotency. How is it related to ulid?

@Stig12 Жыл бұрын

This is either the same or very similar to a Comb Guid. Not sure if you mentioned it, but you'll want to modify the byte order before storing in MS SQL Server to get the same sequential benefit.

@skreibb 10 ай бұрын

UUID v4 could be beneficial for databases with partitions under the hood (dynamo db). ULID could make writes to the last partition really expensive

@nathansgreen Жыл бұрын

Use UUIDv7, as defined in the drafted revisions to RFC 4122. Yes, it's slightly different, but it enables compatibility with existing UUID code, which already exists in a lot of places.

@mosespeter9711 Жыл бұрын

I have been using the sorted UUID strategy since MYSQL 8.0 with UUID_TO_BIN("uuid",1) and this function further convert it to BINARY(16) instead of storing it as string... this is super fast!

@DavisTibbz Жыл бұрын

Yes! More people should upgrade to mysql 8.0

@svencowart Жыл бұрын

Fantastic explanation

@kickeddroid Жыл бұрын

Great video man keep up the great work!!!!

@abcdef-fo1tf Жыл бұрын

I feel like in url shortener, the ULID would still help reads as you're more likely to read newer URLs which might be in the buffer pool

@figloalds Жыл бұрын

The RID mechanism I use on my program is similar to that, but I use 32 byte array containing a machine specific signature, a timestamp, a random number and a in-memory sequential number, it is designed to absolutely never conflict

@Milchmann2 Жыл бұрын

Using a "timestamp" as key will then also redirect all writes to the same node of the distributed database and thus creating a hotstpot. I wish the article and you talked a bit more why this is not an issue for them.

@fishoil5310 Жыл бұрын

The database is probably sharded using user id. The write is still effectively distributed across different nodes.

@colt4by5 Жыл бұрын

Edit: I forgot about the 80 bits of random data! I think partitioners will hash the partition key, so any randomness will cause the data to be spread across nodes and avoid hotspots. I was also going to make this point... good insight! Using a timestamp as the partition key in a distributed database can indeed create a hotspot and degrade performance. I think it's not an issue for them because they are using MySQL, which is not a true distributed database, so that caveat doesn't apply. I guess the potential issue for them is they're limited to the write performance of a SQL database, which can't scale up as high as a distributed database can.

@miladin4023 Жыл бұрын

Can someone elaborate more on this? Thanks :)

@colt4by5 Жыл бұрын

@@miladin4023 I think the main point is that distributed databases use the partition key to "route" data to a physical node, so if you use just a timestamp, then all data will be routed to the same partition (and therefore node) for each time slice, which can create a hotspot and remove much of the value of using a distributed database. A 48-bit timestamp should allow for millisecond granularity, so maybe it's not a huge issue, but probably better to avoid it. The 80 bits of random data in the ULID should prevent the issue altogether though. You can find a quick explanation if you search for "DynamoDB designing partition keys to distribute your workload". On the relational database side, Hussein already did a much better job of explaining how data is routed than I could :).

@junaid1555 Жыл бұрын

@@colt4by5 if the partition key is hashed to avoid hotspots, won't the range queries be very expensive?

@ashishbtech Жыл бұрын

That explains well why SQL-type DBs do not perform well with the long tail of data.

@pollathajeeva23 Жыл бұрын

One video on Search engine indepth please

@JohnZakaria Жыл бұрын

That's my first time hearing about ULID

@engineeranonymous Жыл бұрын

Never use GUID's for anything if your database have the ability to use b-tree. Eventually some day you have to use the GUID field for something and you will regret it. Use integer keys and hashid library in order to use it with controllers.

@megalodon4272 Жыл бұрын

This is gold

@nerdative Жыл бұрын

Do they assume that clients have the correct time? What happens if the client have the wrong time? Then it would generate "old" or "incorrect" ULIDs. Unless they somehow synchronize the time between server and client, or it happens so rarely that it doesn't matter.

@nuculabs Жыл бұрын

Great content, thank you!

@tesla1772 Жыл бұрын

So can we say that mongodb ids are more efficient compared to ulids as it is 96 bits and are also sequential

@tesla1772 Жыл бұрын

@@JamesSmith-cm7sg we can genrate same ids locally

@btom1990 Жыл бұрын

@@JamesSmith-cm7sg but MongoDB is webscale kzbin.info/www/bejne/mGOpXnd_qb2nkNU

@atmrar Жыл бұрын

@@tesla1772 local generation of ulid is error prone. If system clocks are not synchronised the local id can go back in the time .

@tesla1772 Жыл бұрын

@@atmrar yeah. But i think machine id and process ids are aslo included in it right?

@MrOlivm Жыл бұрын

Sequential importantly implies that mongo’s “web scale” ain’t so web scale. I don’t know why you would use mongo when you could use /dev/null

@Vikasptl07 Жыл бұрын

Great stuff !!

@athar-q 4 ай бұрын

Tip: Don't forgot to watch at 1.75x speed.

@billywang3829 Жыл бұрын

I love this man

@ahmedelzubair6907 Жыл бұрын

Why do you think that ULID is not gonna help in reads? At the end, it's sorted, right ? So it can be easily indexed, and if we indexed our ULIDs, it will help in reading a well because we are going to use the index in reading this data.

@oah8465 Жыл бұрын

fantastic video but why are we going with GUIDs, UUID, ULID, if we have our beloved auto-increment primary key in MySQL? Let me put it this way, in which scenario is the primary key (auto-increment) by MySQL is no good and we have to go either the UUID or the ULID way?

@SaeedAbdulRaheem Жыл бұрын

7:05

@DavisTibbz Жыл бұрын

Because it is Predictable

@eddie_dane 6 ай бұрын

Which book or resource would you recommend to learn in depth knowledge on database

@DurgaShiva7574 Жыл бұрын

Amazing video, but, one doubt, who is actually generating the UUID or ULID's .. in both the scenarios.. client or Server ??

@esra_erimez Жыл бұрын

The more things change, the more things stay the same. IBM ISAM/VSAM

@davidvultur8704 Жыл бұрын

Any plans on doing a new twitter architecture/speed video with the introduced updates? (compared to the old "Twitter Backend is slow" vide) Thanks

@danielbocelli 5 ай бұрын

Please explain to me why a second column with a timestamp couldn’t serve this purpose?

@1Eagler Жыл бұрын

So either threads will go sequentially either in memory or in disk

@raunaquepatra3966 Жыл бұрын

What about BTree getting skewed

@harshsahu7825 Жыл бұрын

just joined you DB course on udemy :)

@urrahman196 Жыл бұрын

Can you please make a video on refresh token? Why do we need this along with JWT? Pros n cons.

@fedemtz6 Жыл бұрын

if the first 48 bits are time why is the random part so long? the chances of two identical random parts being generated at the exact same time are very low

@johnsailor3590 Жыл бұрын

Yay

@abcdef-fo1tf Жыл бұрын

I'm curious why we couldn't just use the auto-incr feature on the SQL DB if we wanted sequential primary keys? Is this because this would make it harder to make it idempotent? If we're not using it for that in the case with URL shortener, we can just use auto-incr right? Also are there cases where ULID is worse than UUID? You talked about downsides, but it seems like they were also the case for UUID

@Zuriki09 Жыл бұрын

The drawbacks of sequential IDs are discussed early on in the video. ULID can be worse in the following cases: Where you specifically don't want to expose the relative position of the ID - the timestamp gives some indication of when the ULID was created in relation to other ULIDs, this may be important for cryptographic reasons. When you are interfacing with systems that expect UUIDs. ULIDs can be converted to UUIDs but that may be more burdensome than simply using UUIDs to begin with. UUID spec is considered more stable and future proof, this may be important in large enterprise environments where change is extremely difficult to implement - you are unlikely to encounter this problem, but that is "a" criticism of ULIDs.

@hirisraharjo Жыл бұрын

What about CUID ? is it better than ULID and UUID?

@mayyar123 Жыл бұрын

Would the same problem of inefficient indexing occur if we use Non-clustered indexing w/ UUID ?

@hnasr Жыл бұрын

Still exists but not as bad as clustered indexes. Clustered indexes leaf pages are rich with data which means UUID inserts are more likely to cause page splits and encores more random IO and dirty pages flushing. non clustered uuids still has the same behavior but it will just take more rows to get there.

@akovbovich Жыл бұрын

Why not use UUIDv4 with hash indexes for fast lookups and a separate timestamp column for range selection?

@hck1bloodday Жыл бұрын

In my little knowledge of databases I feel that will cause slower inserts and they want to maximize insert speed

@-Jason-L Жыл бұрын

If you're selecting a range, it better be indexed on that tange

@jitxhere 5 ай бұрын

what hash function would you use??

@guerra_dos_bichos Жыл бұрын

Thats not a type universally, university universe should be proceeded by A not AN

@cvoges12 Жыл бұрын

Be this guy and spend 30 minutes on talking about how sorted data makes things predictable.

@awnion Жыл бұрын

ULID is not something new btw

@cvoges12 Жыл бұрын

Stop playing with the mic. It's both highly distracting and messing with the audio by getting louder and quieter at near random. And aim the mic at your chest/jaw, not your forehead.