The effect of Random UUID on database performance

Рет қаралды 72,027

Hussein Nasser

Күн бұрын

Пікірлер: 167

@hnasr Жыл бұрын

database fundamentals course database.husseinnasser.com

@MrAtomUniverse Жыл бұрын

This is good , i'm on udemy business, do keep more courses open to business plan users! thank you! great videos you have there

@bishalkandel9976 Жыл бұрын

Let's make it bigger.

@barebears289 Жыл бұрын

That's what she said😂😂

@warrenarnoldmusic 19 күн бұрын

That caught me off guard 9:40 😂😂😂 damn all indians are unhinged

@AffyisAffy Жыл бұрын

The whole time I was wonder why not just insert on a new index rather than pointing to a UUID, that way it's always ordered. But then you said MySQL defaults to this at the end of the video. That's critical for the 'why' of this video

@tirthwork9288 Жыл бұрын

He said that in the start of the video as well

@berndeckenfels Жыл бұрын

You can use a sequence but keeping them has its own scalability problem, plus the pk won’t be stable

@nalbertcerqueira6079 2 ай бұрын

He talked about this at 4:45. He explains that MySQL uses the primary key index as the table itself, this implementation is called clustered index. In the case of MySQL you can have only one clustered index per table, because the index is the table itself. The other indexes you create are "common indexes" which stores only the index id and a pointer to the actual data, and this pointer is actually the primary key.

@berndeckenfels Жыл бұрын

Yes when you age them out, they might leave gaps in the index tree but also reuse the gaps regularly. They might not “nowhere near each other” but always in between. Sp with larger pages the splits are less likely. (Of course IOT is to a good usecase)

@ahmedalaaeldin7362 Жыл бұрын

Actually, I was thinking about that question a couple of days ago Thank you :)

@RealEvangelizer Жыл бұрын

Ive used UUID generated outside the database and stored as primary key varchar within the db. Its worked a lot better for over 100 million devices that I was working with. Primary goal was to use it as hashtables when use other systems like redis/dynamo etc. For lookups, with a bit off magic in UUID generation which is not being generated by DB, you can apply Bloom filters as well. I've been out of engineering field for a decade+ but these sort of videos are always much needed to discuss the fundamentals of how things work.

@ddanielsandberg Жыл бұрын

MySQL 8 uses UUID v1 (kind of compatible with UUID v2). By using the UUID_TO_BIN()/BIN_TO_UUID() function with the optional second argument 'swap_flag' it will reformat the UUID before converting it to a BINARY(16). This will make them sequential (since UUID v1/v2 is based on the timestamp).

@someoneanonymous6707 Жыл бұрын

How costly is that operation tho?

@cedricbrisson7240 Жыл бұрын

uuidv1 is not secure tho

@joesb Жыл бұрын

@@cedricbrisson7240 "What security and why should it apply for this case?" uuidv1 has some characteristic that make it possible for someone who want to *intentionally* clash/guess your id, they might be able to. If YOUR OWN service is creating this id, why would you intentionally do it? Don't just parrot that something is insecure. Knows if it applies to your use case and attack vector as well.

@cedricbrisson7240 Жыл бұрын

Jesus chill lol It's simply good practice to avoid using vulnerable services/tools when safer alternatives can be used. As someone on a CSEC team I'd rather pre-emptively protect myself than wait for someone to find a use case for a specific flaw :) But sure, I'm just "parroting". You're the kind that praises himself as a software engineer thinking their above "programmers" right?

@joesb Жыл бұрын

@@cedricbrisson7240 do you think “integer” is secure? Do you avoid using integer? It’s nonsense statement. Uuid v1 is insecure for a certain use case. Using it as an id you generate yourself is not one of them.

@OzoneGrif Жыл бұрын

The worst is using UUID with SQL Server; since the PK is Clustered; performance becomes awful very quickly. I use Postgresql and never had performance issues with UUIDs.

@ShinigamiZone Жыл бұрын

Can you tell me a bit more about this? What means clustered in that case for SQL Server? I just structured a whole DB with PK UUIDv4 in SQL Server but it’s not shipped to prod yet 👀

@ShinigamiZone Жыл бұрын

At least the prod system is not high-performance required I expect around 2000 writes per day and 1_000_000 reads per day And let’s say 10_000_000 rows in total with around 10 tables that are read mainly, 5 more tables for not so much read tables and 2 views (maybe need to be materialized views)

@OzoneGrif Жыл бұрын

@@ShinigamiZone Clustered means physically ordered on the storage. Random UUID used as a Clustered PK (the default on SQL Server) means your records will be fragmented everywhere on your storage, leading to awful performance.

@tempaccount8256 Жыл бұрын

Never thought, Hussein will come with a pun in the video "that's what she said"

@Maman-Setrum 3 ай бұрын

me usually used index "order by timestamp", thats why UUID (in this case, just for unique identity) not a problem.

@gam1l Жыл бұрын

This one cracked me up 09:59 😂 Great work as usual 👏 Please consider making a video on Pinecone and vector databases 🙏

@siya.abc123 Жыл бұрын

He made it bigger.

@vasiliynet3425 Жыл бұрын

Awesome! Thanks. Dialectic of randomness and order. Very beautiful.

@redpillsatori3020 Жыл бұрын

Pro tip: watch at 1.5x speed. You're welcome

@sad_man_no_talent 2 ай бұрын

thx

@simon3121 Жыл бұрын

Missing the point. A UUID allows an application to create the ID rather having a two step approach of asking the DB for an ID first.

@AnOmegastick Жыл бұрын

Applications can create a ULID.

@saggitt Жыл бұрын

If your IDs are public and you make them sequential, be aware that third parties will know the number of entities created over a time period.

@warrenarnoldmusic 19 күн бұрын

Enumeration attack, you can prehash to protect against that

@mikestaub Жыл бұрын

Please do a video on vector databases.

@Notion615 Жыл бұрын

set page fill to 71% and periodically rebuild your index at low fragmentation percent and problem solved for indexes on uuid4

@durgeshchoudhary Жыл бұрын

can you please point to documentation for these? would be helpful.

@ehfoss Жыл бұрын

What shower of a db requires that you rebuild it all the time haha

@Notion615 Жыл бұрын

@@ehfoss set up a service that runs every so often that rebuild your index programatically during periods of low load.

@Notion615 Жыл бұрын

@@durgeshchoudhary To my fellow neckbeards who wish to know the black arts. kzbin.info/www/bejne/oKmQd6iEpZmZhaM

@alegon2007 Жыл бұрын

What about distributed NO SQL DBs (document, key-value, etc) where the recommendation is to use partition keys with high cardinality and avoid sequential values because that will create “hot” partitions for inserts at least? Maybe its a different use case but it would be an interesting topic too

Жыл бұрын

where did you read that? Could you share the article?

@mishikookropiridze Жыл бұрын

Different scenario.

@ehfoss Жыл бұрын

There are 3 new uuids in rfc draft. ulid is equivalent to uuidv7. uuidv8 let's you define your own data. And uuidv6 has a machine key followed by a timestamp akin to ulid which can help with the situation you're describing.

@mishikookropiridze Жыл бұрын

Also possible problem is that ULID doesn't have RFC, so implementation can differ.

@avinashboppudi Жыл бұрын

@alegon2007, that is true for NO SQL DB's as it does not use B+ tree datastructure to store data. If I take DynamoDB as an example, It relies on something like partition key to hash and figure out the partition. This makes sure your single partition is not going to be overloaded with requests. While for sort key, it's recommended to keep it sequential if possible as internally in the shard it must have been using a B+ tree to sort on the sort keys

@consistentlearners Жыл бұрын

Happy Teachers Day Husseinn

@usamaabubakar3061 Жыл бұрын

Will this concept be applied to columns that are Varchar and indexed? Because strings are also random like username, email, url slugs, etc.. If so, what is the workaround for storing string values as indexed?

@shmmh Жыл бұрын

His videos are so interesting but so slow. I can easily watch it at 2x speed.

@kavyadevadiga2446 24 күн бұрын

What factors led Shopify to favor ULID over UUIDv7? What specific advantages does ULID provide that make it a better fit compared to UUIDv7, especially considering that sorting could also be achieved with UUIDv7?

@anuragbhagsain9724 Жыл бұрын

lowkey "That's what she said" reference at 9:57 😂

@RZhuAmpere Жыл бұрын

10:00 would have never expected that joke from you 😂

@SwapnilSoni 3 ай бұрын

BRUHHH XDDD

@vmaxxk Жыл бұрын

I've seen DBs with millions of records across various tables perform just fine with UUIDs leveraged everywhere. Either the performance difference is negligible or it only affects extremely large datasets.

@willi1978 Жыл бұрын

i think so too. usually data has no correlation between each other for example customers in a table. for timeseries data uid v7 can be used

@IvanRandomDude Жыл бұрын

Databases are much more powerful than we think.

@DirkSteinkamp Жыл бұрын

Sorry, but having a B+ tree with a page size of 2 is very close to degenerate the B+ tree to a binary tree -- and binary trees either need to be rebalanced regularly, or might degenerate into a linear list. Rebalancing is an expensive operation, especially if it involved a secondary storage like HDD / SSD. And that's what B+ trees are optimized for: to minimize HDD access, leveraging having some leeway for inserts and deletes without "reshuffling", and generally keep the tree depth low. This requires the page size to be considerably larger than two, and be harmonized with the block size on the disk. Besides that, different DBMS might offer different variants of indexes, e.g. hash indexes, that are not implemented as trees, but as hash tables. Depending on the use case this might be the index type to choose, and your argument becomes void. I personally don't like UUIDs very much in many use cases, but for other reasons: first is they are not easily graspable for humans, they don't tell a story, don't have some inherent use case information in their structure, ... yet they are very useful when randomness is required, because the systems involved only synchronize their data eventually. But besides that: too much cognitive load ... Anyhow: I don't buy your argument, sorry.

@amrojjeh Жыл бұрын

Hash table would still suffer from performance as it would have to load from disk if inserts aren't stored close to each other

@energy-tunes 4 ай бұрын

it was just for an example dumbass we all know what b trees do

@malibudancer8472 Жыл бұрын

This is very narrowly scoped on dbs that don't optimize for distributed cases...

@SmartK8 Жыл бұрын

You never had to distribute databases into multiple locations, that insert independently and then access global data. I use UUIDs for all the data tables. You have 100 databases with millions of IDs. Just insert independently whenever you want, the worst thing that can happen is a rare collision hypothetically (never happened to me yet). They will still be OK with UUIDs. Now imagine this with auto-increment integer. Insanity. How would you keep track of this auto-increment ID and ensure it's unique across all databases, other then synchronizing them. Auto-increment integers as a key are for script kiddie databases, when you have one local database. I still wouldn't use integers even then. Because you never know, whether you'll need distributed database later.

@ConsuelaPlaysRS Жыл бұрын

"Auto-increment integers as a key are for script kiddie databases" I'm willing to bet the house that most tables in most databases powering most applications use auto-incrementing PKs lol

@SmartK8 Жыл бұрын

@@ConsuelaPlaysRS That doesn't automatically mean there's a good reason to have such keys. Also most databases are probably local file databases. Everyone can do whatever they want. I'm just saying after some gathered experience, you realize that little gain in being sequential is worth sacrificing for a more robust and secure database.

@shubhambravo Жыл бұрын

I work for a major b2c org and we use auto increment IDs to store order info 😅

@Ryan-gf1sz Жыл бұрын

@@ConsuelaPlaysRS While I agree that's true and it's convenient in many case,it just won't work in large distributed system.

@omniphage9391 Жыл бұрын

for the longest time you wouldnt distribute your databases if you could avoid it, i know that sql sharding exists but that is something you would have to build your persistence layer around. With the rise of nosql databases this has changed, although that in turn also has implications for design. Its engineering. The perfect solution is the one that works.

Жыл бұрын

does it mean that, generally, indexing on a random value column is hurt?

@fullstackdotnet Жыл бұрын

it is, but dont start avoiding it and enforcing new guidelines yet, wait till you reach shopify level scale where you can benefit from that sort of performance improvement. its all about ROI

@tmbarral664 Жыл бұрын

Hum…Hussein, I’m a kind of surprised here…. To a point I maybe doubtful…. Here you’re showing the ordering of the whole rows ( as stating item as key value) when we’re talking about indexes (indices). Indexes are ordered (by nature ;)) and are pointing to the value, no matter where the value is. But here, you’re saying the primary index, the one for the primary key, is not behaving as a normal index…. If that’s the case, that’s a major (I should use uppercase here) problem. That means that we shouldn’t be allowed, in any case(uppercase again) to use anything but autoincrement number ! See, that’s so big it makes me doubt.

@berndeckenfels Жыл бұрын

For clustered index, also called IOT the value is in the index - however even if the value is not there you need to shuffle them

@ankurdutta3277 Жыл бұрын

I learned this the hard way. In mongodb I keep getting 'document not found' even though it does.

@AbhishekSingh-pu6dg Жыл бұрын

The silent joke at 10:00 That's what she said. Haha, classic.

@therealtuyen Жыл бұрын

I just randomly stumbled upon your video, and I've always had this itch about UUID performance. Your video really boosted my confidence in what I was thinking. Thanks a lot! Oh, and quick question: Do databases like MariaDB, MySQL, or PostgreSQL automatically play nice with ULIDs? And how do they know to sort 'em out for indexing?

@nirnullz Жыл бұрын

Postgresql, mysql 8++, mariadb 10.7++ support uuid internally. Postgresql have best implementation since it’s saved as 128bit binary but can be accessed as string without custom function on client. MariaDB/MySql client need to call certain function if you want to query a binary uuid field.

@therealtuyen Жыл бұрын

@@nirnullz I asked about ULID, bro ?

@PabloGnesutta Жыл бұрын

What about NOSql databases like mongo db?

@Iandavidandrino Жыл бұрын

I was just thinking about this 😵

@utkarshsaxena8034 Жыл бұрын

How about when we go ahead and distribute the data. Won't this make the last shard hot all the time?

@xD-saleem Жыл бұрын

thanks so much its make sense now

@jameswestbrook5709 Жыл бұрын

how about ULID thought?

@mardix Жыл бұрын

ULID leaks timestamp. For sensitive data, you may not want to expose this info.

@julianpurse956 Жыл бұрын

What about database performance of using a uuid stored in varchar column vs new db versions which support a native uuid type

@jacob_90s Жыл бұрын

Doesn't change the problem, and in fact you'll probably get even worse performance since you'll have a less compact data type.

@alfrendosilalahi6652 Жыл бұрын

Any one can explain about what is page in that explanation? i don't really understand

@muratkhan1453 Жыл бұрын

it is like a file that DB put indexes in. Lets say you want to get a record from DB with id number 80. if indexing in pages were not used it would have to scan entire DB to find where a record with id=80 saved. This wont be an eficient way of getting data. this is actualy how file system work. In DB it uses indexes that points which sector that record are saved. And pages are basicly where those indexesare saved and its pointers are where actual data/record are saved. So DB first goes to pages and finds index. then looks for the exact record from where index is pointing. Long story short pages are phonebook for DB...

@jwbonnett Жыл бұрын

UULD is not 100% accurate, the issue is the datetime ticks, not every generation, generates a tick. So you can still have this shuffling, though it won't be to the same extent.

@michawhite7613 7 ай бұрын

If you need an index on usernames, won't you have the same problem?

@MrAtomUniverse Жыл бұрын

For postgresql, why would anyone use UUID , SERIAL is the default method for incremental ID right?

@dz5483 Жыл бұрын

Perhaps you don't want to reveal the IDs of your records such that it becomes easy to query your records incrementally. This could be a security risk in API available publicly

@gamemoves2415 Жыл бұрын

@@dz5483I'm thinking when you get the record from the DB on the front end rather you show the UUID.

@gamemoves2415 Жыл бұрын

@@dz5483I'm thinking when you get the record from the DB on the front end rather you show the UUID.

@sasonbraha4599 Жыл бұрын

I remember reading about an incident in GithHub involving auto-incremented counters: "For example, in one incident at GitHub [13], an out-of-date MySQL follower was promoted to leader. The database used an autoincrementing counter to assign primary keys to new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some primary keys that were previously assigned by the old leader. These primary keys were also used in a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, which caused some private data to be disclosed to the wrong users." Kleppmann, Martin. Designing Data-Intensive Applications (p. 250). O'Reilly Media. Kindle Edition.

@MrAtomUniverse Жыл бұрын

@@dz5483 in this case, people use slug which is unique , otherwise it's like youtube watch?v=qRv5g5MEF2k , let me know i anyone has a better method.

@usun_development3947 Жыл бұрын

So what are we supposed to use as a primary key identifier in MySQL, if we use distributed database setup?

@ferrysuhandri Жыл бұрын

using UUID for distributed database is fine. the video is about using identifier in general case. using distributed database is a special case.

@VTOLfreak Жыл бұрын

And now you get locking contention on that last page. Fix one problem, create another.

@AnkurVermachords Жыл бұрын

I am confused at the statement that indexes are b+ tree and it gets indexed and it gets balanced but if gets balanced isnt the order gets messed up...i am a bit confused 😐 🤔

@Aspartame12 Жыл бұрын

The tree gets balanced by shifting which node is treated as root , balanced binary search tree concept.

@AnkurVermachords Жыл бұрын

@@Aspartame12 but if change the root isnt the root becomes the first node then where is the order still same confuision

@Aspartame12 Жыл бұрын

@@AnkurVermachords order in a balanced BST is never there, it's for search only , in a balanced bst search takes less time compared to unbalanced that's the whole premise, it also supports less than and greater than queries easily.

@AnkurVermachords Жыл бұрын

@@Aspartame12he mentioned the ds to be ordered at 3.40 .. aaaaahhhhhhh i am confused now, had the idea of tree now doesnt have the idea of plant as well😂

@Aspartame12 Жыл бұрын

@@AnkurVermachords Okay, so a BST is ordered in the sense that there is a order in which the nodes are arranged left child is smaller right child is bigger and so on, if you balance the tree your root value changes , that means you do not have ready access to smallest value or largest value, but the middle value ( approx ) , so while searching for something you start from middle value ( root ) and go left or right to find it. You can also write a simple recursive/stack based code to print contents of a BST in ascending sorted order or descending sorted order.

@CaimAstraea Жыл бұрын

They never made sense to me. There's no force powerful enough in the universe to convience me I need them over autoincremented integers.

@CaleMcCollough Жыл бұрын

Great info. I need a tutorial on how to do this with PlanetScale. I ordered your Udemy class. You should add it to that.

@bernard-ng Жыл бұрын

I messed up 😂😂😂 gotta talk to my CTO as soon as possible

@mitchdigitalnetwork Жыл бұрын

😂😂😂😂 Seems to be a little bit small, so let's make it bigger............ that's what she said 9:56

@ДаниялбекДаниялов-у3ч Жыл бұрын

so what about storing uuid in hashmap index?

Жыл бұрын

if a B-tree index on UUID is hurt, will the hash index on UUID be better?

@ferrysuhandri Жыл бұрын

hash index for secondary index is possible, but for primary key not all systems can set it with hash. also, uuid is large, we should only use it as primary key when we really need.

@amrojjeh Жыл бұрын

My guess is that it'd be better but would still suffer because inserts would be stored far apart, and so you'd still have the caching issue

@michawhite7613 7 ай бұрын

Updates are slower on hash indices in the worst case. When a bucket becomes full, you now have to update every single bucket, since ID % (number of buckets) is now resulting in a different value for every single ID

@hipertracker Жыл бұрын

Using numbers as the example is misleading. UUID are not ordered by design so you cannot sort them. I do not understand why the index has to be sorted? What about hash indexes?

@berndeckenfels Жыл бұрын

You can sort them by comparing their raw byte or hex string representations, both is used depending on your db type

@hipertracker Жыл бұрын

@@berndeckenfels yes, but it doesn't make sense, you have no control over the sort order.

@berndeckenfels Жыл бұрын

@@hipertracker for an index which is supposed to find a specific UUID the sort order does not matter as long as it can be used to find it.

@michawhite7613 7 ай бұрын

Hash indices are even slower to update. Instead of updating a few nodes, you now have to update every single bucket, since ID % (number of buckets) is now resulting in a different value for every single ID.

@banterfc2073 Жыл бұрын

Why not use auto incremented IDs instead

@michawhite7613 7 ай бұрын

My understanding is that's not a problem in-of-itself. But when you have another vulnerability, this will make it way worse. For example, say I found a way to see a user's phone number. I do this for user 1, then user 2, etc. I can run the attack on every single user rather quickly. If the IDs are random, then I have to guess and check user IDs, and I won't be able to affect as many people.

@experimentalhypothesis1137 Жыл бұрын

does this apply also for uuid5?

@philheathslegalteam Жыл бұрын

Yes. UUIDv6 is the implementation which is proposed to fix this

@BrianWoodruff-Jr Жыл бұрын

I get the sense you hate indexed random UUIDs in databases.

@yadneshkhode3091 Жыл бұрын

👍

@adarshkr532 Жыл бұрын

did he just make a that's what she said joke at 10:00 😂

@shubhambravo Жыл бұрын

Use escalidraw

@IAjayMukhiya Жыл бұрын

the speed of the video is 1.25x

@RaduCruceru Жыл бұрын

All talk without showing numbers. You can theorize on how the index performs all you want, but you should take the time to run some performance tests before-hand and present the data. Otherwise this video is just basic guessing and a waste of time.

@nbme-answers Жыл бұрын

I appreciate the conceptual overview

@AtomicCodeX 10 ай бұрын

What you expect? Watch him calculate how long it takes to conclude which method they use? You do realise he’s talking about how it works, a performance test won’t help with that. If you don’t have time to learn and just need to know which is faster, then just do that. Instead of shutting on other peoples work.

@warrenarnoldmusic Ай бұрын

You can analyse algorithms analytically not just statistically. You can request for a follow up video instead of discrediting the hypothesis and the author

@ArunGordon Жыл бұрын

That's what she said 🤭

@Cthulch Жыл бұрын

Do you need to have your index ordered? A linked list? TERRIBLE. A tree? gosh! I always imagined random uuids would be utilizing hash table indexes, which is completely useless to sort. Using random numbers with a sorted index seems like a pretty dumb mistake. I feel like when you say "index", you don't even consider that there's more than one index type.

@michawhite7613 7 ай бұрын

Hash indices are really fast for select queries, but they make the updates even worse. If a page gets full, now instead of just updating a few nodes, you have to update every single bucket, since the ID % (number of buckets) will result in a different value now.

@Cthulch 7 ай бұрын

@@michawhite7613 If it was a student who wrote the hashing lib then yes. In SQL, however, it's so well optimized that the page overflow happens extremely rarely and it's not as dramatic as it may seem. Plus in vast majority of cases it's totally worth it to have inserts longer if it means having O(1) for reading. And the complexity of the inserts is also linear or at least log(n), so insignificant.

@andrewbrown8463 11 ай бұрын

There is a big difference between a database that understands a uuid type and is storing it as a number compared with a uuid stored as a string (varchar). Ultimately though using either a int64/bigint or uuid is the right way to create a primary key for a table. Never should a user entered value be a primary key. Otherwise the rest of this video is mostly garbage or irrelevant in the scheme of things when it comes to performance in a database.

@kraigochieng6395 Жыл бұрын

10:02

@solitary200 Жыл бұрын

ULID > UUID

@silicalnz Жыл бұрын

you've never worked with distributed databases 😂

@roy20021 Жыл бұрын

can you elaborate?

@voicevy3210 Жыл бұрын

@@roy20021you need uuid to handle data migrations, data should be independent of where it's located

@malibudancer8472 Жыл бұрын

@@roy20021 he is correct.

@roy20021 Жыл бұрын

@@malibudancer8472 ok, can you elaborate a bit more?

@malibudancer8472 Жыл бұрын

@@roy20021 I don't really have time to spend time on a stranger on youtube, and you can't really post links on youtube without getting your content deleted. So I can only guide you what to google for. Read about the partitioner of Cassandra and the hash function used. Search for partitioning hot spots and educate yourself about it. Also all relevant databases optimize their indexes for truly random UUIDs and allow tuples to be indexed. That is all I can do for you, if I could I would have posted you direct links instead. Thank youtube for being a P**e of s***.