How Reddit designed their metadata store to serve 100k req/sec at p99 of 17ms

Рет қаралды 25,930

Күн бұрын

System Design for SDE-2 and above: arpitbhayani.me/masterclass
System Design for Beginners: arpitbhayani.me/sys-design
Redis Internals: arpitbhayani.me/redis
Build Your Own Redis / DNS / BitTorrent / SQLite - with CodeCrafters.
Sign up and get 40% off - app.codecrafters.io/join?via=...
Recommended videos and playlists
If you liked this video, you will find the following videos and playlists helpful
System Design: • PostgreSQL connection ...
Designing Microservices: • Advantages of adopting...
Database Engineering: • How nested loop, hash,...
Concurrency In-depth: • How to write efficient...
Research paper dissections: • The Google File System...
Outage Dissections: • Dissecting GitHub Outa...
Hash Table Internals: • Internal Structure of ...
Bittorrent Internals: • Introduction to BitTor...
Things you will find amusing
Knowledge Base: arpitbhayani.me/knowledge-base
Bookshelf: arpitbhayani.me/bookshelf
Papershelf: arpitbhayani.me/papershelf
Other socials
I keep writing and sharing my practical experience and learnings every day, so if you resonate then follow along. I keep it no fluff.
LinkedIn: / arpitbhayani
Twitter: / arpit_bhayani
Weekly Newsletter: arpit.substack.com
Thank you for watching and supporting! it means a ton.
I am on a mission to bring out the best engineering stories from around the world and make you all fall in
love with engineering. If you resonate with this then follow along, I always keep it no-fluff.

Пікірлер: 59

@richi12345678910 Ай бұрын

The Kafka CDC can solve the problem of synchronous write inconsistencies, but not the backfill overwrriting. I suspect they might do some kind of business logic or SHA/checksum validation to ensure that they are not overwriting the data during backfilling. Correct me if I'm missing something bro.

@vinayak_98 Ай бұрын

Hey Arpit, thanks a lot for putting this up. Your writing skills are next level, crisp and crystal clear. Could you please tell what's the setup you use for taking these notes? Thanks in advance.

@ankitpandey3724 Ай бұрын

Weekend party ❌ Asli Engineering ✅

@nextgodlevel4056 Ай бұрын

Successfully ruined my upcoming weekend. Have to view all of your videos now 😢

@GSTGST-dw5rf 12 күн бұрын

Отлично, что на KZbin есть такие полезные видео. Спасибо, Министр!

@prtk2329 Ай бұрын

Nice, very informative 👍

@raj_kundalia 19 күн бұрын

Thank you for doing this!

@AayushThokchom Ай бұрын

Large Data Migration -> Event Driven Architecture Also, interesting to learn about postgres's extentions which are not required if going with a serverless database solution like DynamoDB.

@user-ob1zi3jc1r Ай бұрын

how many shards were used to hold those partitions to achieve that much throughput

@JardaniJovonovich192 Ай бұрын

I guess we don't need both the CDC Setup and dual writes, just thr CDC setup would suffice to insert the data in the new DB, correct?

@mehulgupta8218 Ай бұрын

As usual great stuff🙌🏻

@ramannanda Күн бұрын

postgres is the king still :) with extensions it is all you need...

@guytonedhai Ай бұрын

Asli Engineering!

@karanhotwani5179 19 күн бұрын

Nice. Kafka part seemed over engineering. Can just verify the hash before writing to new metadata db in syncing phase.

@dreamerchawla Ай бұрын

Hey Arpit… thanks for the video I liked doing partition as policy that runs on a cron. But wouldn’t moving data around in partitions also warrant a change in backend(read) ? Or you are saying the backend has been written in a way that it takes partitioning into account while reading the data?

@AsliEngineering Ай бұрын

They use range based partitioning so no repartitioning required.

@AsliEngineering Ай бұрын

They use range based partitioning so no repartitioning required. New range every 90 milion IDs.

@anurag-vishwakarma Ай бұрын

Can u share the notes ? Pls

@namanjoshi5089 Ай бұрын

Amajeeng

@poojanpatel2437 Ай бұрын

Good video. Would appreciate a lot it you can attach any resources you used in video like blog from reddit that is mentioned in description. Would be great if link is also attached there.

@AsliEngineering Ай бұрын

Already added in the description. You are just a scroll away from finding that out.

@sachinmalik5837 Ай бұрын

Hi Arpit, I think you could have gone a bit more into depth, like they have mentioned in their blog. a bit about how they are using incrementing post_id, which allows them to manage most of the query from 1 partition only. Not complaining at all. Thanks for being awesome as always. TLDR; 7 minutes seem a bit short

@AsliEngineering Ай бұрын

I deliberately skipped it because it would have taken 4 more minutes of explanation. I experimented in this video by keeping it very surface level and around 8 min mark, and the retention for this one is through the roof 😅 In the last 15 videos I saw a massive drop in retention numbers when I started explaining the implementation nuances or when video length went beyond 8 minutes. So I wanted to experiment and test out the hypothesis in this one video. Hence you see I did not even inject the ad of my courses or the intro. Jumped right in the topic. But yes, given their IDs are monotonic, their batch gets for media metadata would almost always hit the single partition if they partition the data by ranges.

@SOMEWORKS456 Ай бұрын

Would appreciate if you can make another 8 min video for details. I am here for the meat. Surface level stuff is ok. But meat. No complains, just stating the opinions of a random lurker.

@amoghyermalkar5502 Ай бұрын

@@AsliEngineering so in short your focus is more viewers instead of better video quality right?

@AsliEngineering Ай бұрын

@@amoghyermalkar5502 if you really feel like it then please check out other videos on my channel. I go in depth that other people cannot even think about or comprehend. Remember, it hurts to put in effort of 2 days on a video to be seen by just 2000 people in 7 days.

@sachinmalik5837 Ай бұрын

@@AsliEngineering Absolutely. I can understand that, without naming names there are so many "Tech" Creator who are getting 10x times the views we got here but they just never seem to talk about Substance. I just want to you know we do appreciate it a lot. I am still trying to read more blogs on my own so I don't think I am being spoon fed for watching a video,

@code-master Ай бұрын

How will you handle search, because the relevant data might be several days older partitions. Even if they're using a secondary data store, the date/time range-based partitioning or even sharding will not suffice. what do you think?

@AsliEngineering Ай бұрын

Why would it be several days older? The migration was a one time activity. Post the switch the writes of media metadata is always going to the unified media metadata store.

@code-master Ай бұрын

Thank you for your response, Apologies, my question was not clear. My question was related more related to searching through such a data store where the data is partitioned daily i.e. partitioned on the created_at. Let’s say you search for an 'X term' post, and the result ideally will contain a lookup from several partitions. For example, if there is a relevant post from a year back. We are looking at many partitions. To build the search result, the DB has to load each daily partition. Daily partition will work well if the lookup is limited to a couple of days back. That’s my understanding.

@user-vv7ph2xr1o Ай бұрын

Arpit - using cdc and kafka.. that still does not solve the problem of - Data from old source during 'migration' overriding data in the new aurora postgres, right? What am i missing? You will still need a bulk batch job that takes up all the archival data from all the multiple sources and ingests them into the new Aurora. Using CDC does not solve for that backport, correct?

@2020goro Ай бұрын

CDC can transfer the historical data using snapshots of the existing databases if/when the transaction log is not available for old data, and then the consumers report any write conflicts into a separate table which the devs can remediate later on. Hope that answers your question

@AsliEngineering Ай бұрын

The consumers of Kafka have this responsibility. It is not that just adding Kafka solved the problem. The core conflict management is written in the consumer of it which checks and sets in the database.

@keshavb2896 Ай бұрын

Why reddit don't go for document db for there storage as per structure and pattern .... What u think about it @arpit?

@AsliEngineering Ай бұрын

according to me familiarity of stack could be the biggest reason. Apart from this the query pattern here is that most request hits single partition (given IDs are monotonically increasing and partitions are created on range basis). Most KV stores do hash based partitioning because of which the lookups need to be fanned out across the shards which is quite expensive. The databases that do support range based lookup on per shard is DDB and that managed offering at scale becomes very expensive. These are some of the pointers I could think of. But again this is pure guess.

@nextgodlevel4056 Ай бұрын

How pg bounce minimizes the cost of creating a new process for each request? May be I am wrong, can you tell me how cost is reducing here?.

@niteshlohchab9219 Ай бұрын

let's say each connection spawns a new process. killing a connection kills the process.. what would you do logically? Think before reading the next line.. simply re-use the connections .. that's what every database proxy in front usually does in simple words.. the connections are re-used and managed accordingly.

@nextgodlevel4056 Ай бұрын

@@niteshlohchab9219 Thanks, bro, for the easy and simple explanation; I appreciate it. What I was thinking was that the term "cost" is used for money, but I was wrong. Here, "cost" means scalability and performance, ensuring that each client gets a response as quickly as possible. So, in terms of money, we increase the cost, and in terms of scalability and performance, we decrease the cost. If we look at it for the long term in enterprise applications, having a scalable product also increases revenue. Let me know If I am correct or not 🙃

@AsliEngineering Ай бұрын

Because it does connection pooling, so connections are reused.

@LeoLeo-nx5gi Ай бұрын

Thanks Arpit!! Also what are your thoughts about using Pandas as a metadata DB, Dropbox had a post regarding they using Pandas wherein they explained in depth why other DBs are not better for them. (Would like to know your views too on it)

@AsliEngineering Ай бұрын

I am not aware about this. Let me take a look.

@calvindsouza1305 Ай бұрын

What is used over here to write down the notes?

@AsliEngineering Ай бұрын

GoodNotes.

@calvindsouza1305 Ай бұрын

Thanks looks very clean

@suhanijain5026 Ай бұрын

why are they using postgres, if they are storing it as json ?

@AsliEngineering Ай бұрын

Stack familiarity, plus range based partitioning support.

@atanusikder4217 Ай бұрын

What is CDC mentioned here ? Please suggest some pointers

@gauravsrivastava3884 Ай бұрын

Change Data Capture

@myjourney7713 Ай бұрын

How did they check if the reads from old vs new database are same?

@AsliEngineering Ай бұрын

a simple diff would work given that the final JSON has to be the same as no changes were made to the client.

@myjourney7713 Ай бұрын

@@AsliEngineering if there is any issue at scale, wouldn't it be very hard to debug?

@bhumit070 Ай бұрын

Dayumm

@anand.garlapati Ай бұрын

Are you saying the reddit has unified database per region?

@AsliEngineering Ай бұрын

Unified database implies that the data that was split across multiple services has been moved to one database. Now this unified one can be replicated across regions to improve client side response times.

@anand.garlapati Ай бұрын

What was the motivation to go for dedicated database per service initially by the Reddit? Could you please tell how many such services they have it? Regarding your second point, the Reddit team allowed only reads from the replicated databases and not writes. Correct?