Twitter Search/ElasticSearch Design Deep Dive with Google SWE!

Twitter Search/ElasticSearch Design Deep Dive with Google SWE! | Systems Design Interview Question 8

Рет қаралды 11,202

Күн бұрын

Пікірлер: 50

@Snehilw 2 жыл бұрын

Amazing content as always dude. Love how much in depth you go in all of your videos! My favorite channel of all by far! Have recommended this to several friends.

@jordanhasnolife5163 2 жыл бұрын

Thanks Snehil!

@Ms452123 2 жыл бұрын

Sshhh Man's been hiding the gun show this whole time. Giga Chad on the low

@jordanhasnolife5163 2 жыл бұрын

Gotta do it to compensate for my miniscule peen

@cc-to2jn Жыл бұрын

dude u along with neetcode are my goto. Great content, and clear explanations.

@jordanhasnolife5163 Жыл бұрын

I appreciate it!!

@mickeyp1291 7 ай бұрын

as allways great videos - 18:00 forgot all that stuff about ESs caching so thanks for the reminder, gonna reread that part in ES docs. great job knowing about lucene most of my applicants have no clue about ES definitely not that lucene is not a db but a search engine (hate the json syntax, but what can you do) again, super fun to listen to your vids and watch this content

@jordanhasnolife5163 7 ай бұрын

Thanks Mickey!

@shivamsinha642 2 жыл бұрын

liked solely for the description

@SwapnilSuhane 3 ай бұрын

great depth of core search design discuss with bit comedy ;)

@RandomShowerThoughts Жыл бұрын

16:00 exactly, I was thinking the same thing. Typically write to the source of truth, and use a queue to send it out to the various locations

@mickeyp1291 7 ай бұрын

today you assume the queue is the source of truth, then you spill into s3

@RandomShowerThoughts Жыл бұрын

16:00 we can also use debezium (for certain databases) and that would write to kafka and listen on that topic

@jordanhasnolife5163 Жыл бұрын

I'll have to look into this! Haven't had the privilege of using Kafka during my career so haven't heard of debezium

@RandomShowerThoughts Жыл бұрын

@@jordanhasnolife5163 it’s pretty cool, I used it at my last company. We used debezium to capture changes from the database using the WAL, it would then write to a Kafka topic and we can read off it. The one downside here is that it writes all the messages into a single topic, and a single partition to ensure ordering. So the approach you mentioned of writing directly to Kafka will allow us to write to multiple partitions if needed (allowing more parallelization)

@kyabia2333 10 ай бұрын

amazing, very helpful

@yashagarwal8249 6 ай бұрын

Will Search Service pull the actual documents from the DB once it receives the documents Ids from cache/search index?

@jordanhasnolife5163 6 ай бұрын

Yep!

@maxmanzhos8411 6 ай бұрын

Wrote a long comment about how a posting list (documents containing a term) is implemented as a skip-list + encoding as per apache/lucene github repo Lucene99PostingsFormat. As I was wondering why we can't use similar idea for follower/following list storage in news feed problem (from System Design 2). But it's only viable if you either store the data in Lucene (I guess no one does that with this purpose in mind) or if you have a full control over DB code, so that you can do such advanced customization over a column (also not practical). nice guns

@jordanhasnolife5163 6 ай бұрын

Interesting, haven't heard of that data structure but would agree that it may be an overoptimization. Thanks, I work hard on the guns haha

@idobleicher Жыл бұрын

I liked your videos, new sub!

@anupamdey4893 Жыл бұрын

Love your content ! Keep up the good work !!

@jordanhasnolife5163 Жыл бұрын

Thanks Anupam!!

@AmolGautam 6 ай бұрын

Thanks giga bro

@jordanhasnolife5163 6 ай бұрын

np gigachad

@neethielizabethjoseph 3 ай бұрын

Don't we need a parser/lexer service between kafka and search index that parses the tweets, hashes it to the correct partitions of the search index ?

@jordanhasnolife5163 3 ай бұрын

Something like elastic search will do this for us, hence why I don't explicitly include it.

@raj_kundalia 6 ай бұрын

thank you!

@kamalsmusic 2 жыл бұрын

If we use the local index (meaning each node stores term -> [doc id's] and multiple nodes can reference the same term), does this mean we need to query all the nodes to answer a search query? How do we know which nodes have the term we are interested in if we are not partitioning by term?

@jordanhasnolife5163 2 жыл бұрын

Yes, you have to query them all and aggregate. It's unfortunate, but there's typically too much data to shard by term as opposed to document.

@axings1 10 ай бұрын

@@jordanhasnolife5163 could we first partition by term, then further partition into multiple shards if a single term has too much data?

@user-zc7os7on7k 9 ай бұрын

i dont understand most of the things. but thanks for the video.

@jordanhasnolife5163 9 ай бұрын

feel free to elaborate

@FarhanKhan-wu3fq Жыл бұрын

Did you really just "NOPQRS"ed to figure out what comes after P?

@jordanhasnolife5163 Жыл бұрын

I am dumb

@neek6327 2 жыл бұрын

Hey man, qq. I was wondering if you thought it would be important in an interview to mention how we know which machine holds which partition? I was thinking maybe we could have a distributed search/index service that maintains the mappings between the partition -> machine. And that mapping could be made consistent across the “search/index service” nodes via a consensus algo or maybe zk. Does this make sense at all or am I missing something? Maybe it’s the local secondary indexes that take care of the problem I’m describing and I just don’t understand 🤷‍♂️

@neek6327 2 жыл бұрын

Like rather than relying on local index, if we knew which machine held which partition couldn’t we just go directly to the correct shard and perform a binary search?

@jordanhasnolife5163 2 жыл бұрын

Yes you would use zookeeper or a gossip protocol to keep track of which docs are held on which partition. Though this shouldn't really matter since we have to query each partition anyways.

@neek6327 2 жыл бұрын

Hmm sorry, maybe this is going over my head. Why is it that we need to query each partition if we know exactly the partition that contains the word we’re looking for? Like say someone is searching the word “gigachad” and we know that machine 1 holds the partition range with that word in it. Couldn’t we go directly to machine 1 and perform a binary search there rather than querying all the shards? Maybe my understanding is off?

@jordanhasnolife5163 2 жыл бұрын

@@neek6327 We aren't partitioning that way here - we are partitioning by groups of document Ids, not term. While in theory, partitioning by term is optimal, the reality is that there are often too many document IDs associated with one term to fit on a given machine, and as a result we have no choice really but to use local indexes on a group of documents.

@neek6327 2 жыл бұрын

Got it, that makes sense. Thanks 🙏

@RandomShowerThoughts Жыл бұрын

Grokking the system design sucks at this question ngl, searched for a solution right after reading it