Stop messing up your stream processing joins! | Systems Design Interview 0 to 1 with Ex-Google SWE

Рет қаралды 11,201

Күн бұрын

Пікірлер: 37

@hbhavsi 2 ай бұрын

Congrats on now 55K. Your videos are among the best out there. I love how you segue from one video to another, building upon one, ending at challenges, and talking about solutions in the next video.

@jordanhasnolife5163 2 ай бұрын

Thanks man!!

@art4eigen93 Жыл бұрын

Congratulations for 10K Jordan!

@vankram1552 Жыл бұрын

Congratz with your data intensive applications youtube channel

@jordanhasnolife5163 Жыл бұрын

Thank you Martin kleppman has no idea

@rebirthless1 5 ай бұрын

9:25 I'm guessing the result of the table-table join is used within the consumer itself instead of being externalized, and the consumer consumes streams in addition to these 2 CDCs which need such joined result, except that they're not drawn in the graph?

@jordanhasnolife5163 5 ай бұрын

Not necessarily! You may just sink the result elsewhere. That being said, it's certainly possible this join could in turn be used with a third stream.

@sushantsrivastav Жыл бұрын

I noticed that a majority of your designs lean heavily towards kafka + Flink combination. Is this a personal choice, or do you run your designs by Senior Engineers? Don't get me wrong, these designs are not textbook-y, they are real and heavily tend towards what is "in" now (as against Alex xu's designs which seem, for lack of a better word, a bit dated. I have 17 years of experience in the industry (I am a fossil), but I get to learn many things from your discussions. This is incredibly rare for someone with 2 years of experience. Thanks for everything that you do!

@jordanhasnolife5163 Жыл бұрын

Mainly a personal choice, but I experience the pitfalls every day at work of using non replayable message brokers, so I'm definitely pro log based mq's and stateful consumers whenever I notice the opportunity to use them. That being said, I understand in practice this may have a high cost to implement due to storage and or latency and many people might compromise and opt for fewer partitions/simpler solutions. My designs are generally pretty idealized and not at all optimized for cost, which is why you don't see stuff like this too much IRL.

@sushantsrivastav Жыл бұрын

@@jordanhasnolife5163 On the contrary, I honestly believe these *are* real world and not textbook-y and cookie-cutter like "Grokking". If someone were to make these systems in 2023, they would choose this tech, as against say 2017-18 when the "system design" questions became mainstream.

@bokistotel 6 ай бұрын

I watched this video again and I dont understand how frequently should Consumer fetch/process elements from Demographics queue ? If I got this right, each time Consumer processes element from demographic queue, a new in-memory copy of a database is created. So for example if every time DB is updated, the current state is sent to queue, that would mean that there is a frequent change of in-memory database in consumer. My question is when we have a new "search term" coming to a consumer, how should we handle the case when in-memory database is updating ? First thing that pops in my head is to create a condition that event in Demographic Queue has a higher priority on Consumer, so If consumer consumes something from demographics queue, this will be processed first, and then it will process the potential join? And in case anything from demographics queue takes precedent over "search-term-queue" would mean we would have to wait until the Demographics queue is empty, when its empty, the database consistency is valid, then preform join??

@jordanhasnolife5163 6 ай бұрын

There needs to be some sort of locking so that they aren't both grabbing the demographic table at the same time. That being said, there's not really any concept of determinism here. We just process events as they come in, no need for a "priority" or anything like that.

@LeoLeo-nx5gi Жыл бұрын

Hi thanks for mentioning the Issues in depth at the end, was about to post a comment asking for how will InMemory in all Consumers work and more. Just as a side note if this problem was to be done without Flink or any tool as such, is there are any other approach?

@jordanhasnolife5163 Жыл бұрын

I think you'd probably end up reinventing the wheel - taking distributed snapshots is really expensive so flink found a great way to do it without hugely impacting performance

@LeoLeo-nx5gi Жыл бұрын

@@jordanhasnolife5163makes sense

@shibhamalik1274 10 ай бұрын

Hi @jordan Awesome video is there a video on CDC deep dive? Is it a queue push after db save ? and if yes then it is similar to 2-phase commit , isnt it?

@jordanhasnolife5163 10 ай бұрын

Similar, however keep in mind that the push to the queue is *from* the db! And we don't need that push to happen for the write to be committed to the database. If the db goes down, or can't communicate to the queue, it's ok if we place those writes there later. That's the main difference.

@indraneelghosh6607 Жыл бұрын

Would it make sense to have to query the db as an option for data that is not in memory and there is no CDC event in the queue for that ID? If you have several TBs of data in DB, maintaining such a large number of consumers may be rather costly, right(As RAM is costly)? Is there a more cost-effective solution?

@jordanhasnolife5163 Жыл бұрын

You could theoretically store the flink data on disk I believe, but yeah if latency isn't a main concern you could always just query a db

@zachlandes5718 Жыл бұрын

Can you review some cases where we’d do table to table joins with streams? Is it mainly to offload work from the db(separation of consumers and db)? Or improve performance?

@jordanhasnolife5163 Жыл бұрын

If you want realtime joins that actually update for you, it would be useful. Think of a books table and an author's table. They're both big, so you don't want to query them both many times over. Maybe a new book gets added, and it turns out many authors had written to it, some of which were already in the table. Let's do a join. Maybe a new author gets added, and it turns out they contributed to many books in the table. Now we can only fetch the books we need without having to redo a full join.

@Rahul-pr1zr 10 ай бұрын

So if the in-memory tables are huge you mentioned we can partition the info coming into multiple queues - does this mean we need to maintain multiple consumers with each consuming from a specific queue?

@jordanhasnolife5163 10 ай бұрын

That's correct!

@bokistotel 6 ай бұрын

@8:05 are you talking about 2 tables in separate databases ?

@jordanhasnolife5163 6 ай бұрын

Yeah