Realtime Advertisement Clicks Aggregator

Realtime Advertisement Clicks Aggregator | System Design

Рет қаралды 19,522

Күн бұрын

Let’s design a real time advertisement clicks aggregator with Kafka, Flink and Cassandra. We start with a simple design and gradually make it scalable while talking about different trade offs.
Note: pdfhost.io/edit?doc=8a32143c-...
System Design Playlist: • System Design Beginner...
🥹 If you found this helpful, follow me online here:
✍️ Blog / irtizahafiz
👨‍💻 Website irtizahafiz.com?
📲 Instagram / irtiza.hafiz
00:00 Why Track & Aggregate Clicks?
01:07 Simple System
02:12 Will it scale?
04:00 Logs, Kafka & Stream Processing
12:02 Database Bottlenecks
17:13 Replace MySQL
18:59 Data Model
25:45 Data Reconciliation
29:00 Offline Batch Process
32:10 Future Videos
#systemDesign #programming #softwareDevelopment

Пікірлер: 86

@kevindebruyne17 Жыл бұрын

Your videos are really really great, no fluff, straight to the topic and covers a lot of details. Thank you and keep it up!

@irtizahafiz Жыл бұрын

Hi! Even though you are a City fan, thank you for watching the video haha!

@nosh3019 10 ай бұрын

Thanks for the effort making this! Very informative and a perfect companion to the system design volume 2 book.

@irtizahafiz 9 ай бұрын

Yes, please refer to the book for more details! It's a brilliant book!

@sarthakgupta290 3 ай бұрын

Thank you so much for this perfect explanation!!

@freezefrancis Жыл бұрын

I think you deserve a lot more audience! The quality of the contents were really good. Thanks for sharing.

@irtizahafiz Жыл бұрын

Thank you for watching! Please like and share to reach more people.

@indavarapuaneesh2871 4 ай бұрын

what I would've done differently: have both warm and cold storages. If your data access pattern is mostly reading data from the last 90 days (pick your number), then store that data in warm storage like Vitess (shared mysql or some distributed relation db). And run a background process periodically that vacuums the stale data from the warm tier and exports it to cold tier like data lakes. This way your optimising both read query latency and storage cost. Best of both worlds.

@ranaafifi5487 Жыл бұрын

Very organized and neat! Thank you

@irtizahafiz Жыл бұрын

Glad it was helpful!

@TheZoneTrader Жыл бұрын

Each day click storage = .1KB * 3 B = .3 TB/ day and not 3 TB /day ? Correct me if i am wrong

@ramin5021 5 ай бұрын

I think you missed the K in KB. btw, he didn't calculate it correctly too. I think the below calculatioin is correct 0.1KB = 100Btye 100 Byte * 3B = 300TB correct me if I am wrong

@erickadinugraha5906 3 ай бұрын

@@ramin5021 you're wrong. 100 Bytes * 3B = 300 GB = 0.3 TB

@srawat1212 3 ай бұрын

It should be 0.3Tb or 300Gb

@sergiim5601 Жыл бұрын

Great explanation and examples !

@irtizahafiz Жыл бұрын

Thank you!

@weixing8985 Жыл бұрын

Thanks for the tutorials! I think you're following the topics of the book System Design Interview 2 but using a way that a lot easier to understand. I'm very much struggled with those topics of the book until I came across your tutorials!

@irtizahafiz 8 ай бұрын

Yes! I refer to both the volumes of that book. I mention it on the video, as well as have it linked on most descriptions.

@pratikjain2264 4 ай бұрын

This wasby farthe best video..thanks for dojng it

@irtizahafiz 3 ай бұрын

Most welcome!

@rajeshkishore7119 5 ай бұрын

Excellent

@mohsanabbas6835 2 жыл бұрын

Event data stream platform . It’s more complex system, where data is being processed either in real time streams or batch, ETL, data pipelines etc

@jkl89966 Жыл бұрын

awesome

@karthikbidder Ай бұрын

Since am working in adtech and would looking to upgrade our approach to modern, fortunately i got a look into your video and it helps me a lot. My question here is how about to use Clikchouse instead of Casandra, will it work well or lead to any issue?

@protyaybanerjee5051 4 ай бұрын

Some notes about this design - Adding more topics is a very vague statement. We have to define the data model to capture each click event and then allow data partitioning based on advertisement_id and some form of timestamp - Not sure why replication lag is stated as an issue here. The read patterns for this design doesn't require reading consistent data. So this should not be an issue - Relational DBs won't do well with aggregation queries. This is a little misguiding. Doing aggregation queries efficiently requires storing the data model in a column major format that unlocks efficient compressions and data loading. - Why provision a stream processing infra to upload data to cold storage . Once a log file reaches X MB, we can place an event in Kafka with a (file_id, offset) pair. There would be a consumer that reads this and uploads the data to s3. This avoids un-necessary dollar cost as well as operational cost of maintaining a stream infrastructure.

@rishabhjain2404 3 ай бұрын

We should use count-min sketch for real time click aggregation on the stream processor, it is going to be very fast and you query data on last minute granularity. A map-reduce system can be useful for exact click information. Clicks can be batched, put into HDFS system, reduced into aggregates and saved on DB.

@monodeepdas2778 3 ай бұрын

afaik count-min sketch was applicable for top k problem. I know we can have some faster lesser accurate algos, but that is what stream processors can do.

@kirillkirooha3848 Жыл бұрын

Thanks for the video, that's brilliant! But I didn't quite understand the problem with inaccurate data in the stream process. Late data can arrive to the stream process, but I suppose it has timestamp from apache log. So can't we just insert that late data to Cassandra?

@irtizahafiz Жыл бұрын

Hi! Yes, you can insert the late data when storing individual clicks. However, when you are computing aggregations, such as "total clicks every hour", you will already emit the count at the end of the hour (with some buffer). Then, when a late data arrives you won't be able to "correct" your aggregate.

@roopashastri9908 3 ай бұрын

You videos are great!Very clearly articulated!Was curious why do we have to use Nosql DB , if we are storing only the aggregated data based on advertiser ID.What are the drawbacks of using any columnar DB like snowflake in thise case?

@irtizahafiz 3 ай бұрын

You can use whatever DB works best for your case. I think Snowflake will work just as well here.

@bobuputheeckal2693 11 ай бұрын

Is it 300GB or 3TB ?

@code-commenter-vs7lb 6 ай бұрын

Hello, when we introduce a log file, how do we ensure the aggregates is still "near realtime" ? IMO when you introduce log files in the middle you will have append only logs which we will probably only publish once the log file finished appending and started generating new file. So there is a delay of sometime may be a min or something (depending how big is your rolling window).

@irtizahafiz 4 ай бұрын

That depends on how you are reading the log file. You can add a "watcher" that tails the log file and publishes a message downstream whenever a new line is appended. Alternatively, as a first step, you can write to Kafka, and the Kafka consumer can both process the data and add it to the log file.

@sumonmal009 4 ай бұрын

capture the click with application logging, good idea, main crucks 6:30 21:30

@tonyliu2858 Жыл бұрын

Can we use MapReduce for stream processing? Will it meet the latency requirement? Or we have to use some other streaming processors such as Flink/Spark?

@irtizahafiz 8 ай бұрын

I think it all depends on your application. If you want the most realtime, I know Flink or Spark (with micro batches) can get you that.

@lytung1532 Жыл бұрын

The tutorial is helpful. Could you give the sample of the format of the records stored in Cassandra. Do we need to store the data in Cassandra with all data in original click?

@irtizahafiz Жыл бұрын

Depends on your use case. You could have two Cassandra tables, one for individual records and one for aggregations.

@ax5344 4 ай бұрын

@5:35 0.1KB*3B = 3 TB Hi, how is the computation done? I thought 3B is 9 zero; multiply it by 0.1 will get 8 zero. 1 TB is 1e9 KB. Then I thought it would be 0.3 TB. Did I get something wrong?

@utkarshgupta2909 Жыл бұрын

Correct me if I am wrong, Seems to me more like lambda architecture.. aggregation being fast but inaccurate whereas S3 being slow but accurate

@irtizahafiz Жыл бұрын

Yup I think you can use the term for this.

@freezefrancis Жыл бұрын

Yes. That's right.

@ankitagarwal4022 5 ай бұрын

Excellent video from thinking from basic design to scalable design. Few question. 1. what technology actually we can use for log watcher, 2. can you please correct me about stream processors for saving data in s3, we are using the Steam processor. Here stream processors can be any consumer of Kafka events like my simple Java service which pushes data in s3 and for stream processer for aggregation of the data and saving in Cassandra we can flink.

@irtizahafiz 3 ай бұрын

1. Don't remember off the top of my head, but you can even write a small cron job to read the file every few minutes. But there are better tools if you can Google around. 2. Yes. That should work. Flink automatically checks data in S3 when doing the aggregation.

@PrateekSaini 4 ай бұрын

Thanks for such a clear and detailed explanation. Could you please share a couple of blogs/articles for reference where companies are using this kind of systems?

@irtizahafiz 3 ай бұрын

You should find some in the Uber engineering blog.

@VishalThakur-wo1vx 5 ай бұрын

We could also keep States in Kafka Stream application (local or Global State) and use Interactive Query to fetch result of the aggregation. Can you please share how to decide whether to offload the aggregation result to external DB vs when to use interactive Query ? I understand that durability can be one factor but what are others ?

@irtizahafiz 4 ай бұрын

If your tables are on a relational DB and they are relatively large, aggregations will do poorly. Instead, either store precomputed aggregations in a different table, or compute on the fly using something like Flink.

@user-eq4oy6bk5p 2 жыл бұрын

What do we store in S3? Apache log files?

@irtizahafiz 2 жыл бұрын

No. In S3 we store the individual click events. In Cassandra we store the aggregations. We are doing that under the assumption that individual clicks are rarely accessed, only aggregations are accessed regularly.

@shadytanny 2 жыл бұрын

How are we using log files?Reading from log files is the best way to source?

@irtizahafiz 2 жыл бұрын

It totally depends on your system. In this example we are deciding to read from log files because of its simplicity. If you want, you can also run a Spark job on some logs for every website visit.

@dhruvgarg722 6 ай бұрын

Great video! Did you create these diagrams in obsidian or these are images?

@irtizahafiz 6 ай бұрын

I use a combination of Obsidian and Miro to create the notes.

@KShi-vq4mg 10 ай бұрын

Big Fan! hopefully lot more people find these. but just one feedback. you covered what kind of data we store. would it also be worth going little deeper into data model?

@irtizahafiz 9 ай бұрын

Hi! That's a really good feedback! It's definitely worth diving deeper, but it's difficult to do that in a high level SD video. If that's something you will find valuable, I can definitely create videos on data models.

@vikramreddy7586 3 ай бұрын

Correct me if I'm wrong - we are processing the data fetched via Batch Job, so essentially we are processing the data twice to get rid of the inaccuracies ?

@irtizahafiz 3 ай бұрын

It's been a while since I uploaded the video, but I believe we are only processing it once with the stream processing pipeline. However, it's pretty common to run some kind of a nightly job to "correct" some of the small inaccuracies coming from the real-time aggregations.

@parthmahajan6535 2 ай бұрын

That was an awesome video, i had a similar approach and got it validated. I was wondering if you could also start a code series on building such systems (as demonstrated in video).

@irtizahafiz 2 ай бұрын

Thank you for watching! I plan on building similar sub-systems, but TBH, building an e2e system like this without an actual use case (and traffic) is not really worth it. I don't think it will add much value either. Thank you for the suggestion though. Appreciate it.

@parthmahajan6535 2 ай бұрын

@@irtizahafiz it would make sense tho, if someone's just starting,I was thinking we could use some dataset on clickstream logs, create a stream of the logs coming in(simulate a stream through python), and then build the system.

@krutikpatel906 5 ай бұрын

For batch job, i think you need separate stream processor, you cant mix real time data and old data. Please share your thought

@irtizahafiz 4 ай бұрын

You don't need a stream processor for batch jobs. You can do it offline.

@matthayden1979 4 ай бұрын

Is it correct to say that Kafka is Data ingestion platform as data is getting stored in the topics which would be later processed by stream processor?

@irtizahafiz 3 ай бұрын

Yup, you can say that. But you shouldn't treat the Kafka storage as persistent storage, but as a temporary buffer. If you want to store the incoming data for longer, you can dump it in S3 or some data warehouse like Redshift/Snowflake.

@eternalsunshine313 10 ай бұрын

Why can’t we process the clicks a bit later than they happen so we capture late data and avoid the batch job? This wouldn’t be 100% real time, but do most systems need super fast real-time processing for ad clicks?

@irtizahafiz 9 ай бұрын

Totally depends on your use case. There are real-life use cases where you need it to be real-time, in those cases the tradeoff of dropping "some" late data is acceptable.

@srishtiganjoo8549 4 ай бұрын

Given that Kafka is durable, why are we storing the clicks in Log Files? Would this not hamper with system performance? If we need logs, maybe the Kafka consumer can log it to files.

@irtizahafiz 4 ай бұрын

Kafka is durable for a short time (14 to 30 days depending on how it's configured). You also cannot do analytical queries easily when your data is in Kafka, as well as connect it to reporting software like PowerBI and Tableau. Because of all those and more, most of the time you only use Kafka as a buffer rather than permanent storage.

@unbox2359 2 ай бұрын

can someone help me with the cassandra database schema design ? like what all tables will be there and what all columns will be there?

@irtizahafiz 2 ай бұрын

It depends on the type of application you are trying to build.

@unbox2359 2 ай бұрын

@@irtizahafiz I'm asking for this application only

@shubhamdalvi6424 9 ай бұрын

Great Video! A minor error: 0.1Kb x 3B = 300 Gb

@irtizahafiz 9 ай бұрын

Thank you! I will start posting again soon, so please let me know what type of content interests you the most.

@siarheikaravai Жыл бұрын

What about expenses calculations? How to estimate how much this system will consume money?

@irtizahafiz 8 ай бұрын

That depends on your AWS, GCP, etc platform cost. Most of these systems and DBs will be hosted on instances with different billing requirements.

@chetanyaahuja1241 2 ай бұрын

Thanks for clearly explaining the end to end design. Just a couple of questions: 1) Could you explain a little bit about how the Apache log files gets the clicks information and how is it realtime. 2) Also, Do you have any link of these notes/Diagram. As the one in description doesn't work.

@irtizahafiz 2 ай бұрын

The simplest would be to write a cron job or something similar that executes every couple of minutes reads the log file, and writes new data to Kafka. You can also poll using a continuously running Python program. What that would look like is a Python program will be running on, say, a "while" loop and read from the file every couple of minutes to write to Kafka. These are 2 solutions you can quickly prototype. For more comprehensive solutions, there are dedicated file watcher daemons that you could use.

@irtizahafiz 2 ай бұрын

Right, about the links. Unfortunately, they expired. Even I don't have access to most of them anymore. Sorry!

@chetanyaahuja1241 2 ай бұрын

@@irtizahafiz Thank you for the explantation.