How Razorpay scaled their notification system

  Рет қаралды 22,195

Arpit Bhayani

Arpit Bhayani

Күн бұрын

Пікірлер: 84
@y5it056
@y5it056 2 жыл бұрын
Razorpay Engg here.. one thing that was critical for us is ensuring we consume all the events published by the clients - there are two important things we implemented - an Outbox pattern on the publishing side and second the API layer doesn't write to the db since that can become another bottleneck. You can read about the Outbox pattern in another blog we have written which is a critical component to scaling a microservices architecture. If you're interested we can come talk about this on your channel too.
@AsliEngineering
@AsliEngineering 2 жыл бұрын
I would love to host you. Although I have never had a guest on my channel, it would be fun to do a deep dive (so long as Razorpay permits) on the design. Let me know, once you are comfortable. You can reach out via LI or Twitter twitter.com/arpit_bhayani www.linkedin.com/in/arpitbhayani/
@y5it056
@y5it056 2 жыл бұрын
@@AsliEngineering we can do it officially. Someone will reach out
@vyshnavramesh9305
@vyshnavramesh9305 Жыл бұрын
Does outbox pattern suit in this video's notification system? I understand it suits to communicate transactional domain events across microservices. But I can't see it suits here.
@y5it056
@y5it056 Жыл бұрын
@@vyshnavramesh9305 for us, webhook delivery to the merchant's system is a critical part of the payment flow. We need to guarantee at least once delivery. Hence, we have to ensure that the payment system's events reach the notification platform. From there the notification platform ensures at least once delivery. You can't believe how many messages get lost over the network at this scale.
@nawabmohdamaan13
@nawabmohdamaan13 2 жыл бұрын
Explained in such a layman language … never imagined I could understand such a complex architecture in span of 15-17 min .. This content is too good to be free… Kudos to you :)
@StingSting844
@StingSting844 Жыл бұрын
Stop giving ideas for monetization man😂
@ranganathg
@ranganathg 9 ай бұрын
@@StingSting844 yeah ;-)
@coincidentIndia
@coincidentIndia 2 жыл бұрын
we can have SNS in the place of Limiter and integrate with SQS. For Ordering, we can use, SNS FIFO and SQS FIFO. Since SNS and SQS are fully managed services, we can somewhat avoid the rate limiter concept. We can apply SNS filtering rules to push the events to respective SQS based on the filtering rules. Along with this, we can have individual DLQ to SQS so that worker (AWS Lambda) can check the DLQ and process the messages. This will help reduce the Latency and corn job work.
@aravindh1989
@aravindh1989 2 жыл бұрын
Good content ! Couple of suggestions - 1) Justifying choice of kinesis over SQS (again) for async writes would help since cost is important criteria in the design. 2) Re-usability of the "task prioritization" module across new requests and scheduling failed ones for retries - Does it make sense to move it to a separate microservice/API ?
@ramnewton
@ramnewton 2 жыл бұрын
Hey Arpit, great video! One question though: At about 7:20, you discuss how read load would increase during peaks. But I don't see how the solution that has been implemented would address this issue. The solution would address write loads since we are making asynchronous writes to the DB. But read loads would still be high, since worker / executors might need info from DB to process the event. Please correct me if I am missing something 🙂 Edit: Is it a secondary effect? By reducing write load using async behaviour, are we freeing up more IOPS bandwidth for reads?
@AsliEngineering
@AsliEngineering 2 жыл бұрын
Yes. It is a secondary effect. You free up IOPS to do reads while async workers are doing staggered writes.
@santosh_bhat
@santosh_bhat 2 жыл бұрын
What a content man !! Hats off
@SudhanshuShekhar6151
@SudhanshuShekhar6151 2 жыл бұрын
#AsliEngineering is happening here .. no need to go anywhere .. Kudos to your content Man! Thanks!
@charansaigummadavalli7098
@charansaigummadavalli7098 2 жыл бұрын
I feel Database in the end of flow can be removed if the information can be re-constructed from source DBs. Dead-Letter Queues can be the best option, Scheduler can act on Dlqs.
@prarthdesai3112
@prarthdesai3112 2 жыл бұрын
Great stuff. Thanks for teaching these things in simple terms 🎉
@LeoLeo-nx5gi
@LeoLeo-nx5gi 2 жыл бұрын
Man hands down this is just so awesome. And you have nailed it!! Thanks a ton
@sudhakarch2611
@sudhakarch2611 2 жыл бұрын
Excellent Session
@jaigangwani4912
@jaigangwani4912 2 жыл бұрын
Quality content! The rate limiter to mellow down the spikes from a single user was a good learning. Thanks for putting out these videos. Love the passion with which you explain things.
@kushalkamra3803
@kushalkamra3803 2 жыл бұрын
Thanks Arpit 🙏
@premparihar
@premparihar Жыл бұрын
Very well explained, please keep posting such videos ❤
@dhananjay7513
@dhananjay7513 2 жыл бұрын
Man Hats off to you please don't stop putting up videos like these, You and Gaurav Sen are legends, No where in the KZbin did I found content similar to you guys 🙂🙂Keep Bringing more System Design Videos
@koteshwarraomaripudi1080
@koteshwarraomaripudi1080 2 жыл бұрын
Instead of using kinesis, SQL db and scheduler, can we introduce retry SQS queues which would be picked up by the workers
@sahazeerks2363
@sahazeerks2363 2 жыл бұрын
Awesome explanation, definitely purchasing your course
@AsliEngineering
@AsliEngineering 2 жыл бұрын
Thank you. Looking forward to it ✨
@ambarishbanerjee414
@ambarishbanerjee414 Жыл бұрын
I see some concerns/doubts with this design - 1. SQS queues ensure atleast once delivery, not exactly once right ? Hence they must ensure that their notification system handles duplication, else customer will get a shock if he receives 2 debit notifications for 1 transaction 2. If their workers are lambdas, if huge number of lambdas are triggered from huge number of messages in the SQS, now if these lambdas are doing anything else like calling some service/reading from DB etc, I am sure it will throttle that service, how do they handle that or is it not the case. Because once the lambdas spring up there is no way for one to know how many others are actively calling a downstream, so some control is needed at the event source side 3. Since this entire process is asynchronous, is their API also asynchronous and if so, just curious how do they make their public apis asynchronous, is it pub sub based/polling based/ web hook kind of thing? After what time client retries if the process fails or they ensure 100% delivery? 4. How is this schedular designed? Is it cron job that goes over the DB once an hour to check failures? If thats so, it’s introducing a lag in retrying, why cant they use a no sql db like Dynamo Db of AWS and utilise dynamo db stream events which will immediately trigger a lambda if there is a failure and it will send the message for retry, converting to a trigger based soln can get rid of the latency 5. Why a sql db for just maintaining event status, why not a no sql db like dynamo? Is my sql serverless? Or they are handling the maintainance part themselves which increases the Oncall load
@charan775
@charan775 Ай бұрын
we can use bloom filters to avoid re-sending notifications. but this means we can miss out some genuine notifications but still memory efficient generally payment systems are asychronous. even razorpay would receive webhooks async from their integrated banks and then they send webhooks to clients
@HSBTechYT
@HSBTechYT Жыл бұрын
I have a question. Why have a Mysql in the new solution. Can't Kinesis directly plug into the scheduler ? (or is it like scheduler is persisting the jobs so that if the server is restarted, it can still reschedule lost events)
@YT-vx9sz
@YT-vx9sz 3 ай бұрын
I think an event bridge would be a preferred choice instead of sqs, because it can read the request body and send messages to respected consumers based on the service, plus you can do modifications to the request body and the failed messages are archived and can be replayed free of cost, of course you would trade off on throughput as event bridge doesn't have the agility of sqs but you already dampened that in this architecture by using rate limiter.
@MohammadImran-if2ou
@MohammadImran-if2ou Жыл бұрын
Great explanation
@ranganathg
@ranganathg 3 ай бұрын
I think - adding a message queue reduces the IOPS spikes on the DB from executors but how did this reduce the overall latency of the system? Is it because the executors had lot of retries which caused the delay during the spikes (>1000 TPS) and added the additional lag which is fixed with the queue? but with queue you introduce additional process also right?
@liuzijian5114
@liuzijian5114 7 ай бұрын
Great video and thanks for making it! A quick question for db choosing, is it required for picking SQL for Db?
@shishirchaurasiya7374
@shishirchaurasiya7374 Жыл бұрын
It was really loving it
@raiyansarker
@raiyansarker Жыл бұрын
Why not just do acknowledgement after success response of notifications, so you don't have to worry about writing to database, if that fails, it can be queued again as database is very hard to scale but these queueing services like kafka is highly distributed and scalable?
@satyamshubham6676
@satyamshubham6676 Жыл бұрын
I think for audit purposes, but that definitely doesn’t need to be synchronous. Only the failures ones can be synchronous.
@sumitmore4680
@sumitmore4680 2 жыл бұрын
Thanks for the awesome video. I think sending mail and recording it in DB is sort of a distributed transaction (there are pattern like outbox pattern which can solve this problem) and hence it might play a role in the scaling strategy of the system.
@hc90919
@hc90919 Жыл бұрын
Do you have any resources for the out box pattern?
@tharun8164
@tharun8164 2 жыл бұрын
I see we could replace sqs with event bus like Kafka itself that way it can be used for persistence also. I don't see the necessity to store it on message queue and again on event bus. Thoughts?
@AsliEngineering
@AsliEngineering 2 жыл бұрын
Yes. even SQS persists. for this use case, Kafka/SQS would have given a similar performance. But Kafka would be costlier.
@charan775
@charan775 Ай бұрын
why to write in mySQL? why couldn't scheduler read from kinesis to retry?
@aravind1129
@aravind1129 3 ай бұрын
Awesome explanation, thanks.. Recently i connected to vpn on my mobile.. when randomly opened the ICICI & hotstar(uae) application, it detected & showed an alert that I am using vpn.. i still surprised how they are getting to know that I am using vpn
@swagatpatra2139
@swagatpatra2139 2 жыл бұрын
How will the read load be mitigated by async calls? If data reaches db slowly/asynchronously, won't the systems dependent on it will again be slowed down? What's the work around here? DB scaling? scharding?
@ramnewton
@ramnewton 2 жыл бұрын
Yeah, was wondering the same. I think the solution doesn't use async calls for reading, it uses it only for writing. But that said, I still don't get how read performance would improve during peak. Only reasonable explanation that comes to mind is: Maybe since write load is reducing due to async behaviour, DB might have more IOPS bandwidth for reads. I'm not sure if that explanation is valid though 😅
@kunalsharma1621
@kunalsharma1621 2 жыл бұрын
Thank you 👏
@adianimesh
@adianimesh 2 жыл бұрын
at times, I find it hard to keep up with the videos :) can't even fathom how you manage to read, try and share so much cool engineering stuff outside work. 👏
@AsliEngineering
@AsliEngineering 2 жыл бұрын
I am realizing this and hence starting next week I am chopping freq to 2 per week :)
@adianimesh
@adianimesh 2 жыл бұрын
@@AsliEngineering much appreciated :) pls do not reduce the frequency any further though. also please do a little bit on how to be productive outside work . "A day in the life of a normal curious software engineer" pun intended.
@AsliEngineering
@AsliEngineering 2 жыл бұрын
@@adianimesh hahaha :) a lot of people have asked for this but it is very hard for me to record such a video. I just don't want to put out a narcissistic video :D I stay away from anything that holds a potential to distract me :) By any chance if that videos get a big traction, I will be tempted to take that route and hence I typically avoid. I hope you understand. But yes the short one line answer to this is PASSION. I am extremely passionate about the field and have a huge bias for action.
@adianimesh
@adianimesh 2 жыл бұрын
​@@AsliEngineering u are an inspiration for me :) thanks
@pesmobileclips7925
@pesmobileclips7925 23 күн бұрын
Why can we just use LSM trees for write heavy system here?
@shantanutripathi
@shantanutripathi 2 жыл бұрын
But won't the Reads be impacted when we are using Kinesis (asynchronous writes)?
@AsliEngineering
@AsliEngineering 2 жыл бұрын
We do not need consistent reads here.
@shantanutripathi
@shantanutripathi 2 жыл бұрын
​@@AsliEngineering Yeah, realized later.....but why they were using synchronous writes in the first place..😅
@AsliEngineering
@AsliEngineering 2 жыл бұрын
@@shantanutripathi no one thinks about optimization on Day 0. It is all about shipping and getting things done
@vanshikasharma5438
@vanshikasharma5438 9 ай бұрын
USP of this channel is Short, meaningful content - no hour-long videos.
@AsliEngineering
@AsliEngineering 9 ай бұрын
Thanks Vanshika 🙌
@sahilsiddiqui3210
@sahilsiddiqui3210 6 ай бұрын
why mysql ??
@rjphotos2393
@rjphotos2393 2 жыл бұрын
It's a very common architecture
@yakshitjain5374
@yakshitjain5374 2 жыл бұрын
Excellent Topic! Slightly off-topic question! Which tool do you use to record/edit videos! Is it Loom?
@AsliEngineering
@AsliEngineering 2 жыл бұрын
OBS
@satyamshubham6676
@satyamshubham6676 Жыл бұрын
What about the latency you’re introducing in the system due to kinesis?
@AsliEngineering
@AsliEngineering Жыл бұрын
Why is that a problem?
@randomstuffsofdayandnight
@randomstuffsofdayandnight 8 ай бұрын
Video starts @2:38
@LawsOfNature108
@LawsOfNature108 2 жыл бұрын
What tool you use for drawing architecture...I need it to present in interview
@AsliEngineering
@AsliEngineering 2 жыл бұрын
GoodNotes
@raj_kundalia
@raj_kundalia 8 ай бұрын
I know there are prioritized queues that are used here, so it is that until P0 is consumed P1 would not be consumed? What if P1 consumption is in progress and consumer is blocked for sometime, how does the system makes sure at this time that P0 should be picked up?
@AsliEngineering
@AsliEngineering 8 ай бұрын
They are all consumed in parallel. Just the number of consumers would vary.
@raj_kundalia
@raj_kundalia 8 ай бұрын
@@AsliEngineering but then Kinesis is still a queue right? How does the system make sure that the important ones go before anything else?
@AsliEngineering
@AsliEngineering 8 ай бұрын
@@raj_kundalia different priorities are different topics.
@raj_kundalia
@raj_kundalia 8 ай бұрын
@@AsliEngineering makes sense, thank you for replying. Big fan and learning every day from you :)
@tharun8164
@tharun8164 2 жыл бұрын
I'm assuming there will be data loss if at all sqs is unavailable. Something like cdc pipeline may mitigate this issue. Even though, cdc is typically used for data integration only. Thoughts?
@AsliEngineering
@AsliEngineering 2 жыл бұрын
could have used CDC but with extra filters and edge case handling. Keeping systems simple is important in real world.
@samyaknayak5731
@samyaknayak5731 2 жыл бұрын
Bhaiya what are workers that you mentioned here?
@parthpathak5712
@parthpathak5712 2 жыл бұрын
Worker will pick up (consume) the multiple events & executor will write the data/message in relational db & obviously push the notification as well.
@nettemsarath3663
@nettemsarath3663 Жыл бұрын
Hi, can u please explain how to solve SES rate limiting issue at scale, because SES has a rate limiting for sending emails like 14 emails/sec. Because I got a scenario in a startup where i need to send a marketing emails to 50k users exactly on Sunday morning.
@AsliEngineering
@AsliEngineering Жыл бұрын
Talk to Razorpay. It is artificial rate limiting.
@nettemsarath3663
@nettemsarath3663 Жыл бұрын
How can I send emails 50k emails on Sunday morning using SES ( SES has rate limiting ) how can I achieve it.
@rewantkedia4232
@rewantkedia4232 2 жыл бұрын
How would iops to mysql reduce by introducing kinesis? If the data produce rate to mysql is less than data produce rate to kinesis, wouldn't this choke kinesis?
@AsliEngineering
@AsliEngineering 2 жыл бұрын
Staggered consumption
@hc90919
@hc90919 Жыл бұрын
@asliengineering - aren’t there consumers on Kinesis which are actually writing to db? How is kinesics able to write to db directly? Are there gng to be any lambda functions triggers?
@AsliEngineering
@AsliEngineering Жыл бұрын
There are consumers consuming and writing to db.
@hc90919
@hc90919 Жыл бұрын
@@AsliEngineering - Got it. Are the consumers going to be services on a physical servers or they cron job programs running every night?
@nettemsarath3663
@nettemsarath3663 Жыл бұрын
How a single consumer can consume events from 3 SQS queue ???
@AsliEngineering
@AsliEngineering Жыл бұрын
Multi-threading.
How Flipkart made their type ahead search hyper personalized
19:00
Arpit Bhayani
Рет қаралды 15 М.
If people acted like cats 🙀😹 LeoNata family #shorts
00:22
LeoNata Family
Рет қаралды 33 МЛН
Don’t Choose The Wrong Box 😱
00:41
Topper Guild
Рет қаралды 33 МЛН
Do you love Blackpink?🖤🩷
00:23
Karina
Рет қаралды 23 МЛН
System Design Interview - Notification Service
25:11
System Design Interview
Рет қаралды 261 М.
How Instagram efficiently serves HashTags ordered by count
12:18
Arpit Bhayani
Рет қаралды 17 М.
What's an Event Driven System?
14:59
Gaurav Sen
Рет қаралды 323 М.
Scaling Push Messaging for Millions of Devices @Netflix
49:10
How to build a robust Payments service?
16:36
Arpit Bhayani
Рет қаралды 17 М.
Database Sharding and Partitioning
23:53
Arpit Bhayani
Рет қаралды 100 М.
If people acted like cats 🙀😹 LeoNata family #shorts
00:22
LeoNata Family
Рет қаралды 33 МЛН