Please continue sharing videos and insights! 😊 Just sharing my thoughts here: First Approach: We can host a single SNS at TCS and have all clients subscribe to it. I don’t see the need to create separate SNS topics for each client. Additionally, using a global cache instead of a local one would eliminate the need for multiple SQS queues. However, accessing a global cache may introduce network latency. Second Approach: This approach may result in delayed cache invalidation due to batching at TCS and periodic calls from the client side. If the system can tolerate this delay, then this approach is viable. However, the first approach offers near real-time cache invalidation. Both approaches have their pros and cons, and the optimal choice depends on the specific requirements of the system.
@johnclukosereloaded14 күн бұрын
As an MVP, SNS+SQS can be build faster than Batching+S3+Long Polling. Hence the first approach. ... I guess.
@prasad_rathod_py14 күн бұрын
Hey Arpit, Thanks for the video. At the end of the day, all these decisions revolves around our SLAs and cost appetite. We use the first (push based SNS-SQS) architecture for our in memory cache invalidation, as we have very tight SLA for cache invalidation. For medium level workloads, this architecture works fine.
@anmolthind964611 күн бұрын
Are you a developer at atlassian?
@prasad_rathod_py10 күн бұрын
@anmolthind9646 No. Sorry if my response sounded that way. I was talking about cache invalidation usecase in my team.
@rahulkulkarni178013 күн бұрын
Would be interesting to observe the latency difference between the 2 approaches. Assuming the S3 approach was in parity with SNS and SQS. Also very weird, that they chose SQS per compute node instead of a logical service.
@nikiteshsoneji446112 күн бұрын
Because the locally cached data won't be invalidated by the logical service. You need a per pod to do that
@UnverifiedMods13 күн бұрын
This seems like a perfect use case for using kafka without a consumer group. Simple have a single kafka topic with single partition, all pods assign to the single partition and do not commit offsets. Every time a pod starts up have it start consumption at the latest offset
@senpaioppai266213 күн бұрын
sounds like a good solution to me. There should also be some performance advantage using kafka vs S3.
@rohitatiit9 күн бұрын
i defer to this opinion, if there are 1000 pods we would need 1000 consumer groups. Further the volume of cache invalidations would be extremly low, like 1MB/10 min. While we can put everything in 1 partion(since ordre matters) because of low volume, having 1000 threads reading same partition would be another level of kafka server congestion. We would be forced to scaleup drastically with very low utilization.
@UnverifiedMods9 күн бұрын
@@rohitatiit there is no need to commit to offsets - there would be no cumsumer groups, scaling would be easy and cheap
@imhiteshgarg9 күн бұрын
@@rohitatiit I doubt that given how famous Kafka is for these use cases, they would have already optimised for kafka server congestion for these types of use cases which I believe is a very common use case in my opinion.
@aayush547413 күн бұрын
Why not use kafka or kinesis streams?
@adilsheikh99167 сағат бұрын
Before cache invalidation, I think Atlassian people first needed to understand what to cache. If something is changing too frequently then why to cache such things.
@saurabhjagtap14 күн бұрын
Ensuring the scalability and availability, they will still be getting a stale data because of the time differences in batching the msgs + the eventual PULL from S3 (pod will still keep on getting stale data until it pulls it from S3). I guess this is the tradeoff they considered for their use case. You get some you lose some.
@MrZiyak9914 күн бұрын
i am not super sure about their new solution. their old solution looks good to me. the scalable version of the old solution is cdc + kinesis. the batching should happen at the consumers side imo i don’t even know how the ingestion service can garuntee a write to s3 and dynamo without some sort of 2 phase commit
@poshakajay13 күн бұрын
How will consumer do the batching because they have already received the events? Its the producer which controls the rate at which events are getting sent.
@Sibearian_14 күн бұрын
U forgot about off by 1 error
@noobgameplay272013 күн бұрын
hello sir , didn't understood the approach , if we are just using s3 , why can't we just maintain a hash for every topic in db/ centralized server , whenever a new batched write is there , we'll update the hash in the db , and then the sideCar , can long poll it , if the hash has changed , invalidate the cache , and new data can be requested . what advantage does s3 provide . Will appreciate the reply from anyone.
@MrKar1813 күн бұрын
Devil in the detail - Your proposed solution will involve a DB per client or another DB per app solution. Need to run another batch to cleanup DB later to clear up hash entries. Provisioning a new DB for hash is not preferred either.
@sanjivsingh09114 күн бұрын
My naive question on Solution 2 : why can't sidecar directly PULL from DB (single source of truth) and invalidate cache. What advantage S3 is bringing in?
@shubhampnp8 күн бұрын
There are multiple downsides i can 1) Multiple responsibility , you are updating the db and parallel consuming from there with multiple consumers put a heavy load on the db. 2) Multiple connections is limitation to db. 3) Tomorrow if you need to switch the database to more advance then consumer at tightly coupled with what you are having, placing s3 in between decouple the 2 architectures and bring flexibility to the whole design. @arpit ka correct
@RaviSingh-lc1tc14 күн бұрын
I am curious, why to cache in sidecare ( What made them choose this approach).
@mahendrahegde640014 күн бұрын
1. Data to be cached is reasonably small and it doesnt make sense to spin up a redis server 2. Faster access(same node access rather than n/w access) of data
@thefirstone2010Күн бұрын
In the second approach, I have few questions, how clients will know which invalidation message is for them? Are we going to create buckets for each client? Once there are sizable clientele will it cause delay to segregate messages per client?
@AsliEngineeringКүн бұрын
Have answered this already in some comment.
@SAsquirtle14 күн бұрын
why not directly subscribe the SC to the SNS topic? via HTTPS for example
@johnclukosereloaded14 күн бұрын
SNS doesn't store messages. So if the side car could not process that message in time, the message is lost. SQS will wait for acknowledgment before deleting a message.
@SAsquirtle14 күн бұрын
@@johnclukosereloaded what if we use an in-memory cache to queue up received notifications from SNS and then process it when possible. Also, we can use a DLQ to gather messages that were sent while SC was down which can then be polled later
@royalthomas14 күн бұрын
@@SAsquirtleif the instance goes down you lose everything in the in-memory queue. You also are unlikely to be able to write to in-memory queue if the host is unable to process SNS messages in the first place.
@anishshah976212 күн бұрын
How does the sidecar know that data has been pushed to S3 and that it needs to pull it? If the sidecar doesn’t receive this information within seconds, we might still display stale data.
@AsliEngineering12 күн бұрын
They periodically poll the same path/dir on S3.
@anshumanupadhyaycodes10 күн бұрын
What about the data consistency in this pull solution? Since now the SC pulls at an interval, it can have stale data for some amount of time. Isn't it?
@AsliEngineering9 күн бұрын
you are singing up for staleness when you start caching things.
@inbha654314 күн бұрын
Why not just a db read replica is not enough? Without knowing what the application does its difficult to come up with a design
@tamberlame2714 күн бұрын
Even if it is a read replica of the db the read is still a disk i/o which is a costly operation
@imhiteshgarg9 күн бұрын
Not gonna lie, I thought we can replace SNS with kafka Topic and create multiple consumer groups(symbolising Pods) subscribing to that topic so that each consumer group can subscribe to that Topic individually. During writing this, I thought, Why not create a single SNS and have the clients themselves subscribe to it in first place. Why we need separate SQS/uploading to S3. There could be some issues in both approaches hence open to have healthy and respectful discussion. Cheers!
@MohitBhagwani9414 күн бұрын
Can you please share the notes link?
@growmoreyt419210 күн бұрын
what is the minimum work experience should i have to watch these videos.
@rohitatiit11 күн бұрын
@ArpitBhayani In the pull approach how the sidecar identify what files it need to pull. Say a sidecar faced a network jitter or s3 rate limited and didn't pulled one cache invalidation file, how it will identify it need to pull another cache invalidation file. Why s3? why not redis if they are doing pull. Both case its a key value lookup
@AsliEngineering11 күн бұрын
Consistent file path or listing of an existing constant s3 directory.
@rohitatiit10 күн бұрын
@@AsliEngineering not very clear, is it like s3root/invalidation.txt ? So every 1 sec the microbatching process with put this file on s3(if not empty). Isn't it a huge unnecessary cost (greater than queue per sidecar), secondly it still didn't account for a network partition of more than 1 sec. The sidecar may not be able to read the invalidation.txt at 00.00.01 but read it 00.00.02. If it becomes listing on s3 directory thats twice the cost and the problem is still same, how the sidecar will keep track of what files it didn't read yet. I think that's why its one of the 2 toughest thing in software engineering. Doesn't they still need zk/etcd to store the last read timestamp.?
@adilkhatri747514 күн бұрын
in solution 2, if side car periodically reads blob from s3 and update is cache then there is a chance that we might serve stale data to user right?
@sarthakbiswas692514 күн бұрын
Yeah but it's being used for Jira tickets right, so it's not something too serious if the Jira ticket data is stale for few seconds or minutes. If it was something too mission critical like Bank transactions or ticket booking system this solution might not have worked. So a key takeaway is that every solution is built on a circumstance, it works for this but will not work for any general circumstance.
@royalthomas14 күн бұрын
In all of the solutions we pick availability over consistency.
@nayeemrafsan3566 күн бұрын
the solution didn't make sense to me. why not have a kafa instead of sqs and sns. just push to kafka and every client can consume from there. also if you're pushing the invalidation information only, why not just push the latest update of the table directly to kafka, so when client consumes, it can update cache accordingly. it would save you from another api call to the database. but I might not be getting the full picture here so I could be wrong.
@AsliEngineering6 күн бұрын
Legacy architecture. There is no one right answer when you are designing systems. I mentioned this in the video, do not judge the approach. There are some design decisions made considering their expertise and context while they were being designed. SNS SQS combo was one of them. Today they will design this system in a totally different way.
@nayeemrafsan3566 күн бұрын
@AsliEngineering I understand. Thanks for the reply 🙌
@yashgupta63276 күн бұрын
@@AsliEngineeringwould you say this approach is feasible at scale? I personally think this is a good approach
@abhis156012 күн бұрын
Hi, why can't we have a global cache or cdn instead of the local cache?
@rohankumarpanigrahi747510 күн бұрын
Cdn invalidation has a much greater lag than application layer cache invalidation
@arpitsharma201513 күн бұрын
why we using multiple SQS inside single TCS client (having multiple pods). can't those multiple pods subscribe to same SQS and poll messages based in trigger events since all pods are on same cluster.
@Ruyefa13 күн бұрын
Multiple clients cannot consume the same SQS message by design.
@msingla13512 күн бұрын
@@RuyefaWhat if SQS is replaced by Kafka?
@himanshudewan723814 күн бұрын
I think kafka should be alternate solution instead of SQS
@saurabhjagtap14 күн бұрын
Imo problem stays the same - cache invalidation needs to be done on pod level. So you'll have to make sure that the msg gets consumed by EVERY pod.
@himanshudewan723813 күн бұрын
@@saurabhjagtap Agreed. Each Pod needs to consume the message. But then we dont need different topics .
@hemanthaugust721713 күн бұрын
Yes, each pod should be a consumer. So, each one can maintain their own sequence number in the log/queue. So, each pod can process at their own pace. Kafka is scalable. Now, the question is - if they use Kinesis/kafka, is this anyway cheaper than SNS + SQS? S3 has absolutlely no admin cost/effort, but it's not clear how the consumers are reading S3 files with only the modified data and not loosing any updates as there can be many files in ans 3 dir.
@himanshudewan723812 күн бұрын
@@hemanthaugust7217 Agreed . Not sure which will be cheaper. But kafka looks pretty decent solution to me and scalable too. S3 looks very counter intutive for this
@sor46578914 күн бұрын
Can anyone explain how does one pull from s3, the same way they pull from a queue ?
@ritikasthana46513 күн бұрын
S3 offers REST APIs to GET/POST objects.
@42Siren5 күн бұрын
They jus over complicated things too much... yes, it solved the problem, but made the solution overly complicated, which makes it hard to maintain and more points of failure, also hard to debug.
@kunalbabbar739913 күн бұрын
Cannot we use a kafka cluster instead of having message queue. We can even partition invalidation messages into multiple topics and each tcs client can subscribe them based on their need. kafka also give option to push data in batches so no extra logic needs to be implemented. latency tradeoffs at consumer side will be also less compared to s3. Is there any pointers which i am missing in this approch?