Cache Invalidation using SNS + SQS at Atlassian and then they moved away!

Рет қаралды 17,813

Күн бұрын

Пікірлер: 67

@mohitlaheri6037 11 күн бұрын

Please continue sharing videos and insights! 😊 Just sharing my thoughts here: First Approach: We can host a single SNS at TCS and have all clients subscribe to it. I don’t see the need to create separate SNS topics for each client. Additionally, using a global cache instead of a local one would eliminate the need for multiple SQS queues. However, accessing a global cache may introduce network latency. Second Approach: This approach may result in delayed cache invalidation due to batching at TCS and periodic calls from the client side. If the system can tolerate this delay, then this approach is viable. However, the first approach offers near real-time cache invalidation. Both approaches have their pros and cons, and the optimal choice depends on the specific requirements of the system.

@johnclukosereloaded 14 күн бұрын

As an MVP, SNS+SQS can be build faster than Batching+S3+Long Polling. Hence the first approach. ... I guess.

@prasad_rathod_py 14 күн бұрын

Hey Arpit, Thanks for the video. At the end of the day, all these decisions revolves around our SLAs and cost appetite. We use the first (push based SNS-SQS) architecture for our in memory cache invalidation, as we have very tight SLA for cache invalidation. For medium level workloads, this architecture works fine.

@anmolthind9646 11 күн бұрын

Are you a developer at atlassian?

@prasad_rathod_py 10 күн бұрын

@anmolthind9646 No. Sorry if my response sounded that way. I was talking about cache invalidation usecase in my team.

@rahulkulkarni1780 13 күн бұрын

Would be interesting to observe the latency difference between the 2 approaches. Assuming the S3 approach was in parity with SNS and SQS. Also very weird, that they chose SQS per compute node instead of a logical service.

@nikiteshsoneji4461 12 күн бұрын

Because the locally cached data won't be invalidated by the logical service. You need a per pod to do that

@UnverifiedMods 13 күн бұрын

This seems like a perfect use case for using kafka without a consumer group. Simple have a single kafka topic with single partition, all pods assign to the single partition and do not commit offsets. Every time a pod starts up have it start consumption at the latest offset

@senpaioppai2662 13 күн бұрын

sounds like a good solution to me. There should also be some performance advantage using kafka vs S3.

@rohitatiit 9 күн бұрын

i defer to this opinion, if there are 1000 pods we would need 1000 consumer groups. Further the volume of cache invalidations would be extremly low, like 1MB/10 min. While we can put everything in 1 partion(since ordre matters) because of low volume, having 1000 threads reading same partition would be another level of kafka server congestion. We would be forced to scaleup drastically with very low utilization.

@UnverifiedMods 9 күн бұрын

@@rohitatiit there is no need to commit to offsets - there would be no cumsumer groups, scaling would be easy and cheap

@imhiteshgarg 9 күн бұрын

@@rohitatiit I doubt that given how famous Kafka is for these use cases, they would have already optimised for kafka server congestion for these types of use cases which I believe is a very common use case in my opinion.

@aayush5474 13 күн бұрын

Why not use kafka or kinesis streams?

@adilsheikh9916 7 сағат бұрын

Before cache invalidation, I think Atlassian people first needed to understand what to cache. If something is changing too frequently then why to cache such things.

@saurabhjagtap 14 күн бұрын

Ensuring the scalability and availability, they will still be getting a stale data because of the time differences in batching the msgs + the eventual PULL from S3 (pod will still keep on getting stale data until it pulls it from S3). I guess this is the tradeoff they considered for their use case. You get some you lose some.

@MrZiyak99 14 күн бұрын

i am not super sure about their new solution. their old solution looks good to me. the scalable version of the old solution is cdc + kinesis. the batching should happen at the consumers side imo i don’t even know how the ingestion service can garuntee a write to s3 and dynamo without some sort of 2 phase commit

@poshakajay 13 күн бұрын

How will consumer do the batching because they have already received the events? Its the producer which controls the rate at which events are getting sent.

@Sibearian_ 14 күн бұрын

U forgot about off by 1 error

@noobgameplay2720 13 күн бұрын

hello sir , didn't understood the approach , if we are just using s3 , why can't we just maintain a hash for every topic in db/ centralized server , whenever a new batched write is there , we'll update the hash in the db , and then the sideCar , can long poll it , if the hash has changed , invalidate the cache , and new data can be requested . what advantage does s3 provide . Will appreciate the reply from anyone.

@MrKar18 13 күн бұрын

Devil in the detail - Your proposed solution will involve a DB per client or another DB per app solution. Need to run another batch to cleanup DB later to clear up hash entries. Provisioning a new DB for hash is not preferred either.

@sanjivsingh091 14 күн бұрын

My naive question on Solution 2 : why can't sidecar directly PULL from DB (single source of truth) and invalidate cache. What advantage S3 is bringing in?

@shubhampnp 8 күн бұрын

There are multiple downsides i can 1) Multiple responsibility , you are updating the db and parallel consuming from there with multiple consumers put a heavy load on the db. 2) Multiple connections is limitation to db. 3) Tomorrow if you need to switch the database to more advance then consumer at tightly coupled with what you are having, placing s3 in between decouple the 2 architectures and bring flexibility to the whole design. @arpit ka correct

@RaviSingh-lc1tc 14 күн бұрын

I am curious, why to cache in sidecare ( What made them choose this approach).

@mahendrahegde6400 14 күн бұрын

1. Data to be cached is reasonably small and it doesnt make sense to spin up a redis server 2. Faster access(same node access rather than n/w access) of data

@thefirstone2010 Күн бұрын

In the second approach, I have few questions, how clients will know which invalidation message is for them? Are we going to create buckets for each client? Once there are sizable clientele will it cause delay to segregate messages per client?

@AsliEngineering Күн бұрын

Have answered this already in some comment.

@SAsquirtle 14 күн бұрын

why not directly subscribe the SC to the SNS topic? via HTTPS for example

@johnclukosereloaded 14 күн бұрын

SNS doesn't store messages. So if the side car could not process that message in time, the message is lost. SQS will wait for acknowledgment before deleting a message.

@SAsquirtle 14 күн бұрын

@@johnclukosereloaded what if we use an in-memory cache to queue up received notifications from SNS and then process it when possible. Also, we can use a DLQ to gather messages that were sent while SC was down which can then be polled later

@royalthomas 14 күн бұрын

⁠@@SAsquirtleif the instance goes down you lose everything in the in-memory queue. You also are unlikely to be able to write to in-memory queue if the host is unable to process SNS messages in the first place.

@anishshah9762 12 күн бұрын

How does the sidecar know that data has been pushed to S3 and that it needs to pull it? If the sidecar doesn’t receive this information within seconds, we might still display stale data.

@AsliEngineering 12 күн бұрын

They periodically poll the same path/dir on S3.

@anshumanupadhyaycodes 10 күн бұрын

What about the data consistency in this pull solution? Since now the SC pulls at an interval, it can have stale data for some amount of time. Isn't it?

@AsliEngineering 9 күн бұрын

you are singing up for staleness when you start caching things.

@inbha6543 14 күн бұрын

Why not just a db read replica is not enough? Without knowing what the application does its difficult to come up with a design

@tamberlame27 14 күн бұрын

Even if it is a read replica of the db the read is still a disk i/o which is a costly operation

@imhiteshgarg 9 күн бұрын

Not gonna lie, I thought we can replace SNS with kafka Topic and create multiple consumer groups(symbolising Pods) subscribing to that topic so that each consumer group can subscribe to that Topic individually. During writing this, I thought, Why not create a single SNS and have the clients themselves subscribe to it in first place. Why we need separate SQS/uploading to S3. There could be some issues in both approaches hence open to have healthy and respectful discussion. Cheers!

@MohitBhagwani94 14 күн бұрын

Can you please share the notes link?

@growmoreyt4192 10 күн бұрын

what is the minimum work experience should i have to watch these videos.

@rohitatiit 11 күн бұрын

@ArpitBhayani In the pull approach how the sidecar identify what files it need to pull. Say a sidecar faced a network jitter or s3 rate limited and didn't pulled one cache invalidation file, how it will identify it need to pull another cache invalidation file. Why s3? why not redis if they are doing pull. Both case its a key value lookup

@AsliEngineering 11 күн бұрын

Consistent file path or listing of an existing constant s3 directory.

@rohitatiit 10 күн бұрын

@@AsliEngineering not very clear, is it like s3root/invalidation.txt ? So every 1 sec the microbatching process with put this file on s3(if not empty). Isn't it a huge unnecessary cost (greater than queue per sidecar), secondly it still didn't account for a network partition of more than 1 sec. The sidecar may not be able to read the invalidation.txt at 00.00.01 but read it 00.00.02. If it becomes listing on s3 directory thats twice the cost and the problem is still same, how the sidecar will keep track of what files it didn't read yet. I think that's why its one of the 2 toughest thing in software engineering. Doesn't they still need zk/etcd to store the last read timestamp.?

@adilkhatri7475 14 күн бұрын

in solution 2, if side car periodically reads blob from s3 and update is cache then there is a chance that we might serve stale data to user right?

@sarthakbiswas6925 14 күн бұрын

Yeah but it's being used for Jira tickets right, so it's not something too serious if the Jira ticket data is stale for few seconds or minutes. If it was something too mission critical like Bank transactions or ticket booking system this solution might not have worked. So a key takeaway is that every solution is built on a circumstance, it works for this but will not work for any general circumstance.

@royalthomas 14 күн бұрын

In all of the solutions we pick availability over consistency.

@nayeemrafsan356 6 күн бұрын

the solution didn't make sense to me. why not have a kafa instead of sqs and sns. just push to kafka and every client can consume from there. also if you're pushing the invalidation information only, why not just push the latest update of the table directly to kafka, so when client consumes, it can update cache accordingly. it would save you from another api call to the database. but I might not be getting the full picture here so I could be wrong.

@AsliEngineering 6 күн бұрын

Legacy architecture. There is no one right answer when you are designing systems. I mentioned this in the video, do not judge the approach. There are some design decisions made considering their expertise and context while they were being designed. SNS SQS combo was one of them. Today they will design this system in a totally different way.

@nayeemrafsan356 6 күн бұрын

@AsliEngineering I understand. Thanks for the reply 🙌

@yashgupta6327 6 күн бұрын

@@AsliEngineeringwould you say this approach is feasible at scale? I personally think this is a good approach

@abhis1560 12 күн бұрын

Hi, why can't we have a global cache or cdn instead of the local cache?

@rohankumarpanigrahi7475 10 күн бұрын

Cdn invalidation has a much greater lag than application layer cache invalidation

@arpitsharma2015 13 күн бұрын

why we using multiple SQS inside single TCS client (having multiple pods). can't those multiple pods subscribe to same SQS and poll messages based in trigger events since all pods are on same cluster.

@Ruyefa 13 күн бұрын

Multiple clients cannot consume the same SQS message by design.

@msingla135 12 күн бұрын

@@RuyefaWhat if SQS is replaced by Kafka?

@himanshudewan7238 14 күн бұрын

I think kafka should be alternate solution instead of SQS

@saurabhjagtap 14 күн бұрын

Imo problem stays the same - cache invalidation needs to be done on pod level. So you'll have to make sure that the msg gets consumed by EVERY pod.

@himanshudewan7238 13 күн бұрын

@@saurabhjagtap Agreed. Each Pod needs to consume the message. But then we dont need different topics .

@hemanthaugust7217 13 күн бұрын

Yes, each pod should be a consumer. So, each one can maintain their own sequence number in the log/queue. So, each pod can process at their own pace. Kafka is scalable. Now, the question is - if they use Kinesis/kafka, is this anyway cheaper than SNS + SQS? S3 has absolutlely no admin cost/effort, but it's not clear how the consumers are reading S3 files with only the modified data and not loosing any updates as there can be many files in ans 3 dir.

@himanshudewan7238 12 күн бұрын

@@hemanthaugust7217 Agreed . Not sure which will be cheaper. But kafka looks pretty decent solution to me and scalable too. S3 looks very counter intutive for this

@sor465789 14 күн бұрын

Can anyone explain how does one pull from s3, the same way they pull from a queue ?

@ritikasthana465 13 күн бұрын

S3 offers REST APIs to GET/POST objects.

@42Siren 5 күн бұрын

They jus over complicated things too much... yes, it solved the problem, but made the solution overly complicated, which makes it hard to maintain and more points of failure, also hard to debug.

@kunalbabbar7399 13 күн бұрын

Cannot we use a kafka cluster instead of having message queue. We can even partition invalidation messages into multiple topics and each tcs client can subscribe them based on their need. kafka also give option to push data in batches so no extra logic needs to be implemented. latency tradeoffs at consumer side will be also less compared to s3. Is there any pointers which i am missing in this approch?