Nielsen: Processing 55TB of Data Per Day with AWS Lambda

Рет қаралды 186,320

Amazon Web Services

Күн бұрын

Пікірлер: 219

@manojbabu2048 4 жыл бұрын

This is one of the best episode in the series. Effective usage of serverless model. 👍

@praveensripati 4 жыл бұрын

A major chuck of the work is done by EMR and the state of the system is stored in RDS which is also very critical and both are not serverless. BTW, really enjoyed the session.

@opherdubrovsky4175 4 жыл бұрын

@@praveensripati Stay tuned for our next update. By now we are now running Spark in serverless mode as well (its our own architecture, not using EMR and scales up and down on a dime) - we are going to talk about it at re:invent (free virtual conference this year). Search for the talk by Boaz and me when the agenda goes public. So the only piece that is still not serverless is the DB, but we don't have an immediate blocker with it that will cause us to try hard to find an alternative.

@praveensripati 4 жыл бұрын

@@opherdubrovsky4175 Thanks for the response. Just curious - what's the stack K8S or AWS Glue or Knative?

@trailerhaul8200 2 жыл бұрын

@@praveensripati EMR now has a serverless model

@gsb22 4 жыл бұрын

The guy asking the questions is legend. He was asking the very same questions that were coming to my mind. If it weren't for his bad-ass questions, this video would have been an average video. Thanks guys. Loved this video.

@ivanguerra1260 4 жыл бұрын

Yeah, and his skill to simplefy.

@prakashpoudele 4 жыл бұрын

Implementing rate limit so that you dont hurt your partners is next level badass!

@VishalVyas30 4 жыл бұрын

Very detailed architecture walkthrough Opher and entire Nielsen Team. This will definitely help millions of AWS customers facing similar situation.

@michaelstaub22 4 жыл бұрын

Wait so the entire internet surveillance economy is powered by SQS, S3, RDS, and Lambda? Incredible. It literally could not be more simple.

@sb9377 4 жыл бұрын

And EMR!

@NiccoloHamlin 4 жыл бұрын

Of course not. They have competitors.

@hellohelloronin 4 жыл бұрын

Spark EMR being the core component. Video title is so misleading.

@itaiyaffe 4 жыл бұрын

@@hellohelloronin actually, the title is not misleading, Spark on EMR is only one component, and not even the significant one

@hellohelloronin 4 жыл бұрын

@@itaiyaffe why do you think spark EMR is not a significant component?

@bobhaffner5902 4 жыл бұрын

One of the best This Is My Architecture episodes to date. Great job, gents

@tanveeriqbal6680 2 жыл бұрын

This is amazing to see the power of Lamda function

@rachellejanssen2655 4 жыл бұрын

when your system is such a badass that your partners' servers think it's a DDoS, that's epic

@manzilkiit 4 жыл бұрын

When your system is so scalable, that you end up DDoS-ing your own services xD

@andrewevans5750 4 жыл бұрын

I mean, some systems come down easy. Once had a job aggregating and cleaning public data. I took down the MT DMV and public data servers for a good 5 days because I forgot a 0 in my timeout. I only had about 3 servers and that was back when running a scaled system meant working with Akka and Scala. Needless to say, that and a growing relationship with a co-worker ended that gig. The only reason it was a layoff was that my boss was confused. He tried to claim the open source I was using was his, he got a nice sieze and desist, and listed me as a layoff with a payout instead of a termination. Make more with AWS anyway now.

@cloud_c5222 2 жыл бұрын

@@andrewevans5750 Thanks for sharing Which domain In AWS do you work like Solutions Architect/Security engineer/ Networking etc

@abrarcalculas 4 жыл бұрын

I can't even imagine how this system can handle 250 BN events a day. The architecture looks very elegant and extremely optimized. The most interesting takeaway for me was how they rate limited the system and was able to reduce the cost per BN events by simply tuning the lambda configurations. Excellent insights. Loved this.

@opherdubrovsky4175 4 жыл бұрын

check out our talk at re:invent conference - free and online this year. We explain the idea behind how we've implemented the rate limiting mechanism. this is the name of the talk - Managing the serverless challenge: Running big data on AWS Lambda - ARC310

@opherdubrovsky4175 4 жыл бұрын

the agenda is up - here are the dates/times for the talk (GMT times). It will be broadcast 3 times.

@abrarcalculas 4 жыл бұрын

@@opherdubrovsky4175 Thanks for heads up. Registered and added the time for the talk on calendar. Eagerly awaiting.

@pemessh 2 жыл бұрын

Thank god , lambda power tuning exists. We can now figure out which memory configuration to use without much effort. No writing simulators and things. Yey!!!

@salessiteboost7665 4 жыл бұрын

I will be moving to Lambda/Serverless for my next projects. I can write Python or Node functions within Lamba and scale very easily. Awesome!

@jaganmayeesahoo4210 4 жыл бұрын

Thanks for the overview and appreciate your efforts yo come up here and sharing the knowledge, could you please upload the videos for each module and the challenges.

@opherdubrovsky4175 4 жыл бұрын

there is a talk I gave about some of the challenges at DevTalks conf a few months ago. you can check it out here - kzbin.info/www/bejne/e5vMnHadZrZqpqc we'll also be done another talk about the system at reinvent later this year.

@YGL53 4 жыл бұрын

It must be very satisfying to write in that blackboard with a white marker :D Thanks for the content though :)

@arrjay3814 3 жыл бұрын

You should consider using RDS PROXY to help with overcoming the connection limiting to essentially

@tomstravelingadventures 4 жыл бұрын

Love watching these architecture videos

@salessiteboost7665 4 жыл бұрын

This is what an elegant solution looks like. Well done guys. :)

@AnGELsPearhead 4 жыл бұрын

Amazing walkthrough and well thought out architecture. Really enjoyed watching it.

@michailxirouchakis8325 4 жыл бұрын

Thanks for sharing this :) Regarding the problem of "writing back to the database" (5:57), have you used the new RDS Proxy Solution, and if yes, what is your feedback?

@opherdubrovsky4175 4 жыл бұрын

we just deployed it to 2 of our smaller regions last week. It looks like a really good solution, but I'll know more next week once we deploy in the US. One thing to know is that since it is a new feature, it is currently not available in all availability zones. We ran into this while implementing. But I guess this is a short term issue that will get resolved in the next weeks

@egomezr 4 жыл бұрын

So funny, your system works so well and follows a very nice scalable approach that your partners think you're doing something wrong, amazing! :-o

@anandsunderraman8643 4 жыл бұрын

Very nice presentation. Makes me ask more questions.What is the runtime of your lambda ? How do you handle DLQ scenarios ? You mentioned about running out of DB connections and hence piping the data via SQS. How did piping the data through SQS actually limit the number of DB connections ? When you say data per second how do you measure it ? Did you run into memory issues with the lambda and if yes how did you overcome the same ?

@opherdubrovsky4175 4 жыл бұрын

see comment about DB locks in reply to "uncle Tetsu". The SQS helped since - instead of having thousands of Lambdas updating each 1 row in the DB, they can send their message to an SQS and we have an "update Lambda" that reads the SQS and updates the DB. So instead of having thousands of Lambdas update the DB, you have just a few copies of that update lambda updating it. that reduces the connections to the DB. That said - AWS came out with DB proxy lately that give you all that stuff out of the box. Suggest you check it out. We just deployed it and it is sweet. memory issues - yes at first. we did a lot of profiling to figure out why and what libraries are not really needed. Lambda runtime - ours average is 10 sec with 30 sec stdev. That means that some runs could end up being 1-2 min long, but most are short.

@ABK6969 2 жыл бұрын

@@opherdubrovsky4175 Thank you for the detailed reply. Is the Lambda processing using Python, Go or something else?

@Rachelebanham 4 жыл бұрын

Thanks #AWS this was a really great short use case.

@keent 2 жыл бұрын

$4.25/billion events is freakin impressive in any way you tell. Without platforms like AWS/Google Cloud, there would be no innovation in the tech world.

@upuldi 2 жыл бұрын

1000 dollars per day is extremely cheap compared to the massive load it handles. Our Splunk cluster is costing more than that. Very good use of technology.

@petersavnik6777 4 жыл бұрын

We love this blackboard and the colors you use in these videos, can you share some details on what markers and blackboard that you are using?

@bziniman 4 жыл бұрын

We use these markers - www.amazon.com/Chalk-Markers-Chalkboard-VersaChalk-Reversible/dp/B00LOTZRUE/.

@desi97244 4 жыл бұрын

Very nice explanation. Wonder if any alternatives were considered to the RDS and why this was chosen.

@AnkitK-1 4 жыл бұрын

Would love to know about those kind of decisions too.

@opherdubrovsky4175 4 жыл бұрын

see comments above (just search for the text) this one: the DB was originally an afterthought. we were mainly focused....... and this one: The DB is the queue!! It's an intelligent queue.....

@odwa.m 4 жыл бұрын

@@opherdubrovsky4175 I'm wondering if it wouldn't be better to use DynamoDB instead of RDS since the way you are using the database is more in a NoSQL fashion. Plus DynamoDB will not hit a performance bottleneck in future which is something that might affect your RDS instances (at least in terms of write capacity)

@LonliLokli 4 жыл бұрын

I think it's pleasure to work there, you know. Challenging and interesting task

@rahulnath9655 4 жыл бұрын

Great video! Answered so many questions for me.

@tamerelfeky1721 4 жыл бұрын

Richy quick study case , good job

@odwa.m 4 жыл бұрын

I see a lot of supportive services which are probably used in the background but are not explicitly mentioned. This includes CloudWatch which is probably monitoring the Lambda functions and databases. CloudWatch can also be used with CloudWatch Insights or Contributer Insights to help with the optimization of Lambda functions. SNS which can be used to launch system ticket for issues.

@opherdubrovsky4175 4 жыл бұрын

Thats correct. Due to the short amount of time for the video, we had to keep it at a higher level. We do use SNS in one part of the system, we use cloudwatch for monitoring and for autoscalong decisions on the spark sode of the system, we recently added RDS Proxy to the system and also EKS. Come listen to my talk at re:invent 2020 and you’ll hear more about it. This year it is online and free - so no reason not to register. reinvent.awsevents.com/

@opherdubrovsky4175 4 жыл бұрын

Another relevant talk i am giving at Data+AI Summit in November (online and free to attend) is: databricks.com/session_eu20/scale-out-using-spark-in-serverless-herd-mode

@ced4298 4 жыл бұрын

That’s all very impressive!

@jujharsingh8128 4 жыл бұрын

Love these videos. Keep them coming

@DerekMurawsky 2 жыл бұрын

Great episode, but the audio seems really low. Just me? Other videos are louder.

@yahtadi5152 4 жыл бұрын

Holy moly, this is super good. I should consider switching

@nicolasrecalde8561 4 жыл бұрын

Use Kinesis streams instead of SQS. This change is a BIG difference in the bill and the performance is much better!

@opherdubrovsky4175 4 жыл бұрын

our SQS is about $10 /day. That's 1% if the system cost. We are more focused on saving costs on the parts with larger chunk of the system cost

@tomerdubrovsky4073 4 жыл бұрын

Thanks for a great presentation!! looking good! well done

@RaymondChenon 4 жыл бұрын

Cost : 1000 USD/day at 6:26 for a such big system, it is cheap. Does it include everything on the blackboard ( lambda, SQS, RDS, S3, EMR) and outbound bandwidth ?

@markusgulden4068 4 жыл бұрын

Yes, I also would be interested in the cost distribution over the different services (obviously, Lambda seems to be the biggest portion). Great talk, be the way!

@opherdubrovsky4175 4 жыл бұрын

@@markusgulden4068 here is a breakdown of a typical day in our biggest region (US). The other regions are much smaller and make up the rest Lambda($) $481.62 EC2-Instances($) $133.16 S3($) $90.72 Relational Database Service($) $47.80 CloudWatch($) $0.17 SQS($) $9.14 EC2-Other($) $3.20 Total cost ($) $765.80

@markusgulden4068 4 жыл бұрын

@@opherdubrovsky4175 Thanks for sharing this

@lordpablo1985 2 жыл бұрын

What are the work managers exactly? I'm interested to know more about the fan out scaling method.

@LudgerPeters 4 жыл бұрын

Thanks for a very informative video. A couple questions come to mind. How do you manage lambda failures and retries for other all your lambdas, not just the delivery lambda with the SQS queue? Also with regards to costs, if you are running a large number of lambdas, would switching to fargate instances and scaling them out not be cheaper since you would be able to scale out based on sqs queue depth? When running tests with our data processing we ended up going with fargate because of cold starts and fargate being able to cache information between invocations since it more long running? How do you invocate the lambda from the RDS instance to decide what files need to be distributed? And finally do you have any locking anywhere? Do you lock on the DB level? Thanks.

@opherdubrovsky4175 4 жыл бұрын

we manage lambda failures by having our system automatically free tasks in the DB that have a "processing" status that is longer than the Lambda timeout period (we set it to 5 min). If the task does not change to "Done" status we have another lambda that periodically checks for stuck tasks and changes the status back to "pending" + increments a reprocess counter. We currently try 3 times before we drop the tasks. We have other systems that use the dead letter queue to process failed tasks but not in this system. Fargate could potentially be cheaper, but you have to manage your containers very carefully since if you do not constantly keep them at 100% capacity and kill them immediately as they complete, you end up paying a lot for the aggregation of all that idle time. That is not easy to do and Lambdas so that for you automatically with no hassle at all. We tried something similar with OpenFaas but had a hard time making the cluster fully loaded all the time, so even though on paper it was cheaper, in practice we had a hard time getting it to be cheaper. We might go down the path of trying to do this with EKS pods, but I prefer that AWS comes up with a business offering for large load systems over Lambdas as the architecture is perfect, its just the cost side that needs a better offering. Something like being able to pay for reserved capacity of Lambdas and get a cheaper price - like they do with EC2 systems. I think eventually they will realize they need to come up with an offering for large systems as their current offering is focused on small scale systems build on Lambdas. Regarding SQS - yes that's a good way to do it. In fact, we have migrated our Spark over EMR to many spark pods running on EKS and are using SQS for it just as you described. Regarding cold starts - we don't have any problem with that as the containers remain available for a long period of time even when not used. So in any system that has constant traffic this is a no problem. We even measure the cold start times and saw that even containers that are not invoked for a long time (sometimes even up to an hour) can remain available so that next you call them they already have it all cached. DB locks - see other comment I wrote about it here - in the recent comment "uncle Tetsu" asked (just search for it)

@mrkandreev 4 жыл бұрын

Thanks. That is an excited work!

@DanishAnsari-hw7so 2 жыл бұрын

Can I practice this architecture on AWS by using a sample dataset? Any resource links for the same?

@josuelim4 4 жыл бұрын

That's great! Have you considered using DynamoDB instead of RDS?

@oll236 4 жыл бұрын

I think because of speed, structured data, know-how its better idea to use RDS

@josuelim4 4 жыл бұрын

Franco Peña I’m not sure. DynamoDB can be much faster, cheaper, easier to scale and to maintain in many scenarios (as compared to RDS). Specially when it’s being primarily queried from lambdas. The structure of their data and the queries complexity may be a good point though. Would be interesting to know more.

@opherdubrovsky4175 4 жыл бұрын

@@josuelim4 the DB was originally an afterthought. we were mainly focused on solving the main problems so started with a DB that we knew we could get some runway with and thought that we'll reconsider later. However, its been working very well and the fact that it is a relational DB has become a big advantage. We are running all kinds of complex control queries on it as well as are able to easily analyze trends like costs, data size, loads, etc.. due to the fact that it is easy to run analytical queries, we use it for almost any research we need to do on the system when trying to improve it - like what flows are more expensive, where the bottlenecks are, what areas we should tackle next, etc.. we also use it for the rate limiting feature we have - as rate limiting is not about a single file, but about aggregates - so you have to keep doing those aggregates to be able to make rate limiting decisions.

@tiktak2234 4 жыл бұрын

DynamoDB can be very restrictive when you do not have very strict and defined access patterns. DynamoDB can be great for many scenarios but probably does not fit their use case

@christophercaldwell6641 4 жыл бұрын

I ran into this exact scenario. An sql database was the better choice for our team because of the inability to know how the data would be accessed. Some people wanted to run their own data visualization tools over our reported data, which would be nearly impossible with Dyanmo because they never know the access pattern. Someone might want to see things by a given metric that their tool automatically does with sql, but would have to create their own implementation had the data been in Dynamo.

@Abrakadabra9to5 4 жыл бұрын

At 1000$ per day being spent was wondering if having reserved instances for handling standard work loads /traffic and using spot instances for bursting up can be cheaper

@opherdubrovsky4175 4 жыл бұрын

could be but managing the scaling becomes more complex. The advantages of the Lambdas is that you get auto scaling built it which makes scaling up and down a breeze

@daniel.lafraia 4 жыл бұрын

@@opherdubrovsky4175 if cost was an issue I'd say spot instances would do the job with less than 20% of this cost ($200/day). Wouldn't be too complex if you used ECS or EKS and docker containers. I'd just replace the lambda stack.

@sfhopkinson 4 жыл бұрын

Would be great to know if the recent 100ms -> 1ms billing granularity of Lambda has continued to dramatically reduce the costs?

@opherdubrovsky4175 3 жыл бұрын

in our case since our Lambda runs average 10.5 seconds the effect was negligible. we are saving about ~0.5% of the cost. For lots of short running tasks - the savings would have been much more significant. You can easily estimate it for any system. just calculate the expected mean time savings per Lambda and divide by the average lambda run time.Obviously the longer the average lambda runtime is, the smaller % savings you get

@zhiweio 2 жыл бұрын

That's pretty COOL

@ChristWolves 4 жыл бұрын

RDS is not serverless, unless you are using the more expensive Serverless Aurora. Any thoughts on optimising the RDS costs while able to run at optimum efficiency?

@opherdubrovsky4175 4 жыл бұрын

after shooting the video, we also got off EMR and are running spark on EKS (i.e. the system has evolved further). so the whole system is now totally serverless except the DB. this is a bit annoying and is a potential block point. However, it is running well right now so no urgency to change it. the main problem is that it is not really clear what we can replace it with. The relational DB gives us a lot of power in analyzing data and complex rate limiting abilities based on data aggregations. so replacing it is not simple. Any ideas ?

@LudgerPeters 4 жыл бұрын

@@opherdubrovsky4175 I would imagine a mix between DynamoDB and Elasticsearch. We run a similar system for marking up large amounts of data and we store metadata in dynamo DB and then have it stream to an elasticsearch cluster where we can create very powerful dashboards and run queries. You can scale out your elasticsearch cluster if you do not need performance you can make it very small and it will easily handle querying a large amount of data.

@LindsayForbes 4 жыл бұрын

Well done. This is amazing.

@mohammedramadan3480 3 жыл бұрын

is it possible to have million lambda invocation in one second ? not the same lambda but i mean the max number of concurrent lambda per second

@Godrose 4 жыл бұрын

Thanks for the great presentation. What kind of tooling did you use to find the sweet spot for Lambdas? I mean the offline part: SAM, SLS, something else?

@HebrewCloud 4 жыл бұрын

Check github.com/alexcasalboni/aws-lambda-power-tuning

@opherdubrovsky4175 4 жыл бұрын

@@HebrewCloud the power tuning tool is a great way to optimize. We originally wrote our own simulator that does . something similar. I suppose the next time we do an optimization we will us powertools. another thing to try is to remove all unneeded libraries in your code and also make it more memory efficient . you need to look at all your "includes" in your code and figure out if you really need them. whatever you do not need, remove it . It will save memory but also reduce the warmup time of the Lambda. we were also able to save a lot of memory by loading the data into the Lambda in batches, instead of a whole file. That makes it process all of the data slower, but if you save enough cost on the lower memory footprint, it may make it worthwhile, as it will reduce the memory requirements of the Lambda and that translates to immediate cost savings. .

@MrDagonzin 2 жыл бұрын

Great episode!. I am surprised they were able to make an intelligent manager working with lambdas and the limitations on the number of libraries that they have. They didn't need to use EC2?

@dondreytaylor8001 4 жыл бұрын

Very informative, I like this one a lot.

@loke261989 4 жыл бұрын

How abt ur EMR Configuration, can u publish the number of nodes and type?

@opherdubrovsky4175 4 жыл бұрын

It would change over time depending on amount of data and load. We did not need a very large cluster as the data transformations are simple grouping and/or partitioning for each account + a few enrichment fields. So most of the time it was 1 master + 1 core + 7 task spot instances using c5d.2xlarge . This cluster was always on running in cycles and pulling the next batch once it finished the current one. When we had high loads or needed to reprocess lots of data we would either add more task nodes or start a second EMR cluster in parallel (we've build it so you can run multiple independent clusters side by side)

@deepus9156 4 жыл бұрын

Superb

@omidvahdaty7636 4 жыл бұрын

Impressive work!

@mohitvachhani1133 4 жыл бұрын

What was the specific reason of using rds instead of dynamo db?. As dynamo db is write friendly

@opherdubrovsky4175 4 жыл бұрын

We do a lot of analytic queries especially for queue assessments and for rate limiting, so a relational DB was convenient. That said, we will reevaluate the db in the future

@rockersamurai 4 жыл бұрын

rds could be dynamodb if they don't need full sql

@opherdubrovsky4175 4 жыл бұрын

we are doing a lot of SQL aggregations and analysis so kept the RDS in there. That said, it is the odd part of the system not being serverless, so we might eventually replace it with something else

@StephenYang21 4 жыл бұрын

that's was AWSome!

@anuragmishra6961 2 жыл бұрын

why that is need of lambda function between SQS and RDS and again there is lambda function that pushes data to EMR cluster for transformations , can not be it done spark on EMR cluster directly fetching the data from SQS and do all sort of works before writing back to RDS . so with this , two lambda functions those first between SQS AND RDS can be removed and other between RDS and EMR.

@CommunityOkcom 3 жыл бұрын

What blackboard are you using? Thank you.

@Tibetan-experience 4 жыл бұрын

Great 👌

@achyutvyas 4 жыл бұрын

Really amazing

@kushagrabhushan 4 жыл бұрын

I have no idea what these guys are talking about. Where do I start learning about it?

@jrapp654 4 жыл бұрын

I'm curious what your RDS costs look like monthly. What instance type do you have it running on?

@opherdubrovsky4175 4 жыл бұрын

in the US where we have a big instance we are paying about $1700 / month for the RDS. It is currently running r4.4xlarge. However, we are looking into moving it to a smaller instance since we just implemented the DB proxy (new feature in AWS) which then allows us to use less connections to the DB and less memory. For the other smaller regions (like EU) we use db.r4.large - that one is ~ $200 / month

@jeffevans8200 4 жыл бұрын

@@opherdubrovsky4175 Have you ever considered using Azure? Or only AWS?

@brennanbugbee 4 жыл бұрын

very cool

@spiralni 4 жыл бұрын

Hard to imagine that volume

@opherdubrovsky4175 4 жыл бұрын

yes. I agree. I keep scratching my head as it is really hard to grasp the volumes as well as how cheap it is to process so much. I guess just a few years ago it would have seemed like science fiction :)

@jamesren4949 4 жыл бұрын

Hi Opher, thanks for sharing the amazing design with serverless. May I ask how would you handle large S3 object with Lambda?

@opherdubrovsky4175 4 жыл бұрын

great question - the previous step to the Lambdas is a Spark server that does the data transformations and data preparations. It used to be one large Spark cluster but lately (after the video was recorded) we have transitioned it to an EKS Kubernetes cluster running small Spark pods. Anyways - this Spark step prepares the files and makes sure the sizes are never above a certain size (currently 50Mb). However, as the data for each account varies in quantity, 85% of the files are around below 5Mb

@BlueBird0424 4 жыл бұрын

Thanks so much for this video, in fact our company is facing a problem right now where we need to deal with a lot of files on a daily basis and this video is well worth referencing! Have you guys thought of using the same serverless ECS or AWS Batch to solve this problem?

@opherdubrovsky4175 4 жыл бұрын

we have not tried AWS Batch. It might also be a viable solution. Would love to hear how it works for you. Regarding other types of serverless - we have a workload that runs on EKS and scales up and down and that works very well too. We'll probably to another talk about it sometimes in the future.

@BlueBird0424 4 жыл бұрын

@@opherdubrovsky4175 With AWS Batch and EC2 Spot Instances you can save up to 90% while maintaining performance (vs. on-demand instances), e.g. you can use about 20 c5ad.24xlarge bidding instances (96vCPU, 192GiB, 2 x 1900 NVMe SSDs, $1.5454 /per hour) to do batch work, which costs around $750 per day. The downside is that it may be limited by AWS service quotas. The problem we're having now is that we need to extract and simply process more than 100GB of data per day from the RDS, and after seeing this video I think Lambda is very feasible!

@opherdubrovsky4175 4 жыл бұрын

@@BlueBird0424 we are using spot instances in many other areas and are very fluent in using them. We have thought of building an architecture around them but the awesomeness of auto scaling you get with Lambdas tipped us towards Lambdas.

@odwa.m 4 жыл бұрын

I find that EMR is better suited for big data jobs like this than using Batch. This is because Hadoop has a lot of transparency with performance optimization and the type of programs you want to run over the data. This means that you have the additional option of being able to use Machine Learning programs like SparkLib or Mahout. You can also perhaps produce reports to managers or stakeholders using Hive. Im sure you get my point

@TheMegaMrMe 2 жыл бұрын

Sooo the heavy lifting is done by EMR and lambdas do basic routing. I wouldn't phrase the title like you did.

@AlexandrCherednichenko 4 жыл бұрын

Isn't there a limit of 1000 concurrent lambdas per AWS account?

@dangtrinhnt 4 жыл бұрын

The maximum execution time of a lambda function is 15m so I wonder how you can make it to handle a large amount of data?

@MohamedMahmoud-od2zy 4 жыл бұрын

You can trigger the lambda on object creation when objects are uploaded to the S3 bucket and you can break the events into smaller files from EMR to be written to S3, so you can have Millions of lambdas running in parallel and not exceeding the lambda limits like memory or execution time.

@opherdubrovsky4175 4 жыл бұрын

@@MohamedMahmoud-od2zy that's exactly what we do. we have a few 1000s concurrent lambdas running in parallel. The trick is to make sure the tasks each one gets can be completed in time before they expire. We actually set our Lambdas to 5 min because it helps us cap runaway costs in cases when we accidentally deploy a bug. so in theory, you'd want your tasks to be as short as possible, to the point where making them shorter will make them less efficient.

@gipbwok2008 2 жыл бұрын

Duh, that was above my head 😊

@sumitkhandelwal8155 4 ай бұрын

I'm surprised you don't have any DLQ setup in the system if there will be any error on how the system knows systems during that time and did we want to reprocess the failed events then how can we do that

@awssupport 4 ай бұрын

Hi, our re:Post community alongside our devs may be able to provide some alternate solutions. Don't hesitate to post any questions you may have here: go.aws/4d7YsNE. ^CM

@dannyruchman 4 жыл бұрын

Respect Opher :-)

@opherdubrovsky4175 4 жыл бұрын

you started it man....I just pushed it along a few more hills along the road :)

@dannyruchman 4 жыл бұрын

@@opherdubrovsky4175 amazing to see this in action in such scale

@daniel.lafraia 4 жыл бұрын

I wouldn't say $1k a day is a low cost approach. I believe if they used spot instances this would make great impact on cost. Spot instances with ECS (or EKS) with autoscaling running docker containers with their code would be a good cost effective solution.

@LudgerPeters 4 жыл бұрын

Yeah I think using ECS tasks with auto-scaling would bring down costs, especially if you are running faster languages. I personally run java ECS on Fargate instances and saw significant savings over using Lambdas, if your lambdas are running 24 hours a day you might as well run a instance it will work out cheaper.

@itaiyaffe 4 жыл бұрын

Trust me, $1k/day is an extremely low-cost approach for this kind of systems (even after exploring the non-managed FaaS option)

@braybilly 4 жыл бұрын

Curious what language they chose for their lambda code

@agesnipes 4 жыл бұрын

Billy B me too!

@opherdubrovsky4175 4 жыл бұрын

Scala

@COMPTROL 4 жыл бұрын

@@opherdubrovsky4175 is there any specific reason for this choice? immutable data structure should be available in other languages as well? Or was that a specific framework like spark marking your decision? Thanks in advance.

@piedroconte 4 жыл бұрын

I did not see the real-time process workload, I think these workloads Are batch and ETL based on the AWS

@opherdubrovsky4175 4 жыл бұрын

correct. These workloads are batch. they come in through files that "drip" in and get processed. That said, the system is constantly processing data, so as data comes in, it gets picked up and processed.

@jayw8348 4 жыл бұрын

Very informative video and great comment on details Opher! I checked out the "THE RISE AND FALL OF SERVERLESS COSTS" or so! Q: Do you have to consider locking when doing the RDS query? You mention RDS as an intelligent Queue. Q: Are your Spark pod fine-tuned for different account / different size of files to achieve higher utilization? Would love to learn more from this piece as we have an use case to read S3 from either Glue or Spark in EMR/EKS.

@opherdubrovsky4175 4 жыл бұрын

As the scale increased over time we got to a point that we had some db locking issues and also ran into too many connections to the DB (our limit was 5000). Our solution back then was to do the file status update via a queue and a lambda that reads messages from it. That reduces the connections to the DB and the danger of locks. Lately AWS came put with RDS Proxy. We tried it out and last week deployed it. Our DB connections dropped from ~5000 to ~1400. So next we will see if we can work with a smaller DB instance and save on the DB costs. We’re thinking going to 1/4 the size.

@g.egziabher1522 2 жыл бұрын

Does anyone know where to get these magnetic stickers?

@shivamnarware4325 2 жыл бұрын

How you are managing lambda execution time limit with this scale?

@idcarlos 2 жыл бұрын

They have a timeout of 5 minutes, each file is processed in parallel

@tobennanwokike 4 жыл бұрын

Wow!

@themoah1 4 жыл бұрын

There is no way it costs $1000 per day. I don't know if they are getting special pricing, but storing 30 TB of data daily on s3 would cost about $600 and that amount of egress traffic (e.g. mentioned 250 mbps ) also costs a lot. Let's assume that all lambda's and EMR don't even produce the logs and metrics stats (CloudWatch can also cost a lot).

@opherdubrovsky4175 4 жыл бұрын

We are not storing 30TB a day in that system. This system only prepares data and uploads to ad platforms. The overall DMP system stores about 5 petabyte of data total. But the Dataout system in the video is just 1 subsystem of it, and does not store a lot of data except the data in process. The DB just holds the meta data about the tasks to process. The data itself is in files on S3.

@dellendinho 4 жыл бұрын

Would love to see their RDS bill.

@opherdubrovsky4175 4 жыл бұрын

current RDS bill for our biggest region (US) is $1500 / month. We just moved to using RDS proxy which should allow us to move to a smaller 1/4 size RDS instance due to the lower amount of connections. I expect that will help drop the cost to about $375 / month

@dellendinho 4 жыл бұрын

@@opherdubrovsky4175 thanks. How many IOPS and how did you determine the appropriate number?

@abhinee 4 жыл бұрын

great stuff

@frenkelalexandre 4 жыл бұрын

Nice !

@upendrareddy6304 4 жыл бұрын

How does your system can have 3000 lambdas concurrently running in a minute, The max parallel invocations are 1000 only right?

@abhinee 4 жыл бұрын

aws will increase the limit if required

@kirin9991 4 жыл бұрын

Tell us that sweet spot

@deepakkwatra 4 жыл бұрын

Do u somehow try to manage the order of events before sending to diff networks?

@opherdubrovsky4175 4 жыл бұрын

in our case the order is not important, spot we don't worry about it. However, they do roughly go out in FIFO (first in first out) order, except when there is reprocessing or a delay for some reason.

@ans0600 4 жыл бұрын

How about the RDS storage cost? Which is increased a few TB per day?

@itaiyaffe 4 жыл бұрын

RDS only stores the metadata in this case (AFAIK), so it a lot less than few TBs/day

@opherdubrovsky4175 4 жыл бұрын

we also delete the meta data from RDS after a few months. we keep it there that long just for statistics

@ppgab 4 жыл бұрын

Couldn't he use a queue instead of RDS to keep track of the tasks?

@opherdubrovsky4175 4 жыл бұрын

The DB is the queue!! It's an intelligent queue. we could have used a regular queue (like SQS or any other) but we would have lost the ability to intelligently control the flows. for example - if you'd want to do rate limiting just for a few accounts, you would basically need to have a separate queue for each account. we have a few 100s of accounts, so we would have needed 100s of queues. also - when you need to reprocess data for various reasons, all we have to do is write a query and set the relevant completed files back to "pending" state and they would automatically get picked up again. With a queue this becomes very hard. we had a queue in another part of the system before, and we ended up getting rid of it and moving it to the DB. we still have queues that are used as a buffer in the system, but not smart control of the flows.

@jrapp654 4 жыл бұрын

@@opherdubrovsky4175 Really interesting way to handle reprocessing failed data. Ive always thought to rely on DLQ but i like the control that RDS gives you.

@abhinee 4 жыл бұрын

@@opherdubrovsky4175 Hi Opher, how are you ingesting data, can you pls elaborate, is it SQS?

@papieskimegatron 4 жыл бұрын

How long did it take to implement the system? How many people worked(or are still working) on this?

@opherdubrovsky4175 4 жыл бұрын

We started out about 2 years ago and had between 2-3 people working on it constantly. That said, a lot pf the work is not related to the scaling but rather to ad platform specific support we gradually added + peripheral tools we built to provide our solutions team ways to upload various data types. Also - it took us quite some time to learn how to build it and make it robust. When we started, a lot of things that in hindsight look obvious, were not obvious. Unfortunately we did not have hindsight back then !! 😁. I imagine if i had to build it today, now that i know how to do it, and focusing just on the core system, i can build it with 2 developers within a few months.

@papieskimegatron 4 жыл бұрын

@@opherdubrovsky4175 Thank you very much for your answer. The talk also skipped the fact that EMR is not serverless and isn't really pay-per-use like the rest of the system(well, so is RDS but lets skip that part for a minute :P). How good is EMR autoscaling when working with billions of events daily? Did you have to create custom "smart" autoscaling rules to spin up and tear down the clusters? Or just the min/max instances is sufficient and YARN does its job well? I imagine you have thousands of Spark jobs running every hour. Great talk! Looking forward to hearing more.

@opherdubrovsky4175 4 жыл бұрын

@@papieskimegatron EMR was decently ok but the problem was that due to data skew between the different ad accounts we upload to (some small and some large), the cluster did not scale well and you got diminishing returns as you added more instances. To fix this, we ended up moving from spark over EMR to a serverless like spark where we run lots of small independent spark instances and we scale up/down by adding/removing instances. This works beautifully and scales really well. And, due to the improved scaling, the cost is about 50% of what it was on EMR - which is great 😊 I am giving a talk about it with one of my developers at Data+AI summit in November (online and free to register this year). You can check it out and register here - databricks.com/session_eu20/scale-out-using-spark-in-serverless-herd-mode

@papieskimegatron 4 жыл бұрын

@@opherdubrovsky4175 Does it mean the EMR in the talk is not really EMR but your own Kubernetes + Spark (on "herd" mode)? :P I will certainly join as it sounds great, thank you!

@opherdubrovsky4175 4 жыл бұрын

@@papieskimegatron when we recorded it a few months ago it was plain vanilla spark over EMR. However, the video took awhile until it was published due to Corona delays in the video post production. By this time, we have already moved to our next system improvement I mentioned above. The next time we talk about it, we will already be talking about the next iteration of the architecture

@bharath700i 4 жыл бұрын

How much data that RDS Postgres is holding? Is that one record for each event?

@opherdubrovsky4175 4 жыл бұрын

RDS just stores the meta data about each task (file). so it has a small amount of data in total. A few million rows each day

@juanvassallo2993 4 жыл бұрын

@@opherdubrovsky4175 the $1000 / day cost includes rds and all the other services or is it just for lambda events?

@jett_royce 4 жыл бұрын

Using a bunch of lambdas for this sounds like it's way more expensive (cost-wise) than it should be. Even after accounting for the convenience.

@kirin9991 4 жыл бұрын

Have u ever seen server maintenance costs

@chris-ew9wl 4 жыл бұрын

It looks simple, but with that large amount of processing, you are better off using kubernetes, Lambda cost would be at least 100,000+ USD a day, and that’s just on one of four Lambda service in his architecture diagram.

@itaiyaffe 4 жыл бұрын

That's actually far from the truth. As Opher describes in the video at 6:40, the cost of the system (not just one Lambda) is only ~$1000 per day (as of the making of this video). As Opher says, the team was able to reduce the costs from $7.7 per billion events to $4.25 per billion events. At about 250 billion events per day, it gets you to about $1000 per day. Note that by "events" we don't mean Lambda invocations, but rather the raw events coming into the system in those files that are shown on the lefthand side of the blackboard

@chris-ew9wl 4 жыл бұрын

@@itaiyaffe hey thanks, for correcting me. I missed that part at 6:40 where they discussed the cost of the system. When I saw the SQS and 250 Billion events, I immediately thought lambda is processing the "per" event of those 250B. Hence my calculation. Was there a a comparison on a similar architecture on a kubernetes cluster? From what I gather, processing events of this magnitude. But if peak traffic is just 30Million Invocations, then that's just $5 (on lambda compute alone).

@itaiyaffe 4 жыл бұрын

@@chris-ew9wl so Lambda costs, as far as I recall, is based on memory capping and execution time. Not sure it's $5, but I think we both agree on the magnitude of the costs 🙂 As per K8s costs - I don't think the numbers were published, but they were in fact greater for sure. Opher is the cost-reduction master, you can trust him he checked it thoroughly

@opherdubrovsky4175 4 жыл бұрын

in parallel to the lambda processing we’ve build another way to dispatch the data using an OpenFaas cluster. We then would send a % of the traffic to that cluster. The goal was to reduce costs. However, we never managed to drive down the costs there as it was really hard to keep the system constantly busy so there will be no wasted compute. That said, it’s probably solvable if we invest enough time. But - its complex!! I’d much rather get a better price plan for volume use on lambdas and keep using them since the architecture is amazing and helps you build very good yet simple solutions, with autoscaling built in. Hopefully AWS will come up with additional pricing plans like EC2 has and then this whole discussion will become irrelevant

@chris-ew9wl 4 жыл бұрын

@@opherdubrovsky4175 Agree and I share your sentiment. Hey Opher, thanks for replying to this thread. I really appreciate it. I guess the whole point of Serverless is to invest more time developing rather than managing. So I guess, even though there's money can be saved via Kubernetes Cluster. (i'm a fan of OpenFaas too), we have to factor in Engineering time if the money saved will offset this loss. I know this is the AWS My architecture. But do you also factor in vendor lock-in when you were designing the system?

@joshhardy5646 2 жыл бұрын

I’d imagine at that scale they were busting the ceilings out of the max concurrent Lambda instances.

@EndvrX Жыл бұрын

Would like to know if somebody has made a project using this architecture would like to connect!

@manmitsingh4958 4 жыл бұрын

Waoooo🔥

@ssb26 4 жыл бұрын

300K opex per year for such large operation is no brainer. If the same thing was done on prem, it would have run into millions. Optimization is unnecessary.

@wtfzalgo 4 жыл бұрын

Isn't lambda concurrency capped out at 1k ?

@opherdubrovsky4175 4 жыл бұрын

There is a default limit, but you can ask AWS to increase it to a higher level.

@wtfzalgo 4 жыл бұрын

@@opherdubrovsky4175 oh that's right, forgot about that

@superdkls 4 жыл бұрын

No use of Glue nor Athena?

@opherdubrovsky4175 4 жыл бұрын

How would you propose using Glue or Athena in the setup described ?

@1991anirudh 2 жыл бұрын

It was a bit annoying that the interviewer kept on interrupting him

@henrywwilson5092 4 жыл бұрын

what's the maximum lambda concurrency you reached? Did your lambda encounter any throttling on S3 API request ?

@opherdubrovsky4175 4 жыл бұрын

currently we set our max concurrency for that Lambda uploading the files at 7,000, but in the past it was lower (3,000). we have run into the limit a few times when we had a large backlog, mostly due to a bug in new code introduced that required reprocessing of a lot of files. we did reach throttling in certain times, but our system is resilient to that. Its important to stress that you have to design your system to be tolerant of failures and be able to recover them. In our case the lambda would rerun again using the AWS dead-letter queue mechanism when the disruption was for a short amount of time. When this persists for a longer time, we have a retry mechanism in the system that will release the files back to pending status and it will be picked up again by the system later on for processing.

@henrywwilson5092 4 жыл бұрын

@@opherdubrovsky4175 thank you very much. Brilliant !