This is one of the best episode in the series. Effective usage of serverless model. 👍
@praveensripati4 жыл бұрын
A major chuck of the work is done by EMR and the state of the system is stored in RDS which is also very critical and both are not serverless. BTW, really enjoyed the session.
@opherdubrovsky41754 жыл бұрын
@@praveensripati Stay tuned for our next update. By now we are now running Spark in serverless mode as well (its our own architecture, not using EMR and scales up and down on a dime) - we are going to talk about it at re:invent (free virtual conference this year). Search for the talk by Boaz and me when the agenda goes public. So the only piece that is still not serverless is the DB, but we don't have an immediate blocker with it that will cause us to try hard to find an alternative.
@praveensripati4 жыл бұрын
@@opherdubrovsky4175 Thanks for the response. Just curious - what's the stack K8S or AWS Glue or Knative?
@trailerhaul82002 жыл бұрын
@@praveensripati EMR now has a serverless model
@gsb224 жыл бұрын
The guy asking the questions is legend. He was asking the very same questions that were coming to my mind. If it weren't for his bad-ass questions, this video would have been an average video. Thanks guys. Loved this video.
@ivanguerra12604 жыл бұрын
Yeah, and his skill to simplefy.
@prakashpoudele4 жыл бұрын
Implementing rate limit so that you dont hurt your partners is next level badass!
@VishalVyas304 жыл бұрын
Very detailed architecture walkthrough Opher and entire Nielsen Team. This will definitely help millions of AWS customers facing similar situation.
@michaelstaub224 жыл бұрын
Wait so the entire internet surveillance economy is powered by SQS, S3, RDS, and Lambda? Incredible. It literally could not be more simple.
@sb93774 жыл бұрын
And EMR!
@NiccoloHamlin4 жыл бұрын
Of course not. They have competitors.
@hellohelloronin4 жыл бұрын
Spark EMR being the core component. Video title is so misleading.
@itaiyaffe4 жыл бұрын
@@hellohelloronin actually, the title is not misleading, Spark on EMR is only one component, and not even the significant one
@hellohelloronin4 жыл бұрын
@@itaiyaffe why do you think spark EMR is not a significant component?
@bobhaffner59024 жыл бұрын
One of the best This Is My Architecture episodes to date. Great job, gents
@tanveeriqbal66802 жыл бұрын
This is amazing to see the power of Lamda function
@rachellejanssen26554 жыл бұрын
when your system is such a badass that your partners' servers think it's a DDoS, that's epic
@manzilkiit4 жыл бұрын
When your system is so scalable, that you end up DDoS-ing your own services xD
@andrewevans57504 жыл бұрын
I mean, some systems come down easy. Once had a job aggregating and cleaning public data. I took down the MT DMV and public data servers for a good 5 days because I forgot a 0 in my timeout. I only had about 3 servers and that was back when running a scaled system meant working with Akka and Scala. Needless to say, that and a growing relationship with a co-worker ended that gig. The only reason it was a layoff was that my boss was confused. He tried to claim the open source I was using was his, he got a nice sieze and desist, and listed me as a layoff with a payout instead of a termination. Make more with AWS anyway now.
@cloud_c52222 жыл бұрын
@@andrewevans5750 Thanks for sharing Which domain In AWS do you work like Solutions Architect/Security engineer/ Networking etc
@abrarcalculas4 жыл бұрын
I can't even imagine how this system can handle 250 BN events a day. The architecture looks very elegant and extremely optimized. The most interesting takeaway for me was how they rate limited the system and was able to reduce the cost per BN events by simply tuning the lambda configurations. Excellent insights. Loved this.
@opherdubrovsky41754 жыл бұрын
check out our talk at re:invent conference - free and online this year. We explain the idea behind how we've implemented the rate limiting mechanism. this is the name of the talk - Managing the serverless challenge: Running big data on AWS Lambda - ARC310
@opherdubrovsky41754 жыл бұрын
the agenda is up - here are the dates/times for the talk (GMT times). It will be broadcast 3 times.
@abrarcalculas4 жыл бұрын
@@opherdubrovsky4175 Thanks for heads up. Registered and added the time for the talk on calendar. Eagerly awaiting.
@pemessh2 жыл бұрын
Thank god , lambda power tuning exists. We can now figure out which memory configuration to use without much effort. No writing simulators and things. Yey!!!
@salessiteboost76654 жыл бұрын
I will be moving to Lambda/Serverless for my next projects. I can write Python or Node functions within Lamba and scale very easily. Awesome!
@jaganmayeesahoo42104 жыл бұрын
Thanks for the overview and appreciate your efforts yo come up here and sharing the knowledge, could you please upload the videos for each module and the challenges.
@opherdubrovsky41754 жыл бұрын
there is a talk I gave about some of the challenges at DevTalks conf a few months ago. you can check it out here - kzbin.info/www/bejne/e5vMnHadZrZqpqc we'll also be done another talk about the system at reinvent later this year.
@YGL534 жыл бұрын
It must be very satisfying to write in that blackboard with a white marker :D Thanks for the content though :)
@arrjay38143 жыл бұрын
You should consider using RDS PROXY to help with overcoming the connection limiting to essentially
@tomstravelingadventures4 жыл бұрын
Love watching these architecture videos
@salessiteboost76654 жыл бұрын
This is what an elegant solution looks like. Well done guys. :)
@AnGELsPearhead4 жыл бұрын
Amazing walkthrough and well thought out architecture. Really enjoyed watching it.
@michailxirouchakis83254 жыл бұрын
Thanks for sharing this :) Regarding the problem of "writing back to the database" (5:57), have you used the new RDS Proxy Solution, and if yes, what is your feedback?
@opherdubrovsky41754 жыл бұрын
we just deployed it to 2 of our smaller regions last week. It looks like a really good solution, but I'll know more next week once we deploy in the US. One thing to know is that since it is a new feature, it is currently not available in all availability zones. We ran into this while implementing. But I guess this is a short term issue that will get resolved in the next weeks
@egomezr4 жыл бұрын
So funny, your system works so well and follows a very nice scalable approach that your partners think you're doing something wrong, amazing! :-o
@anandsunderraman86434 жыл бұрын
Very nice presentation. Makes me ask more questions.What is the runtime of your lambda ? How do you handle DLQ scenarios ? You mentioned about running out of DB connections and hence piping the data via SQS. How did piping the data through SQS actually limit the number of DB connections ? When you say data per second how do you measure it ? Did you run into memory issues with the lambda and if yes how did you overcome the same ?
@opherdubrovsky41754 жыл бұрын
see comment about DB locks in reply to "uncle Tetsu". The SQS helped since - instead of having thousands of Lambdas updating each 1 row in the DB, they can send their message to an SQS and we have an "update Lambda" that reads the SQS and updates the DB. So instead of having thousands of Lambdas update the DB, you have just a few copies of that update lambda updating it. that reduces the connections to the DB. That said - AWS came out with DB proxy lately that give you all that stuff out of the box. Suggest you check it out. We just deployed it and it is sweet. memory issues - yes at first. we did a lot of profiling to figure out why and what libraries are not really needed. Lambda runtime - ours average is 10 sec with 30 sec stdev. That means that some runs could end up being 1-2 min long, but most are short.
@ABK69692 жыл бұрын
@@opherdubrovsky4175 Thank you for the detailed reply. Is the Lambda processing using Python, Go or something else?
@Rachelebanham4 жыл бұрын
Thanks #AWS this was a really great short use case.
@keent2 жыл бұрын
$4.25/billion events is freakin impressive in any way you tell. Without platforms like AWS/Google Cloud, there would be no innovation in the tech world.
@upuldi2 жыл бұрын
1000 dollars per day is extremely cheap compared to the massive load it handles. Our Splunk cluster is costing more than that. Very good use of technology.
@petersavnik67774 жыл бұрын
We love this blackboard and the colors you use in these videos, can you share some details on what markers and blackboard that you are using?
@bziniman4 жыл бұрын
We use these markers - www.amazon.com/Chalk-Markers-Chalkboard-VersaChalk-Reversible/dp/B00LOTZRUE/.
@desi972444 жыл бұрын
Very nice explanation. Wonder if any alternatives were considered to the RDS and why this was chosen.
@AnkitK-14 жыл бұрын
Would love to know about those kind of decisions too.
@opherdubrovsky41754 жыл бұрын
see comments above (just search for the text) this one: the DB was originally an afterthought. we were mainly focused....... and this one: The DB is the queue!! It's an intelligent queue.....
@odwa.m4 жыл бұрын
@@opherdubrovsky4175 I'm wondering if it wouldn't be better to use DynamoDB instead of RDS since the way you are using the database is more in a NoSQL fashion. Plus DynamoDB will not hit a performance bottleneck in future which is something that might affect your RDS instances (at least in terms of write capacity)
@LonliLokli4 жыл бұрын
I think it's pleasure to work there, you know. Challenging and interesting task
@rahulnath96554 жыл бұрын
Great video! Answered so many questions for me.
@tamerelfeky17214 жыл бұрын
Richy quick study case , good job
@odwa.m4 жыл бұрын
I see a lot of supportive services which are probably used in the background but are not explicitly mentioned. This includes CloudWatch which is probably monitoring the Lambda functions and databases. CloudWatch can also be used with CloudWatch Insights or Contributer Insights to help with the optimization of Lambda functions. SNS which can be used to launch system ticket for issues.
@opherdubrovsky41754 жыл бұрын
Thats correct. Due to the short amount of time for the video, we had to keep it at a higher level. We do use SNS in one part of the system, we use cloudwatch for monitoring and for autoscalong decisions on the spark sode of the system, we recently added RDS Proxy to the system and also EKS. Come listen to my talk at re:invent 2020 and you’ll hear more about it. This year it is online and free - so no reason not to register. reinvent.awsevents.com/
@opherdubrovsky41754 жыл бұрын
Another relevant talk i am giving at Data+AI Summit in November (online and free to attend) is: databricks.com/session_eu20/scale-out-using-spark-in-serverless-herd-mode
@ced42984 жыл бұрын
That’s all very impressive!
@jujharsingh81284 жыл бұрын
Love these videos. Keep them coming
@DerekMurawsky2 жыл бұрын
Great episode, but the audio seems really low. Just me? Other videos are louder.
@yahtadi51524 жыл бұрын
Holy moly, this is super good. I should consider switching
@nicolasrecalde85614 жыл бұрын
Use Kinesis streams instead of SQS. This change is a BIG difference in the bill and the performance is much better!
@opherdubrovsky41754 жыл бұрын
our SQS is about $10 /day. That's 1% if the system cost. We are more focused on saving costs on the parts with larger chunk of the system cost
@tomerdubrovsky40734 жыл бұрын
Thanks for a great presentation!! looking good! well done
@RaymondChenon4 жыл бұрын
Cost : 1000 USD/day at 6:26 for a such big system, it is cheap. Does it include everything on the blackboard ( lambda, SQS, RDS, S3, EMR) and outbound bandwidth ?
@markusgulden40684 жыл бұрын
Yes, I also would be interested in the cost distribution over the different services (obviously, Lambda seems to be the biggest portion). Great talk, be the way!
@opherdubrovsky41754 жыл бұрын
@@markusgulden4068 here is a breakdown of a typical day in our biggest region (US). The other regions are much smaller and make up the rest Lambda($) $481.62 EC2-Instances($) $133.16 S3($) $90.72 Relational Database Service($) $47.80 CloudWatch($) $0.17 SQS($) $9.14 EC2-Other($) $3.20 Total cost ($) $765.80
@markusgulden40684 жыл бұрын
@@opherdubrovsky4175 Thanks for sharing this
@lordpablo19852 жыл бұрын
What are the work managers exactly? I'm interested to know more about the fan out scaling method.
@LudgerPeters4 жыл бұрын
Thanks for a very informative video. A couple questions come to mind. How do you manage lambda failures and retries for other all your lambdas, not just the delivery lambda with the SQS queue? Also with regards to costs, if you are running a large number of lambdas, would switching to fargate instances and scaling them out not be cheaper since you would be able to scale out based on sqs queue depth? When running tests with our data processing we ended up going with fargate because of cold starts and fargate being able to cache information between invocations since it more long running? How do you invocate the lambda from the RDS instance to decide what files need to be distributed? And finally do you have any locking anywhere? Do you lock on the DB level? Thanks.
@opherdubrovsky41754 жыл бұрын
we manage lambda failures by having our system automatically free tasks in the DB that have a "processing" status that is longer than the Lambda timeout period (we set it to 5 min). If the task does not change to "Done" status we have another lambda that periodically checks for stuck tasks and changes the status back to "pending" + increments a reprocess counter. We currently try 3 times before we drop the tasks. We have other systems that use the dead letter queue to process failed tasks but not in this system. Fargate could potentially be cheaper, but you have to manage your containers very carefully since if you do not constantly keep them at 100% capacity and kill them immediately as they complete, you end up paying a lot for the aggregation of all that idle time. That is not easy to do and Lambdas so that for you automatically with no hassle at all. We tried something similar with OpenFaas but had a hard time making the cluster fully loaded all the time, so even though on paper it was cheaper, in practice we had a hard time getting it to be cheaper. We might go down the path of trying to do this with EKS pods, but I prefer that AWS comes up with a business offering for large load systems over Lambdas as the architecture is perfect, its just the cost side that needs a better offering. Something like being able to pay for reserved capacity of Lambdas and get a cheaper price - like they do with EC2 systems. I think eventually they will realize they need to come up with an offering for large systems as their current offering is focused on small scale systems build on Lambdas. Regarding SQS - yes that's a good way to do it. In fact, we have migrated our Spark over EMR to many spark pods running on EKS and are using SQS for it just as you described. Regarding cold starts - we don't have any problem with that as the containers remain available for a long period of time even when not used. So in any system that has constant traffic this is a no problem. We even measure the cold start times and saw that even containers that are not invoked for a long time (sometimes even up to an hour) can remain available so that next you call them they already have it all cached. DB locks - see other comment I wrote about it here - in the recent comment "uncle Tetsu" asked (just search for it)
@mrkandreev4 жыл бұрын
Thanks. That is an excited work!
@DanishAnsari-hw7so2 жыл бұрын
Can I practice this architecture on AWS by using a sample dataset? Any resource links for the same?
@josuelim44 жыл бұрын
That's great! Have you considered using DynamoDB instead of RDS?
@oll2364 жыл бұрын
I think because of speed, structured data, know-how its better idea to use RDS
@josuelim44 жыл бұрын
Franco Peña I’m not sure. DynamoDB can be much faster, cheaper, easier to scale and to maintain in many scenarios (as compared to RDS). Specially when it’s being primarily queried from lambdas. The structure of their data and the queries complexity may be a good point though. Would be interesting to know more.
@opherdubrovsky41754 жыл бұрын
@@josuelim4 the DB was originally an afterthought. we were mainly focused on solving the main problems so started with a DB that we knew we could get some runway with and thought that we'll reconsider later. However, its been working very well and the fact that it is a relational DB has become a big advantage. We are running all kinds of complex control queries on it as well as are able to easily analyze trends like costs, data size, loads, etc.. due to the fact that it is easy to run analytical queries, we use it for almost any research we need to do on the system when trying to improve it - like what flows are more expensive, where the bottlenecks are, what areas we should tackle next, etc.. we also use it for the rate limiting feature we have - as rate limiting is not about a single file, but about aggregates - so you have to keep doing those aggregates to be able to make rate limiting decisions.
@tiktak22344 жыл бұрын
DynamoDB can be very restrictive when you do not have very strict and defined access patterns. DynamoDB can be great for many scenarios but probably does not fit their use case
@christophercaldwell66414 жыл бұрын
I ran into this exact scenario. An sql database was the better choice for our team because of the inability to know how the data would be accessed. Some people wanted to run their own data visualization tools over our reported data, which would be nearly impossible with Dyanmo because they never know the access pattern. Someone might want to see things by a given metric that their tool automatically does with sql, but would have to create their own implementation had the data been in Dynamo.
@Abrakadabra9to54 жыл бұрын
At 1000$ per day being spent was wondering if having reserved instances for handling standard work loads /traffic and using spot instances for bursting up can be cheaper
@opherdubrovsky41754 жыл бұрын
could be but managing the scaling becomes more complex. The advantages of the Lambdas is that you get auto scaling built it which makes scaling up and down a breeze
@daniel.lafraia4 жыл бұрын
@@opherdubrovsky4175 if cost was an issue I'd say spot instances would do the job with less than 20% of this cost ($200/day). Wouldn't be too complex if you used ECS or EKS and docker containers. I'd just replace the lambda stack.
@sfhopkinson4 жыл бұрын
Would be great to know if the recent 100ms -> 1ms billing granularity of Lambda has continued to dramatically reduce the costs?
@opherdubrovsky41753 жыл бұрын
in our case since our Lambda runs average 10.5 seconds the effect was negligible. we are saving about ~0.5% of the cost. For lots of short running tasks - the savings would have been much more significant. You can easily estimate it for any system. just calculate the expected mean time savings per Lambda and divide by the average lambda run time.Obviously the longer the average lambda runtime is, the smaller % savings you get
@zhiweio2 жыл бұрын
That's pretty COOL
@ChristWolves4 жыл бұрын
RDS is not serverless, unless you are using the more expensive Serverless Aurora. Any thoughts on optimising the RDS costs while able to run at optimum efficiency?
@opherdubrovsky41754 жыл бұрын
after shooting the video, we also got off EMR and are running spark on EKS (i.e. the system has evolved further). so the whole system is now totally serverless except the DB. this is a bit annoying and is a potential block point. However, it is running well right now so no urgency to change it. the main problem is that it is not really clear what we can replace it with. The relational DB gives us a lot of power in analyzing data and complex rate limiting abilities based on data aggregations. so replacing it is not simple. Any ideas ?
@LudgerPeters4 жыл бұрын
@@opherdubrovsky4175 I would imagine a mix between DynamoDB and Elasticsearch. We run a similar system for marking up large amounts of data and we store metadata in dynamo DB and then have it stream to an elasticsearch cluster where we can create very powerful dashboards and run queries. You can scale out your elasticsearch cluster if you do not need performance you can make it very small and it will easily handle querying a large amount of data.
@LindsayForbes4 жыл бұрын
Well done. This is amazing.
@mohammedramadan34803 жыл бұрын
is it possible to have million lambda invocation in one second ? not the same lambda but i mean the max number of concurrent lambda per second
@Godrose4 жыл бұрын
Thanks for the great presentation. What kind of tooling did you use to find the sweet spot for Lambdas? I mean the offline part: SAM, SLS, something else?
@@HebrewCloud the power tuning tool is a great way to optimize. We originally wrote our own simulator that does . something similar. I suppose the next time we do an optimization we will us powertools. another thing to try is to remove all unneeded libraries in your code and also make it more memory efficient . you need to look at all your "includes" in your code and figure out if you really need them. whatever you do not need, remove it . It will save memory but also reduce the warmup time of the Lambda. we were also able to save a lot of memory by loading the data into the Lambda in batches, instead of a whole file. That makes it process all of the data slower, but if you save enough cost on the lower memory footprint, it may make it worthwhile, as it will reduce the memory requirements of the Lambda and that translates to immediate cost savings. .
@MrDagonzin2 жыл бұрын
Great episode!. I am surprised they were able to make an intelligent manager working with lambdas and the limitations on the number of libraries that they have. They didn't need to use EC2?
@dondreytaylor80014 жыл бұрын
Very informative, I like this one a lot.
@loke2619894 жыл бұрын
How abt ur EMR Configuration, can u publish the number of nodes and type?
@opherdubrovsky41754 жыл бұрын
It would change over time depending on amount of data and load. We did not need a very large cluster as the data transformations are simple grouping and/or partitioning for each account + a few enrichment fields. So most of the time it was 1 master + 1 core + 7 task spot instances using c5d.2xlarge . This cluster was always on running in cycles and pulling the next batch once it finished the current one. When we had high loads or needed to reprocess lots of data we would either add more task nodes or start a second EMR cluster in parallel (we've build it so you can run multiple independent clusters side by side)
@deepus91564 жыл бұрын
Superb
@omidvahdaty76364 жыл бұрын
Impressive work!
@mohitvachhani11334 жыл бұрын
What was the specific reason of using rds instead of dynamo db?. As dynamo db is write friendly
@opherdubrovsky41754 жыл бұрын
We do a lot of analytic queries especially for queue assessments and for rate limiting, so a relational DB was convenient. That said, we will reevaluate the db in the future
@rockersamurai4 жыл бұрын
rds could be dynamodb if they don't need full sql
@opherdubrovsky41754 жыл бұрын
we are doing a lot of SQL aggregations and analysis so kept the RDS in there. That said, it is the odd part of the system not being serverless, so we might eventually replace it with something else
@StephenYang214 жыл бұрын
that's was AWSome!
@anuragmishra69612 жыл бұрын
why that is need of lambda function between SQS and RDS and again there is lambda function that pushes data to EMR cluster for transformations , can not be it done spark on EMR cluster directly fetching the data from SQS and do all sort of works before writing back to RDS . so with this , two lambda functions those first between SQS AND RDS can be removed and other between RDS and EMR.
@CommunityOkcom3 жыл бұрын
What blackboard are you using? Thank you.
@Tibetan-experience4 жыл бұрын
Great 👌
@achyutvyas4 жыл бұрын
Really amazing
@kushagrabhushan4 жыл бұрын
I have no idea what these guys are talking about. Where do I start learning about it?
@jrapp6544 жыл бұрын
I'm curious what your RDS costs look like monthly. What instance type do you have it running on?
@opherdubrovsky41754 жыл бұрын
in the US where we have a big instance we are paying about $1700 / month for the RDS. It is currently running r4.4xlarge. However, we are looking into moving it to a smaller instance since we just implemented the DB proxy (new feature in AWS) which then allows us to use less connections to the DB and less memory. For the other smaller regions (like EU) we use db.r4.large - that one is ~ $200 / month
@jeffevans82004 жыл бұрын
@@opherdubrovsky4175 Have you ever considered using Azure? Or only AWS?
@brennanbugbee4 жыл бұрын
very cool
@spiralni4 жыл бұрын
Hard to imagine that volume
@opherdubrovsky41754 жыл бұрын
yes. I agree. I keep scratching my head as it is really hard to grasp the volumes as well as how cheap it is to process so much. I guess just a few years ago it would have seemed like science fiction :)
@jamesren49494 жыл бұрын
Hi Opher, thanks for sharing the amazing design with serverless. May I ask how would you handle large S3 object with Lambda?
@opherdubrovsky41754 жыл бұрын
great question - the previous step to the Lambdas is a Spark server that does the data transformations and data preparations. It used to be one large Spark cluster but lately (after the video was recorded) we have transitioned it to an EKS Kubernetes cluster running small Spark pods. Anyways - this Spark step prepares the files and makes sure the sizes are never above a certain size (currently 50Mb). However, as the data for each account varies in quantity, 85% of the files are around below 5Mb
@BlueBird04244 жыл бұрын
Thanks so much for this video, in fact our company is facing a problem right now where we need to deal with a lot of files on a daily basis and this video is well worth referencing! Have you guys thought of using the same serverless ECS or AWS Batch to solve this problem?
@opherdubrovsky41754 жыл бұрын
we have not tried AWS Batch. It might also be a viable solution. Would love to hear how it works for you. Regarding other types of serverless - we have a workload that runs on EKS and scales up and down and that works very well too. We'll probably to another talk about it sometimes in the future.
@BlueBird04244 жыл бұрын
@@opherdubrovsky4175 With AWS Batch and EC2 Spot Instances you can save up to 90% while maintaining performance (vs. on-demand instances), e.g. you can use about 20 c5ad.24xlarge bidding instances (96vCPU, 192GiB, 2 x 1900 NVMe SSDs, $1.5454 /per hour) to do batch work, which costs around $750 per day. The downside is that it may be limited by AWS service quotas. The problem we're having now is that we need to extract and simply process more than 100GB of data per day from the RDS, and after seeing this video I think Lambda is very feasible!
@opherdubrovsky41754 жыл бұрын
@@BlueBird0424 we are using spot instances in many other areas and are very fluent in using them. We have thought of building an architecture around them but the awesomeness of auto scaling you get with Lambdas tipped us towards Lambdas.
@odwa.m4 жыл бұрын
I find that EMR is better suited for big data jobs like this than using Batch. This is because Hadoop has a lot of transparency with performance optimization and the type of programs you want to run over the data. This means that you have the additional option of being able to use Machine Learning programs like SparkLib or Mahout. You can also perhaps produce reports to managers or stakeholders using Hive. Im sure you get my point
@TheMegaMrMe2 жыл бұрын
Sooo the heavy lifting is done by EMR and lambdas do basic routing. I wouldn't phrase the title like you did.
@AlexandrCherednichenko4 жыл бұрын
Isn't there a limit of 1000 concurrent lambdas per AWS account?
@dangtrinhnt4 жыл бұрын
The maximum execution time of a lambda function is 15m so I wonder how you can make it to handle a large amount of data?
@MohamedMahmoud-od2zy4 жыл бұрын
You can trigger the lambda on object creation when objects are uploaded to the S3 bucket and you can break the events into smaller files from EMR to be written to S3, so you can have Millions of lambdas running in parallel and not exceeding the lambda limits like memory or execution time.
@opherdubrovsky41754 жыл бұрын
@@MohamedMahmoud-od2zy that's exactly what we do. we have a few 1000s concurrent lambdas running in parallel. The trick is to make sure the tasks each one gets can be completed in time before they expire. We actually set our Lambdas to 5 min because it helps us cap runaway costs in cases when we accidentally deploy a bug. so in theory, you'd want your tasks to be as short as possible, to the point where making them shorter will make them less efficient.
@gipbwok20082 жыл бұрын
Duh, that was above my head 😊
@sumitkhandelwal81554 ай бұрын
I'm surprised you don't have any DLQ setup in the system if there will be any error on how the system knows systems during that time and did we want to reprocess the failed events then how can we do that
@awssupport4 ай бұрын
Hi, our re:Post community alongside our devs may be able to provide some alternate solutions. Don't hesitate to post any questions you may have here: go.aws/4d7YsNE. ^CM
@dannyruchman4 жыл бұрын
Respect Opher :-)
@opherdubrovsky41754 жыл бұрын
you started it man....I just pushed it along a few more hills along the road :)
@dannyruchman4 жыл бұрын
@@opherdubrovsky4175 amazing to see this in action in such scale
@daniel.lafraia4 жыл бұрын
I wouldn't say $1k a day is a low cost approach. I believe if they used spot instances this would make great impact on cost. Spot instances with ECS (or EKS) with autoscaling running docker containers with their code would be a good cost effective solution.
@LudgerPeters4 жыл бұрын
Yeah I think using ECS tasks with auto-scaling would bring down costs, especially if you are running faster languages. I personally run java ECS on Fargate instances and saw significant savings over using Lambdas, if your lambdas are running 24 hours a day you might as well run a instance it will work out cheaper.
@itaiyaffe4 жыл бұрын
Trust me, $1k/day is an extremely low-cost approach for this kind of systems (even after exploring the non-managed FaaS option)
@braybilly4 жыл бұрын
Curious what language they chose for their lambda code
@agesnipes4 жыл бұрын
Billy B me too!
@opherdubrovsky41754 жыл бұрын
Scala
@COMPTROL4 жыл бұрын
@@opherdubrovsky4175 is there any specific reason for this choice? immutable data structure should be available in other languages as well? Or was that a specific framework like spark marking your decision? Thanks in advance.
@piedroconte4 жыл бұрын
I did not see the real-time process workload, I think these workloads Are batch and ETL based on the AWS
@opherdubrovsky41754 жыл бұрын
correct. These workloads are batch. they come in through files that "drip" in and get processed. That said, the system is constantly processing data, so as data comes in, it gets picked up and processed.
@jayw83484 жыл бұрын
Very informative video and great comment on details Opher! I checked out the "THE RISE AND FALL OF SERVERLESS COSTS" or so! Q: Do you have to consider locking when doing the RDS query? You mention RDS as an intelligent Queue. Q: Are your Spark pod fine-tuned for different account / different size of files to achieve higher utilization? Would love to learn more from this piece as we have an use case to read S3 from either Glue or Spark in EMR/EKS.
@opherdubrovsky41754 жыл бұрын
As the scale increased over time we got to a point that we had some db locking issues and also ran into too many connections to the DB (our limit was 5000). Our solution back then was to do the file status update via a queue and a lambda that reads messages from it. That reduces the connections to the DB and the danger of locks. Lately AWS came put with RDS Proxy. We tried it out and last week deployed it. Our DB connections dropped from ~5000 to ~1400. So next we will see if we can work with a smaller DB instance and save on the DB costs. We’re thinking going to 1/4 the size.
@g.egziabher15222 жыл бұрын
Does anyone know where to get these magnetic stickers?
@shivamnarware43252 жыл бұрын
How you are managing lambda execution time limit with this scale?
@idcarlos2 жыл бұрын
They have a timeout of 5 minutes, each file is processed in parallel
@tobennanwokike4 жыл бұрын
Wow!
@themoah14 жыл бұрын
There is no way it costs $1000 per day. I don't know if they are getting special pricing, but storing 30 TB of data daily on s3 would cost about $600 and that amount of egress traffic (e.g. mentioned 250 mbps ) also costs a lot. Let's assume that all lambda's and EMR don't even produce the logs and metrics stats (CloudWatch can also cost a lot).
@opherdubrovsky41754 жыл бұрын
We are not storing 30TB a day in that system. This system only prepares data and uploads to ad platforms. The overall DMP system stores about 5 petabyte of data total. But the Dataout system in the video is just 1 subsystem of it, and does not store a lot of data except the data in process. The DB just holds the meta data about the tasks to process. The data itself is in files on S3.
@dellendinho4 жыл бұрын
Would love to see their RDS bill.
@opherdubrovsky41754 жыл бұрын
current RDS bill for our biggest region (US) is $1500 / month. We just moved to using RDS proxy which should allow us to move to a smaller 1/4 size RDS instance due to the lower amount of connections. I expect that will help drop the cost to about $375 / month
@dellendinho4 жыл бұрын
@@opherdubrovsky4175 thanks. How many IOPS and how did you determine the appropriate number?
@abhinee4 жыл бұрын
great stuff
@frenkelalexandre4 жыл бұрын
Nice !
@upendrareddy63044 жыл бұрын
How does your system can have 3000 lambdas concurrently running in a minute, The max parallel invocations are 1000 only right?
@abhinee4 жыл бұрын
aws will increase the limit if required
@kirin99914 жыл бұрын
Tell us that sweet spot
@deepakkwatra4 жыл бұрын
Do u somehow try to manage the order of events before sending to diff networks?
@opherdubrovsky41754 жыл бұрын
in our case the order is not important, spot we don't worry about it. However, they do roughly go out in FIFO (first in first out) order, except when there is reprocessing or a delay for some reason.
@ans06004 жыл бұрын
How about the RDS storage cost? Which is increased a few TB per day?
@itaiyaffe4 жыл бұрын
RDS only stores the metadata in this case (AFAIK), so it a lot less than few TBs/day
@opherdubrovsky41754 жыл бұрын
we also delete the meta data from RDS after a few months. we keep it there that long just for statistics
@ppgab4 жыл бұрын
Couldn't he use a queue instead of RDS to keep track of the tasks?
@opherdubrovsky41754 жыл бұрын
The DB is the queue!! It's an intelligent queue. we could have used a regular queue (like SQS or any other) but we would have lost the ability to intelligently control the flows. for example - if you'd want to do rate limiting just for a few accounts, you would basically need to have a separate queue for each account. we have a few 100s of accounts, so we would have needed 100s of queues. also - when you need to reprocess data for various reasons, all we have to do is write a query and set the relevant completed files back to "pending" state and they would automatically get picked up again. With a queue this becomes very hard. we had a queue in another part of the system before, and we ended up getting rid of it and moving it to the DB. we still have queues that are used as a buffer in the system, but not smart control of the flows.
@jrapp6544 жыл бұрын
@@opherdubrovsky4175 Really interesting way to handle reprocessing failed data. Ive always thought to rely on DLQ but i like the control that RDS gives you.
@abhinee4 жыл бұрын
@@opherdubrovsky4175 Hi Opher, how are you ingesting data, can you pls elaborate, is it SQS?
@papieskimegatron4 жыл бұрын
How long did it take to implement the system? How many people worked(or are still working) on this?
@opherdubrovsky41754 жыл бұрын
We started out about 2 years ago and had between 2-3 people working on it constantly. That said, a lot pf the work is not related to the scaling but rather to ad platform specific support we gradually added + peripheral tools we built to provide our solutions team ways to upload various data types. Also - it took us quite some time to learn how to build it and make it robust. When we started, a lot of things that in hindsight look obvious, were not obvious. Unfortunately we did not have hindsight back then !! 😁. I imagine if i had to build it today, now that i know how to do it, and focusing just on the core system, i can build it with 2 developers within a few months.
@papieskimegatron4 жыл бұрын
@@opherdubrovsky4175 Thank you very much for your answer. The talk also skipped the fact that EMR is not serverless and isn't really pay-per-use like the rest of the system(well, so is RDS but lets skip that part for a minute :P). How good is EMR autoscaling when working with billions of events daily? Did you have to create custom "smart" autoscaling rules to spin up and tear down the clusters? Or just the min/max instances is sufficient and YARN does its job well? I imagine you have thousands of Spark jobs running every hour. Great talk! Looking forward to hearing more.
@opherdubrovsky41754 жыл бұрын
@@papieskimegatron EMR was decently ok but the problem was that due to data skew between the different ad accounts we upload to (some small and some large), the cluster did not scale well and you got diminishing returns as you added more instances. To fix this, we ended up moving from spark over EMR to a serverless like spark where we run lots of small independent spark instances and we scale up/down by adding/removing instances. This works beautifully and scales really well. And, due to the improved scaling, the cost is about 50% of what it was on EMR - which is great 😊 I am giving a talk about it with one of my developers at Data+AI summit in November (online and free to register this year). You can check it out and register here - databricks.com/session_eu20/scale-out-using-spark-in-serverless-herd-mode
@papieskimegatron4 жыл бұрын
@@opherdubrovsky4175 Does it mean the EMR in the talk is not really EMR but your own Kubernetes + Spark (on "herd" mode)? :P I will certainly join as it sounds great, thank you!
@opherdubrovsky41754 жыл бұрын
@@papieskimegatron when we recorded it a few months ago it was plain vanilla spark over EMR. However, the video took awhile until it was published due to Corona delays in the video post production. By this time, we have already moved to our next system improvement I mentioned above. The next time we talk about it, we will already be talking about the next iteration of the architecture
@bharath700i4 жыл бұрын
How much data that RDS Postgres is holding? Is that one record for each event?
@opherdubrovsky41754 жыл бұрын
RDS just stores the meta data about each task (file). so it has a small amount of data in total. A few million rows each day
@juanvassallo29934 жыл бұрын
@@opherdubrovsky4175 the $1000 / day cost includes rds and all the other services or is it just for lambda events?
@jett_royce4 жыл бұрын
Using a bunch of lambdas for this sounds like it's way more expensive (cost-wise) than it should be. Even after accounting for the convenience.
@kirin99914 жыл бұрын
Have u ever seen server maintenance costs
@chris-ew9wl4 жыл бұрын
It looks simple, but with that large amount of processing, you are better off using kubernetes, Lambda cost would be at least 100,000+ USD a day, and that’s just on one of four Lambda service in his architecture diagram.
@itaiyaffe4 жыл бұрын
That's actually far from the truth. As Opher describes in the video at 6:40, the cost of the system (not just one Lambda) is only ~$1000 per day (as of the making of this video). As Opher says, the team was able to reduce the costs from $7.7 per billion events to $4.25 per billion events. At about 250 billion events per day, it gets you to about $1000 per day. Note that by "events" we don't mean Lambda invocations, but rather the raw events coming into the system in those files that are shown on the lefthand side of the blackboard
@chris-ew9wl4 жыл бұрын
@@itaiyaffe hey thanks, for correcting me. I missed that part at 6:40 where they discussed the cost of the system. When I saw the SQS and 250 Billion events, I immediately thought lambda is processing the "per" event of those 250B. Hence my calculation. Was there a a comparison on a similar architecture on a kubernetes cluster? From what I gather, processing events of this magnitude. But if peak traffic is just 30Million Invocations, then that's just $5 (on lambda compute alone).
@itaiyaffe4 жыл бұрын
@@chris-ew9wl so Lambda costs, as far as I recall, is based on memory capping and execution time. Not sure it's $5, but I think we both agree on the magnitude of the costs 🙂 As per K8s costs - I don't think the numbers were published, but they were in fact greater for sure. Opher is the cost-reduction master, you can trust him he checked it thoroughly
@opherdubrovsky41754 жыл бұрын
in parallel to the lambda processing we’ve build another way to dispatch the data using an OpenFaas cluster. We then would send a % of the traffic to that cluster. The goal was to reduce costs. However, we never managed to drive down the costs there as it was really hard to keep the system constantly busy so there will be no wasted compute. That said, it’s probably solvable if we invest enough time. But - its complex!! I’d much rather get a better price plan for volume use on lambdas and keep using them since the architecture is amazing and helps you build very good yet simple solutions, with autoscaling built in. Hopefully AWS will come up with additional pricing plans like EC2 has and then this whole discussion will become irrelevant
@chris-ew9wl4 жыл бұрын
@@opherdubrovsky4175 Agree and I share your sentiment. Hey Opher, thanks for replying to this thread. I really appreciate it. I guess the whole point of Serverless is to invest more time developing rather than managing. So I guess, even though there's money can be saved via Kubernetes Cluster. (i'm a fan of OpenFaas too), we have to factor in Engineering time if the money saved will offset this loss. I know this is the AWS My architecture. But do you also factor in vendor lock-in when you were designing the system?
@joshhardy56462 жыл бұрын
I’d imagine at that scale they were busting the ceilings out of the max concurrent Lambda instances.
@EndvrX Жыл бұрын
Would like to know if somebody has made a project using this architecture would like to connect!
@manmitsingh49584 жыл бұрын
Waoooo🔥
@ssb264 жыл бұрын
300K opex per year for such large operation is no brainer. If the same thing was done on prem, it would have run into millions. Optimization is unnecessary.
@wtfzalgo4 жыл бұрын
Isn't lambda concurrency capped out at 1k ?
@opherdubrovsky41754 жыл бұрын
There is a default limit, but you can ask AWS to increase it to a higher level.
@wtfzalgo4 жыл бұрын
@@opherdubrovsky4175 oh that's right, forgot about that
@superdkls4 жыл бұрын
No use of Glue nor Athena?
@opherdubrovsky41754 жыл бұрын
How would you propose using Glue or Athena in the setup described ?
@1991anirudh2 жыл бұрын
It was a bit annoying that the interviewer kept on interrupting him
@henrywwilson50924 жыл бұрын
what's the maximum lambda concurrency you reached? Did your lambda encounter any throttling on S3 API request ?
@opherdubrovsky41754 жыл бұрын
currently we set our max concurrency for that Lambda uploading the files at 7,000, but in the past it was lower (3,000). we have run into the limit a few times when we had a large backlog, mostly due to a bug in new code introduced that required reprocessing of a lot of files. we did reach throttling in certain times, but our system is resilient to that. Its important to stress that you have to design your system to be tolerant of failures and be able to recover them. In our case the lambda would rerun again using the AWS dead-letter queue mechanism when the disruption was for a short amount of time. When this persists for a longer time, we have a retry mechanism in the system that will release the files back to pending status and it will be picked up again by the system later on for processing.
@henrywwilson50924 жыл бұрын
@@opherdubrovsky4175 thank you very much. Brilliant !