AWS Tutorials - Using Concurrent AWS Glue Jobs

Рет қаралды 6,153

AWS Tutorials

Күн бұрын

Пікірлер: 31

@trinath89 2 жыл бұрын

Thanks a lot, everything is put in as simple as possible format for us to understand.

2 жыл бұрын

Pleased to return again, this time to clarify an additional limitation to be taken into account, it is about the ips available in the vpc, because the glue job occupies ec2 instances and if there are not enough ips the job will crash, so it is important to verify the ips available to paralyze

@AWSTutorialsOnline 2 жыл бұрын

I agree. Glue will occupy IP only if your are working with VPC based resources.

@kameshnatarajan7735 Ай бұрын

Thank You so much !. If you dont mind... help me to understand "How to orchestrate multiple Glue Jobs in Parallel" ?

@gatsbylee2773 3 жыл бұрын

I got some idea what the max concurrency=4 is for. Based on your example, you still need to create multiple AWS Glue Jobs ( more precisely 200 "Runs" in a Job ) since you set "Source Table Name" and "Target Table Name" with the same Glue Job. Basically, you can group jobs in a job by increasing max concurrency. but you still need to create 200 Runs in a Job. And, you can still share a code across 200 Jobs or 200 Runs. I really appreciate to your video. It helps me get an idea what the parameter is for. Thank you.

@AWSTutorialsOnline 3 жыл бұрын

yeah. It is one code based and configuration for the job. But you are running multiple instances of it with different parameters.

@sheikirfan2652 Жыл бұрын

Nice tutorial. One question here, how to configure the glue job to run multiple SQL queries in parallel instead of reading from multiple tables

@AWSTutorialsOnline Жыл бұрын

I think you are looking for this one - kzbin.info/www/bejne/h3mUe5ZvjNejb7s

@sheikirfan2652 Жыл бұрын

@@AWSTutorialsOnline Thanks brother i will check and let you know

@sheikirfan2652 Жыл бұрын

@@AWSTutorialsOnlineThanks. I looked into it and seems that video explains we can have parallel runs corresponding to one column. But my solution is something like we need to pass SQL query as a job parameter and using that job parameter i should pass more than one SQL query either through just CLI or step function. Example my job concurrency is 2 So the job should run parallel with a queries like "select * from emp inner join students where std_id = 5" and "select * from emp inner join class where class_id = 10" and fetch results in respective locations(S3 locations).

@sheikirfan2652 Жыл бұрын

Also I have a solution like i can run more than one SQL query in my glue job but that approach will work sequentially not parallely

@zubinbal1880 8 ай бұрын

Hi Sir, Is it possible to enable job bookmark for concurrent job run but single script with step function?

@ashishvishwakarma8790 2 жыл бұрын

Excellent explanation. I'm working on a similar use case - however, I need to run the same job multiple time for same table (writing to different partition). The problem I'm facing with that is - the moment 1 of the many parallel job executions finishes, it wipes the temporary directory (created by spark) in the table directory, leading to deletion of temp data of other execution writing to the same table, which results into data loss as the execution of other parallel execution was still in progress, but the 1st job to complete deleted the temp data(created by Spark). Do you have solution to that problem?

@victorgueorguiev6500 2 жыл бұрын

It turns out you can't really use Glue Workflows for running them in parallel. When trying to add a job multiple times in different nodes in the workflow, it throws an error that the "action contains duplicate job name", which prevents one from adding the same job more than once in sequence or in parallel. Really silly, since Glue inherently lets you have concurrent runs. Luckily Step Functions works fine, but really disappointing that Glue natively doesn't support this in Workflows. Maybe I'm doing something wrong?

@victorgueorguiev6500 2 жыл бұрын

Thank you for the video by the way! It was really informative

@AWSTutorialsOnline 2 жыл бұрын

You are right. Unfortunately workflow does not allow running parallel jobs.

@rtzkdt 7 ай бұрын

Nice tutorial,Thanks. can it run in sequence? i want to run the jobs with different parameter, but i want the second job run after the first one is finished. Like a queue. Or we must set the max concurrent to 1 and handle the retry ourself if max concurrent error occurred?

3 жыл бұрын

Nice tuturial just now i make 5 jobs.. But try the 3 aproach. My dubt is what hapend when the size of table is variable... The num of worker can change?

@AWSTutorialsOnline 3 жыл бұрын

I don't think you can change job capacity at the time of job run when calling in Glue Workflow or Step Function. However - if you are calling the job using CLI or Code then you do have opportunity change allocated capacity, max capacity and worker type.

@MahimDashoraHackR Жыл бұрын

What happens if the python script itself uses multiprocessing for achieving concurrency

@veerachegu 2 жыл бұрын

Can you pls clarify i have a 15 data sets in one of the source how to run concurrent run from raw layer to cleansed layer maybe the script is different based on DQ in this scenario how to run concurrent job ?

@AWSTutorialsOnline 2 жыл бұрын

is each job doing the same things between raw to cleansed layer?

@veerachegu 2 жыл бұрын

@@AWSTutorialsOnline yes

@veerachegu 2 жыл бұрын

You are implementing through step function can you pls suggest how to do concurrent run on glue work flow

@pulkitdikshit9474 2 жыл бұрын

Hi, I have a lambda function where I pass a list of tables + lambda triggers a glue job. Glue job has been configured with workers 2 and max concurrency = 1. Later I saw that only one element(one table in the list of tables passed in lambda) gets executed. What is the reason for it? will it cost higher if I increase concurrency? In this case, is it important to keep max concurrency equal to the length of list(number of elements in python array list) ? If not, then what is the best possible approach such that glue job executes all the table elements in the array list passed in Lambda. Fyi, storing results in S3 bucket. Please do reply. Thanks in Advance :)

@hsz7338 3 жыл бұрын

Hello, thank you for the tutorial. It is fantastic as always. On where the actual Concurrent (parallel) job run, are those jobs are run in one serverless Glue compute cluster or multiple serverless Glue compute clusters? If it is the formal, it means it is Concurrent but not pure parallelisation. If it is the latter, then the actual Glue job we are creating acts as a job definition, whereby such job defnition can be deployed across multiple serverless compute in parallel (within the Max Concurrency)?

@AWSTutorialsOnline 3 жыл бұрын

It is like one job definition which can run for more than one instances at the same time - no matter how you start the job.

@anti2117 2 жыл бұрын

Thank you for this video, very insightfull. How is this working with job bookmarks (transformation_ctx)?

@siddharthsatapathy1366 2 жыл бұрын

Hello Sir, In case of concurrent runs how are the resources shared in different runs?

@AWSTutorialsOnline 2 жыл бұрын

each run is allocated the same capacity as configured in the job,

@IranianButterfly 2 жыл бұрын

but there is a drawback here in term of pricing, let's say you have 20 tables and you run with concurrency and let's say each job finish in 1 minute, G1.X would bills for min 10 minutes, it means you will pay 20*10 (min), instead of 20*1 (min).