Great. Learn most important part of transformation. Thanks
@AWSTutorialsOnline2 жыл бұрын
Glad it was helpful!
@compton83013 жыл бұрын
So glad I found this channel. Thank you! :)
@AWSTutorialsOnline3 жыл бұрын
You are so welcome!
@shubham_sb Жыл бұрын
Great Video .Thanks I was able to to flattern json file from this .
@AWSTutorialsOnline Жыл бұрын
Glad it helped
@imtiyazali7003 Жыл бұрын
Great video. I m glad to find this channel.
@maheralabboodi96672 жыл бұрын
Very helpful video. thank you
@AWSTutorialsOnline2 жыл бұрын
Glad it was helpful!
@pallavianand5642 жыл бұрын
Hi , The video is super helpful. But can we create a nested structure using aws glue studio from a flatten data?Please let me know.
@jocelynbaduria632 жыл бұрын
Great tutorial :). Thank you.
@kanchankumar71742 жыл бұрын
very good demo
@OPopoola3 жыл бұрын
Thanks for this video. I had a massive tweet object that I could ot deal with. Will try relationalise on it
@AWSTutorialsOnline3 жыл бұрын
Sure. Please share your experience
@vijaymani832 жыл бұрын
Many thanks for this video and once again you choose to discuss topics that aren't getting addressed elsewhere. We had a similar requirement (to flatten a nested JSON with dynamic schema and convert to parquet) and I wish I saw this before I went with Pandas approach (of using json_normalize and explode functions). I tried the relationalize function but I end up with multiple dynamicframes (5 nos) in my dyc. I do not seem to find the that should be used for the join/merge of these dynamicframes. Can you please help?
@AWSTutorialsOnline2 жыл бұрын
I am not sure if I get your question. Once you get dynamic frame collection. You can use keys() method to know number of Dynamic frames (5 in your case). Then use index to fetch individual dynamics frames into different dynamic frame variables. Post that you can do row level or column level join using Join method.
@sanjaybedwal2385 Жыл бұрын
I think you are talking about the columns which should be used while joining between parent and child tables. Its too late to answer now . But as shown in this video too , parent table will contain a id column wrt to the child table ( ideally the id column name in parent will be same as of child table name ) . In child table you will have corresponding id column ( in child table the id column name will be will the name of "id" column only. The index column in child table will used to differentiate to the different records in child wrt to single record in parent . It denotes 1:many relationship between parent -child table :)
@darrensmith41182 жыл бұрын
Great Video my question is now I have the data frame flattened. How do you write the result so that It then can be used as a table in athena?
@AWSTutorialsOnline2 жыл бұрын
Use pyspark to write data frame to S3. Catalog the s3 data sets using crawler and then query using Athena. Check pyspark code here - docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html
@stjepan_89022 жыл бұрын
Thanks. You didn't quite explain how did you catalog your 3 json files? You mentioned disadvantages of crawling nested data earlier in the video, but then in the example you used cataloged data to create a spark dataframe. Doesn't really make sense.
@vivekjacobalex3 жыл бұрын
very useful video . we can use pandas json_normalize for the same process correct?
@AWSTutorialsOnline3 жыл бұрын
Yes, definitely
@diyalizavarghese99682 жыл бұрын
Hi, I just had a doubt. So, what if my XML file contains a few fields which might have an int type, but I want the schema to be written in string type only. How do I make the crawler have a custom schema? apart from editing it manually in the classifiers? withSchema is something I heard about. but it doesn't seem to be working.
@AWSTutorialsOnline2 жыл бұрын
You cannot control how crawler will interpret data type. Either you update schema (manually or using API) after the crawler generates schema. Or you use console or API to create catalog and schema.
@mind-conscious2 жыл бұрын
I have a data crawled from dynamoDB. I applied renationalised. I needed to join the root with other keys. I want to pivot some columns that are similar by changing based on a.0 a.1 a.2. The next day data might be a.10 , a.15. How do I pivot these similar columns in a generic way to form a value under new column. Please I will appreciate any ideas on how to sort this using python/spark.
@AWSTutorialsOnline2 жыл бұрын
sorry not able to understand the questions. Can you please provide more details.
@sriadityab47942 жыл бұрын
Thank you. Sometimes glue crawlers are not able to identify the schema of the data. It shows col0, col1…how do we handle this case? Can you guide me through this?
@AWSTutorialsOnline2 жыл бұрын
I think it happens when you have CSV file with no header row. Simplest solution is - you add header. otherwise you can always edit schema after it generated but it is painful :)
@sriadityab47942 жыл бұрын
@@AWSTutorialsOnline thanks for your response. Yes, I have CSV files with headers. I can see them when I query with S3.
@AWSTutorialsOnline2 жыл бұрын
@@sriadityab4794 This is odd. Try creating a customer classifier where you can mention that your first row in the header. And then use this classifier in your schema. Let me know how it goes.
@volakola94952 жыл бұрын
for AWS data ETL engineer is it necaasory to learn EMR ? or Glue is enough ?
@AWSTutorialsOnline2 жыл бұрын
I will say yes. Glue uses Apache Spark & Hive Metastore. But EMR supports larger number of big data framework and apps. To have completeness, you should know both.
@volakola94952 жыл бұрын
@@AWSTutorialsOnline jobs are specific for glue & EMR or ...or they expect knoledge of both ?
@AWSTutorialsOnline2 жыл бұрын
@@volakola9495 Depends on hiring need for the company. Glue is a lot in ask now days. You might start with it,
@aprajitamishra81482 жыл бұрын
If I am processing json file daily then how do I append the flattened data to the existing data as id column will always start from 1 for the child/auxillary table. Can you please help with this.
@AWSTutorialsOnline2 жыл бұрын
good one. You can always transform the data in data frame before you append it. If you don't need id column, you can drop it or you can generate your custom id. Hope it was helpful.
@sanjaybedwal2385 Жыл бұрын
It too late to answer . But aprajita you will have to use glue job bookmark for it.
@adarshverma54292 жыл бұрын
It's working fine..but while applying relationalize function ,some of fields are missing from the resultant dataframes .Now, how can I handle it?
@AWSTutorialsOnline2 жыл бұрын
Can you please share some sample? I can then investigate.
@adarshverma54292 жыл бұрын
Where I can share those sample?
@AWSTutorialsOnline2 жыл бұрын
@@adarshverma5429 brajends@aws-dojo.com
@hsz73383 жыл бұрын
Thank you AWS Tutorials. This is extremely useful. I have two questions: 1. Is 'create_dynamic_frame.from_catalog' a Glue Spark feature (not yet available in EMR)? 2. What's the main difference between 'relationlize' and 'Unbox' & 'UnnestFrame'? It seems that "relationlize" covers much large range of 'problems' and 'data classification types'.
@AWSTutorialsOnline3 жыл бұрын
With EMR, you use PySpark SQL. but it does not matter because all you need is data in dataframe and SQL also creates dataframe. now, unbox coverts a string json to struct. unnestframe converts a struct into flat frame with native data types. relationalize covers both. It can break a large complex data into multiple frames, enable relation / link between the frames and also flat the frames with native data types. relationalize is more handy.
@hsz73383 жыл бұрын
@@AWSTutorialsOnline Thank you for the reply. To ensure my understanding of the EMR question, in EMR, Spark SQL can be used to parse the data, such parsing step will be custom developed using Spark SQL, which is different from the "relationalize" in Glue Spark whereby the complexity has been taken away (it is supported and managed) by AWS Glue, therefore it is less troublesome. Is my understanding correct?
@shivangsingh23643 жыл бұрын
Hello, When I'm trying to use unbox in glue it is giving me error like - Dataframe object has no attribute unbox. Can you please tell me how to resolve this?@hao sky zhou and @AWS Tutorials Thanks
@AWSTutorialsOnline3 жыл бұрын
@@shivangsingh2364 Hi, unbox is a method for Glue DynamicFrame not DataFrame. Please check on which object type you call the unbox method.
@shivangsingh23643 жыл бұрын
@@AWSTutorialsOnline Thank you, I was able to use unbox on dynamic frame. One thing that I wanted to ask is- my json is now flattened and I write the dataframe in parquet format in s3, but when I tried to crawl the data using glue crawler from s3 in Athena, the table contents(records) that I'm seeing in Athena is still in json format. Is there any way I can see the records in table format? Please suggest !! Thanks
@chatchaikomrangded9603 жыл бұрын
Nice.
@AWSTutorialsOnline3 жыл бұрын
Thanks!
@noraznizam61232 жыл бұрын
good tutorial... did u have tutorial on managing incremental data loads? appreciate if u can make it. :)
@AWSTutorialsOnline2 жыл бұрын
not yet, let me plan about it.
@vishalsingh-ku4uk2 жыл бұрын
@@AWSTutorialsOnline any update on managing incremental data loads
@chris06282 жыл бұрын
Useful bits start at 8:00
@sardesaisantosh Жыл бұрын
video link for customer classifier ??
@biplobshrivastava11002 жыл бұрын
How to combine these output data frames into a single output file
@AWSTutorialsOnline2 жыл бұрын
You can join data frames column wise using key.
@amn53412 жыл бұрын
The author has pretty much confused the DataFrame vs DynamicFrame. He is mentioning Dynamic Frame as Data Frame. So be aware of those.
@adarshverma5429 Жыл бұрын
Having one issue here, relationalize function is taking more time to execute the job.
@AWSTutorialsOnline Жыл бұрын
Hard to tell why? Can you please share your file structure?
@adarshverma5429 Жыл бұрын
@@AWSTutorialsOnline Yeah Sure!! Where should I share that?