AWS Tutorials - AWS Glue Handling Nested Data

Рет қаралды 15,798

AWS Tutorials

Күн бұрын

Пікірлер: 63

@rafik790 2 жыл бұрын

Great. Learn most important part of transformation. Thanks

@AWSTutorialsOnline 2 жыл бұрын

Glad it was helpful!

@compton8301 3 жыл бұрын

So glad I found this channel. Thank you! :)

@AWSTutorialsOnline 3 жыл бұрын

You are so welcome!

@shubham_sb Жыл бұрын

Great Video .Thanks I was able to to flattern json file from this .

@AWSTutorialsOnline Жыл бұрын

Glad it helped

@imtiyazali7003 Жыл бұрын

Great video. I m glad to find this channel.

@maheralabboodi9667 2 жыл бұрын

Very helpful video. thank you

@AWSTutorialsOnline 2 жыл бұрын

Glad it was helpful!

@pallavianand564 2 жыл бұрын

Hi , The video is super helpful. But can we create a nested structure using aws glue studio from a flatten data?Please let me know.

@jocelynbaduria63 2 жыл бұрын

Great tutorial :). Thank you.

@kanchankumar7174 2 жыл бұрын

very good demo

@OPopoola 3 жыл бұрын

Thanks for this video. I had a massive tweet object that I could ot deal with. Will try relationalise on it

@AWSTutorialsOnline 3 жыл бұрын

Sure. Please share your experience

@vijaymani83 2 жыл бұрын

Many thanks for this video and once again you choose to discuss topics that aren't getting addressed elsewhere. We had a similar requirement (to flatten a nested JSON with dynamic schema and convert to parquet) and I wish I saw this before I went with Pandas approach (of using json_normalize and explode functions). I tried the relationalize function but I end up with multiple dynamicframes (5 nos) in my dyc. I do not seem to find the that should be used for the join/merge of these dynamicframes. Can you please help?

@AWSTutorialsOnline 2 жыл бұрын

I am not sure if I get your question. Once you get dynamic frame collection. You can use keys() method to know number of Dynamic frames (5 in your case). Then use index to fetch individual dynamics frames into different dynamic frame variables. Post that you can do row level or column level join using Join method.

@sanjaybedwal2385 Жыл бұрын

I think you are talking about the columns which should be used while joining between parent and child tables. Its too late to answer now . But as shown in this video too , parent table will contain a id column wrt to the child table ( ideally the id column name in parent will be same as of child table name ) . In child table you will have corresponding id column ( in child table the id column name will be will the name of "id" column only. The index column in child table will used to differentiate to the different records in child wrt to single record in parent . It denotes 1:many relationship between parent -child table :)

@darrensmith4118 2 жыл бұрын

Great Video my question is now I have the data frame flattened. How do you write the result so that It then can be used as a table in athena?

@AWSTutorialsOnline 2 жыл бұрын

Use pyspark to write data frame to S3. Catalog the s3 data sets using crawler and then query using Athena. Check pyspark code here - docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html

@stjepan_8902 2 жыл бұрын

Thanks. You didn't quite explain how did you catalog your 3 json files? You mentioned disadvantages of crawling nested data earlier in the video, but then in the example you used cataloged data to create a spark dataframe. Doesn't really make sense.

@vivekjacobalex 3 жыл бұрын

very useful video . we can use pandas json_normalize for the same process correct?

@AWSTutorialsOnline 3 жыл бұрын

Yes, definitely

@diyalizavarghese9968 2 жыл бұрын

Hi, I just had a doubt. So, what if my XML file contains a few fields which might have an int type, but I want the schema to be written in string type only. How do I make the crawler have a custom schema? apart from editing it manually in the classifiers? withSchema is something I heard about. but it doesn't seem to be working.

@AWSTutorialsOnline 2 жыл бұрын

You cannot control how crawler will interpret data type. Either you update schema (manually or using API) after the crawler generates schema. Or you use console or API to create catalog and schema.

@mind-conscious 2 жыл бұрын

I have a data crawled from dynamoDB. I applied renationalised. I needed to join the root with other keys. I want to pivot some columns that are similar by changing based on a.0 a.1 a.2. The next day data might be a.10 , a.15. How do I pivot these similar columns in a generic way to form a value under new column. Please I will appreciate any ideas on how to sort this using python/spark.

@AWSTutorialsOnline 2 жыл бұрын

sorry not able to understand the questions. Can you please provide more details.

@sriadityab4794 2 жыл бұрын

Thank you. Sometimes glue crawlers are not able to identify the schema of the data. It shows col0, col1…how do we handle this case? Can you guide me through this?

@AWSTutorialsOnline 2 жыл бұрын

I think it happens when you have CSV file with no header row. Simplest solution is - you add header. otherwise you can always edit schema after it generated but it is painful :)

@sriadityab4794 2 жыл бұрын

@@AWSTutorialsOnline thanks for your response. Yes, I have CSV files with headers. I can see them when I query with S3.

@AWSTutorialsOnline 2 жыл бұрын

@@sriadityab4794 This is odd. Try creating a customer classifier where you can mention that your first row in the header. And then use this classifier in your schema. Let me know how it goes.

@volakola9495 2 жыл бұрын

for AWS data ETL engineer is it necaasory to learn EMR ? or Glue is enough ?

@AWSTutorialsOnline 2 жыл бұрын

I will say yes. Glue uses Apache Spark & Hive Metastore. But EMR supports larger number of big data framework and apps. To have completeness, you should know both.

@volakola9495 2 жыл бұрын

@@AWSTutorialsOnline jobs are specific for glue & EMR or ...or they expect knoledge of both ?

@AWSTutorialsOnline 2 жыл бұрын

@@volakola9495 Depends on hiring need for the company. Glue is a lot in ask now days. You might start with it,

@aprajitamishra8148 2 жыл бұрын

If I am processing json file daily then how do I append the flattened data to the existing data as id column will always start from 1 for the child/auxillary table. Can you please help with this.

@AWSTutorialsOnline 2 жыл бұрын

good one. You can always transform the data in data frame before you append it. If you don't need id column, you can drop it or you can generate your custom id. Hope it was helpful.

@sanjaybedwal2385 Жыл бұрын

It too late to answer . But aprajita you will have to use glue job bookmark for it.

@adarshverma5429 2 жыл бұрын

It's working fine..but while applying relationalize function ,some of fields are missing from the resultant dataframes .Now, how can I handle it?

@AWSTutorialsOnline 2 жыл бұрын

Can you please share some sample? I can then investigate.

@adarshverma5429 2 жыл бұрын

Where I can share those sample?

@AWSTutorialsOnline 2 жыл бұрын

@@adarshverma5429 brajends@aws-dojo.com

@hsz7338 3 жыл бұрын

Thank you AWS Tutorials. This is extremely useful. I have two questions: 1. Is 'create_dynamic_frame.from_catalog' a Glue Spark feature (not yet available in EMR)? 2. What's the main difference between 'relationlize' and 'Unbox' & 'UnnestFrame'? It seems that "relationlize" covers much large range of 'problems' and 'data classification types'.

@AWSTutorialsOnline 3 жыл бұрын

With EMR, you use PySpark SQL. but it does not matter because all you need is data in dataframe and SQL also creates dataframe. now, unbox coverts a string json to struct. unnestframe converts a struct into flat frame with native data types. relationalize covers both. It can break a large complex data into multiple frames, enable relation / link between the frames and also flat the frames with native data types. relationalize is more handy.

@hsz7338 3 жыл бұрын

@@AWSTutorialsOnline Thank you for the reply. To ensure my understanding of the EMR question, in EMR, Spark SQL can be used to parse the data, such parsing step will be custom developed using Spark SQL, which is different from the "relationalize" in Glue Spark whereby the complexity has been taken away (it is supported and managed) by AWS Glue, therefore it is less troublesome. Is my understanding correct?

@shivangsingh2364 3 жыл бұрын

Hello, When I'm trying to use unbox in glue it is giving me error like - Dataframe object has no attribute unbox. Can you please tell me how to resolve this?@hao sky zhou and @AWS Tutorials Thanks

@AWSTutorialsOnline 3 жыл бұрын

@@shivangsingh2364 Hi, unbox is a method for Glue DynamicFrame not DataFrame. Please check on which object type you call the unbox method.

@shivangsingh2364 3 жыл бұрын

@@AWSTutorialsOnline Thank you, I was able to use unbox on dynamic frame. One thing that I wanted to ask is- my json is now flattened and I write the dataframe in parquet format in s3, but when I tried to crawl the data using glue crawler from s3 in Athena, the table contents(records) that I'm seeing in Athena is still in json format. Is there any way I can see the records in table format? Please suggest !! Thanks