AWS re:Invent 2020: Serverless data preparation with AWS Glue

Рет қаралды 14,668

AWS Events

Күн бұрын

Пікірлер: 13

@aruncp1980 2 жыл бұрын

Excellent material and presentation..

@sukulmahadik0303 3 жыл бұрын

*Notes: Part 2:* *AWS Glue Components:* 1) Serverless ETL engine: a. Serverless ETL engine based on Apache Spark. b. Apache Spark or Python Shell jobs - We provide Spark or Python scripts and Glue takes care of entire lifecycle of the job execution. Glue spins up the necessary cluster , run the script and shut down the machine. We only pay for what we use. c. Visual tool (Glue Studio) are also provided to create jobs interactively and Glue compiles those jobs into Apache Spark scripts. 2) AWS Glue Data Catalog: a. Centralised metadata store. b. Fully managed. c. Hive metastore compatible d. Many services , 3rd party partners, Open source tools are integrated with this catalog. 3) Crawlers: a. Used to load and maintain Data Catalog. b. They infer metadata of our table - schema c. Also supports schema evolution thru versioning. 4) Workflow management: a. Orchestrate triggers, crawlers and jobs b. Helps us build and monitor complex workflows for our pipeline in a reliable fashion. *What is Glue used for? (Glue Use cases)* 1) Building Datalakes: Customers take all their data and store it on Amazon S3 (ubiquitous, low cost , highly durable object store). They break their data silos and use AWS Glue jobs and workflows to ingest data from their silos into S3 and process that data from stage to stage. AWS Glue crawlers would be used to load and maintain the Data Catalog. Customers also use AWS Lake formation service to secure the datalake. The Datalake built can then be accessed for analysis, business intelligence or machine learning using tools like Athena, QuickSight, EMR , Redshift, Sagemaker etc. 2) Loading DW: AWS Glue is also being used to load Data warehousing using the traditional ETL processing. 3) Data preparation for AI/ML and Data science workloads. Cleaning, Enriching data, extracting features , build training etc. Data scientists also use notebooks connected to Glue for Data exploration and Experimentation.

@sukulmahadik0303 3 жыл бұрын

*Notes: Part 5:* *AWS Glue Custom connectors* To be introduced Dec 2020. This feature allows us to create our own custom connectors for our data sources and use them in our glue jobs. We can also easily deploy partner developer connectors from AWS marketplace. *AWS Glue DataBrew* New interface for cleaning and normalizing our data. It profiles our data to detect patterns and anomalies and we can choose from over 250 built-in cleaning transformations and visually apply them at scale. *AWS Glue schema registry:* Centrally discover , control and evolve our data schemas. This allows us to enforce schemas and schema evolution to prevent downstream application failures. This helps improve data quality for our data streaming applications and easily integrates with AWS MSK, Kinesis Data streams, kinesis Data Analytics for Apache flink.

@DJ-ws6je 3 жыл бұрын

lots of data is an understatement

@maa1dz1333q2eqER 3 жыл бұрын

Great presentation, thanks. Still would prefer a full 60 minutes.

@mangeshxjoshi 3 жыл бұрын

Excellent presentation . very much appreciated .

@mangeshxjoshi 3 жыл бұрын

hi Sir, we have been comparing two cloud based etl tools , AWS Glue and Azure Data Factory . Scenario : we need to process process / extract S3 files , and S3 files are large in size (may be million of records) / more than 100MB . How do we process such larger files through AWS Glue , if you through some ideas . i believe , AWS Glue 2.0 is much more recommended here as compare to Azure data factory . 2) How the S3 File encryption / decryption can be handled through AWS Glue , looking for encryption key management through AWS glue . How encrypted S3 files being processed into AWS RDS postgre sql engine . need some thoughts on Encryption mechanism in AWS Glue . Regards, Mangesh