how to upload on premise database data to AWS S3 | Build a Data Lake

how to upload on premise database data to AWS S3 | Build a Data Lake | Python

Рет қаралды 16,656

2 жыл бұрын

A data lake is a centralized cloud storage in which you can store of all the data, both structured and unstructured, at any scale. This platform is fast becoming the standard for users looking to store and process big data. we will cover how to build an AWS S3 data lake with an on-premise SQL Server database. S3 is an easy to use data store. We use it to load large amounts of data for later analysis.
Link to medium article on this topic: blog.devgenius.io/how-to-build-a-s3-data-lake-with-python-from-on-premise-database-23d5d2cdd1da
Link to GitHub repo:: github.com/hnawaz007/pythondataanalysis/tree/main/AWS%20Data%20Lake
Subscribe to our channel:
kzbin.info
---------------------------------------------
Follow me on social media!
GitHub: github.com/hnawaz007
Instagram: bi_insights_inc
LinkedIn: www.linkedin.com/in/haq-nawaz/
---------------------------------------------
#AWS #S3 # DataLake
Topics covered in this video:
0:00 - Intro data lake from on-premise to to AWS S3
1:03 Create S3 user with programmatic access
2:37 - Create S3 bucket
3:04 - Python setup
3:56 - Read data from SQL Server
5:04 - Load Data to S3 Bucket
6:59 - Code Demo
7:36 - Review S3 Data Lake

Пікірлер: 50

@BiInsightsInc 2 жыл бұрын

Link to AWS Playlist: kzbin.info/www/bejne/mXeyY2CdqJppmpo

@muppallavenkatadri1777 Жыл бұрын

I have 4 different csv files in s3 and i need to load into four different tables in redshift, can tell me how it is possible using lambda function

@BiInsightsInc Жыл бұрын

@@muppallavenkatadri1777 feel free to check out the AWS Lambda video in this series.

@richardhoppe4991 10 ай бұрын

Excellent content, just what I need to know! Looking forward to your Airflow videos next

@RogerDunbar Жыл бұрын

Very informative video, thanks!

@yjtechnical4700 Жыл бұрын

Excellent video.

@kshitijbansal3672 Жыл бұрын

Hey, how to automate this process, like after certain time, the code will run aotmatically and upload sql data to s3.

@uatcloud9122 10 ай бұрын

Thanks for great video! instead of csv , would you recommend to upload the same structured data as parquet?

@BiInsightsInc 10 ай бұрын

Yeah, if you have the option then go with Parquet. Parquet files are smaller than CSV files, and they can be read and written much faster.

@jaswanth333 9 ай бұрын

Hey Haq,How to do incremental load in this pipeline? Do we need to rewrite the new data all again into S3 or is there a way to make field level changes / inserts /deletes on existing buckets? I am assuming that this is where data lake file formats such as Apache Hudi into picture , correct me if i am wrong and please walk me through to workaround process

@BiInsightsInc 9 ай бұрын

Amazon S3 is simply a storage system. It will store whatever data is provided. It is not possible to 'update' an object in Amazon S3. An object can be overwritten (replaced), but it cannot be appended. You can use incremental approach to write only changed data to S3 or use S3 as staging layer and push only changed data via Glue to redshift or any other database service available in AWS. Here is an overview of the various incremental approaches: kzbin.info/www/bejne/l5C3aauIlqiCbJI kzbin.info/www/bejne/aWOoo6l-lNKSpNE&t

@ahmetaslan9261 Ай бұрын

Thanks for the informative content! so how would you deal with large tables especially for the initial loads? assume that table is 200-300GB, selecting all data and keeping in data frames/ in memory objects doesn't look practical so I believe defining a batch key/partitions on source side and iterate it on codes could be a way..

@BiInsightsInc Ай бұрын

Thanks. It seems you are working with a large dataset. You can partition your data and also use the chunksize to limit the amount of data read into memory, if your source supports the chunksize. Keep in mind Pandas is single threaded so it will be slow processing this size dataset. However, there are specialized tools for large datasets i.e. Flink, Polars and Spark. Flink and Spark are designed for this and they run on a cluster therefore they are not restricted by a single machine's limits. So they are worth exploring for large datasets. I have covered PySpark on the channel if you are interested: kzbin.info/aero/PLaz3Ms051BAkwR7d9voHsflTRmumfkGVW Here are the links to these frameworks: flink.apache.org/ spark.apache.org/docs/latest/api/python/index.html

@ahmetaslan9261 Ай бұрын

@@BiInsightsInc Thank you so much for your time and prompt response. Definitely I'll go through your posts related to PySpark.

@socialawareness1643 Жыл бұрын

I have these two databases from amazon AWS s3 into my on premise sql server is there a way i can migrate this entire DB on amazon s3 to snowflake ?

@BiInsightsInc Жыл бұрын

If your data is already in S3 then you can use snowpipe to ingest it into snowflake from S3. Here is an overview of this approach. medium.com/geekculture/how-to-load-data-from-s3-to-snowflake-2113b844d980

@muppallavenkatadri1777 Жыл бұрын

I have 4 different csv files in s3 and i need to load into four different tables in redshift, can tell me how it is possible using lambda function

@BiInsightsInc Жыл бұрын

Feel free to check out the rest of the series. I have a video on ETL with AWS Lambda using Python Shell. You can use it as a starting point. kzbin.info/www/bejne/mXeyY2CdqJppmpo

@ihab6796 Жыл бұрын

i found problem like this importing rows0 to 606 for table DimProduct Data load error: name 'upload_file_bucket' is not defined how to problem solve this case? thanks

@ihab6796 Жыл бұрын

my bucket name :engpc-sqlserver-s3 and region in singapore. I already changed but still is't connect

@jasonp6597 Жыл бұрын

What if the tables are so big that you cannot load all the data on your local/Pandas? It would be great to show how to batch the table to s3 using SQL and Pandas or SQL and PySpark. I am using a Docker Container currently, but my source is so big, even with all my Mac's computing power allocated to the Docker container, that the transfer still fails with an OOM 137 from Docker.

@BiInsightsInc Жыл бұрын

Hi Jason, if your table is too big for a single machine then it's time to move on to a distributed engine such as Dask, Spark, Flink or Kafka to handle big data. However, you can use batching with SQLAlchemy and use server side cursors to only load specified numbers of rows in memory. I will cover an example of this in the future.

@jasonp6597 Жыл бұрын

@@BiInsightsInc Awesome, thank you. Yes, I am blocked by Cloud Network/permission issues, so I have to use my local to process until those issues are sorted out. I am creating a mini POC to use the same SQLAlchemy batch example you mentioned. Appreciate your videos.

@AnanthuS-mi2qm Жыл бұрын

can i move any type of files by using this

@BiInsightsInc Жыл бұрын

Yes, you can move various file types with this script such as csv, excel, parquet, json, text etc.

@mukeshgupta933 3 ай бұрын

I am working with ms sql the data base is on aws rds and contains data in billions i want to extract and load data to s3 via glue how can i do it ??

@BiInsightsInc 3 ай бұрын

You can use the follow guide to move data from RDS to S3 using Glue. This article sources from mysql but you can substitute that with ms sql. docs.aws.amazon.com/prescriptive-guidance/latest/archiving-mysql-data/export-data-by-using-aws-glue.html

@AnshuJoshi-oh1io 3 ай бұрын

Where would you run main.py if you dont want to run locally

@BiInsightsInc 3 ай бұрын

Hell there, you can run the Python script on an EC2 instance or in a lambda fiction depending on the load size. Keep lambda’s restrictions in mind if you end up going that route.

@KeshavChoudhary-dx4xd Жыл бұрын

Is it possible to update an existing file on s3 line by line ?

@BiInsightsInc Жыл бұрын

Yes, you can read the file and compare it against another file or source. Get the differences and update the source with the differences. An easier approach would be to overwrite the file with the new incoming file.

@KeshavChoudhary-dx4xd Жыл бұрын

@@BiInsightsInc Yes, sounds about right. My use case is that there are on premises sensors connected to Raspberry Pi, every 3 seconds RPi writes 3 lines to a local csv file and also sends 3 json strings to AWS IoT MQTT Broker. Now I have to save this json string to a object inside S3 bucket. I dont want to upload entire file again and again to S3 bucket. I just want to append newly generated json string to an existing S3 object. Any suggestions on this ?

@BiInsightsInc Жыл бұрын

@@KeshavChoudhary-dx4xd S3 does not directly support append however, we can read in the file and append to it using Python. One caveat as the file grows larger the so will the read time. Another option would be to create a “folder” in S3 instead of a file. That way you can do file/1 and then next time write file/2 and so on. Practically you might use a millisecond Unix timestamp as the incremental identifier or a date time including millisecond. Like 2018-01-21_08:45:32:752, for readability. Anyways here is a sample of append method to get you started... import json import boto3 s3_path = '/some/s3/path' bucket = 'somebucket' local_data = {"id": "123", "name": "XYZ", "transaction": [20.0, 30.0]} s3 = boto3.client('s3') resp=s3.get_object(Bucket=bucket, Key=s3_path) data=resp.get('Body') json_data = json.loads(data) json_data.append(local_data) s3.put_object(Bucket=bucket, Key=s3_path, Body=json.dumps(json_data).encode())

@KeshavChoudhary-dx4xd Жыл бұрын

@@BiInsightsInc Thank you for your response and taking out some time for this. I appreciate it. This method is known to me, but do you have any information on S3FS ? Mounting a S3 bucket as a filesystem using s3fs and then append data to a s3 object ?

@BiInsightsInc Жыл бұрын

@@KeshavChoudhary-dx4xd there are various sources covering this topic you can explore them. I have not covered it yet so I can't point you to an internal resource. However, here is an article on the subject to get you started. medium.com/tensult/aws-how-to-mount-s3-bucket-using-iam-role-on-ec2-linux-instance-ad2afd4513ef

@montassarbenkraiem3029 Жыл бұрын

hi if you can show us how to connect to MSSQL with details

@BiInsightsInc Жыл бұрын

Here is the video that covers how to connect to a SQL Server database using Python. This is a common question that comes up in the ETL series. kzbin.info/www/bejne/sJXIq3hsibyGh7M

@montassarbenkraiem3029 Жыл бұрын

@@BiInsightsInc i write it the same code and put it in a dag in order to launch it but dag status == fail i don't why i already used the code and it works but in airflow no also i have another question how to get Host value of my MSSQL in order to create MSSQL connection i already installed MSSQL providers for my apache airflow and Thanks

@BiInsightsInc Жыл бұрын

@@montassarbenkraiem3029 Airflow may not able to reach SQL Server. So you need to make sure SQL Server accepts remote connection, SQL Server port is open. Anyways, I would suggest to utilize the Airflow mechanism to create and test connection to a database. I have covered how add SQL Server library to Airflow and how to connect to SQL Server. kzbin.info/www/bejne/qmXLZampirGqfKc&t

@montassarbenkraiem3029 Жыл бұрын

@@BiInsightsInc i did the same like your steps i built a simple python script to connect to mssql and select * from table and it works i get the same parameter "server_name" and put it in apache airflow web UI in order to create MSSQL connection still there's an error :(20009, b'DB-Lib error message 20009, severity 9: Unable to connect: Adaptive Server is unavailable or does not exist (desktop-mgkahft) ') also TCP/IP already enable

@BiInsightsInc Жыл бұрын

@@montassarbenkraiem3029 Server name is your computer name or IP address where your database is located. If you are unable to reach SQL Server then go through the steps outlined in this video: kzbin.info/www/bejne/qmXLZampirGqfKc&t

@sjtrout7705 Жыл бұрын

Is the raw code anywhere?

@BiInsightsInc Жыл бұрын

The source code is available on GitHub, link is in the description of the video.

@kwabena44u Жыл бұрын

what's your email? I do have some questions for you?

@kshitijbansal3672 Жыл бұрын

Hey, how to automate this process, like after certain time, the code will run aotmatically and upload sql data to s3.

@kshitijbansal3672 Жыл бұрын

Hey, how to automate this process, like after certain time, the code will run aotmatically and upload sql data to s3.

@GuilhermeMendesG 10 ай бұрын

You can use airflow, mage or other workflow tools.

@kshitijbansal3672 Жыл бұрын

Hey, how to automate this process, like after certain time, the code will run aotmatically and upload sql data to s3.

@BiInsightsInc Жыл бұрын

Hey Kshitji you can use number of tools to schedule a Python script. I have covered Windows Task Scheduler, Apache Airflow and Dagster to name a few that run scripts on schedule. If you want to it in AWS ecosystem then use the AWS Step Function. kzbin.info/www/bejne/f6TYcp-qhbSce9k&t kzbin.info/www/bejne/m4vJdWmubqt9aqc