Importing a Pandas Dataframe to a Database in Python [For Your Data Science Project]

Рет қаралды 36,751

StrataScratch

Күн бұрын

Пікірлер: 105

@IrakliChitishvili 3 жыл бұрын

How the hell this channel isn't more popular?

@stratascratch 3 жыл бұрын

I know right?! I guess not many people like to watch me code...

@IrakliChitishvili 3 жыл бұрын

@@stratascratch I think it's just matter of time before your channel and the work you do takes off.

@flipcase 3 жыл бұрын

I second that!

@toilinginobscurity3091 3 жыл бұрын

There's really no rhyme or reason in the universe. This channel and these tutorials are gold but instead I am bombarded by ML channels playing with toy datasets and brand themselves as all you need clickbaits. You Sir are a saint.

@stratascratch 3 жыл бұрын

haha =) I know exactly what you mean. Too many of those channels around.

@Networkprofessor 2 жыл бұрын

Just what I needed! This will take you to a whole new level. Thank you.

@stratascratch 2 жыл бұрын

You are welcome. Thank you for the nice comment. 😀

@prateek2159 3 жыл бұрын

Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on KZbin with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.

@stratascratch 3 жыл бұрын

Thanks for the kind words! I have created a series of these videos from importing a pd df to a db to grabbing data from an API. I think most of the videos in the series should get you started on grabbing and managing data.Are there other types of skills you're interested in?

@jaquelinerazzo 3 жыл бұрын

I already play back the video like 4 times just to watch the ads because I don't know how to thank you! Great channel, great video. :)

@andrealves3104 5 ай бұрын

I was having weird issues to insert data via pandas to_sql method on a Postgresql database on AWS while passing it through a sqlalchemy engine API interface. For now, adding it manually through a SQL query while iterating row by row on pandas worked fine, both for my project's purpose as well as for the video's goal. All and all, thank you very much for the overall explanation.

@RenatoCosta1 2 жыл бұрын

Hey Nate, amazing stuff! I think it would be great if you could do two more videos covering the topics you've mentioned in you video about the only project one would need to get a job in data science. I guess the next videos would be about machine learning modelling and the last one on how to deploy it online. That would be awesome! Thanks anyway

@stratascratch 2 жыл бұрын

We're planning to publish 1 data project a month! When they are released, it'd be great if you can provide feedback. For example, did we dive deep enough into the technical aspects of the project? Did we dive too deep? Is the video too long/too short? etc? This will help us iterate on the videos so we can provide something of value for all viewers. Thanks for the feedback!

@its_me7363 3 жыл бұрын

much awaited video by myself...thanks for your time and effort Nate

@stratascratch 3 жыл бұрын

Thanks for watching! I just updated the description with a link to the github repo that has the final version of this project. I'd definitely take a look at that and follow along with the notebook. I felt a bit too rushed on this video trying to explain all the concepts. But if there's anything to take away from this video, it's (1) how to properly structure your functions and (2) how to solve for memory & performance issues when potentially dealing with millions and billions of rows of data.

@its_me7363 3 жыл бұрын

@@stratascratch how industry people manage large data stored in database while exploring data? what are the best practices for data scientist or analyst in these cases?

@stratascratch 3 жыл бұрын

@@its_me7363 The dbs are optimized to handle large amounts of data with speed. You definitely won't be using any postgres or mysql dbs. You'll be using HIVE or Greenplum or Snowflake. Also, depending on what you are trying to do, you will likely move away from using pandas and onto py spark to do some work. All of this is taught when you join the company. Most people know that not all entry level ds have used these tools.

@its_me7363 3 жыл бұрын

@@stratascratch does these dbs are similar to postgresql? and what do you mean by moving away from pandas...does industry not use pandas in their projects?

@stratascratch 3 жыл бұрын

@@its_me7363 yes the dbs are similar to postgres. There's just minor differences in syntax but it's easy to learn. The industry uses pandas but sometimes when dealing with large amounts of data, you want to work with something that can handle large amounts of data and that's pyspark.

@quadrialli3715 Жыл бұрын

Great and detailed explanation

@zynkers401 3 жыл бұрын

Great video Nate! You really have the spirit for teaching. The vocabulary and content is easy to understand and follow. I hope Strata Scratch has a continued exponential growth.

@stratascratch 3 жыл бұрын

Thanks for the kind words! And thanks for watching!

@codingstyle9480 3 жыл бұрын

Hi, Thank you for the video. I have a question: How would you go around if you wanted to update your database in real time without you manually running the code. Perhaps we could design the code such that it is triggered when something changes in the source data to insert or update data automatically in real-time(perhaps with some delay, of course)

@muhammedmarong6645 3 жыл бұрын

hey did you figure that out?

@pashlenenaidoo3695 2 жыл бұрын

I would also like to know if you have achieved this

@stratascratch 2 жыл бұрын

You would need something like Airflow or Jenkins to schedule the trigger.

@caioriet458 2 жыл бұрын

I would like to know if you solved this problem!

@stratascratch 2 жыл бұрын

@@caioriet458 Yes, I left a comment. You'll need to use a scheduler like Airflow in order to keep updating your db without manually running the code. The scheduler will run it for you.

@awds121 3 жыл бұрын

Hey Nate! Great video! I have a few feedback. First, the videos are very helpful and you should definitely continue doing more videos like this. Here are some ideas: How to schedule scripts to run on AWS; create tables, analyze data, and create a dashboard. Second, the playlist that includes this video is unordered. It is not hard to figure out which video should be watched first, but it doesn't hurt to sort it in the order that you want videos to be watched. Third, AWS doesn't include PostgreSQL in the free tier (at least anymore). Just a heads up. Keep up the good work!

@JeffersonCanedo 2 жыл бұрын

I so glad I found this video Thankyou

@stratascratch 2 жыл бұрын

You're welcome.

@bodhid3091 2 жыл бұрын

Good video. A couple of questions: Why do you store the new rows into another dataframe rather than calling insert_into_table directly? Any reason you don't use the postgresql INSERT ON CONFLICT option to create an upsert, combining the insert and update into one call?

@javierjdaza 3 жыл бұрын

Brother, never miss one video. this is pure gold!! keeo doing this please, can i connect with u in linked in? thanks bro, ure the best

@stratascratch 3 жыл бұрын

Thanks so much for watching! And glad you like my videos =) Yes please connect with me on LinkedIn. You can search for StrataScratch and I should pop-up (I think?). If not, let me know.

@javierjdaza 3 жыл бұрын

@@stratascratch i cant find u. :(

@stratascratch 3 жыл бұрын

@@javierjdaza Sorry about that. Here you go! linkedin.com/in/nathanrosidi

@wisjnujudho3152 2 жыл бұрын

hi Nate, is there any alternative to aws rds, a cloud database server which doesn't require credit card even though it's free tier. alternatively, is it possible (maybe) to use heroku postgrest as database? thx

@stratascratch 2 жыл бұрын

Hi Winsjnu, I find this article helpful. You might want to check out. www.lastweekinaws.com/blog/10-free-cloud-databases-you-should-consider-and-1-you-shouldnt/

@elextures 9 ай бұрын

im having a problem with channel id, in CHANNEL_ID, i put my own channel, but each time i press response, i get a different channel

@GauravKumar-xl1fs 3 жыл бұрын

please make more video of this series and pipelines

@stratascratch 3 жыл бұрын

glad you liked the videos. Will do if there are more views!

@PATRICKCHUAD 3 жыл бұрын

Thanks Nath for your video on Pandas. I will apply this to my project. in python. Thanks,

@stratascratch 3 жыл бұрын

Great! Happy to help, thanks for watching!

@TripPiper 2 жыл бұрын

If your connection to AWS fails try creating a new inbound security group that allows All Access. But also learn how to create a connection with TCP or some form of security because in production you’d never have inbound connections on All access

@denyseperezdevera7188 2 жыл бұрын

Hello! do you have any idea why it returns " database "database-1" does not exist " ?

@williamausenka9251 3 жыл бұрын

Great video! One question, what ML implementation would you do with that data? I'm trying to come up with something, but no success so far.

@jessinthamathew3032 3 жыл бұрын

Hi Nate! Really appreciate your content, its relevance and quality. So much to learn with each video! I wanted to ask why you opted for postgreSQL specifically? Can I use MS SQL server too?

@stratascratch 3 жыл бұрын

I'm just used to postgres and it's open source. MS SQL is fine to learn as well. In the future, the platform will be able to handle both db engines but currently it only uses postgres. Thanks for watching!

@acsrr4288 2 жыл бұрын

what library shall we use to load data to azure DB Cloud?

@stratascratch 2 жыл бұрын

You can use various libraries and frameworks to connect.

@acsrr4288 2 жыл бұрын

@@stratascratch can you recommend a couple for us to explore ? thx

@yashgupte3044 2 жыл бұрын

Getting a connection timed out error while connecting to the RDS instance. I have allowed all traffic in the inbound rules of the security group.

@stratascratch 2 жыл бұрын

Hi Yash, how about this link could help to troubleshoot. aws.amazon.com/premiumsupport/knowledge-center/rds-oracle-connection-errors/

@jakubjankovic4144 2 жыл бұрын

Have you managed to solve it? I am having the same issue

@llamashockz 2 жыл бұрын

having the same issue for hours, nothing seems to work to help fix it

@jakubjankovic4144 2 жыл бұрын

@@llamashockz I ended up using ElephantSQL instead of AWS to host the db and it resolved the issue for me.

@hogthebuilder 2 жыл бұрын

2:20 in and know that this is about to be the video for me.

@MartinoxxHD 3 жыл бұрын

Nice vid, I actually learned new things! One question, what's the point on creating a temporary DF for the new videos and then UPDATE them on a new for lopp. Can't we just UPDATE and INSERT the values in the same for loop?

@stratascratch 3 жыл бұрын

I hold the new videos in a temp df so that I only need to upload the new videos or replace existing videos with new metrics. I'm trying to stay away from updating all videos for performance. Then I do the update and insert all at once but it's only for the new vids. Hope that makes sense. Thanks for watching

@suprithm3802 2 жыл бұрын

I am having data type clob i am not able to update data but the append function is working fine

@suprithm3802 2 жыл бұрын

Any suggestions

@andrew6233 2 жыл бұрын

Newbie question: what is the database manager you show at 11:00? Thank you so much for these video

@stratascratch 2 жыл бұрын

It's Datagrip!

@stratascratch 2 жыл бұрын

It's my favorite BTW. I hate the in-browser db managers.

@andrew6233 2 жыл бұрын

@@stratascratch Thanks! It looks really nice + convenient to work with compared to what I've been using.

@noobshady 2 жыл бұрын

That was really helpful.

@stratascratch 2 жыл бұрын

Glad it was helpful!

@khairulhakimi508 2 жыл бұрын

Im new to this field. I have one noob question. Why people(DA, DE, DS) using pyhton(PANDAS DATAFRAME) for data ingestion to transfer sql . Why not just upload the csv into the ssms using query or using import flat file?

@stratascratch 2 жыл бұрын

It's basically to allow processing or manipulation of the column names and data, if you want to do that.

@ruthfussee5503 3 жыл бұрын

Hi Nate, thank you for this video super useful was scanning through the internet for such a solution but could not find it. I am thinking what if there are many columns (e.g. 100 columns) wouldnt it be inefficient to type in row['column1], ..., row['column100]. ISsthere a way we can put the column information in a tuple, list or dictionary and passing through all this, like after SET also or vars_to_update. another issue is where the excel column names have space (different from SQL columns) how do I then write the code as I keep having sytnax errors Lastly, in the real world sometimes the excel files do not follow the template we stipulated, how should I design a validation check that is useful?

@denyseperezdevera7188 2 жыл бұрын

Hi Nate, thank you so much for these amazing resources ! Do you know why when i try to connect the database i get"OperationalError: FATAL: database "database-1" does not exist" ?

@stratascratch 2 жыл бұрын

Do you have a db? Did you create one for the project? You need to add your credentials.

@denyseperezdevera7188 2 жыл бұрын

just solved my issue! i switched DB instance identifier with DB username.

@artyomashigov 5 ай бұрын

@@denyseperezdevera7188 oh thanks, i had the same issue, so my username and dbname are the same

@skyace173 3 жыл бұрын

Hey Nate! Great video! How much time do you expect to get this finish if this is a project at work?

@stratascratch 3 жыл бұрын

This is probably a 1-2 day project depending on what is being asked of me. But it shouldn't take too long if you already know what you're doing.

@Digital-Light 2 жыл бұрын

i'm geting this error "DatatypeMismatch: column "upload_date" is of type date but expression is of type double precision LINE 3: ... VALUES('mTL23Gd-T3g','Engagement_Ring_Animation','NaN'::flo..." please help

@stratascratch 2 жыл бұрын

The error is saying that you have a dtype mismatch in the upload_date column. So clean up the values in the column and get it to a dtype that the db is expecting.

@Digital-Light 2 жыл бұрын

@@stratascratch thank you for reply! i love your videos.

@ShahnazMalik. 2 жыл бұрын

To insert or update in a table having millions of records in SQL database takes hours to complete. What is the best solution , please advise Thank you.

@stratascratch 2 жыл бұрын

That's normal. Sometimes jobs just take hours to complete. I would optimize for ensuring that the process doesn't break and if it does, you don't lose all the work. One way to do this is to batch the update/ inserts so that if the process does break half way through, you at least have updated/inserted half the records.

@amadysvlog3449 3 жыл бұрын

Great video! Thank you! But question: How do i get access to the Amazon RDS free databases? I created an account on aws but can't find any database-yt.

@stratascratch 3 жыл бұрын

you'll need to create your own db. That's a whole topic by itself. But you can try it by going on aws and then creating a db on the RDS service.

@andrewraad5626 2 жыл бұрын

Hey Nate, thanks for this video. how do you deal with foreign keys? My csv file has strings instead of the foreign keys needed to upload to a specific table that uses foreign key constraints. What do you suggest?

@stratascratch 2 жыл бұрын

I have no idea how to fix your situation. I would just change the strings to FK. Some data manipulations will need to be done with your issue it seems.

@reality8221 3 жыл бұрын

Hi Nate, Thanks for the great videos. Won't it be simpler to insert it as a part of 1st for loop; instead of creating a df and then iterating through once again. Would like to know your thoughts.

@stratascratch 3 жыл бұрын

My thinking about the temp df was to only update the new videos and metrics, and leave the rest alone. I think that's what I did in this project. Hope that makes sense.

@mcmoodoo 2 жыл бұрын

Why not just use pandas.to_sql("table_name", engine) which will create the table based on the dataframe object? You can also pass types dictionary to specify the exact types for the columns like varchar and integer?

@stratascratch 2 жыл бұрын

Yes you can do that. It doesn't always work the way I want it to when I use to_sql() and I often like to have QA checks along the way so my method helps with that. But there are multiple ways to do what I'm doing. I would choose the option that allows you to get your work done!

@Davidkiania 2 жыл бұрын

Hi Nate, any chance you might show us how to automatically run the script? it is a cron job or how does that work? Thanks

@stratascratch 2 жыл бұрын

Yeah, it's essentially a cron/scheduled job. I use Airflow at work but I've seen others use Jenkins.

@SensiStarToaster 3 жыл бұрын

Nice video. But writing sql directly gets messy to maintain, why not use something like Sqlalchemy to separate your python code from the having to worry about the specific details of the specific sql implementation and make it independent of the type of database manager used?

@stratascratch 3 жыл бұрын

yes, you can definitely do that. I haven't used sqlalchemy but have seen the documentation and agree with you. It could be a much better solution.

@hoanglam2814 2 жыл бұрын

Hi Nate, thanks for your videos. I have to retrieve data from almost 100 API and your video help a lot. But normally it takes me almost 10 hours to update all the data/records each day. Yesterday, I just came across some articles about Webhooks, and it mentioned that Webhooks will push data whereas we have to pull data from API. What do you think if I can use Webhooks to update data and insert new data into table? It will only retrieve data when there are events happen, instead of retrieving all data like pulling through API. For API, I think I can use it to retrieve old data from last year.

@stratascratch 2 жыл бұрын

How many records do you have where it would take 10 hrs a day? If it takes that long, I wouldn't implement this method. Seems like you have a lot of data. I would use Airflow and setup DAGs to get those jobs scheduled. Without an understanding of what type of data you're trying to move and for what purpose, I can't tell you if webhooks is the right way

@hoanglam2814 2 жыл бұрын

@@stratascratch I retrieve about 40 thousand records each day, these are orders from a CRM system. My company helps around 100 clients manage their business, and each of them have one account in this CRM provided by a third party ( this means 100 API source). Actually, we are building an analytic system to work on it (included ETL process and dashboard), but until that, I to collect data from 100 APIs one by one. So I tried to implement your method and it works. But, yeah, like you said, it took too much time. I also did research on Airflow but I don't know what the difference between using Airflow and using a cron tool like Power automate/Zapier is. For me, Power automate/Zapier is much easier to learn.

@stratascratch 2 жыл бұрын

@@hoanglam2814 The difference between Airflow and something like Zapier is that Zapier is good for low volume jobs that a non-technical person can setup. Airflow is the industry standard and can help manage huge databases and data lakes. For your use case, an automated ETL process where you collect data from APIs seems like it should work just fine for 40K records. Just make sure you automate the pulls since there are 100 of them. You can try webhooks if you'd like. I just don't have any experience using webhooks for your use case.

@hoanglam2814 2 жыл бұрын

@@stratascratch Thank you for your short and concise answer :)) it is really helpful!

@suprithm3802 2 жыл бұрын

The code is not updating the only itz appending the new values

@hardikvig2723 Жыл бұрын

Where to get the name of database??

@stratascratch Жыл бұрын

It's your db that you need to setup so it can be any name you give it.

@nqomuleya 2 жыл бұрын

InterfaceError: (pyodbc.InterfaceError) ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')