Automating Your Data Science Tasks In Python (importing CSV files to database AUTOMATION TUTORIAL)

Рет қаралды 24,609

Күн бұрын

Пікірлер: 91

@stratascratch 4 жыл бұрын

Find Part 1 where I create the script's functionality before automating the work here (kzbin.info/www/bejne/raKld5qXgsx5h6M). Let me know what you guys think. Is this useful?

@trongtienhoang9522 4 жыл бұрын

Hi Nate, Your materials are fantastic I wonder if you will provide discount for Black Friday this year

@stratascratch 4 жыл бұрын

@@trongtienhoang9522 Thank you so much! Part 3 will come next week to finish off the series. I'm glad you like it. Yes I do have a Black Friday special which is 50% off any of the premium plans. You can use coupon code THANKS50. The discount is good until the end of the month. If you have any problems, feel free to email me at nate@stratascratch.com. Thank you!

@rezahamzeh3736 2 жыл бұрын

That is a difference between pure tutorials and best-practices for real-world scenarios. Can you please continue producing more of ETL tutorials with the same style?

@prateek2159 3 жыл бұрын

Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on KZbin with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.

@gregNFL 2 жыл бұрын

Thank you. Beautiful. So clean! Even addresses Unicode errors. The cleanup bits are an added bonus. 10/10

@Davidkiania 2 жыл бұрын

Most helpful youtube video series I have seen in a long long time ... thank you soo much.

@Shane_Diesel_2010 2 жыл бұрын

You just got my sub. Nice tutorial. I really need to start learning to automate the repetitive tasks I do at work 👌

@DCCfhrue 2 жыл бұрын

Thanks Nate for the great value! Here's a minor helpful modification for when ths CSV filnames start with a number/integer -to prevent the *sqlite3.OperationalError: unrecognized token: "0001test"* error: (using f-string notation for recent python versions) # drop table with same name # cursor.execute("drop table if exists %s;" % (tbl_name)) # older placeholder notation style *cursor.execute(f"drop table if exists \'{tbl_name}\';")* # f-string version # cursor.execute("create table %s (%s);" % (tbl_name, col_str)) # ERROR # cursor.execute("create table \'%s\' (%s);" % (tbl_name, col_str)) # ERROR *cursor.execute(f"create table \'{tbl_name}\' (\'{col_str}\');")* # integer as filename's 1st character WORKS Cheers!

@DCCfhrue 2 жыл бұрын

Just found the Primary Key setting from pandas.DataFrame.to_sql doc dataframe.to_sql(tbl_name, con=conn, *dtype={'MyPKColumnnamehere': 'INTEGER PRIMARY KEY'})*

@monascholtz2440 2 жыл бұрын

Thank you for sharing your knowledge - I enjoy your tutorials!

@nargisparvin4267 4 жыл бұрын

Excellent work sir - you are making python so easy,great job and I respect for your work.

@stratascratch 4 жыл бұрын

Thanks! Let me know if there's other topics you want to see.

@mysteriousbd3743 4 жыл бұрын

Awesome... Looking for a while for such code. Thanks for sharing it

@stratascratch 4 жыл бұрын

Thanks! Stay tuned for part 3 in this series where I turn the code into functions so the entire script is cleaner and can be used for other projects.

@mdabulkalamazad6775 4 жыл бұрын

Love you teaching style.

@stratascratch 4 жыл бұрын

Thanks so much. Glad you like it!

@oluwaseunatoyebi6796 4 жыл бұрын

This is incredible my friend. Can you do a video where you load each of the 3 files into separate tables? I'd be curious to see how you dynamically handle that. Thanks for the great work.

@stratascratch 4 жыл бұрын

In this method, you should be able to load 3 tables pretty easily with a for loop. Just make sure you close and open the connection each time. This method can handle pretty large datasets and number of rows. I've imported 5gb files to db tables. Other methods that reply in INSERTs have limitations on the number of rows, not necessarily the number of files. If there are over 100K+ rows, use the batch parameter found in the psycopg2 documentation. Then just batch the rows into the table 1K at a time. Hope that provides some more insights into the limitations of this method.

@adiliophi 2 жыл бұрын

Your help is very useful! Thanks for sharing

@harshamadhwani8438 3 жыл бұрын

Simply Superb... love the way you explain and make it sound so simple. Thank you so much!!

@stratascratch 3 жыл бұрын

Thanks so much for watching!

@everettwitt9464 4 жыл бұрын

this is an incredible channel - thanks a ton man

@stratascratch 4 жыл бұрын

Thanks so much and thanks for watching. Let me know if there's any topic you'd like me to cover. I love getting ideas from my audience.

@everettwitt9464 4 жыл бұрын

@@stratascratch This is more of a concept - Data Science is so much more than using algorithms to analyze data. I always heard and always think about the concept of "garbage in garbage out", but I never really pushed further and asked how can you verify that the data you are using is not garbage? Writing scripts to clean and sanitize parts of the data is great, but how can you confirm your data is useful? I guess in a sense making a prediction model on the data and seeing that your predictions match reality closely is probably a secondhand test confirming that your data is in fact quality. However, if you're model fails there are many possible reasons why - it would be nice to know that your data quality isn't one of them. When you have data sets that are massive it is impossible to go through everything - can you use sampling to test the validity so you can try and make a statement about the whole? The process of having nothing, writing a scraper to get some public information, making sure the data you have is quality, adding new data every day, having the ability to fetch some other historical data then add it to your dataset. If the data is going to be massive, what platform should you use to store it? How should you query it efficiently? When you have a nice, big, quality dataset then you can do some fun analysis. I guess I'm rambling because I want to start collecting horse racing data, so I have thought about everything above in that context.

@stratascratch 4 жыл бұрын

@@everettwitt9464 Wow there's a lot of topics to unpack there. On the note about data quality -- you're absolutely right. How do you know if it's even quality data? How do you know the data you have collected and cleaned is going to be useful for predicting and building models? You never really know and even if you build a great model or a bad model, you still don't really know. It could have been luck or bad luck. What's important is iterating on the model to make it better every time you optimize your model, collect more data, and/or refine your features. Which brings me to your last few thoughts about collecting data. "A massive collection of data" means a lot of things to a lot of people. You could collect millions of rows but still keep it on 1 db server. You could collect billions of rows and still keep it on a relational db that's distributed and sharded. Or you can go no SQL if you need to. I never really worry about the db too much for personal use even if I do collect millions of rows. Once you get bigger, you definitely want to go to industry tools like Snowflake and/or HIVE and use other means like pySpark. But if it comes to that -- there are more qualified people to handle the data so that you can keep churning out the analysis. Those are some of my initial thoughts. There's so much more though....thanks for the brain dump!

@everettwitt9464 4 жыл бұрын

@@stratascratch so what about a video or something on pySpark and handling "massive datasets" where using pandas simply doesn't cut it? Or using HIVE/Snowflake - things that you might potentially see on the job but not when doing a project on your own (because you arent working with nearly the same magnitude of data)? and on the data quality part - there must be automated tests (like sampling from the data, testing that sample to generalize something about the population/whole dataset) no? The proxy for data quality cannot just be model accuracy I assume? Regardless, thanks for the current videos and responses :)

@stratascratch 4 жыл бұрын

@@everettwitt9464 I would love to do a video on pySpark, HIVE/snowflake, but it's hard to get access to those tools and the massive datasets mainly because they are industry grade. I can't use my work resources for YT videos. For hobbyists, pandas is probably fine. Otherwise, if you do need a big compute resource, your job would be able to train you on those tools. I was trained on HIVe and pySpark because my job required it. But I interviewed on basic SQL and python questions. That's a good question about data quality. You would think there are automated tests for sampling the data, testing for normality, etc but I haven't encountered one. It seems like they expect all data scientists to perform those data tests because they start building the model. But those tests do take a while to perform so automated tests are a good idea. Something to consider if you're interested in providing a service for other data scientists =) I'll keep trying out new ideas and seeing what sticks. It's definitely hard when I don't have access to industry grade tools and data.

@culpgrant21 4 жыл бұрын

Good video! Just a preference on using .format vs. f strings?

@stratascratch 4 жыл бұрын

I think you can use .format or f strings when you're just trying to print something. F strings have gotten more popular over the years so I'm starting to see more of that. But when you're passing SQL code to the server/database, you should use parameterized queries which is similar to F strings (but not quite the same). A parameterized query will pass the SQL statement to the db and then pass the values, so it's a 2 step process. If you used .format to pass SQL queries to the db, it passes the entire query in 1 step, which can make it vulnerable to SQL injection attacks. Here's more info on it: docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.execute I don't explain this in this video but it's good to know if you're going to build anything that would go into production.

@XxShadowzeekxX 3 жыл бұрын

what do you mean by running the os system and executing on the batch command? How do I do this because my files have not moved to the new datasets folder? Thank you!

@stratascratch 3 жыл бұрын

There's an os library in python that can help you navigate the folders. I don't remember specifics about the batch command. Probably means bulk inserting the rows in large CSV files. There's a parameter in the import function that allows for this.

@chk3967 2 жыл бұрын

If you are coding in windowa then replace :- "mv '{0}' {1}" with 'move "{0}" {1}' .... (replace 'single quote' and "double quote" with each other ).

@luzestrada9536 Жыл бұрын

@@chk3967 thank you so much! Making this changes, worked for me

@lettalkaboutit Жыл бұрын

Your video is fantastic, it help me, but this is where I struggle and I don't know if you can help with the following: my entire csv file needs to go in the table results, the dataframe created from the csv file have columns that contain foreign keys, I mean the csv files is a mix of different tables values, each column of my dataframe represents a table, My use case is that: 1/ first step: I have to loop through each column and compare the value in the column with the value from the corresponding table in the database, if the value does not exist then I create a SQL insert query to add this new value. After getting all new values in the databases then I do step 2 2/ Step : I get the value from each corresponding table from the database and for each columns I have to replace the data in each column by the table ID (foreign key) , and next send all the dataframe in the database Can you help me achieve this please?

@T12F34 Жыл бұрын

This is awesome! I'm wondering how I can use this to automate a .csv load to a ms sql table, any ideas?

@stratascratch Жыл бұрын

Yes absolutely. Most of the script can be used for csv --> mysql load. You'll need to obviously change the credentials and use a mysql wrapper to connect to your db. You'll probably need a new SQL query (written in MySQL) to insert the data. That's probably all you need to do.

@shubhanshukanungo5424 2 жыл бұрын

hello sir, this video is really helpful. I tried to import 5 csv files to SQL Server database on my local machine. I didn't find any difficulties till creating the 5 different tables in database, but after that i got stuck while importing values. I tried numerous method. I don't to change the code manually for multiple files. Please make video on this if possible.

@wanderborn. 3 жыл бұрын

Thanks a lot. It's just what I was looking for to learn-how. Just a though: wouldn't it be right/appropriate to clean the data in .csv and then move it to db?

@stratascratch 3 жыл бұрын

It would be very helpful to clean the data before moving it to the db. I created some automated clean up functions in this video that cleans up the column headers and converts data types. But I don't touch the actual data. Since the CSV is converted to a pd dataframe, it's easy to clean it up.

@jacobsaddler7664 2 жыл бұрын

Hey, love the video! Just wondering if you've ever come across this issue: Kernel is running properly, but nothing within the for loops are appearing in the output. For example, if I make a new kernel and simply print('hello world"), the output shows as: hello world. However, within my for loops( for k in csv_files), nothing is printing even though the kernel is still running without error. I have print('opened database successfully') in this for loop, which DOES NOT show up in my output. However, print('all tables have been successfully imported into the db'), outside of the for loop, DOES show up in the output. Any idea what could be causing this? I was not previously having this issue but at some point I re-ran all cells and am no longer getting the outputs within the for loops.

@stratascratch 2 жыл бұрын

Hard to really say without looking at the code. I would just try to print an output at every step to see what's going on. And try to see if there is any data being ingested into the loop.

@jacobsaddler7664 2 жыл бұрын

@@stratascratch I appreciate the response! Good friend who knows python was able to fix the bug for me. This vid was extremely helpful and keep up the great content!

@Lyriks_ 2 жыл бұрын

Hello Stratascratch, thanks for your videos (the whole serie on automating csv to psotgre), i couldn't help but notice your e-cornell diploma in the background, is it worth it ? Cheers from Switzerland

@stratascratch 2 жыл бұрын

Thanks! I actually attended Cornell for my grad school (in person, not online). I don't know much about their online program but their in-person one is definitely worth it. I had a great time =)

@prachipatel9388 Жыл бұрын

Hello Nate, How can i import live data (csv ) file from any application (both ios and android ) ?

@stratascratch Жыл бұрын

Sorry, what do you mean?

@prachipatel9388 Жыл бұрын

@@stratascratch I mean if i want data from any mobile application ( it is in .csv form ) , how do i fetch it ?

@stratascratch Жыл бұрын

@@prachipatel9388 I have no idea. We don't cover data collection other than from an API in this series.

@sudarshanvu4807 3 жыл бұрын

Awesome learning platform. Here my question is Instead of pandas df can we do same project task in Pyspark Pyspark-SQL and AWS RDS???

@stratascratch 3 жыл бұрын

For AWS RDS, change the language to Postgres in the platform. Unfortunately, I don't have pySpark on the platform.

@arthuramanyire526 2 жыл бұрын

Hi Nathan, thanks alot for sharing this. I have tried to follow along but i have hit a wall. At the point where the datasets directory is created all is fine but the code to move the csv files to the datasets directory runs with no errors but the files are not moved. I'm using windows and i noticed you are using mac, could the issue be the os i'm using? One other thing, would you consider continuing this series by showing how to pull the data from db to tableau/power bi for realtime analysis

@stratascratch 2 жыл бұрын

There is a difference between windows and mac related to how you write the path so that's one potential issue. But there could be a lot of different reasons why it's not working since everyone's setup is different. Happy to extend this series once I have some availability.

@chk3967 2 жыл бұрын

If you are coding in windowa then replace :- "mv '{0}' {1}" with 'move "{0}" {1}' .... (replace 'single quote' and "double quote" with each other ).

@sushmita4713 2 жыл бұрын

Hi Arthur, I am sure you must have figured it out already. but here is what I did when I got the same error in windows. for file in csv_files: original = r'{0}'.format(file) target = r'{0}'.format(data_dir) shutil.move(original, target)

@arthuramanyire526 2 жыл бұрын

@@sushmita4713 thanks Sushmita will try it out

@majubodas6206 2 жыл бұрын

@@chk3967 Hey, I tried this and it finally worked! thank you so much!!

@pulakkabir2276 3 жыл бұрын

bro, how to start with an (almost) empty CLASS with the headers of your data file?

@Victoria-dl9uk 3 жыл бұрын

This is absolutely amazing, Nathan. Can't thank you enough for sharing your talent in teaching across the community of DS. Would there be a "quick" way to change this code to open .xls instead of .csv? And if possible transform that .xls into a .csv file and proceed with the rest of your code? That would be super helpful if you have any advise (or other videos to refer to). Thanks!

@stratascratch 3 жыл бұрын

Try this resource (pythonbasics.org/read-excel/). My only advice is that Excel uploads can be message because they have a lot of encodings (like UTF) that can't be read by pandas. You'll see weird symbols if there are some decoding errors when you import your excel file. It's the main reason why I stick with CSV files.

@iveependo9997 3 жыл бұрын

hi! for csv in csv_files: mv_file = "mv '{0}' {1}".format(csv, dataset_dir) os.system(mv_file) this part didn't move my files into the new datasets directory. Am I missing something? Thanks!

@stratascratch 3 жыл бұрын

Really hard to say and debug without the notebook. But I would print out the for loop and see what you have. If you think it's right, try running the code manually to see if it moved your files.

@mohammadfaruque2865 2 жыл бұрын

@@stratascratch I think it is happening because of the mv cmd is not recognized in windows. I have found ppl are using shutil to move files in windows systems.

@chk3967 2 жыл бұрын

If you are coding in windowa then replace :- "mv '{0}' {1}" with 'move "{0}" {1}' .... (replace 'single quote' and "double quote" with each other ).

@sushmita4713 2 жыл бұрын

Hi if you are using windows use this for file in csv_files: original = r'{0}'.format(file) target = r'{0}'.format(data_dir) shutil.move(original, target)

@manuelbonilla1638 11 ай бұрын

my database created folder is empty :(

@TensaiCollectibles 2 жыл бұрын

Awesome thank you!

@danielschellhorn8838 2 жыл бұрын

Why don't you use a list comprehension to find the csv files?

@diaconescutiberiu7535 2 жыл бұрын

I am on windows and mv command is not recognized what can i replace it with

@stratascratch 2 жыл бұрын

Try google to get the answer?

@chk3967 2 жыл бұрын

If you are coding in windowa then replace :- "mv '{0}' {1}" with 'move "{0}" {1}' .... (replace 'single quote' and "double quote" with each other ).

@stephensander3061 3 жыл бұрын

Hi I love your videos, is it possible to make this script a rest API?

@stratascratch 3 жыл бұрын

I think so? I think one way is to use Flask to turn it into a rest API? That's maybe what I'd do but I usually collect from APIs not write one so I'm a bit of a beginner there =)

@stephensander3061 3 жыл бұрын

@@stratascratch Thanks man!

@Veronica.cs.9 Жыл бұрын

amazing!!!!

@anthonyobrienvicenciopalla641 3 жыл бұрын

Hi nate,nice video :) I have a questions... How I fix this?? COPY from stdin failed: error in .read() call: UnicodeDecodeError 'charmap' codec can't decode byte 0x81 in position 885: character maps to CONTEXT: COPY tags, line 9260

@lucydavis6325 Жыл бұрын

this is an issue with the encoding, you must have a special character in your file.. I was able to fix this when opening the file setting the encoding to latin_1 ie my_file= open(k, encoding="latin_1")

@TheCpopp 2 жыл бұрын

If you are using AWS services, you might want to import boto3 and use SDK api calls instead of using psycopg2. Also, if you are using AWS, you don’t want to pass your database password as plaintext in a parameter. It’s better to use boto3 and pull the sensitive data from AWS Secrets Manager service.

@stratascratch 2 жыл бұрын

That's a good point. I've never played with boto3 but if it does what you say then I agree, it's much better than the current implementation.

@abdobourenane9294 4 жыл бұрын

What an amazing code !! , i love your explanation , just if you can give us the github link to the data so we can practice , Thank you

@stratascratch 4 жыл бұрын

Here's the github repo URL: github.com/Strata-Scratch/csv_to_db_automation Thanks for watching!

@abdobourenane9294 4 жыл бұрын

@@stratascratch Thank you

@stratascratch 4 жыл бұрын

@@abdobourenane9294 Thanks for watching!

@javierjdaza 4 жыл бұрын

dude why use copy instead of insert into sql???

@stratascratch 4 жыл бұрын

You're like the only person that caught that! In reality, I prefer doing INSERTs because when I upload data to databases, it's usually in batches, repeated on a schedule. So I'd create an automatic job that would grab new data and upload the new rows into the db table 5gbs. If I used an INSERT, I would have had to batch my inserts. The end result would be the same but I just thought it was faster to do it using COPY. Thanks for the question!

@javierjdaza 4 жыл бұрын

@@stratascratch thanks for the answear dude. Another question, how would be the code for example if i had a previous table, let me call it table1 of the DB1, and i just want to insert a data frame into this table. Avoiding the “copy” query, what i need to use? Thanks u man. (Sorry for my english, im spanish guy)

@stratascratch 4 жыл бұрын

@@javierjdaza Basically you can use the pd.to_sql() method to append your data in the pandas df into table1. So it could look something like this df.to_sql('table1', engine, if_exists='append') where `engine` is your SQL engine like postgres. There are other ways but this way uses a method from the pandas library which is kind of nice. Let me know if that makes sense.

@javierjdaza 4 жыл бұрын

@@stratascratch man thank you, ill check it out, please keep doing this kind of videos. A hug!

@stratascratch 4 жыл бұрын

@@javierjdaza Thanks so much man! Will keep trying to do coding videos!

@jaimejosejuarezrosales9927 7 ай бұрын

@stratascratch Good video ! thanks for sharing. I wonder if you could have the time to finish the part 3 of your code (*Edited a few minutes later:* I have found it ! : kzbin.info/www/bejne/hGHEeZiBrrN4qtE ) . Great Series of video.