Find Part 1 where I create the script's functionality before automating the work here (kzbin.info/www/bejne/raKld5qXgsx5h6M). Let me know what you guys think. Is this useful?
@trongtienhoang95224 жыл бұрын
Hi Nate, Your materials are fantastic I wonder if you will provide discount for Black Friday this year
@stratascratch4 жыл бұрын
@@trongtienhoang9522 Thank you so much! Part 3 will come next week to finish off the series. I'm glad you like it. Yes I do have a Black Friday special which is 50% off any of the premium plans. You can use coupon code THANKS50. The discount is good until the end of the month. If you have any problems, feel free to email me at nate@stratascratch.com. Thank you!
@rezahamzeh37362 жыл бұрын
That is a difference between pure tutorials and best-practices for real-world scenarios. Can you please continue producing more of ETL tutorials with the same style?
@prateek21593 жыл бұрын
Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on KZbin with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.
@gregNFL2 жыл бұрын
Thank you. Beautiful. So clean! Even addresses Unicode errors. The cleanup bits are an added bonus. 10/10
@Davidkiania2 жыл бұрын
Most helpful youtube video series I have seen in a long long time ... thank you soo much.
@Shane_Diesel_20102 жыл бұрын
You just got my sub. Nice tutorial. I really need to start learning to automate the repetitive tasks I do at work 👌
@DCCfhrue2 жыл бұрын
Thanks Nate for the great value! Here's a minor helpful modification for when ths CSV filnames start with a number/integer -to prevent the *sqlite3.OperationalError: unrecognized token: "0001test"* error: (using f-string notation for recent python versions) # drop table with same name # cursor.execute("drop table if exists %s;" % (tbl_name)) # older placeholder notation style *cursor.execute(f"drop table if exists \'{tbl_name}\';")* # f-string version # cursor.execute("create table %s (%s);" % (tbl_name, col_str)) # ERROR # cursor.execute("create table \'%s\' (%s);" % (tbl_name, col_str)) # ERROR *cursor.execute(f"create table \'{tbl_name}\' (\'{col_str}\');")* # integer as filename's 1st character WORKS Cheers!
@DCCfhrue2 жыл бұрын
Just found the Primary Key setting from pandas.DataFrame.to_sql doc dataframe.to_sql(tbl_name, con=conn, *dtype={'MyPKColumnnamehere': 'INTEGER PRIMARY KEY'})*
@monascholtz24402 жыл бұрын
Thank you for sharing your knowledge - I enjoy your tutorials!
@nargisparvin42674 жыл бұрын
Excellent work sir - you are making python so easy,great job and I respect for your work.
@stratascratch4 жыл бұрын
Thanks! Let me know if there's other topics you want to see.
@mysteriousbd37434 жыл бұрын
Awesome... Looking for a while for such code. Thanks for sharing it
@stratascratch4 жыл бұрын
Thanks! Stay tuned for part 3 in this series where I turn the code into functions so the entire script is cleaner and can be used for other projects.
@mdabulkalamazad67754 жыл бұрын
Love you teaching style.
@stratascratch4 жыл бұрын
Thanks so much. Glad you like it!
@oluwaseunatoyebi67964 жыл бұрын
This is incredible my friend. Can you do a video where you load each of the 3 files into separate tables? I'd be curious to see how you dynamically handle that. Thanks for the great work.
@stratascratch4 жыл бұрын
In this method, you should be able to load 3 tables pretty easily with a for loop. Just make sure you close and open the connection each time. This method can handle pretty large datasets and number of rows. I've imported 5gb files to db tables. Other methods that reply in INSERTs have limitations on the number of rows, not necessarily the number of files. If there are over 100K+ rows, use the batch parameter found in the psycopg2 documentation. Then just batch the rows into the table 1K at a time. Hope that provides some more insights into the limitations of this method.
@adiliophi2 жыл бұрын
Your help is very useful! Thanks for sharing
@harshamadhwani84383 жыл бұрын
Simply Superb... love the way you explain and make it sound so simple. Thank you so much!!
@stratascratch3 жыл бұрын
Thanks so much for watching!
@everettwitt94644 жыл бұрын
this is an incredible channel - thanks a ton man
@stratascratch4 жыл бұрын
Thanks so much and thanks for watching. Let me know if there's any topic you'd like me to cover. I love getting ideas from my audience.
@everettwitt94644 жыл бұрын
@@stratascratch This is more of a concept - Data Science is so much more than using algorithms to analyze data. I always heard and always think about the concept of "garbage in garbage out", but I never really pushed further and asked how can you verify that the data you are using is not garbage? Writing scripts to clean and sanitize parts of the data is great, but how can you confirm your data is useful? I guess in a sense making a prediction model on the data and seeing that your predictions match reality closely is probably a secondhand test confirming that your data is in fact quality. However, if you're model fails there are many possible reasons why - it would be nice to know that your data quality isn't one of them. When you have data sets that are massive it is impossible to go through everything - can you use sampling to test the validity so you can try and make a statement about the whole? The process of having nothing, writing a scraper to get some public information, making sure the data you have is quality, adding new data every day, having the ability to fetch some other historical data then add it to your dataset. If the data is going to be massive, what platform should you use to store it? How should you query it efficiently? When you have a nice, big, quality dataset then you can do some fun analysis. I guess I'm rambling because I want to start collecting horse racing data, so I have thought about everything above in that context.
@stratascratch4 жыл бұрын
@@everettwitt9464 Wow there's a lot of topics to unpack there. On the note about data quality -- you're absolutely right. How do you know if it's even quality data? How do you know the data you have collected and cleaned is going to be useful for predicting and building models? You never really know and even if you build a great model or a bad model, you still don't really know. It could have been luck or bad luck. What's important is iterating on the model to make it better every time you optimize your model, collect more data, and/or refine your features. Which brings me to your last few thoughts about collecting data. "A massive collection of data" means a lot of things to a lot of people. You could collect millions of rows but still keep it on 1 db server. You could collect billions of rows and still keep it on a relational db that's distributed and sharded. Or you can go no SQL if you need to. I never really worry about the db too much for personal use even if I do collect millions of rows. Once you get bigger, you definitely want to go to industry tools like Snowflake and/or HIVE and use other means like pySpark. But if it comes to that -- there are more qualified people to handle the data so that you can keep churning out the analysis. Those are some of my initial thoughts. There's so much more though....thanks for the brain dump!
@everettwitt94644 жыл бұрын
@@stratascratch so what about a video or something on pySpark and handling "massive datasets" where using pandas simply doesn't cut it? Or using HIVE/Snowflake - things that you might potentially see on the job but not when doing a project on your own (because you arent working with nearly the same magnitude of data)? and on the data quality part - there must be automated tests (like sampling from the data, testing that sample to generalize something about the population/whole dataset) no? The proxy for data quality cannot just be model accuracy I assume? Regardless, thanks for the current videos and responses :)
@stratascratch4 жыл бұрын
@@everettwitt9464 I would love to do a video on pySpark, HIVE/snowflake, but it's hard to get access to those tools and the massive datasets mainly because they are industry grade. I can't use my work resources for YT videos. For hobbyists, pandas is probably fine. Otherwise, if you do need a big compute resource, your job would be able to train you on those tools. I was trained on HIVe and pySpark because my job required it. But I interviewed on basic SQL and python questions. That's a good question about data quality. You would think there are automated tests for sampling the data, testing for normality, etc but I haven't encountered one. It seems like they expect all data scientists to perform those data tests because they start building the model. But those tests do take a while to perform so automated tests are a good idea. Something to consider if you're interested in providing a service for other data scientists =) I'll keep trying out new ideas and seeing what sticks. It's definitely hard when I don't have access to industry grade tools and data.
@culpgrant214 жыл бұрын
Good video! Just a preference on using .format vs. f strings?
@stratascratch4 жыл бұрын
I think you can use .format or f strings when you're just trying to print something. F strings have gotten more popular over the years so I'm starting to see more of that. But when you're passing SQL code to the server/database, you should use parameterized queries which is similar to F strings (but not quite the same). A parameterized query will pass the SQL statement to the db and then pass the values, so it's a 2 step process. If you used .format to pass SQL queries to the db, it passes the entire query in 1 step, which can make it vulnerable to SQL injection attacks. Here's more info on it: docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.execute I don't explain this in this video but it's good to know if you're going to build anything that would go into production.
@XxShadowzeekxX3 жыл бұрын
what do you mean by running the os system and executing on the batch command? How do I do this because my files have not moved to the new datasets folder? Thank you!
@stratascratch3 жыл бұрын
There's an os library in python that can help you navigate the folders. I don't remember specifics about the batch command. Probably means bulk inserting the rows in large CSV files. There's a parameter in the import function that allows for this.
@chk39672 жыл бұрын
If you are coding in windowa then replace :- "mv '{0}' {1}" with 'move "{0}" {1}' .... (replace 'single quote' and "double quote" with each other ).
@luzestrada9536 Жыл бұрын
@@chk3967 thank you so much! Making this changes, worked for me
@lettalkaboutit Жыл бұрын
Your video is fantastic, it help me, but this is where I struggle and I don't know if you can help with the following: my entire csv file needs to go in the table results, the dataframe created from the csv file have columns that contain foreign keys, I mean the csv files is a mix of different tables values, each column of my dataframe represents a table, My use case is that: 1/ first step: I have to loop through each column and compare the value in the column with the value from the corresponding table in the database, if the value does not exist then I create a SQL insert query to add this new value. After getting all new values in the databases then I do step 2 2/ Step : I get the value from each corresponding table from the database and for each columns I have to replace the data in each column by the table ID (foreign key) , and next send all the dataframe in the database Can you help me achieve this please?
@T12F34 Жыл бұрын
This is awesome! I'm wondering how I can use this to automate a .csv load to a ms sql table, any ideas?
@stratascratch Жыл бұрын
Yes absolutely. Most of the script can be used for csv --> mysql load. You'll need to obviously change the credentials and use a mysql wrapper to connect to your db. You'll probably need a new SQL query (written in MySQL) to insert the data. That's probably all you need to do.
@shubhanshukanungo54242 жыл бұрын
hello sir, this video is really helpful. I tried to import 5 csv files to SQL Server database on my local machine. I didn't find any difficulties till creating the 5 different tables in database, but after that i got stuck while importing values. I tried numerous method. I don't to change the code manually for multiple files. Please make video on this if possible.
@wanderborn.3 жыл бұрын
Thanks a lot. It's just what I was looking for to learn-how. Just a though: wouldn't it be right/appropriate to clean the data in .csv and then move it to db?
@stratascratch3 жыл бұрын
It would be very helpful to clean the data before moving it to the db. I created some automated clean up functions in this video that cleans up the column headers and converts data types. But I don't touch the actual data. Since the CSV is converted to a pd dataframe, it's easy to clean it up.
@jacobsaddler76642 жыл бұрын
Hey, love the video! Just wondering if you've ever come across this issue: Kernel is running properly, but nothing within the for loops are appearing in the output. For example, if I make a new kernel and simply print('hello world"), the output shows as: hello world. However, within my for loops( for k in csv_files), nothing is printing even though the kernel is still running without error. I have print('opened database successfully') in this for loop, which DOES NOT show up in my output. However, print('all tables have been successfully imported into the db'), outside of the for loop, DOES show up in the output. Any idea what could be causing this? I was not previously having this issue but at some point I re-ran all cells and am no longer getting the outputs within the for loops.
@stratascratch2 жыл бұрын
Hard to really say without looking at the code. I would just try to print an output at every step to see what's going on. And try to see if there is any data being ingested into the loop.
@jacobsaddler76642 жыл бұрын
@@stratascratch I appreciate the response! Good friend who knows python was able to fix the bug for me. This vid was extremely helpful and keep up the great content!
@Lyriks_2 жыл бұрын
Hello Stratascratch, thanks for your videos (the whole serie on automating csv to psotgre), i couldn't help but notice your e-cornell diploma in the background, is it worth it ? Cheers from Switzerland
@stratascratch2 жыл бұрын
Thanks! I actually attended Cornell for my grad school (in person, not online). I don't know much about their online program but their in-person one is definitely worth it. I had a great time =)
@prachipatel9388 Жыл бұрын
Hello Nate, How can i import live data (csv ) file from any application (both ios and android ) ?
@stratascratch Жыл бұрын
Sorry, what do you mean?
@prachipatel9388 Жыл бұрын
@@stratascratch I mean if i want data from any mobile application ( it is in .csv form ) , how do i fetch it ?
@stratascratch Жыл бұрын
@@prachipatel9388 I have no idea. We don't cover data collection other than from an API in this series.
@sudarshanvu48073 жыл бұрын
Awesome learning platform. Here my question is Instead of pandas df can we do same project task in Pyspark Pyspark-SQL and AWS RDS???
@stratascratch3 жыл бұрын
For AWS RDS, change the language to Postgres in the platform. Unfortunately, I don't have pySpark on the platform.
@arthuramanyire5262 жыл бұрын
Hi Nathan, thanks alot for sharing this. I have tried to follow along but i have hit a wall. At the point where the datasets directory is created all is fine but the code to move the csv files to the datasets directory runs with no errors but the files are not moved. I'm using windows and i noticed you are using mac, could the issue be the os i'm using? One other thing, would you consider continuing this series by showing how to pull the data from db to tableau/power bi for realtime analysis
@stratascratch2 жыл бұрын
There is a difference between windows and mac related to how you write the path so that's one potential issue. But there could be a lot of different reasons why it's not working since everyone's setup is different. Happy to extend this series once I have some availability.
@chk39672 жыл бұрын
If you are coding in windowa then replace :- "mv '{0}' {1}" with 'move "{0}" {1}' .... (replace 'single quote' and "double quote" with each other ).
@sushmita47132 жыл бұрын
Hi Arthur, I am sure you must have figured it out already. but here is what I did when I got the same error in windows. for file in csv_files: original = r'{0}'.format(file) target = r'{0}'.format(data_dir) shutil.move(original, target)
@arthuramanyire5262 жыл бұрын
@@sushmita4713 thanks Sushmita will try it out
@majubodas62062 жыл бұрын
@@chk3967 Hey, I tried this and it finally worked! thank you so much!!
@pulakkabir22763 жыл бұрын
bro, how to start with an (almost) empty CLASS with the headers of your data file?
@Victoria-dl9uk3 жыл бұрын
This is absolutely amazing, Nathan. Can't thank you enough for sharing your talent in teaching across the community of DS. Would there be a "quick" way to change this code to open .xls instead of .csv? And if possible transform that .xls into a .csv file and proceed with the rest of your code? That would be super helpful if you have any advise (or other videos to refer to). Thanks!
@stratascratch3 жыл бұрын
Try this resource (pythonbasics.org/read-excel/). My only advice is that Excel uploads can be message because they have a lot of encodings (like UTF) that can't be read by pandas. You'll see weird symbols if there are some decoding errors when you import your excel file. It's the main reason why I stick with CSV files.
@iveependo99973 жыл бұрын
hi! for csv in csv_files: mv_file = "mv '{0}' {1}".format(csv, dataset_dir) os.system(mv_file) this part didn't move my files into the new datasets directory. Am I missing something? Thanks!
@stratascratch3 жыл бұрын
Really hard to say and debug without the notebook. But I would print out the for loop and see what you have. If you think it's right, try running the code manually to see if it moved your files.
@mohammadfaruque28652 жыл бұрын
@@stratascratch I think it is happening because of the mv cmd is not recognized in windows. I have found ppl are using shutil to move files in windows systems.
@chk39672 жыл бұрын
If you are coding in windowa then replace :- "mv '{0}' {1}" with 'move "{0}" {1}' .... (replace 'single quote' and "double quote" with each other ).
@sushmita47132 жыл бұрын
Hi if you are using windows use this for file in csv_files: original = r'{0}'.format(file) target = r'{0}'.format(data_dir) shutil.move(original, target)
@manuelbonilla163811 ай бұрын
my database created folder is empty :(
@TensaiCollectibles2 жыл бұрын
Awesome thank you!
@danielschellhorn88382 жыл бұрын
Why don't you use a list comprehension to find the csv files?
@diaconescutiberiu75352 жыл бұрын
I am on windows and mv command is not recognized what can i replace it with
@stratascratch2 жыл бұрын
Try google to get the answer?
@chk39672 жыл бұрын
If you are coding in windowa then replace :- "mv '{0}' {1}" with 'move "{0}" {1}' .... (replace 'single quote' and "double quote" with each other ).
@stephensander30613 жыл бұрын
Hi I love your videos, is it possible to make this script a rest API?
@stratascratch3 жыл бұрын
I think so? I think one way is to use Flask to turn it into a rest API? That's maybe what I'd do but I usually collect from APIs not write one so I'm a bit of a beginner there =)
@stephensander30613 жыл бұрын
@@stratascratch Thanks man!
@Veronica.cs.9 Жыл бұрын
amazing!!!!
@anthonyobrienvicenciopalla6413 жыл бұрын
Hi nate,nice video :) I have a questions... How I fix this?? COPY from stdin failed: error in .read() call: UnicodeDecodeError 'charmap' codec can't decode byte 0x81 in position 885: character maps to CONTEXT: COPY tags, line 9260
@lucydavis6325 Жыл бұрын
this is an issue with the encoding, you must have a special character in your file.. I was able to fix this when opening the file setting the encoding to latin_1 ie my_file= open(k, encoding="latin_1")
@TheCpopp2 жыл бұрын
If you are using AWS services, you might want to import boto3 and use SDK api calls instead of using psycopg2. Also, if you are using AWS, you don’t want to pass your database password as plaintext in a parameter. It’s better to use boto3 and pull the sensitive data from AWS Secrets Manager service.
@stratascratch2 жыл бұрын
That's a good point. I've never played with boto3 but if it does what you say then I agree, it's much better than the current implementation.
@abdobourenane92944 жыл бұрын
What an amazing code !! , i love your explanation , just if you can give us the github link to the data so we can practice , Thank you
@stratascratch4 жыл бұрын
Here's the github repo URL: github.com/Strata-Scratch/csv_to_db_automation Thanks for watching!
@abdobourenane92944 жыл бұрын
@@stratascratch Thank you
@stratascratch4 жыл бұрын
@@abdobourenane9294 Thanks for watching!
@javierjdaza4 жыл бұрын
dude why use copy instead of insert into sql???
@stratascratch4 жыл бұрын
You're like the only person that caught that! In reality, I prefer doing INSERTs because when I upload data to databases, it's usually in batches, repeated on a schedule. So I'd create an automatic job that would grab new data and upload the new rows into the db table 5gbs. If I used an INSERT, I would have had to batch my inserts. The end result would be the same but I just thought it was faster to do it using COPY. Thanks for the question!
@javierjdaza4 жыл бұрын
@@stratascratch thanks for the answear dude. Another question, how would be the code for example if i had a previous table, let me call it table1 of the DB1, and i just want to insert a data frame into this table. Avoiding the “copy” query, what i need to use? Thanks u man. (Sorry for my english, im spanish guy)
@stratascratch4 жыл бұрын
@@javierjdaza Basically you can use the pd.to_sql() method to append your data in the pandas df into table1. So it could look something like this df.to_sql('table1', engine, if_exists='append') where `engine` is your SQL engine like postgres. There are other ways but this way uses a method from the pandas library which is kind of nice. Let me know if that makes sense.
@javierjdaza4 жыл бұрын
@@stratascratch man thank you, ill check it out, please keep doing this kind of videos. A hug!
@stratascratch4 жыл бұрын
@@javierjdaza Thanks so much man! Will keep trying to do coding videos!
@jaimejosejuarezrosales99277 ай бұрын
@stratascratch Good video ! thanks for sharing. I wonder if you could have the time to finish the part 3 of your code (*Edited a few minutes later:* I have found it ! : kzbin.info/www/bejne/hGHEeZiBrrN4qtE ) . Great Series of video.