I know right?! I guess not many people like to watch me code...
@IrakliChitishvili3 жыл бұрын
@@stratascratch I think it's just matter of time before your channel and the work you do takes off.
@flipcase3 жыл бұрын
I second that!
@toilinginobscurity30913 жыл бұрын
There's really no rhyme or reason in the universe. This channel and these tutorials are gold but instead I am bombarded by ML channels playing with toy datasets and brand themselves as all you need clickbaits. You Sir are a saint.
@stratascratch3 жыл бұрын
haha =) I know exactly what you mean. Too many of those channels around.
@Networkprofessor2 жыл бұрын
Just what I needed! This will take you to a whole new level. Thank you.
@stratascratch2 жыл бұрын
You are welcome. Thank you for the nice comment. 😀
@prateek21593 жыл бұрын
Hey Nate, your videos are just too good. I love how your channel is so dedicated towards real word data science. By the way I noticed that you started a video series, "For your Data Science Project" and I really want you to continue making videos for this particular series because there's literally no one on KZbin with such guidance on DS projects and I have been looking for one since a very long time because I have my placements just after 12 months and I really want to make a full stack data science project. Thank you.
@stratascratch3 жыл бұрын
Thanks for the kind words! I have created a series of these videos from importing a pd df to a db to grabbing data from an API. I think most of the videos in the series should get you started on grabbing and managing data.Are there other types of skills you're interested in?
@jaquelinerazzo3 жыл бұрын
I already play back the video like 4 times just to watch the ads because I don't know how to thank you! Great channel, great video. :)
@andrealves31045 ай бұрын
I was having weird issues to insert data via pandas to_sql method on a Postgresql database on AWS while passing it through a sqlalchemy engine API interface. For now, adding it manually through a SQL query while iterating row by row on pandas worked fine, both for my project's purpose as well as for the video's goal. All and all, thank you very much for the overall explanation.
@RenatoCosta12 жыл бұрын
Hey Nate, amazing stuff! I think it would be great if you could do two more videos covering the topics you've mentioned in you video about the only project one would need to get a job in data science. I guess the next videos would be about machine learning modelling and the last one on how to deploy it online. That would be awesome! Thanks anyway
@stratascratch2 жыл бұрын
We're planning to publish 1 data project a month! When they are released, it'd be great if you can provide feedback. For example, did we dive deep enough into the technical aspects of the project? Did we dive too deep? Is the video too long/too short? etc? This will help us iterate on the videos so we can provide something of value for all viewers. Thanks for the feedback!
@its_me73633 жыл бұрын
much awaited video by myself...thanks for your time and effort Nate
@stratascratch3 жыл бұрын
Thanks for watching! I just updated the description with a link to the github repo that has the final version of this project. I'd definitely take a look at that and follow along with the notebook. I felt a bit too rushed on this video trying to explain all the concepts. But if there's anything to take away from this video, it's (1) how to properly structure your functions and (2) how to solve for memory & performance issues when potentially dealing with millions and billions of rows of data.
@its_me73633 жыл бұрын
@@stratascratch how industry people manage large data stored in database while exploring data? what are the best practices for data scientist or analyst in these cases?
@stratascratch3 жыл бұрын
@@its_me7363 The dbs are optimized to handle large amounts of data with speed. You definitely won't be using any postgres or mysql dbs. You'll be using HIVE or Greenplum or Snowflake. Also, depending on what you are trying to do, you will likely move away from using pandas and onto py spark to do some work. All of this is taught when you join the company. Most people know that not all entry level ds have used these tools.
@its_me73633 жыл бұрын
@@stratascratch does these dbs are similar to postgresql? and what do you mean by moving away from pandas...does industry not use pandas in their projects?
@stratascratch3 жыл бұрын
@@its_me7363 yes the dbs are similar to postgres. There's just minor differences in syntax but it's easy to learn. The industry uses pandas but sometimes when dealing with large amounts of data, you want to work with something that can handle large amounts of data and that's pyspark.
@quadrialli3715 Жыл бұрын
Great and detailed explanation
@zynkers4013 жыл бұрын
Great video Nate! You really have the spirit for teaching. The vocabulary and content is easy to understand and follow. I hope Strata Scratch has a continued exponential growth.
@stratascratch3 жыл бұрын
Thanks for the kind words! And thanks for watching!
@codingstyle94803 жыл бұрын
Hi, Thank you for the video. I have a question: How would you go around if you wanted to update your database in real time without you manually running the code. Perhaps we could design the code such that it is triggered when something changes in the source data to insert or update data automatically in real-time(perhaps with some delay, of course)
@muhammedmarong66453 жыл бұрын
hey did you figure that out?
@pashlenenaidoo36952 жыл бұрын
I would also like to know if you have achieved this
@stratascratch2 жыл бұрын
You would need something like Airflow or Jenkins to schedule the trigger.
@caioriet4582 жыл бұрын
I would like to know if you solved this problem!
@stratascratch2 жыл бұрын
@@caioriet458 Yes, I left a comment. You'll need to use a scheduler like Airflow in order to keep updating your db without manually running the code. The scheduler will run it for you.
@awds1213 жыл бұрын
Hey Nate! Great video! I have a few feedback. First, the videos are very helpful and you should definitely continue doing more videos like this. Here are some ideas: How to schedule scripts to run on AWS; create tables, analyze data, and create a dashboard. Second, the playlist that includes this video is unordered. It is not hard to figure out which video should be watched first, but it doesn't hurt to sort it in the order that you want videos to be watched. Third, AWS doesn't include PostgreSQL in the free tier (at least anymore). Just a heads up. Keep up the good work!
@JeffersonCanedo2 жыл бұрын
I so glad I found this video Thankyou
@stratascratch2 жыл бұрын
You're welcome.
@bodhid30912 жыл бұрын
Good video. A couple of questions: Why do you store the new rows into another dataframe rather than calling insert_into_table directly? Any reason you don't use the postgresql INSERT ON CONFLICT option to create an upsert, combining the insert and update into one call?
@javierjdaza3 жыл бұрын
Brother, never miss one video. this is pure gold!! keeo doing this please, can i connect with u in linked in? thanks bro, ure the best
@stratascratch3 жыл бұрын
Thanks so much for watching! And glad you like my videos =) Yes please connect with me on LinkedIn. You can search for StrataScratch and I should pop-up (I think?). If not, let me know.
@javierjdaza3 жыл бұрын
@@stratascratch i cant find u. :(
@stratascratch3 жыл бұрын
@@javierjdaza Sorry about that. Here you go! linkedin.com/in/nathanrosidi
@wisjnujudho31522 жыл бұрын
hi Nate, is there any alternative to aws rds, a cloud database server which doesn't require credit card even though it's free tier. alternatively, is it possible (maybe) to use heroku postgrest as database? thx
@stratascratch2 жыл бұрын
Hi Winsjnu, I find this article helpful. You might want to check out. www.lastweekinaws.com/blog/10-free-cloud-databases-you-should-consider-and-1-you-shouldnt/
@elextures9 ай бұрын
im having a problem with channel id, in CHANNEL_ID, i put my own channel, but each time i press response, i get a different channel
@GauravKumar-xl1fs3 жыл бұрын
please make more video of this series and pipelines
@stratascratch3 жыл бұрын
glad you liked the videos. Will do if there are more views!
@PATRICKCHUAD3 жыл бұрын
Thanks Nath for your video on Pandas. I will apply this to my project. in python. Thanks,
@stratascratch3 жыл бұрын
Great! Happy to help, thanks for watching!
@TripPiper2 жыл бұрын
If your connection to AWS fails try creating a new inbound security group that allows All Access. But also learn how to create a connection with TCP or some form of security because in production you’d never have inbound connections on All access
@denyseperezdevera71882 жыл бұрын
Hello! do you have any idea why it returns " database "database-1" does not exist " ?
@williamausenka92513 жыл бұрын
Great video! One question, what ML implementation would you do with that data? I'm trying to come up with something, but no success so far.
@jessinthamathew30323 жыл бұрын
Hi Nate! Really appreciate your content, its relevance and quality. So much to learn with each video! I wanted to ask why you opted for postgreSQL specifically? Can I use MS SQL server too?
@stratascratch3 жыл бұрын
I'm just used to postgres and it's open source. MS SQL is fine to learn as well. In the future, the platform will be able to handle both db engines but currently it only uses postgres. Thanks for watching!
@acsrr42882 жыл бұрын
what library shall we use to load data to azure DB Cloud?
@stratascratch2 жыл бұрын
You can use various libraries and frameworks to connect.
@acsrr42882 жыл бұрын
@@stratascratch can you recommend a couple for us to explore ? thx
@yashgupte30442 жыл бұрын
Getting a connection timed out error while connecting to the RDS instance. I have allowed all traffic in the inbound rules of the security group.
@stratascratch2 жыл бұрын
Hi Yash, how about this link could help to troubleshoot. aws.amazon.com/premiumsupport/knowledge-center/rds-oracle-connection-errors/
@jakubjankovic41442 жыл бұрын
Have you managed to solve it? I am having the same issue
@llamashockz2 жыл бұрын
having the same issue for hours, nothing seems to work to help fix it
@jakubjankovic41442 жыл бұрын
@@llamashockz I ended up using ElephantSQL instead of AWS to host the db and it resolved the issue for me.
@hogthebuilder2 жыл бұрын
2:20 in and know that this is about to be the video for me.
@MartinoxxHD3 жыл бұрын
Nice vid, I actually learned new things! One question, what's the point on creating a temporary DF for the new videos and then UPDATE them on a new for lopp. Can't we just UPDATE and INSERT the values in the same for loop?
@stratascratch3 жыл бұрын
I hold the new videos in a temp df so that I only need to upload the new videos or replace existing videos with new metrics. I'm trying to stay away from updating all videos for performance. Then I do the update and insert all at once but it's only for the new vids. Hope that makes sense. Thanks for watching
@suprithm38022 жыл бұрын
I am having data type clob i am not able to update data but the append function is working fine
@suprithm38022 жыл бұрын
Any suggestions
@andrew62332 жыл бұрын
Newbie question: what is the database manager you show at 11:00? Thank you so much for these video
@stratascratch2 жыл бұрын
It's Datagrip!
@stratascratch2 жыл бұрын
It's my favorite BTW. I hate the in-browser db managers.
@andrew62332 жыл бұрын
@@stratascratch Thanks! It looks really nice + convenient to work with compared to what I've been using.
@noobshady2 жыл бұрын
That was really helpful.
@stratascratch2 жыл бұрын
Glad it was helpful!
@khairulhakimi5082 жыл бұрын
Im new to this field. I have one noob question. Why people(DA, DE, DS) using pyhton(PANDAS DATAFRAME) for data ingestion to transfer sql . Why not just upload the csv into the ssms using query or using import flat file?
@stratascratch2 жыл бұрын
It's basically to allow processing or manipulation of the column names and data, if you want to do that.
@ruthfussee55033 жыл бұрын
Hi Nate, thank you for this video super useful was scanning through the internet for such a solution but could not find it. I am thinking what if there are many columns (e.g. 100 columns) wouldnt it be inefficient to type in row['column1], ..., row['column100]. ISsthere a way we can put the column information in a tuple, list or dictionary and passing through all this, like after SET also or vars_to_update. another issue is where the excel column names have space (different from SQL columns) how do I then write the code as I keep having sytnax errors Lastly, in the real world sometimes the excel files do not follow the template we stipulated, how should I design a validation check that is useful?
@denyseperezdevera71882 жыл бұрын
Hi Nate, thank you so much for these amazing resources ! Do you know why when i try to connect the database i get"OperationalError: FATAL: database "database-1" does not exist" ?
@stratascratch2 жыл бұрын
Do you have a db? Did you create one for the project? You need to add your credentials.
@denyseperezdevera71882 жыл бұрын
just solved my issue! i switched DB instance identifier with DB username.
@artyomashigov5 ай бұрын
@@denyseperezdevera7188 oh thanks, i had the same issue, so my username and dbname are the same
@skyace1733 жыл бұрын
Hey Nate! Great video! How much time do you expect to get this finish if this is a project at work?
@stratascratch3 жыл бұрын
This is probably a 1-2 day project depending on what is being asked of me. But it shouldn't take too long if you already know what you're doing.
@Digital-Light2 жыл бұрын
i'm geting this error "DatatypeMismatch: column "upload_date" is of type date but expression is of type double precision LINE 3: ... VALUES('mTL23Gd-T3g','Engagement_Ring_Animation','NaN'::flo..." please help
@stratascratch2 жыл бұрын
The error is saying that you have a dtype mismatch in the upload_date column. So clean up the values in the column and get it to a dtype that the db is expecting.
@Digital-Light2 жыл бұрын
@@stratascratch thank you for reply! i love your videos.
@ShahnazMalik.2 жыл бұрын
To insert or update in a table having millions of records in SQL database takes hours to complete. What is the best solution , please advise Thank you.
@stratascratch2 жыл бұрын
That's normal. Sometimes jobs just take hours to complete. I would optimize for ensuring that the process doesn't break and if it does, you don't lose all the work. One way to do this is to batch the update/ inserts so that if the process does break half way through, you at least have updated/inserted half the records.
@amadysvlog34493 жыл бұрын
Great video! Thank you! But question: How do i get access to the Amazon RDS free databases? I created an account on aws but can't find any database-yt.
@stratascratch3 жыл бұрын
you'll need to create your own db. That's a whole topic by itself. But you can try it by going on aws and then creating a db on the RDS service.
@andrewraad56262 жыл бұрын
Hey Nate, thanks for this video. how do you deal with foreign keys? My csv file has strings instead of the foreign keys needed to upload to a specific table that uses foreign key constraints. What do you suggest?
@stratascratch2 жыл бұрын
I have no idea how to fix your situation. I would just change the strings to FK. Some data manipulations will need to be done with your issue it seems.
@reality82213 жыл бұрын
Hi Nate, Thanks for the great videos. Won't it be simpler to insert it as a part of 1st for loop; instead of creating a df and then iterating through once again. Would like to know your thoughts.
@stratascratch3 жыл бұрын
My thinking about the temp df was to only update the new videos and metrics, and leave the rest alone. I think that's what I did in this project. Hope that makes sense.
@mcmoodoo2 жыл бұрын
Why not just use pandas.to_sql("table_name", engine) which will create the table based on the dataframe object? You can also pass types dictionary to specify the exact types for the columns like varchar and integer?
@stratascratch2 жыл бұрын
Yes you can do that. It doesn't always work the way I want it to when I use to_sql() and I often like to have QA checks along the way so my method helps with that. But there are multiple ways to do what I'm doing. I would choose the option that allows you to get your work done!
@Davidkiania2 жыл бұрын
Hi Nate, any chance you might show us how to automatically run the script? it is a cron job or how does that work? Thanks
@stratascratch2 жыл бұрын
Yeah, it's essentially a cron/scheduled job. I use Airflow at work but I've seen others use Jenkins.
@SensiStarToaster3 жыл бұрын
Nice video. But writing sql directly gets messy to maintain, why not use something like Sqlalchemy to separate your python code from the having to worry about the specific details of the specific sql implementation and make it independent of the type of database manager used?
@stratascratch3 жыл бұрын
yes, you can definitely do that. I haven't used sqlalchemy but have seen the documentation and agree with you. It could be a much better solution.
@hoanglam28142 жыл бұрын
Hi Nate, thanks for your videos. I have to retrieve data from almost 100 API and your video help a lot. But normally it takes me almost 10 hours to update all the data/records each day. Yesterday, I just came across some articles about Webhooks, and it mentioned that Webhooks will push data whereas we have to pull data from API. What do you think if I can use Webhooks to update data and insert new data into table? It will only retrieve data when there are events happen, instead of retrieving all data like pulling through API. For API, I think I can use it to retrieve old data from last year.
@stratascratch2 жыл бұрын
How many records do you have where it would take 10 hrs a day? If it takes that long, I wouldn't implement this method. Seems like you have a lot of data. I would use Airflow and setup DAGs to get those jobs scheduled. Without an understanding of what type of data you're trying to move and for what purpose, I can't tell you if webhooks is the right way
@hoanglam28142 жыл бұрын
@@stratascratch I retrieve about 40 thousand records each day, these are orders from a CRM system. My company helps around 100 clients manage their business, and each of them have one account in this CRM provided by a third party ( this means 100 API source). Actually, we are building an analytic system to work on it (included ETL process and dashboard), but until that, I to collect data from 100 APIs one by one. So I tried to implement your method and it works. But, yeah, like you said, it took too much time. I also did research on Airflow but I don't know what the difference between using Airflow and using a cron tool like Power automate/Zapier is. For me, Power automate/Zapier is much easier to learn.
@stratascratch2 жыл бұрын
@@hoanglam2814 The difference between Airflow and something like Zapier is that Zapier is good for low volume jobs that a non-technical person can setup. Airflow is the industry standard and can help manage huge databases and data lakes. For your use case, an automated ETL process where you collect data from APIs seems like it should work just fine for 40K records. Just make sure you automate the pulls since there are 100 of them. You can try webhooks if you'd like. I just don't have any experience using webhooks for your use case.
@hoanglam28142 жыл бұрын
@@stratascratch Thank you for your short and concise answer :)) it is really helpful!
@suprithm38022 жыл бұрын
The code is not updating the only itz appending the new values
@hardikvig2723 Жыл бұрын
Where to get the name of database??
@stratascratch Жыл бұрын
It's your db that you need to setup so it can be any name you give it.
@nqomuleya2 жыл бұрын
InterfaceError: (pyodbc.InterfaceError) ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')