The end of the video was so fascinating to see how that huge amount of data was compressed to such a manageable size.
@TechTrekbyKeithGalli2 жыл бұрын
I agree! So satisfying :)
@dhssb9992 жыл бұрын
Never used chunk in read_csv before, it helps a lot! Great tip, thanks
@TechTrekbyKeithGalli2 жыл бұрын
Glad it was helpful!!
@mjacfardk2 жыл бұрын
During my 3 years in the field of data science, this course would be the best I've ever watched. thank you brother, go ahead.
@TechTrekbyKeithGalli2 жыл бұрын
Glad you enjoyed!
@fruitfcker53512 жыл бұрын
If (and only if) you only want to read a few columns, just specify the columns you want to process from the CSV by adding *usecols=["brand", "category_code", "event_type"]* to the *pd.read_csv* function. Took about 38seconds to read on an M1 Macbook Air.
@michaelhaag33672 жыл бұрын
glad you are back my man, I am currently in a data science bootcamp and you are way better than some of my teachers ;)
@TechTrekbyKeithGalli2 жыл бұрын
Glad to be back :). I appreciate the support!
@jacktrainer43872 жыл бұрын
No wonder I've had trouble with Kaggle datasets! "Big" is a relative term. It's great to have a reasonable benchmark to work with! Many thanks!
@TechTrekbyKeithGalli2 жыл бұрын
Definitely, "big" very much means different things to different people and circumstances.
@Nevir2022 жыл бұрын
Ya, I've been trying to process a book in Sheets, for that processing 100k words, so a few MB, in the way I'm trying to is already too much lol.
@CaribouDataScience Жыл бұрын
Since you are working with Python, another approach would be to import the data into SQLite db. Then create some aggregate tables and views ...
@ahmetsenol6104 Жыл бұрын
It was quick and straight to the point. Very good one thanks.
@JADanso2 жыл бұрын
Very timely info, thanks Keith!!
@andydataguy Жыл бұрын
Great video! Hope you start making more soon
@TechTrekbyKeithGalli Жыл бұрын
Thank you! More on the way soon :)
@AshishSingh-7532 жыл бұрын
Pandas have capabilities I don't know it - secret Keith knows everything
@TechTrekbyKeithGalli2 жыл бұрын
Lol I love the nickname "secret keith". Glad this video was helpful!
@lesibasepuru85212 жыл бұрын
You are a star my man... thank you
@abhaytiwari59912 жыл бұрын
Well-done Keith 👍👍👍
@TechTrekbyKeithGalli2 жыл бұрын
Thank you :)
@rishigupta23422 жыл бұрын
Thanks Keith. Please do more videos on EDA python.
@firasinuraya70652 жыл бұрын
OMG..this is gold..thank you for sharing
@elu12 жыл бұрын
great short video! nice job and thanks!
@TechTrekbyKeithGalli2 жыл бұрын
Glad you enjoyed!
@spicytuna082 жыл бұрын
thanks for the great lesson wondering what would be the performance between output = pd.concat([output, summary) vs output.append(summary)?
@agnesmunee9406 Жыл бұрын
How would a go about it if it was a jsonlines(jsonl) data file?
@manyes75772 жыл бұрын
i have error message on this one. it says 'DataFrame' object is not callable. why is that and how to solve it? thanks for chunk in df: details = chunk[['brand', 'category_code','event_type']] display(details.head()) break
@TechTrekbyKeithGalli2 жыл бұрын
How did you define "df"? I think that's where your issue lies.
@CS_n00b Жыл бұрын
why not groupby.size() instead of groupby.sum() the column of 1's?
@DataAnalystVictoria Жыл бұрын
Why and how you use 'append' with DataFrame? I have an error, when I do the same thing. Only if I use a list instead, and then concat all the dfs in the list I have the same result as you do.
@lukaschumchal66762 жыл бұрын
Thank you for video, it was really helpfull. But i am still little confused. Do I have to run every big file with chunks, because its necessary or it is just quicker way of working with large files?
@TechTrekbyKeithGalli2 жыл бұрын
The answer really depends on the amount of RAM that you have on your machine. For example, I have 16gb of ram on my laptop. No matter what, I would never be able to load in a file 16gb+ all at once because I don't have enough RAM (memory) to do that. Realistically, my machine is probably using about half the RAM for miscellaneous tasks at all times so I wouldn't even be able to open up a 8gb file all at once. If you are on Windows, you can open up your task manager --> performance to see details on how much memory is available. You could technically open up a file as long as you have enough memory available for it, but performance will decrease as you get closer to your total memory limit. As a result my general recommendation would be to load in files in chunks basically any time the file is greater than 1-2gb in size.
@lukaschumchal66762 жыл бұрын
@@TechTrekbyKeithGalli Thank you very much. I cannot even describe you how this is helpfull to me :).
@rokaskarabevicius5 ай бұрын
This works fine if you don't have any duplicates in your data. Even if you de-dupe every chunk, aggregating it makes it impossible to know whether there are any dupes between the chunks. In other words, do not use this method if you're not sure whether your data contains duplicates.
@rodemire3 ай бұрын
What method can we use if there are possible duplicates?
@machinelearning1822 Жыл бұрын
I have tried and followed each step however it gives this error: OverflowError: signed integer is greater than maximum
@TechTrekbyKeithGalli Жыл бұрын
How big is the data file you are trying to open?
@oscararmandocisnerosruvalc8503 Жыл бұрын
Cool videos bro . Can you address load and dump for Json please :)?
@TechTrekbyKeithGalli Жыл бұрын
No guarantees, but I'll put that on my idea list!
@konstantinpluzhnikov48622 жыл бұрын
Nice video! Working with big files If a hardware is not at it best means there is much time to make a cup of coffee, discuss the latest news...
@TechTrekbyKeithGalli2 жыл бұрын
Haha yep
@vickkyphogat Жыл бұрын
what about .SAV files ?
@dicloniusN359 ай бұрын
but new file have only 100000 , not all info, you ignore other data?
@oscararmandocisnerosruvalc8503 Жыл бұрын
Why did you use the count there
@TechTrekbyKeithGalli Жыл бұрын
If you want to aggregate data (make it smaller), counting the number of occurrences of events is a common method to do that. If you are wondering why I added an additional 'count' column and summing, instead of just doing something like value_counts(), that's just my personal preferred method of doing it. Both can work correctly.
@oscararmandocisnerosruvalc8503 Жыл бұрын
@@TechTrekbyKeithGalli Thanks a lot for your videos, bro !!!!