How to work with big data files (5gb+) in Python Pandas!

  Рет қаралды 40,800

TechTrek by Keith Galli

TechTrek by Keith Galli

Күн бұрын

Пікірлер: 50
@Hossein118
@Hossein118 2 жыл бұрын
The end of the video was so fascinating to see how that huge amount of data was compressed to such a manageable size.
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
I agree! So satisfying :)
@dhssb999
@dhssb999 2 жыл бұрын
Never used chunk in read_csv before, it helps a lot! Great tip, thanks
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Glad it was helpful!!
@mjacfardk
@mjacfardk 2 жыл бұрын
During my 3 years in the field of data science, this course would be the best I've ever watched. thank you brother, go ahead.
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Glad you enjoyed!
@fruitfcker5351
@fruitfcker5351 2 жыл бұрын
If (and only if) you only want to read a few columns, just specify the columns you want to process from the CSV by adding *usecols=["brand", "category_code", "event_type"]* to the *pd.read_csv* function. Took about 38seconds to read on an M1 Macbook Air.
@michaelhaag3367
@michaelhaag3367 2 жыл бұрын
glad you are back my man, I am currently in a data science bootcamp and you are way better than some of my teachers ;)
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Glad to be back :). I appreciate the support!
@jacktrainer4387
@jacktrainer4387 2 жыл бұрын
No wonder I've had trouble with Kaggle datasets! "Big" is a relative term. It's great to have a reasonable benchmark to work with! Many thanks!
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Definitely, "big" very much means different things to different people and circumstances.
@Nevir202
@Nevir202 2 жыл бұрын
Ya, I've been trying to process a book in Sheets, for that processing 100k words, so a few MB, in the way I'm trying to is already too much lol.
@CaribouDataScience
@CaribouDataScience Жыл бұрын
Since you are working with Python, another approach would be to import the data into SQLite db. Then create some aggregate tables and views ...
@ahmetsenol6104
@ahmetsenol6104 Жыл бұрын
It was quick and straight to the point. Very good one thanks.
@JADanso
@JADanso 2 жыл бұрын
Very timely info, thanks Keith!!
@andydataguy
@andydataguy Жыл бұрын
Great video! Hope you start making more soon
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli Жыл бұрын
Thank you! More on the way soon :)
@AshishSingh-753
@AshishSingh-753 2 жыл бұрын
Pandas have capabilities I don't know it - secret Keith knows everything
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Lol I love the nickname "secret keith". Glad this video was helpful!
@lesibasepuru8521
@lesibasepuru8521 2 жыл бұрын
You are a star my man... thank you
@abhaytiwari5991
@abhaytiwari5991 2 жыл бұрын
Well-done Keith 👍👍👍
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Thank you :)
@rishigupta2342
@rishigupta2342 2 жыл бұрын
Thanks Keith. Please do more videos on EDA python.
@firasinuraya7065
@firasinuraya7065 2 жыл бұрын
OMG..this is gold..thank you for sharing
@elu1
@elu1 2 жыл бұрын
great short video! nice job and thanks!
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Glad you enjoyed!
@spicytuna08
@spicytuna08 2 жыл бұрын
thanks for the great lesson wondering what would be the performance between output = pd.concat([output, summary) vs output.append(summary)?
@agnesmunee9406
@agnesmunee9406 Жыл бұрын
How would a go about it if it was a jsonlines(jsonl) data file?
@manyes7577
@manyes7577 2 жыл бұрын
i have error message on this one. it says 'DataFrame' object is not callable. why is that and how to solve it? thanks for chunk in df: details = chunk[['brand', 'category_code','event_type']] display(details.head()) break
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
How did you define "df"? I think that's where your issue lies.
@CS_n00b
@CS_n00b Жыл бұрын
why not groupby.size() instead of groupby.sum() the column of 1's?
@DataAnalystVictoria
@DataAnalystVictoria Жыл бұрын
Why and how you use 'append' with DataFrame? I have an error, when I do the same thing. Only if I use a list instead, and then concat all the dfs in the list I have the same result as you do.
@lukaschumchal6676
@lukaschumchal6676 2 жыл бұрын
Thank you for video, it was really helpfull. But i am still little confused. Do I have to run every big file with chunks, because its necessary or it is just quicker way of working with large files?
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
The answer really depends on the amount of RAM that you have on your machine. For example, I have 16gb of ram on my laptop. No matter what, I would never be able to load in a file 16gb+ all at once because I don't have enough RAM (memory) to do that. Realistically, my machine is probably using about half the RAM for miscellaneous tasks at all times so I wouldn't even be able to open up a 8gb file all at once. If you are on Windows, you can open up your task manager --> performance to see details on how much memory is available. You could technically open up a file as long as you have enough memory available for it, but performance will decrease as you get closer to your total memory limit. As a result my general recommendation would be to load in files in chunks basically any time the file is greater than 1-2gb in size.
@lukaschumchal6676
@lukaschumchal6676 2 жыл бұрын
@@TechTrekbyKeithGalli Thank you very much. I cannot even describe you how this is helpfull to me :).
@rokaskarabevicius
@rokaskarabevicius 5 ай бұрын
This works fine if you don't have any duplicates in your data. Even if you de-dupe every chunk, aggregating it makes it impossible to know whether there are any dupes between the chunks. In other words, do not use this method if you're not sure whether your data contains duplicates.
@rodemire
@rodemire 3 ай бұрын
What method can we use if there are possible duplicates?
@machinelearning1822
@machinelearning1822 Жыл бұрын
I have tried and followed each step however it gives this error: OverflowError: signed integer is greater than maximum
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli Жыл бұрын
How big is the data file you are trying to open?
@oscararmandocisnerosruvalc8503
@oscararmandocisnerosruvalc8503 Жыл бұрын
Cool videos bro . Can you address load and dump for Json please :)?
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli Жыл бұрын
No guarantees, but I'll put that on my idea list!
@konstantinpluzhnikov4862
@konstantinpluzhnikov4862 2 жыл бұрын
Nice video! Working with big files If a hardware is not at it best means there is much time to make a cup of coffee, discuss the latest news...
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli 2 жыл бұрын
Haha yep
@vickkyphogat
@vickkyphogat Жыл бұрын
what about .SAV files ?
@dicloniusN35
@dicloniusN35 9 ай бұрын
but new file have only 100000 , not all info, you ignore other data?
@oscararmandocisnerosruvalc8503
@oscararmandocisnerosruvalc8503 Жыл бұрын
Why did you use the count there
@TechTrekbyKeithGalli
@TechTrekbyKeithGalli Жыл бұрын
If you want to aggregate data (make it smaller), counting the number of occurrences of events is a common method to do that. If you are wondering why I added an additional 'count' column and summing, instead of just doing something like value_counts(), that's just my personal preferred method of doing it. Both can work correctly.
@oscararmandocisnerosruvalc8503
@oscararmandocisnerosruvalc8503 Жыл бұрын
@@TechTrekbyKeithGalli Thanks a lot for your videos, bro !!!!
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 268 М.
Intro to Python Dask: Easy Big Data Analytics with Pandas!
20:31
Bryan Cafferky
Рет қаралды 14 М.
Perfect Pitch Challenge? Easy! 🎤😎| Free Fire Official
00:13
Garena Free Fire Global
Рет қаралды 90 МЛН
Walking on LEGO Be Like... #shorts #mingweirocks
00:41
mingweirocks
Рет қаралды 7 МЛН
Exploratory Data Analysis with Pandas Python
40:22
Rob Mulla
Рет қаралды 497 М.
Read Giant Datasets Fast - 3 Tips For Better Data Science Skills
15:17
Python Simplified
Рет қаралды 53 М.
How to Create a Beautiful Python Visualization Dashboard With Panel/Hvplot
10:57
Thu Vu data analytics
Рет қаралды 551 М.
This Is Why Python Data Classes Are Awesome
22:19
ArjanCodes
Рет қаралды 814 М.
Make Your Pandas Code Lightning Fast
10:38
Rob Mulla
Рет қаралды 186 М.
Big text file handling
13:18
IIT Madras - B.S. Degree Programme
Рет қаралды 27 М.