Do these Pandas Alternatives actually work?

  Рет қаралды 14,164

Rob Mulla

Rob Mulla

Күн бұрын

In this video we benchmark some of the python pandas alternative libraries and benchmark their speed on a large dataset. We look at four different libraries: Dask, Modin, Ray and Vaex. Pandas is a very popular library used by data scientists who code in python and other libraries exist that claim to be faster than pandas. We put them to the test and see which is the fastest!
Timeline:
00:00 Intro
00:30 Setup
03:05 Pandas
05:54 Ray
10:24 Dask
13:30 Modin
15:45 Vaex
18:45 Summary
Follow me on twitch for live coding streams: / medallionstallion_
My other videos:
Speed Up Your Pandas Code: • Make Your Pandas Code ...
Speed up Pandas Code: • Make Your Pandas Code ...
Intro to Pandas video: • A Gentle Introduction ...
Exploratory Data Analysis Video: • Exploratory Data Analy...
Working with Audio data in Python: • Audio Data Processing ...
Efficient Pandas Dataframes: • Speed Up Your Pandas D...
* KZbin: youtube.com/@robmulla?sub_con...
* Discord: / discord
* Twitch: / medallionstallion_
* Twitter: / rob_mulla
* Kaggle: www.kaggle.com/robikscube
#python #pandas #datascience #dataengineering

Пікірлер: 86
@robmulla
@robmulla Жыл бұрын
If you enjoyed this video you should also check out my video about Polars, a pandas alternative that I didn't cover in this video: kzbin.info/www/bejne/jHnUn2qrm86coqc&feature=shares
@joaomurilopalonefauvel942
@joaomurilopalonefauvel942 Жыл бұрын
I wonder how polars performs. It seems like the fastest pandas alternative from my research.
@robmulla
@robmulla Жыл бұрын
Thanks for mentioning polars! I haven’t heard of it before but just read the GitHub page and it looks promising. Maybe next video I’ll cover it!
@robmulla
@robmulla Жыл бұрын
@Charles I made a video about polars! Check it out here! kzbin.info/www/bejne/jHnUn2qrm86coqc&feature=shares
@N147185
@N147185 Жыл бұрын
There is an important subtlety that is being missed about Vaex - it lazily reads the parquet file each time you do an operation. That means, you add the time to read (stream) the data in addition to the time it takes to actually do the math. That makes it all the more impressive (to me anyways). Other libraries read and hold the data in memory, so it is ready to be used. Vaex is more "memory safe", which is especially useful if you work with datasets that are much larger than ram. Anyway, very nice video - keep up the good work!
@robmulla
@robmulla Жыл бұрын
Great point. I didn’t know much about Vaex going into this. Interesting that it’s memory safe.
@CedricDeBoom
@CedricDeBoom Жыл бұрын
Would have liked more focus on memory and cpu usage. Especially in contexts with big datasets but limited resources, this is crucial, and it would have been nice to see and compare the effects of lazy evaluation here.
@glitchaddict99
@glitchaddict99 Ай бұрын
yeah this barely touched on the real reason I use dask, to do out of core data operations when I can’t use pandas anymore
@gingerjiang666
@gingerjiang666 Жыл бұрын
Great video. Thank you very much. Quick question though, how did you change you jupyter lab theme. It looks so great?
@zhenliu6596
@zhenliu6596 Жыл бұрын
Just want to say thank you for saving the time for us.
@robmulla
@robmulla Жыл бұрын
I’m happy to! Thanks for watching 😀
@DarthJarJar10
@DarthJarJar10 Жыл бұрын
Was a tad surprised you didn't explore a Dask Delayed object, or try out some of Dask's concurrency features but love your videos! Would have loved to have seen Polars in the mix. For the missing cumsum in Vaex, I want to check if a reduce plus lambda combo would not have worked...
@robmulla
@robmulla Жыл бұрын
Thanks for the feedback. I realize that this video only scratches the surface of what’s possible with each library. I felt like it wouldn’t be fair to use more than just the base API- but you make a fair point. Maybe I need to make a follow up video.
@DarthJarJar10
@DarthJarJar10 Жыл бұрын
@@robmulla, it was a pleasure! Your videos are great regardless! I'm immersing myself in this stuff bit by bit but having your content whilst working from home has been amazing!
@bennri
@bennri Жыл бұрын
Doesn't dask.dataframe run on multiple cores concurrently by default?
@bennri
@bennri Жыл бұрын
@Javis_Lumu Doesn't dask.dataframe run on multiple cores concurrently by default?
@DarthJarJar10
@DarthJarJar10 Жыл бұрын
@@bennri, I speak under correction - it may be specified by default that dask.dataframe is run concurrently but my understanding was that whether this was the case and the number of threads used is actually a setting, and moreover, that Dask utilises concurrency most efficiently using the delayed delayed. You're likely correct.
@terusensei_japones
@terusensei_japones Жыл бұрын
Very interesting. It seems that I will keep mostly using pandas 🤣🤣 thanks for sharing the experiment!
@robmulla
@robmulla Жыл бұрын
When I started making this video I thought that each library would outperform pandas in a different way. I was suprised by the results. I'm sure there are situations where they are better alternatives to pandas - but for the time being I too will be mostly sticking with pandas.
@igormriegel
@igormriegel Жыл бұрын
Try Polars, it is awesome and have way better results than what was shown in the video.
@RockieYang
@RockieYang Жыл бұрын
Did you by change test with arrow format with vaex as well? As vaex is using memory mapping. It still need load the whole thing with parquet file. While it might avoid the whole load if using arrow.
@robmulla
@robmulla Жыл бұрын
That's a good point. I just ran each package using the default with no modifications. If there is a way to change the backend format let me know how it might be done.
@jorgetimes2
@jorgetimes2 Жыл бұрын
Hi, @Rob Mulla, would you please share the link to the data parquet, so that we could replicate your results and dig a bit deeper to tell where the actual problems lie? Thanks for your video!
@robmulla
@robmulla Жыл бұрын
The data is a combination of the parquet files in this dataset: www.kaggle.com/datasets/robikscube/reddit-place-2022-official-canvas-history Good luck!
@androiduser457
@androiduser457 Жыл бұрын
I love modin the most because it is backed by dask, ray or omnisci and compatibility with pandas api. If they did not support processing big data, pyspark it is.
@robmulla
@robmulla Жыл бұрын
Thanks for sharing. Is there a backed for modin you typically use more?
@bcak611
@bcak611 Жыл бұрын
Try Vaex for Big Data!
@jti107
@jti107 Жыл бұрын
nice! the big thing with pandas is the amount of resources when learning and debugging. any thoughts on Julia?
@robmulla
@robmulla Жыл бұрын
Agreed the resources and documentation surrounding pandas makes it hard to beat. I don’t have any experience with Julia- do you recommend it?
@jti107
@jti107 Жыл бұрын
@@robmulla i work in aerospace so alot of my colleague started using it but i love python too much so i've havent used it yet. i started with matlab and it took alot of effort to transition to python so the switching cost is pretty high. when i have some time, i'd like to at least try some tutorials to see what the hype is about. love your channel by the way, i've learned so much!!
@FabioRBelotto
@FabioRBelotto Жыл бұрын
You should share the notebook and the data (if it's avaliable somewhere). That would be interesting to explore more about such tools getting such a bad result as modin or dask.
@robmulla
@robmulla Жыл бұрын
Thanks! I did provide the code to the people at modin and they were looking into how to speed it up, but I haven't heard anything about it lately.
@gokulakrishnanm
@gokulakrishnanm Жыл бұрын
Which processor you're using is that Intel processor? From what i heard is modin is good at running on Intel CPU. Please share your system spec
@robmulla
@robmulla Жыл бұрын
I’m using a ryzen chip. Maybe that’s the issue.
@gokulakrishnanm
@gokulakrishnanm Жыл бұрын
@@robmulla share your test code with data. I have i5 11 th gen I'll benchamark and share result.
@kayderl
@kayderl Жыл бұрын
Your notebook looks really nice. Is that jupyter notebook with a theme or something else?
@robmulla
@robmulla Жыл бұрын
Thanks! Jupyterlab with the solarized dark theme.
@nishantkumar-lw6ce
@nishantkumar-lw6ce Жыл бұрын
How do we add existing list comprehension functions in pyspark?
@robmulla
@robmulla Жыл бұрын
I'm not sure. I haven't used pyspark in a long time :D
@585ghz
@585ghz Жыл бұрын
In dask, you can split into index, so they can aggregate by the index much more faster
@robmulla
@robmulla Жыл бұрын
That’s true. I just wanted to compare as a drop in replacement
@wayneh7067
@wayneh7067 2 ай бұрын
Tbh if you have that much CPU memory, there’s really no need to consider any Pandas alternative. Maybe do some memory heavy tasks like joining large dataframes, which I usually use Dask for.
@CNW21
@CNW21 Жыл бұрын
This is interesting because from my understanding pandas can only use 1 CPU core, where as some or all of those alternatives *should* be able to use some or all of your threadripper cores which theoretically would drastically improve performance. Either way, from the looks of it I'd rather spend a few seconds/minutes waiting in pandas than reading the documentation for pandas alternatives.
@robmulla
@robmulla Жыл бұрын
You have the same thought that I did! Why python inherently using only a single core the vectorized numpy and pandas functions are written in a lower level language that can do multithreading. So that's why straight pandas is hard to beat when the data can fit into memory.
@bennri
@bennri Жыл бұрын
@@robmulla yes but when I look at the tapsk manager, I don't see multiple cores running.
@rafaeel731
@rafaeel731 Жыл бұрын
Would be useful knowing the specs of your machine. I think Dask makes sense when you have much larger data and clusters of executors, more like Spark situation.
@robmulla
@robmulla Жыл бұрын
My machine has a 32 thread ryzen CPU. There may be situations where it performs better but my main goal was to show how it performs on a single machine- most of the time pandas alone is the best.
@rafaeel731
@rafaeel731 Жыл бұрын
@@robmulla which is expected as Pandas can parallelise on a single machine and other options try to build on top of it, or some of them do. Thanks anyway!
@kv1kv
@kv1kv Жыл бұрын
vaex is the only package that provides working out-of-core functionality you can process and explore the data that just does not fit into the memory at all on your desktop or laptop this is its purpose, it works when pandas just does not work at all it is an awesome package that I use on almost everyday basis it can be a little slower sometimes cause it does not load full dataset into memory and tries to use multiple cores so there is some expected overhead and it really misses some functionality so you sometimes need to convert data pieces into pandas which can be done easily
@robmulla
@robmulla Жыл бұрын
Really cool. I haven't used vaex much outside of in this video. Seems similar to polars, which I made a different video on.
@GiasoneP
@GiasoneP Жыл бұрын
Interesting video. I’ve been working on a problem to break up a 10 GB CSV file into multiple parquet files grouped by date. I’ve attempted to do it via Pandas chunksize= and Dask. Using Dask to read into a Pandas data frame (compute()) has yielded the fastest method. However, I think there are better methods. Moving on to pyspark next. The research continues…
@joaomurilopalonefauvel942
@joaomurilopalonefauvel942 Жыл бұрын
Have you taken a look at polars?
@GiasoneP
@GiasoneP Жыл бұрын
@@joaomurilopalonefauvel942 i have not, but will check it out 👍
@robmulla
@robmulla Жыл бұрын
Thanks for the feedback. As I mentioned in the video every dataset may respond differently. I didn’t know how fast each would perform going into the video - and was a bit surprised by the results.
@igormriegel
@igormriegel Жыл бұрын
@@GiasoneP I'm sure Polars will shine for you I'm using it on 40gbs datasets and it is pretty snappy.
@GiasoneP
@GiasoneP Жыл бұрын
@@igormriegel I’ll check it out this weekend. Thanks for sharing.
@soren-1184
@soren-1184 Жыл бұрын
Pandas 2 with arrow backend would be interesting here as well.
@jmoz
@jmoz Жыл бұрын
I spent days testing dask and couldn’t find any benefits or even it would r work for what I was trying. A large 450M row dataset needed to pivot it and it simply wouldn’t work. Maxed out memory and hdd space. Had to use standard pandas and some clever iterating.
@robmulla
@robmulla Жыл бұрын
I’ve been in the exact same situation a bunch of times before too! That’s partly why I wanted to make this video. Thanks for watching.
@LeandroGessner
@LeandroGessner 2 ай бұрын
I missed DuckDB In my tests, it is, by far, faster than these in the video (not sure about pandas)
@sawekb8102
@sawekb8102 Жыл бұрын
In order to speed up dask you can configure client to use all cores or pass .compute(scheduler ="processes")
@robmulla
@robmulla Жыл бұрын
Good to know. I didn’t want to add too much difference to the packages because I wanted to compare apples to apples.
@riptorforever2
@riptorforever2 Жыл бұрын
A suggestion: Add pyarrow lib if you do a update video about this :) the presentation 'PyArrow and the future of data analytics ( id: 6aWX9bZizu4 ) by EuroPython Conference impressed me
@robmulla
@robmulla Жыл бұрын
Oh wow. I need to check that out. Doing a review of polars soon. It’s really good!
@p.v.h.8659
@p.v.h.8659 4 ай бұрын
Tbh the comparison is like comparing pears and apples. You start off with a pd df which can fit in your memory, you are then obviously faster with running it in pandas bc you gonna have less overhead. But when you have datasets ehich just cant fit into your RAM pandas starts to get useless and one has to switch to alternatives for example dask, especially when you run computational heavy stuff liek bootstrapping etc on a cluster where dask supports the proper allocation of resources while pandas normally lacks this support.
@Maric18
@Maric18 11 ай бұрын
hm i am not that happy with this comparison, as it doesn't add anything that just naively trying these things out doesn't already do pandas (to my knowledge) already uses numpy under the hood, so it runs parallel on your local machine so every thing else doing the exact same thing as pandas will be less efficient most of these are optimizing for datasats that cannot fit in ram, dask is (to my knowledge) for clustering, at least thats how i am using it so a bit more research, trying to get ray to work for example, actually using dask features, doing some applys and so on would have been nice otherwise this video clickbait compatible title could be something like "Can these libraries be a direct drop in improvement over pandas?" or something
@lucasbraesch805
@lucasbraesch805 Жыл бұрын
What about polars? This one beats everything else hands in all the benchmarks that I have seen.
@robmulla
@robmulla Жыл бұрын
I've heard a lot of good things about polars and need to check it out.
@fizipcfx
@fizipcfx Жыл бұрын
How about cudf?
@robmulla
@robmulla Жыл бұрын
I didn’t cover it in this video but maybe in the future.
@CaribouDataScience
@CaribouDataScience Жыл бұрын
You misspelled "Tidyverse" 😂
@robmulla
@robmulla Жыл бұрын
😂
@ajaypranav1390
@ajaypranav1390 Жыл бұрын
Try with polars
@robmulla
@robmulla Жыл бұрын
I did! Check out my channel I have two new videos about it
@MichaelMantion
@MichaelMantion Жыл бұрын
My butt puckers when ever I see people use "dd". It is not an urban legend that people have lost a lot of data misusing dd in bash.
@robmulla
@robmulla Жыл бұрын
lol. That thought has never crossed my mind but it’s pretty funny.
@FabioRBelotto
@FabioRBelotto Жыл бұрын
If you are an experienced user and have issues with some libs, imagine what happens to a beginner lol
@robmulla
@robmulla Жыл бұрын
true! But this is a good thing to learn as a beginner too.
@barelmishal9668
@barelmishal9668 Жыл бұрын
Hi try polars this is the best of the much better then pandas by far
@robmulla
@robmulla Жыл бұрын
Absolutely! I made a whole video about it. Check it out here. kzbin.info/www/bejne/jHnUn2qrm86coqc&feature=shares
@tashfinbashar1943
@tashfinbashar1943 6 ай бұрын
Great video. Can you do one on Polars? @robmulla
Time Series Forecasting with XGBoost - Advanced Methods
22:02
Rob Mulla
Рет қаралды 108 М.
CAN YOU HELP ME? (ROAD TO 100 MLN!) #shorts
00:26
PANDA BOI
Рет қаралды 35 МЛН
How many pencils can hold me up?
00:40
A4
Рет қаралды 16 МЛН
Intro to Python Dask: Easy Big Data Analytics with Pandas!
20:31
Bryan Cafferky
Рет қаралды 12 М.
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
Rob Mulla
Рет қаралды 257 М.
Speed Up Your Pandas Dataframes
11:15
Rob Mulla
Рет қаралды 67 М.
Make Your Pandas Code Lightning Fast
10:38
Rob Mulla
Рет қаралды 175 М.
DuckDB vs Pandas vs Polars For Python devs
12:05
MotherDuck
Рет қаралды 12 М.
Wait... PostgreSQL can do WHAT?
20:33
The Art Of The Terminal
Рет қаралды 185 М.
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 254 М.
The Harsh Reality of Being a Data Analyst
7:39
Sundas Khalid
Рет қаралды 524 М.
Generative AI in a Nutshell - how to survive and thrive in the age of AI
17:57
CAN YOU HELP ME? (ROAD TO 100 MLN!) #shorts
00:26
PANDA BOI
Рет қаралды 35 МЛН