Do these Pandas Alternatives actually work?

Рет қаралды 14,164

Күн бұрын

In this video we benchmark some of the python pandas alternative libraries and benchmark their speed on a large dataset. We look at four different libraries: Dask, Modin, Ray and Vaex. Pandas is a very popular library used by data scientists who code in python and other libraries exist that claim to be faster than pandas. We put them to the test and see which is the fastest!
Timeline:
00:00 Intro
00:30 Setup
03:05 Pandas
05:54 Ray
10:24 Dask
13:30 Modin
15:45 Vaex
18:45 Summary
Follow me on twitch for live coding streams: / medallionstallion_
My other videos:
Speed Up Your Pandas Code: • Make Your Pandas Code ...
Speed up Pandas Code: • Make Your Pandas Code ...
Intro to Pandas video: • A Gentle Introduction ...
Exploratory Data Analysis Video: • Exploratory Data Analy...
Working with Audio data in Python: • Audio Data Processing ...
Efficient Pandas Dataframes: • Speed Up Your Pandas D...
* KZbin: youtube.com/@robmulla?sub_con...
* Discord: / discord
* Twitch: / medallionstallion_
* Twitter: / rob_mulla
* Kaggle: www.kaggle.com/robikscube
#python #pandas #datascience #dataengineering

Пікірлер: 86

@robmulla Жыл бұрын

If you enjoyed this video you should also check out my video about Polars, a pandas alternative that I didn't cover in this video: kzbin.info/www/bejne/jHnUn2qrm86coqc&feature=shares

@joaomurilopalonefauvel942 Жыл бұрын

I wonder how polars performs. It seems like the fastest pandas alternative from my research.

@robmulla Жыл бұрын

Thanks for mentioning polars! I haven’t heard of it before but just read the GitHub page and it looks promising. Maybe next video I’ll cover it!

@robmulla Жыл бұрын

@Charles I made a video about polars! Check it out here! kzbin.info/www/bejne/jHnUn2qrm86coqc&feature=shares

@N147185 Жыл бұрын

There is an important subtlety that is being missed about Vaex - it lazily reads the parquet file each time you do an operation. That means, you add the time to read (stream) the data in addition to the time it takes to actually do the math. That makes it all the more impressive (to me anyways). Other libraries read and hold the data in memory, so it is ready to be used. Vaex is more "memory safe", which is especially useful if you work with datasets that are much larger than ram. Anyway, very nice video - keep up the good work!

@robmulla Жыл бұрын

Great point. I didn’t know much about Vaex going into this. Interesting that it’s memory safe.

@CedricDeBoom Жыл бұрын

Would have liked more focus on memory and cpu usage. Especially in contexts with big datasets but limited resources, this is crucial, and it would have been nice to see and compare the effects of lazy evaluation here.

@glitchaddict99 Ай бұрын

yeah this barely touched on the real reason I use dask, to do out of core data operations when I can’t use pandas anymore

@gingerjiang666 Жыл бұрын

Great video. Thank you very much. Quick question though, how did you change you jupyter lab theme. It looks so great?

@zhenliu6596 Жыл бұрын

Just want to say thank you for saving the time for us.

@robmulla Жыл бұрын

I’m happy to! Thanks for watching 😀

@DarthJarJar10 Жыл бұрын

Was a tad surprised you didn't explore a Dask Delayed object, or try out some of Dask's concurrency features but love your videos! Would have loved to have seen Polars in the mix. For the missing cumsum in Vaex, I want to check if a reduce plus lambda combo would not have worked...

@robmulla Жыл бұрын

Thanks for the feedback. I realize that this video only scratches the surface of what’s possible with each library. I felt like it wouldn’t be fair to use more than just the base API- but you make a fair point. Maybe I need to make a follow up video.

@DarthJarJar10 Жыл бұрын

@@robmulla, it was a pleasure! Your videos are great regardless! I'm immersing myself in this stuff bit by bit but having your content whilst working from home has been amazing!

@bennri Жыл бұрын

Doesn't dask.dataframe run on multiple cores concurrently by default?

@bennri Жыл бұрын

@Javis_Lumu Doesn't dask.dataframe run on multiple cores concurrently by default?

@DarthJarJar10 Жыл бұрын

@@bennri, I speak under correction - it may be specified by default that dask.dataframe is run concurrently but my understanding was that whether this was the case and the number of threads used is actually a setting, and moreover, that Dask utilises concurrency most efficiently using the delayed delayed. You're likely correct.

@terusensei_japones Жыл бұрын

Very interesting. It seems that I will keep mostly using pandas 🤣🤣 thanks for sharing the experiment!

@robmulla Жыл бұрын

When I started making this video I thought that each library would outperform pandas in a different way. I was suprised by the results. I'm sure there are situations where they are better alternatives to pandas - but for the time being I too will be mostly sticking with pandas.

@igormriegel Жыл бұрын

Try Polars, it is awesome and have way better results than what was shown in the video.

@RockieYang Жыл бұрын

Did you by change test with arrow format with vaex as well? As vaex is using memory mapping. It still need load the whole thing with parquet file. While it might avoid the whole load if using arrow.

@robmulla Жыл бұрын

That's a good point. I just ran each package using the default with no modifications. If there is a way to change the backend format let me know how it might be done.

@jorgetimes2 Жыл бұрын

Hi, @Rob Mulla, would you please share the link to the data parquet, so that we could replicate your results and dig a bit deeper to tell where the actual problems lie? Thanks for your video!

@robmulla Жыл бұрын

The data is a combination of the parquet files in this dataset: www.kaggle.com/datasets/robikscube/reddit-place-2022-official-canvas-history Good luck!

@androiduser457 Жыл бұрын

I love modin the most because it is backed by dask, ray or omnisci and compatibility with pandas api. If they did not support processing big data, pyspark it is.

@robmulla Жыл бұрын

Thanks for sharing. Is there a backed for modin you typically use more?

@bcak611 Жыл бұрын

Try Vaex for Big Data!

@jti107 Жыл бұрын

nice! the big thing with pandas is the amount of resources when learning and debugging. any thoughts on Julia?

@robmulla Жыл бұрын

Agreed the resources and documentation surrounding pandas makes it hard to beat. I don’t have any experience with Julia- do you recommend it?

@jti107 Жыл бұрын

@@robmulla i work in aerospace so alot of my colleague started using it but i love python too much so i've havent used it yet. i started with matlab and it took alot of effort to transition to python so the switching cost is pretty high. when i have some time, i'd like to at least try some tutorials to see what the hype is about. love your channel by the way, i've learned so much!!

@FabioRBelotto Жыл бұрын

You should share the notebook and the data (if it's avaliable somewhere). That would be interesting to explore more about such tools getting such a bad result as modin or dask.

@robmulla Жыл бұрын

Thanks! I did provide the code to the people at modin and they were looking into how to speed it up, but I haven't heard anything about it lately.

@gokulakrishnanm Жыл бұрын

Which processor you're using is that Intel processor? From what i heard is modin is good at running on Intel CPU. Please share your system spec

@robmulla Жыл бұрын

I’m using a ryzen chip. Maybe that’s the issue.

@gokulakrishnanm Жыл бұрын

@@robmulla share your test code with data. I have i5 11 th gen I'll benchamark and share result.

@kayderl Жыл бұрын

Your notebook looks really nice. Is that jupyter notebook with a theme or something else?

@robmulla Жыл бұрын

Thanks! Jupyterlab with the solarized dark theme.

@nishantkumar-lw6ce Жыл бұрын

How do we add existing list comprehension functions in pyspark?

@robmulla Жыл бұрын

I'm not sure. I haven't used pyspark in a long time :D

@585ghz Жыл бұрын

In dask, you can split into index, so they can aggregate by the index much more faster

@robmulla Жыл бұрын

That’s true. I just wanted to compare as a drop in replacement

@wayneh7067 2 ай бұрын

Tbh if you have that much CPU memory, there’s really no need to consider any Pandas alternative. Maybe do some memory heavy tasks like joining large dataframes, which I usually use Dask for.

@CNW21 Жыл бұрын

This is interesting because from my understanding pandas can only use 1 CPU core, where as some or all of those alternatives *should* be able to use some or all of your threadripper cores which theoretically would drastically improve performance. Either way, from the looks of it I'd rather spend a few seconds/minutes waiting in pandas than reading the documentation for pandas alternatives.

@robmulla Жыл бұрын

You have the same thought that I did! Why python inherently using only a single core the vectorized numpy and pandas functions are written in a lower level language that can do multithreading. So that's why straight pandas is hard to beat when the data can fit into memory.

@bennri Жыл бұрын

@@robmulla yes but when I look at the tapsk manager, I don't see multiple cores running.

@rafaeel731 Жыл бұрын

Would be useful knowing the specs of your machine. I think Dask makes sense when you have much larger data and clusters of executors, more like Spark situation.

@robmulla Жыл бұрын

My machine has a 32 thread ryzen CPU. There may be situations where it performs better but my main goal was to show how it performs on a single machine- most of the time pandas alone is the best.

@rafaeel731 Жыл бұрын

@@robmulla which is expected as Pandas can parallelise on a single machine and other options try to build on top of it, or some of them do. Thanks anyway!

@kv1kv Жыл бұрын

vaex is the only package that provides working out-of-core functionality you can process and explore the data that just does not fit into the memory at all on your desktop or laptop this is its purpose, it works when pandas just does not work at all it is an awesome package that I use on almost everyday basis it can be a little slower sometimes cause it does not load full dataset into memory and tries to use multiple cores so there is some expected overhead and it really misses some functionality so you sometimes need to convert data pieces into pandas which can be done easily

@robmulla Жыл бұрын

Really cool. I haven't used vaex much outside of in this video. Seems similar to polars, which I made a different video on.

@GiasoneP Жыл бұрын

Interesting video. I’ve been working on a problem to break up a 10 GB CSV file into multiple parquet files grouped by date. I’ve attempted to do it via Pandas chunksize= and Dask. Using Dask to read into a Pandas data frame (compute()) has yielded the fastest method. However, I think there are better methods. Moving on to pyspark next. The research continues…

@joaomurilopalonefauvel942 Жыл бұрын

Have you taken a look at polars?

@GiasoneP Жыл бұрын

@@joaomurilopalonefauvel942 i have not, but will check it out 👍

@robmulla Жыл бұрын

Thanks for the feedback. As I mentioned in the video every dataset may respond differently. I didn’t know how fast each would perform going into the video - and was a bit surprised by the results.

@igormriegel Жыл бұрын

@@GiasoneP I'm sure Polars will shine for you I'm using it on 40gbs datasets and it is pretty snappy.

@GiasoneP Жыл бұрын

@@igormriegel I’ll check it out this weekend. Thanks for sharing.

@soren-1184 Жыл бұрын

Pandas 2 with arrow backend would be interesting here as well.

@jmoz Жыл бұрын

I spent days testing dask and couldn’t find any benefits or even it would r work for what I was trying. A large 450M row dataset needed to pivot it and it simply wouldn’t work. Maxed out memory and hdd space. Had to use standard pandas and some clever iterating.

@robmulla Жыл бұрын

I’ve been in the exact same situation a bunch of times before too! That’s partly why I wanted to make this video. Thanks for watching.

@LeandroGessner 2 ай бұрын

I missed DuckDB In my tests, it is, by far, faster than these in the video (not sure about pandas)

@sawekb8102 Жыл бұрын

In order to speed up dask you can configure client to use all cores or pass .compute(scheduler ="processes")

@robmulla Жыл бұрын

Good to know. I didn’t want to add too much difference to the packages because I wanted to compare apples to apples.

@riptorforever2 Жыл бұрын

A suggestion: Add pyarrow lib if you do a update video about this :) the presentation 'PyArrow and the future of data analytics ( id: 6aWX9bZizu4 ) by EuroPython Conference impressed me

@robmulla Жыл бұрын

Oh wow. I need to check that out. Doing a review of polars soon. It’s really good!

@p.v.h.8659 4 ай бұрын

Tbh the comparison is like comparing pears and apples. You start off with a pd df which can fit in your memory, you are then obviously faster with running it in pandas bc you gonna have less overhead. But when you have datasets ehich just cant fit into your RAM pandas starts to get useless and one has to switch to alternatives for example dask, especially when you run computational heavy stuff liek bootstrapping etc on a cluster where dask supports the proper allocation of resources while pandas normally lacks this support.

@Maric18 11 ай бұрын

hm i am not that happy with this comparison, as it doesn't add anything that just naively trying these things out doesn't already do pandas (to my knowledge) already uses numpy under the hood, so it runs parallel on your local machine so every thing else doing the exact same thing as pandas will be less efficient most of these are optimizing for datasats that cannot fit in ram, dask is (to my knowledge) for clustering, at least thats how i am using it so a bit more research, trying to get ray to work for example, actually using dask features, doing some applys and so on would have been nice otherwise this video clickbait compatible title could be something like "Can these libraries be a direct drop in improvement over pandas?" or something