Pandas Dataframes on your GPU w/ CuDF

  Рет қаралды 43,173

sentdex

sentdex

Күн бұрын

An overview and some quick examples of using CuDF's Pandas accelerator and how much faster it can be than vanilla Pandas for data analysis.
Colab demo of Rapids: nvda.ws/3LWggQj
AI and Data Science Virtual Summit: nvda.ws/3ZR3wjL
Notebook in this video: gist.github.co...
Install CuDF: pip install cudf-cu11 --extra-index-url=pypi.nvidia.com (or cu12)
Neural Networks from Scratch book: nnfs.io
Channel membership: / @sentdex
Discord: / discord
Reddit: / sentdex
Support the content: pythonprogramm...
Twitter: / sentdex
Instagram: / sentdex
Facebook: / pythonprogramming.net
Twitch: / sentdex

Пікірлер: 73
@HarrisBallis
@HarrisBallis 10 ай бұрын
Enabling cuDF using a single flag is insane! However, I just wannted to point out (especially for new pandas users) that the proper way to calculate average price per city in pandas is by using groupby. Running `df.groupby('Town/City')['price'].mean()` in plain pandas is blazing fast (a few ms), nothing compared to 19 minutes. That doesn't mean that cuDF is not useful, but don't forget that using plain pandas properly can get you a long way.
@AchileasGalatis
@AchileasGalatis 5 ай бұрын
😊
@bantaibaman5662
@bantaibaman5662 10 ай бұрын
Hey Sentdex, can you take this video down so my manager doesn't find out that I sped up the entire codebase by 200 fold with just one line and I end up getting appreciation bonuses?? Jokes aside, this is absolutely wild. What a gamechanger. Thanks a lot as always, Kevin!
@harshvaragiya8834
@harshvaragiya8834 10 ай бұрын
Awesome video! I encountered a similar issue where I had to process ~8 GB of data using an AWS Lambda (limited RAM and time). I used polars (pandas alternative written in rust from scratch for performance) and I found it to be blazing fast . It's really really useful - especially with non nvidia devices like my raspberry pi and the AWS lambda function. You should definitely check it out!
@incremental_failure
@incremental_failure 10 ай бұрын
I've nearly forgotten pandas after going with polars. Pandas was great for its time.
@xylem87
@xylem87 10 ай бұрын
there is also dask which also allows deployment on clusters with several workers similar to spark
@AyahuascaDataScientist
@AyahuascaDataScientist 10 ай бұрын
@@incremental_failurepolars doesn’t even offer a .info() method. Simply inferior()
@incremental_failure
@incremental_failure 10 ай бұрын
@@AyahuascaDataScientist df.describe()
@kenchang3456
@kenchang3456 10 ай бұрын
Once again thank you for sharing :-) You are appreciated.
@megalomaniacal
@megalomaniacal 10 ай бұрын
Thanks bro, will give it a test run.
@Brickkzz
@Brickkzz 10 ай бұрын
Missing your tutorials man, trying to install this on windows...
@jameslucas5590
@jameslucas5590 10 ай бұрын
Outstanding. Thank you for this informatoin.
@damianshaw8456
@damianshaw8456 10 ай бұрын
For the read_csv operation I would be curious what is actually taking the most amount of time with the Pandas object, I suspect it's building the Python string objects, and if so I wonder if you have PyArrow installed and set pd.options.future.infer_string = True it would be much faster? And in general it makes sense that using strings are slow in Pandas, because it's falling back to looking up a Python object by via reference to it, it's actually a much more interesting comparison for number or datetime data types. For strings it would be much more interesting if you had use the PyArrow string data type.
@usamatahir7091
@usamatahir7091 10 ай бұрын
Thanks a lot for sharing. super useful
@extrememike
@extrememike 10 ай бұрын
Great find!🎉
@__python__
@__python__ 10 ай бұрын
Thanks for sharing... I would be curious about a comparison between accelerated version of pandas and polars.
@perryholman5302
@perryholman5302 10 ай бұрын
Wonderful! Thank you. It would be an interesting comparison with polars library as well.
@VermontStrolls
@VermontStrolls 10 ай бұрын
Thanks a ton!
@oguzhanyldrm962
@oguzhanyldrm962 10 ай бұрын
Please post more often videos Harrison
@ahmed-yassinechraa8731
@ahmed-yassinechraa8731 10 ай бұрын
Can you make a tutorial on howa to install cuDF, i saw that there is a lot of things to install before it
@maurice9327
@maurice9327 10 ай бұрын
What if my RAM (128GB) is larger than my VRAM (32GB)? Is normal pandas still faster for data that's larger than the VRAM?
@kascesar
@kascesar 10 ай бұрын
Impresive!
@thetdg
@thetdg 10 ай бұрын
Great video. Just one thing: instead of comparing cuDF with vanilla Pandas, wouldn’t a comparison with Modin be a more appropriate one?
@droit19
@droit19 10 ай бұрын
I'd like to see this as well, scale Modin on a Ray Cluster/Single Node using a GPU
@shaftymaze
@shaftymaze 9 ай бұрын
Polars in rust wrapped in tqdm.
@Mil-Keeway
@Mil-Keeway 7 ай бұрын
3:38 it doesn't have the prices "in quotes like a string", it's a properly exported csv that has ALL fields quoted. Your pd.read_csv is missing quoting=csv.QUOTE_ALL (or just quoting=1) and optionally quotechar='\"' . The only "magic" pandas is doing is interpreting that column as quoted. If you add those options, I'm guessing cudf will run just as well, since the ingest portion will still be using python standard lib or at least pandas C implementations.
@delt19
@delt19 10 ай бұрын
10 sec compared to 19 min?!?! Holy f....!!!!!
@shashisaini7919
@shashisaini7919 10 ай бұрын
@sentdex sir please make videos on 3d deep learning, its really exciting to see your work on point cloud
@HeigthTrielli
@HeigthTrielli 4 ай бұрын
Hi! Can you share what hardware were you operating on?
@EarlZMoade
@EarlZMoade 10 ай бұрын
What's the reasoning for not using groupby in this demo? Wouldn't that be the more natural and faster pandas method to use - instead of looping over everything. Feels a little disingenuous to compare poorly optimised pandas code that no one would actually write.
@AlignmentLabAI
@AlignmentLabAI 10 ай бұрын
jesus christ my life has totally changed
@BohonChina
@BohonChina 10 ай бұрын
how about Mojo? Mojo can actually use GPU to accelerate calculation too, currently Mojo support numpy,pandas in cpu. It will be fun to make a comparison with CuDF. Mojo is more like a superset for python.
@mamiri8520
@mamiri8520 Ай бұрын
hi thank you for showing this great way to use gpu, it seems really easy but I ran into an error that I couldn't find the solution anywhere: UserWarning: cudf.pandas detected an already configured memory resource, ignoring 'CUDF_PANDAS_RMM_MODE'=managed_pool can anyone help me?
@vigneshpadmanabhan
@vigneshpadmanabhan 10 ай бұрын
CuDF vs Polars may be.
@Stinosko
@Stinosko 10 ай бұрын
Cool!
@sternsemasuka9716
@sternsemasuka9716 10 ай бұрын
How about Polar?
@mamiri8520
@mamiri8520 Ай бұрын
Can you please make a video for using cudf in python scripts? it's much trickier in the scripts
@EtienneTremblay
@EtienneTremblay 10 ай бұрын
What happens if your dataset doesn't fit in GPU memory?
@jurajjakubik2185
@jurajjakubik2185 10 ай бұрын
I have tested that and it is slower than on CPU. Pretty much you use all GPU memory and rest going to RAM. Then it is back and forward.
@BDog_1
@BDog_1 7 ай бұрын
how would you start to make an AI that deals with data using python. I'm trying to learn more about this
@d_b_
@d_b_ 10 ай бұрын
This looks like it would beat out something like Dask for non-distributed large datasets. Is that the case?
@Andrea-du3or
@Andrea-du3or 4 ай бұрын
Has anyone been able to install the library with pip as he shown ? I keep getting errors like - Preparing metadata (pyproject.toml): finished with status 'error' :')
@ivchatov
@ivchatov 10 ай бұрын
Is this compatible with Python 3.7 at all? Last time I tried installing CuDF I remember version incompatibility stopping me.
@MuhammadNurdinnewspecies
@MuhammadNurdinnewspecies 10 ай бұрын
Next brow... how to manage your gpu memory. Loading your dataset and training your model
@bikkikumarsha
@bikkikumarsha 10 ай бұрын
please make a video on custom GPTs, actions and open ai dev event
@AyahuascaDataScientist
@AyahuascaDataScientist 10 ай бұрын
Does cudf.pandas work with apple silicon MPS GPU framework instead of just cuda?
@catslover4745
@catslover4745 9 ай бұрын
Hey does anyone know how to get this working on normal visual studio code python file instead of opening it in jupyter? Thanks
@youknowmyname12345
@youknowmyname12345 10 ай бұрын
Does this new accelerator speed up groupby() operations?
@r.k.vignesh7832
@r.k.vignesh7832 10 ай бұрын
9:01 I'm running late for work, but wouldn't it be possible to vectorize this code and it be faster than both the CuDF and the CPU versions of this benchmark? Curious to see how CuDF plays with vectorized versions. If I get the time I'll try some experiments and update this comment.
@r.k.vignesh7832
@r.k.vignesh7832 10 ай бұрын
On a trial dataset with 10000 rows of fake data (~500kB in size), using GroupBy to find the mean of the "unique prices" was 72x faster than the one implemented in this one at 6:45 or so. I expect that it will become exponentially faster with a dataset 5GB in size. I used groupby in another part and that naively halved the time, but it's still far from optimized and I'll probably upload my findings here with a ~5GB dataset running on Colab sometime in the next couple of weeks. After that, I'll try the CuDF version.
@EarlZMoade
@EarlZMoade 10 ай бұрын
Very interested to see this comparison. No one writes pandas code like this video where looping is both unnatural and terribly optimised.
@r.k.vignesh7832
@r.k.vignesh7832 10 ай бұрын
@@EarlZMoade thanks for your comment! I made a dataset with random data of 1.5GB for benchmarks, and for 2 operations (the first groupby at 6:45 and the upper-lower price bands) it took Sentdex's code 4 mins and 2 seconds, and 19 mins and 43 secs respectively on my computer. An optimized version (just Groupby and vectorization, nothing fancy) took 2.25 SECONDS and 13.5 seconds respectively. I'm sure CuDF would be even faster but in this case there was a lot of performance left on the table.
@Summersault666
@Summersault666 10 ай бұрын
Will it work on apple notebook?
@mikahoy
@mikahoy 4 ай бұрын
Hi, how to do it for geopandas? is it same?
@gingerjiang666
@gingerjiang666 10 ай бұрын
Does it still use memorry to fit the entire dataset?
@citizenR1203
@citizenR1203 10 ай бұрын
Very interesting, thank you for sharing 😊but this seems to be not compatible with MacOS and 2,9 GHz Quad-Core Intel Core i7 processor.
@henryyoo3032
@henryyoo3032 10 ай бұрын
which would make perfect sense since you would need a cuda enabled nvidia gpu for cudf to work
@sillybuttons925
@sillybuttons925 10 ай бұрын
**does scipy work with it?
@JuanRamirez-di9bl
@JuanRamirez-di9bl 10 ай бұрын
Is this faster than numpy??
@kelkka7
@kelkka7 10 ай бұрын
Will this work with geopandas?
@spicy.d
@spicy.d 10 ай бұрын
I'm waiting on AMD to enter the DS space so I can use my 7900XTX to do things LOL
@SwissPGO
@SwissPGO 10 ай бұрын
Any alternative for Apple Silicon ?
@mlcat
@mlcat 10 ай бұрын
Dask maybe
@Adaministrator
@Adaministrator 10 ай бұрын
Dask and swifter have like 1000x processing speed for some batch jobs I have with airflow, so def try that. It’s also drop in
@poloceccati
@poloceccati 10 ай бұрын
what about amd gpu ?
@andreaqui1653
@andreaqui1653 2 ай бұрын
Anyone have any luck getting it installed on a local windows machine?
@Harmxn
@Harmxn 8 күн бұрын
You cannot install it on Windows because CuDF is only supported on Linux. Instead, you can make a WSL instance and install Python and CuDF on there.
@ai.simplified..
@ai.simplified.. 10 ай бұрын
3:20
@jamoncitovideos
@jamoncitovideos 10 ай бұрын
Ditch pandas and use spark, your local data engineer will thank you
@Arctect
@Arctect 10 ай бұрын
like ** 10 == 0.0s
@ashu-
@ashu- 10 ай бұрын
Ngl, your video looks like Deepfaked, I also think at this point you can probably deepfake yourself with a bit editing to make an people believe its real.
@noormohammedshikalgar
@noormohammedshikalgar 10 ай бұрын
Hello Sentdex, I am reaching out to you regarding your Neural network from scratch series ? any updates on that, you left on pt 9 Please do continue its an awesome series and any updated on Book discounts for use for the Black friday ?? please do help
@acelaox6836
@acelaox6836 10 ай бұрын
the kubota warrior is back with the heat 🗣🗣🗣
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
Rob Mulla
Рет қаралды 270 М.
Much Faster Pandas with cuDF GPU Processing - CPU vs GPU Speed Benchmarks
19:57
Amazing Parenting Hacks! 👶✨ #ParentingTips #LifeHacks
00:18
Snack Chat
Рет қаралды 22 МЛН
Will A Guitar Boat Hold My Weight?
00:20
MrBeast
Рет қаралды 236 МЛН
The day of the sea 😂 #shorts by Leisi Crazy
00:22
Leisi Crazy
Рет қаралды 1,5 МЛН
Better Attention is All You Need
14:29
sentdex
Рет қаралды 62 М.
7 Tips To Structure Your Python Data Science Projects
14:49
ArjanCodes
Рет қаралды 115 М.
Open Source AI Inference API w/ Together
25:25
sentdex
Рет қаралды 31 М.
The AI wars: Google vs Bing (ChatGPT)
18:41
sentdex
Рет қаралды 108 М.
Make Your Pandas Code Lightning Fast
10:38
Rob Mulla
Рет қаралды 183 М.
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 265 М.
OpenAI GPT-4 Function Calling: Unlimited Potential
23:49
sentdex
Рет қаралды 228 М.
25 nooby Python habits you need to ditch
9:12
mCoding
Рет қаралды 1,8 МЛН
Amazing Parenting Hacks! 👶✨ #ParentingTips #LifeHacks
00:18
Snack Chat
Рет қаралды 22 МЛН