Polars: The Next Big Python Data Science Library... written in RUST?

  Рет қаралды 163,193

Rob Mulla

Rob Mulla

Күн бұрын

In this video tutorial I explain everything you need to get started coding with polars. Polars is a multi-threaded DataFrame library, meaning that it allows using all the cores of a computer at the same time to achieve its full processing potential. It's been shown to have huge performance gains over pandas.
Timeline:
00:00 Intro
01:00 What is Polars?
02:43 Getting Started
06:32 Filtering
07:15 New Columns
08:10 Groupby
08:55 Combining Dataframes
10:17 Multithreaded Approach
11:21 Speed Test
12:50 Takeaways
Follow me on twitch for live coding streams: / medallionstallion_
My other videos:
Speed Up Your Pandas Code: • Make Your Pandas Code ...
Intro to Pandas video: • A Gentle Introduction ...
Exploratory Data Analysis Video: • Exploratory Data Analy...
Working with Audio data in Python: • Audio Data Processing ...
Efficient Pandas Dataframes: • Speed Up Your Pandas D...
* KZbin: youtube.com/@robmulla?sub_con...
* Discord: / discord
* Twitch: / medallionstallion_
* Twitter: / rob_mulla
* Kaggle: www.kaggle.com/robikscube
#python #polars #datascience

Пікірлер: 241
@rahuldev2380
@rahuldev2380 Жыл бұрын
Polars is built on top of Apache Arrow which pandas supports. So you can easily convert your polars dataframe to pandas with almost zero overhead. I use polars to do the hard part and jump back to pandas for the visualization stuff
@cryptoworkdonkey
@cryptoworkdonkey Жыл бұрын
If you use pyarrow firstly. Pandas convert arrow in his inner representation (numpy arrays managed by BlockManager) and reverse. It not zero cost.
@rahuldev2380
@rahuldev2380 Жыл бұрын
@@cryptoworkdonkey Ah my bad. I thought they had updated their internals from numpy
@jakobullmann7586
@jakobullmann7586 Жыл бұрын
Same here. There are some things where Pandas is more convenient, but for most stuff I strongly prefer Polars. It’s not just execution performance, but also the speed of writing the code.
@adrianjdelgado
@adrianjdelgado Жыл бұрын
​@@cryptoworkdonkey good news, Pandas 2.0 release candidate now uses pyarrow as the backend. Polars Pandas conversions will be zero cost.
@bigphab7205
@bigphab7205 Жыл бұрын
10000 points for printing the version. Every tutorial video should do that.
@robmulla
@robmulla Жыл бұрын
Thanks! I forget to do it on all of my videos but your comment is going to remind me to do it in the future.
@brd5548
@brd5548 Жыл бұрын
Our team tried to integrate polars into our analytics pipeline last year, and the result was kinda on and off. To be honest, the performance of pandas is not that bad, we spent some time on doing several fine tunings, like rewriting key bottlenecks with our native modules or with these vectorized pandas methods, and the result turned out just ok. On the other hand, the integration work of polars did require some major revamping and refactoring, due to API gaps and implementation differences between the two. However, the performance gains didn't seem to justify the effort. What's worse, while pandas does come with pitfalls and caveats here and there, polars is a relatively young project and it comes with bugs on basic text manipulating operations. But don't get me wrong, that was my experience last year. I do think polars has the potential. It has a much more robust and modern architecture than pandas in my opinion. Its API style is cleaner and more consistent. And it comes with a query optimization engine, which many users can appreciate if you are familiar with tools like apache spark or some databases. Given time, I think polars should become another powerful player in the future. So, definitely give it a try if you're building something new!
@robmulla
@robmulla Жыл бұрын
Thanks for sharing! I haven't used polars in production yet, so it's interesting to hear about your experience. I guess there are limitations I didn't consider in this video. I totally agree it's worth giving a try.
@BiologyIsHot
@BiologyIsHot Жыл бұрын
This is,the major bit.. Who is bottlenecked by Pandas? I think the bottlenecks happen with ML or other modeling libraries which are working with the data in the form of Numpy arrays.
@leventelajos5078
@leventelajos5078 Жыл бұрын
"Its API style is cleaner" Really? I think Pandas is much more pythonic.
@incremental_failure
@incremental_failure Жыл бұрын
@@leventelajos5078 Agree. Column assignment in Pandas seems more pythonic.
@konstagold
@konstagold Жыл бұрын
@@BiologyIsHot When you're working with large data sizes, you will be bottlenecked by pandas in no time. Typically at that point, you switch to spark, which has its advantages, but also downsides. Polars looks to be a good middle fit between the two that dask was trying to achieve.
@jakobullmann7586
@jakobullmann7586 Жыл бұрын
13:20 Regarding learning the syntax… It’s worth mentioning that Polars syntax is very similar to PySpark, so it’s really two birds with one stone.
@robmulla
@robmulla Жыл бұрын
That’s a good point. Thanks for pointing it out. I really need to do a spark vs polars comparison video.
@Joselias156216
@Joselias156216 Жыл бұрын
Nice video. Very interesting to see how polar works, hope to see it more frequent in your future streams to learn more about the practical use.
@robmulla
@robmulla Жыл бұрын
Thanks Jose! I apprecaite the feedback. I'm going to definately give it a try in a future stream. I just need to find a good dataset for it.
@rohitnair4268
@rohitnair4268 Жыл бұрын
as usual rob nice video i have learned a lot from you
@santiagoperman3804
@santiagoperman3804 Жыл бұрын
Great timing, I was looking to start playing with Polars since Mark Tenenholtz mentioned it some days ago. I went back to Pandas because couldn't find the assign() and astype() equivalents in Polars, I thought they were lacking, but they seem to be with_columns() and cast(). Now I will resume more persistently.
@robmulla
@robmulla Жыл бұрын
Glad you found this video helpful. It does seem like polars may be worth the time investment now that it's becoming more established.
@calum.macleod
@calum.macleod Жыл бұрын
Thanks for a good explanation of how Polars could benefit people who use Pandas and need more speed. In my project we already have a heavy emphasis on multi processing and fast inter process communication, so I am especially interested to see a Pandas vs Polar single core performance comparison for group and join. I hope that someone does the comparison and posts it to KZbin.
@robmulla
@robmulla Жыл бұрын
Glad it was helpful! If you look in the polars repo they have some queries that they benchmark. H2o also has a benchmark comparison of a few different libraries.
@calum.macleod
@calum.macleod Жыл бұрын
@@robmulla Thanks for the reply. I will look into the benchmarks and h2o.
@juan.o.p.
@juan.o.p. Жыл бұрын
Thanks for the recommendation, I will definitely give it a try 😊
@robmulla
@robmulla Жыл бұрын
Please do and let me know what you think. There might be negatives about it that I'm not aware of.
@scraps7624
@scraps7624 Жыл бұрын
I saw some tweets about Polars but seeing it in action is something else Also, I can't believe it took me this long to find your channel, subbed!
@robmulla
@robmulla Жыл бұрын
That’s awesome! Glad you found my channel. Feel free to share with others!
@gregharvey8574
@gregharvey8574 Жыл бұрын
Thanks for brining this to my attention, I think I might include polars into some productionalization processes. For data exploration, typically I only use parts of dataframes for plotting or investigation. Given that you can convert a polars dataframe to pandas, it seems like a good approach would be to have the the full dataset in polars and then filter into a pandas dataframe and plot.
@robmulla
@robmulla Жыл бұрын
That's a good point about how you can convert the dataframe to pandas when you need to do exploration. I'll have to think about how to use this in my EDA pipelines.
@headbangingidiot
@headbangingidiot Жыл бұрын
​@@robmulla you can pass polars columns into plotting libs like plotly
@BiologyIsHot
@BiologyIsHot Жыл бұрын
The question though is do you save much time when doing this? Instantiation of Numpy arrays and Pandas dataframes themselves isn't the fastest. I guess if you have multiple "slow" actions to perform on the data you might have some benefits? Or if you really are working at such a massive scale with many many users that saving compute time is really valuable.
@jcbritobr
@jcbritobr Жыл бұрын
Nice stuff. This Polars seems a killer tool. Thank you for share.
@robmulla
@robmulla Жыл бұрын
Thanks for watching. It does seem promising.
@bryanwilly4086
@bryanwilly4086 10 ай бұрын
Perfect, thank you!
@sonnix31
@sonnix31 Жыл бұрын
This is fantastic. Thank you
@robmulla
@robmulla Жыл бұрын
You're very welcome!
@tmb8807
@tmb8807 7 ай бұрын
I'm blown away by how fast this is. Sure there are some things it can't do, but man, even for just reading large data sets it's absolutely blazing.
@user-fb3lg6ys5s
@user-fb3lg6ys5s Жыл бұрын
OMG.. thank you!!
@gabrielperfumo1122
@gabrielperfumo1122 Жыл бұрын
Great channel!! Thanks for sharing. I'll check it out for sure!
@robmulla
@robmulla Жыл бұрын
Thanks Gabriel!
@ChaiTimeDataScience
@ChaiTimeDataScience Жыл бұрын
DataTable is also pretty legendary, you might also find it super awesome. Thanks again for your amazing videos, I have watched and learned from every one of them. I hope I'll interview you about your 100k celebration sometime next year 🙏
@robmulla
@robmulla Жыл бұрын
Thanks Sanyam! I need to check it out. Hopefully 100k will come next year, but maybe 2024! Talk soon.
@rackstar2
@rackstar2 5 ай бұрын
I recently decided to fully transition over to using polars instead of pandas for a data pipeline project. The primary reason im liking polars over pandas is not just the speed (the speed is nice dont get me wrong) but its the Space usage! Allmost all of my operations entailed working with data larger than memory. One of the operations i have to do is pivoting a dataframe. My end result has thousands of columns! My kernel never seems to hold steady when doing this with pandas, but polars is really doing the trick for me. One small problem did face tho is when it comes to exporting the results of the pipline. I still have to resort to something like pyarrow and use its writer to do the export in chunks. This might just be because of how low my system memory is. Regardless of this, polars seems to be an excellent option for data processing and manipulation, and if you do want to showcase your data, you can always convert back and forth with pandas !
@GiasoneP
@GiasoneP Жыл бұрын
Like PySpark AND Pandas. Second half mirrors PySpark. Due to the speed, and out of the box parralelization, I wonder how it stacks up against Spark and how it’s functionality compares to a cluster of machines. Take AWS for example, can it be applied to an EMR cluster? As a side note, I’m super excited about Rust and it’s future in data.
@cryptoworkdonkey
@cryptoworkdonkey Жыл бұрын
There is some Apache Arrow based Spark competitors (too young) like Ballista (distributed Data Fusion, written in Rust). We "buy" Spark for Resilent in RDD abbr. Polars can process 50gb on machine, Spark - 35gb because not so effective row-based abstraction from "distributed" trade-off, scala case classes memory blowing etc. vs skinny Rayon runtime in Polars. Ray platform has same arrow format backend and more effective than Spark but can't streaming (yet). In Polars repo polars-dask integration is empty.
@pabtorre
@pabtorre Жыл бұрын
Yeah the syntax is very similar to pyspark Wonder how well it'll run on a spark cluster...
@robmulla
@robmulla Жыл бұрын
Good question. I don’t think polars is meant as a replacement for pyspark because from I can tell it doesn’t computation across nodes.
@AWest-ns3dl
@AWest-ns3dl Жыл бұрын
I can confirm, Polars does not use nodes.
@RyanApplegatePhD
@RyanApplegatePhD Жыл бұрын
@@robmulla With the ever improving compute, I think Polars could be in a sweet spot between Spark and Pandas. I know when I was parsing very raw large datasets in pandas I did feel sometimes constrained and moved to Spark, however; there is a lot of overhead for using Spark effectively and this might split the difference.
@akhil-menon
@akhil-menon Жыл бұрын
Hi Rob, thank you for this super informative video! In one of your takeaways, you mentioned that Polars is a good fit if we have some really heavy data processing work. Would you be able to share some insight on how Polars would stack up against Pandas when having to perform heavy NumPy specific computations?(Think linear and vector algebra, trigonometry, matrix operations) I read on SO that it is imperative to not kill the parallelization that Polars provides by using Python specific code, so it is my intuition that applying NumPy operations on Polars columns could result in a loss of parallelization. It would be great if you could share your thoughts on this. Thank you again for the amazing content you produce!
@aminehadjmeliani72
@aminehadjmeliani72 Жыл бұрын
Hi @rub, I think it's a good approach to diversity our tools this days, especially when it comes to deal with memory (sometimes I find myself running out of time with pandas)
@robmulla
@robmulla Жыл бұрын
Absolutely! Well said.
@patrickonodje1428
@patrickonodje1428 Жыл бұрын
I love your work. You should have a course on data science.. for folks like us just learning
@robmulla
@robmulla Жыл бұрын
Maybe one day! Thanks for watching Patrick!
@patrickonodje1428
@patrickonodje1428 Жыл бұрын
@@robmulla Looking forward
@bubbathemaster
@bubbathemaster Жыл бұрын
Extremely interesting. It’ll be hard to dethrone pandas due to the huge community support but I really like the lib.
@robmulla
@robmulla Жыл бұрын
I agree pandas is too entrenched at this point to be easily dethroned.
@michaelnorthrup2
@michaelnorthrup2 Жыл бұрын
This is my little trick for hyper optimizing data processing haha. Pivots are insanely fast in polars
@robmulla
@robmulla Жыл бұрын
Ohh. Never tried pivots in it.
@Mari_Selalu_Berbuat_Kebaikan
@Mari_Selalu_Berbuat_Kebaikan Жыл бұрын
Let's always do good and encourage more people to do the same 🙏
@robmulla
@robmulla Жыл бұрын
ok!
@CaribouDataScience
@CaribouDataScience Жыл бұрын
Good stuff!!
@robmulla
@robmulla Жыл бұрын
Glad you enjoyed it
@tonik2558
@tonik2558 Жыл бұрын
The usage in Python seems to mirror a lot of the standard Rust iterator API. Looks like it would be even better if used directly in Rust. Thanks for making a video about this.
@brainsniffer
@brainsniffer Жыл бұрын
I think that there is so much for data that is built in python that it’s easier to use an abstraction like this than to do things in rust, especially for interactions. It’s an interesting idea.
@robmulla
@robmulla Жыл бұрын
I have learning RUST on my todo list. Will you teach me? 😝
@tonik2558
@tonik2558 Жыл бұрын
@@robmulla The Book is an amazing starting resource. It's how I learned Rust, and it's probably the fastest way to get started with the language
@shadowangel8005
@shadowangel8005 Жыл бұрын
@@robmulla google just posted a small course a week or so back
@pimziengs2900
@pimziengs2900 Жыл бұрын
Thanks for this video! I am a data scientist always looking for some new techniques xD. Cheers from the Netherlands! PS: There is some background noise in your video around 3:30.
@robmulla
@robmulla Жыл бұрын
Welcome! Glad to have a viewer from the Netherlands. Sorry about the noise at 3:30 - I didn't notice it until after I was done editing and then it was too late.
@nikjs
@nikjs Жыл бұрын
For the python library developers : Pls create a wrapper lib that does this job of converting regular pandas syntax into the wee-bit more complicated polars syntax. I can see that not all ops would be readily convertible, but there's definitely some low-hanging fruit here, which would cover a lot of simple use cases.
@robmulla
@robmulla Жыл бұрын
That would be nice. But I also think it’s nice to have it different to make it clear it’s not the same.
@mutley11
@mutley11 Жыл бұрын
Very compelling presentation; many thanks. I would have liked to see an example of how user-friendly the error messages are. Rust error messages are surprisingly good in general and I was wondering if that is true of polars. You missed at least one opportunity to illustrate a typo. 😊
@robmulla
@robmulla Жыл бұрын
Glad it was helpful! Next time I'll try to throw more errors :D
@BiologyIsHot
@BiologyIsHot Жыл бұрын
I think the big problem is that it isn't inter-operable with Numpy-based libraries. I'm honestly struggling to think of many cases where Pandas is too slow. Some of thd features like a lazy/eager API could be nice, but I think most of the slow computations people are doing is within libraries that are going to require conversion to Numpy arrays already.
@robmulla
@robmulla Жыл бұрын
Yea, I guess it really depends on your use case. I've run across a few recently where polars was helpful.
@adrianjdelgado
@adrianjdelgado Жыл бұрын
You can convert to and from Pandas very easily. Now that Pandas 2.0 will use pyarrow as the backend, that conversion will be truly zero cost.
@HyperFocusMarshmallow
@HyperFocusMarshmallow Жыл бұрын
The rust community really produce brilliant stuff. Very impressive! Did you find any areas where polars is lacking vs pandas? Btw, have you checked out nu-shell? It’s essentially a new shell language designed to do the Unix-philosophy but with data frames for data flow. At least as far as I understand it. Written in rust of course. It’s in pretty early development but it feel pretty great to play around with and can probably produce some nice workflows.
@robmulla
@robmulla Жыл бұрын
Never heard of my-shell but I’ll check it out. I am not too familiar with the RUST community but this package is pretty solid. As people have mentioned the syntax is much more verbose and it lacks some of the built in pandas features.
@hensonjhensonjesse
@hensonjhensonjesse Жыл бұрын
It looks surprisingly similar to pyspark. Especially the lazy implementation. Pretty cool stuff!
@robmulla
@robmulla Жыл бұрын
Yea, a lot of similarities to pyspark!
@The-KP
@The-KP Жыл бұрын
@Rob Mulla Nice that Polars can perform rdbms-like ops, but what about the computation libs bind to Pandas dataframes, like numpy, scipy, scikit-learn? If it can be used with those, or somehow replaces them, I'm in! Hopefully Polars is not an island.
@robmulla
@robmulla Жыл бұрын
I know you can easily convert from polars back to a pandas dataframe and they use similar Apache Arrow.
@Pedro_Israel
@Pedro_Israel Жыл бұрын
Hey Rob can you do a video about automatic EDA librearies? I used them and they blew my mind. I am amazed I didn´t know them earlier.
@robmulla
@robmulla Жыл бұрын
That's a good suggestion. What libraries have you used that you like? The main one I've seen is pandas profiling.
@ApeWithPants
@ApeWithPants Жыл бұрын
Pandas has some strange quirks that always bothered me. Strange syntax or unintuitive copy/not copy behavior. Glad to see more competitors
@robmulla
@robmulla Жыл бұрын
I’m a big fan. But also think polars and others like it have good potential. Thanks for watching! Are you a kraken fan? Go Caps!
@jackychan4640
@jackychan4640 Жыл бұрын
Happy New Year 2023
@robmulla
@robmulla Жыл бұрын
Same to you Jacky! 🎆
@DiegoSilva-dv9uf
@DiegoSilva-dv9uf 10 ай бұрын
Valeu!
@robmulla
@robmulla 10 ай бұрын
Thanks so much 🙌
@AlexanderHyll
@AlexanderHyll Жыл бұрын
As a btw. If you want to plot smth quick, converting to a pandas is super fast (if ofc a bit mem inefficient). Can also just pass columns to plt. Just my 2 cents.
@robmulla
@robmulla Жыл бұрын
Good point, I do use df.plot() a lot though so it would take some getting used to.
@adrianjdelgado
@adrianjdelgado Жыл бұрын
Now that Pandas 2.0 uses pyarrow as backend, conversions will be truly zero cost.
@PlatinumDragonProductions999
@PlatinumDragonProductions999 Жыл бұрын
I love Pandas, but I prefer Spark. This looks very Spark-like to me; I'm eager to make it my goto dataframe processor. :-)
@robmulla
@robmulla Жыл бұрын
If you prefer spark I’m guessing this will be a great package for you.
@aliyektaie9123
@aliyektaie9123 Жыл бұрын
Hi, thanks for this great video! It looks like polar is very similar to spark, do you know how they compare?
@robmulla
@robmulla Жыл бұрын
Thanks for the comment. They are very similar. Check out my most recent video where I compare the two.
@cryptoworkdonkey
@cryptoworkdonkey Жыл бұрын
I think Polars must be replace Pandas in ETL tasks. But it have some struggles for comfortable Exprs construction. And in Arrow universe there is Data Fusion project as alternative.
@robmulla
@robmulla Жыл бұрын
I agree. I haven't fully tested out the expressions to notice what I use in pandas that polars is missing. What is the Data Fusion project, I'm not familiar with that?
@cryptoworkdonkey
@cryptoworkdonkey Жыл бұрын
@@robmulla , DataFusion is more "arrow-society" convented project (part of Apache Arrow project) as Spark/Hive/MR challenger. This is designed more modularity with SQL and DataFrame APIs. This project can be used as library (it positioned self as query engine for arrow) for more high level projects. Polars positioning self as classical DataFrames libraries challenger. But with both you can use as SQL CLI. Both has plan optimizers, Rayon parallelism, simd optimizations etc. Both are cool. I don't know about larger-than-memory capabilities of DataFusion. DataFusion is fundament of Blaze/Ballista distributed computing engines. Polars Dask integration repo currently not active.
@chris_kouts
@chris_kouts Жыл бұрын
You should do a benchmarking video i was waiting for you to tell me if i should start using it
@robmulla
@robmulla Жыл бұрын
I made a video about it just yesterday! Check it out on my channel.
@JustinGarza
@JustinGarza Жыл бұрын
i like this, but i wish i covered graphs? does this use matplotlib or something use to make graphs and charts ?
@robmulla
@robmulla Жыл бұрын
It doesn’t. But you can always convert it back to a pandas data frame to plot.
@JustinGarza
@JustinGarza Жыл бұрын
@@robmulla umm maybe I’ll wait til it gets more graphic/chart support or until pandas gets updated
@neronjp9909
@neronjp9909 Жыл бұрын
how come everytime when u click the column name, the column name then copied into yr tpying code.. is there a hot key for that? my company raw data column name is so long and with _ / space / dot...i always get slow down when typing code across the column name, may i know how u do that 8:07..thx
@grabani
@grabani Жыл бұрын
Interesting.
@robmulla
@robmulla Жыл бұрын
Glad you think so!
@MaavBR
@MaavBR Жыл бұрын
7:10 Quick correction, SAN is San Diego, not San Francisco San Francisco airport's code is SFO
@robmulla
@robmulla Жыл бұрын
Doh! Good catch.
@simplemanideas4719
@simplemanideas4719 Жыл бұрын
Speed is always priority, because it is equal to resource optimization. However, this leads to question how effizient are both libs in per core efficiency?
@robmulla
@robmulla Жыл бұрын
Good question. I'd guess polars is faster on all fronts but it would depend on a lot of things.
@AaronWoodrow1
@AaronWoodrow1 Жыл бұрын
I don't fully get why it's geared more toward data pipelining rather than data exploration (as mentioned @ 13:33) if the data needs to be contained to a single host. Even with parallelization across multiple CPUs, there's still a data size cap limited by available memory. A tool such as PySpark (or Dask) seems better suited for pipelining, which ultimately consumes larger amounts of data.
@robmulla
@robmulla Жыл бұрын
Yea. I see your point. Sometimes you have data in between or just want a faster pipeline for a small job you run on a regular basis. Either way, if it was identical to python and faster then people would use it for sure!
@AaronWoodrow1
@AaronWoodrow1 Жыл бұрын
@@robmulla True, just a minor nit. Great video btw!
@georgiyveter6391
@georgiyveter6391 Жыл бұрын
Use python 3.10. Created dictionary: d = {'a': [1,2,3], 'b': [4, -5, 6]} Created dataframe: df = pl.DataFrame(d) print(type(df)) print(df) It all works. But if I change in dictionary d any number to float, for example 6.8, then functions print type still shows it's a dataframe, but next print silently do nothing, like 'pass', and script ends. Why?
@robmulla
@robmulla Жыл бұрын
That’s a great question. Is it only with 3.10?
@samstanton-cook1419
@samstanton-cook1419 Жыл бұрын
Great video thanks Rob! Our data science teams use polars alot. For long timeseries aggregation queries (100M+) rows we use the pykx python package to access q kdb+ language for higher performance still over pandas and polars. Have you seen it? kx.q.qsql.select(qtab, columns={'minCol2': 'min col2', 'medCol3': 'med col3'}, by={'groupCol1': 'col1'}, where=['col30.7'] )
@robmulla
@robmulla Жыл бұрын
I need to check that out. Pykx… first time hearing of it. Sounds cool though. Thanks for watching.
@Matias-eh2pn
@Matias-eh2pn Жыл бұрын
How did you configured that theme on jupyter?
@robmulla
@robmulla Жыл бұрын
I have a whole video on my setup. Check it out here: kzbin.info/www/bejne/ipXFlqyjiciMj6c
@K-mk6pc
@K-mk6pc Жыл бұрын
I am working on large data in pandas.But its not a problem for me. Pandas is doing fine in few mins.
@bazoo513
@bazoo513 Жыл бұрын
I wonder what authors of these tabular data manipulation libraries didn't adopt relational algebra terminology (or even SQL as a, if not the, manipulation language). For example, why is not choosing only some columns called "projection"? Subtle syntax (and _especially_ semantics) differences between libraries designed to do essentially the same tasks make life of users unnecessarily more difficult.
@robmulla
@robmulla Жыл бұрын
That’s a good point. Some libraries (like spark) do have the ability to write SQL directly on flat files like this.
@ankan650
@ankan650 Жыл бұрын
Wow. It looks like Apache Spark might be obsolete soon. Can you also compare Ray packages with Polar. I think Ray is not exactly for data processing instead for more compute intensive tasks. Thanks.
@robmulla
@robmulla Жыл бұрын
I benchmark ray in a different video if you want to check it out.
@JordiRosell
@JordiRosell Жыл бұрын
For ploting polars, I think plotnine is a good option.
@robmulla
@robmulla Жыл бұрын
I have a video all about my favorite plotting libraries (including plotnine): kzbin.info/www/bejne/aoDCoGhplsxml8k&feature=shares
@yayasssamminna
@yayasssamminna 2 ай бұрын
Please make a tutorial on Dask!!!
@chintansawla
@chintansawla Жыл бұрын
The library feels like it's based off the syntax/methods of pyspark. A lot of the methods used are similar to how RDDs are converted to DataFrames in pyspark
@robmulla
@robmulla Жыл бұрын
Yes, definitely a lot of similarities between pyspark and polars. Pyspark has always been much slower for me when running on a single node.
@chintansawla
@chintansawla Жыл бұрын
@@robmulla that's a bit shocking! Both seem to be performing in a similar fashion theoretically (lazy evaluation, parallel computing). Going to try and compare polars soon. Thanks
@jordanfox470
@jordanfox470 Жыл бұрын
@@robmulla have you tried pandas on spark? Databricks has that running.
@robmulla
@robmulla Жыл бұрын
@@jordanfox470 no. Have you? How does it compare?
@bazoo513
@bazoo513 Жыл бұрын
"Split, apply, combine" approach sounds like it could employ massively parallel processing of graphics cards. Is there a CUDA implementation?
@robmulla
@robmulla Жыл бұрын
Yes! It’s called rapids. I need to make a video about it.
@bazoo513
@bazoo513 Жыл бұрын
@@robmulla Thanks!
@user-fv1576
@user-fv1576 2 ай бұрын
Looks a bit like SQL with the select. Newbie question, why not just use pandasql library?
@donnillorussia
@donnillorussia Жыл бұрын
Isn't this "split-apply-combine" approach similar to map-reduce? Just curious 😉
@robmulla
@robmulla Жыл бұрын
Yes! Exactly. Map reduce (like in spark) is very similar. Polars only runs single node, and map reduce I believe can be done across nodes.
@suvidani
@suvidani Жыл бұрын
How does the performance compares to pyspark? The syntax very similar to pyspark.
@robmulla
@robmulla Жыл бұрын
Good question. I might need to test it out. Haven’t used spark in years and had some bad experiences but it’s probably gotten better since then.
@praveenmogilipuri4524
@praveenmogilipuri4524 Жыл бұрын
Hi, anyone can help me how to connect polars with snowflake. Through pandas i can but i don't want to use pandas.
@robmulla
@robmulla Жыл бұрын
I’ve never done anything like that before but maybe others will know how.
@akshaydushyanth9720
@akshaydushyanth9720 Жыл бұрын
Is it similar to pyspark? Whats the difference between both?
@robmulla
@robmulla Жыл бұрын
Only runs on a single node. Much faster than pyspark when working with data that can fit in memory.
@user-ck3hp8cj4h
@user-ck3hp8cj4h Жыл бұрын
somehow it's very similar to Spark on AWS Glue ?
@robmulla
@robmulla Жыл бұрын
Yes, very similar but I think polars is intended for a single machine vs. spark which can be distributed across nodes.
@rhard007
@rhard007 Жыл бұрын
Is it not possible to use Matplotlib or Seaborn with Polars?
@robmulla
@robmulla Жыл бұрын
It probably is possible. It's just not built into the dataframe as methods like it is in pandas. Just one additional step or you can convert the final data to pandas after processing.
@rahulrjb
@rahulrjb 3 ай бұрын
Very pysparke syntax
@ArnabAnimeshDas
@ArnabAnimeshDas Жыл бұрын
I would import another plotting library which produces a better plot anyways.
@robmulla
@robmulla Жыл бұрын
Yep, that's totally reasonible. Thanks for watching.
@ArnabAnimeshDas
@ArnabAnimeshDas Жыл бұрын
@@robmulla also you can convert polars dataframe to pandas if you want to
@michaeldeleted
@michaeldeleted Жыл бұрын
OMG I just completely replaced pandas with polars and all the regular pandas commands worked
@robmulla
@robmulla Жыл бұрын
Wait, what? I think the syntax should be very different. Unless they released a new version that I don't know about. Can you show an example?
@michaeldeleted
@michaeldeleted Жыл бұрын
Oops, didn't change all my pd to pl. LOL was still using pandas
@robmulla
@robmulla Жыл бұрын
@@michaeldeleted oh! That explains it.
@valuetraveler2026
@valuetraveler2026 Жыл бұрын
URLError:
@robmulla
@robmulla Жыл бұрын
Strange. Did you get this error when trying to pip install? Otherwise polars shouldn't be using anything to connect to the internet.
@ibekweobinna3514
@ibekweobinna3514 Жыл бұрын
Rob,can I add you to website as one of the best tutors of data science. Man you are good. But funny enough I am still learning pandas,then boom came polars.
@robmulla
@robmulla Жыл бұрын
Thanks Ibekwe. Never stop learning!
@nikjs
@nikjs Жыл бұрын
3:35 - some audio interference starts from around this point, pls check the video
@robmulla
@robmulla Жыл бұрын
Thanks for the heads up. I noticed that when editing. Sorry about it.
@EircWong
@EircWong Жыл бұрын
Nosie at 3:29, about 10 seconds
@robmulla
@robmulla Жыл бұрын
Yes! I noticed that. I forgot to put my phone further away from the mic. I tried to edit it out as much as possible. Hopefully it wasn't too distracting.
@AyahuascaDataScientist
@AyahuascaDataScientist 9 ай бұрын
Polars doesn’t have a .info() method? I can’t use it…
@JohannPetrak
@JohannPetrak Жыл бұрын
Your timeit presentation includes the time to read the data which might not be such a good idea.
@robmulla
@robmulla Жыл бұрын
Nice catch, but I actually did that intentionally because data I/O is one area where polars can be much faster.
@JohannPetrak
@JohannPetrak Жыл бұрын
@@robmulla it is just very bad practice to do this and there other issues which may totally distort the measurements like the OS caching read data in buffers from a previous read.
@robmulla
@robmulla Жыл бұрын
@@JohannPetrak that’s a good point. Any idea how I could properly compare the read time in a way that wouldn’t be messed up by the caching?
@JohannPetrak
@JohannPetrak Жыл бұрын
@@robmulla i think there is no way to avoid it, but it may be possible to reduce the effect by loading files that are much larger than what the OS might use for caching, and also load a sequence of many different files for a single benchmarking run, then repeat this several times and take the average (and stdev). Also maybe check how much the external storage is the bottleneck by also loading from SSDs or memcached files. With HDDs this will be A LOT slower than the CPU based benchmarks, so I would argue to separate these benchmarks from each other. But even with the CPU based ones, running on larger data structures (on a computer that has even larger RAM) may give better results as the impact of other OS, memory management, (JIT) interpreter etc optimizations gets reduced. Sorry, I do not want to claim I know how to do proper benchmarks, but I do know (from experience) it is easy to not do it properly :)
@fredgavin
@fredgavin Жыл бұрын
Tried Polars multiple times, and felt that it was too verbose. Just cannot give up R's data.table, which is the best data manipulation package in the data science world, no competitor at all.
@robmulla
@robmulla Жыл бұрын
Yea. Definitely more verbose than pandas. I haven’t used R in years but don’t remember it ever being the fastest.
@rolandheinze7182
@rolandheinze7182 Жыл бұрын
Polars syntax seems very similar to pyspark, and in my opinion therefore hurts readability vs pandas
@JayRodge
@JayRodge Жыл бұрын
Have you tried RAPIDS cuDF?
@robmulla
@robmulla Жыл бұрын
A little bit. It can be really fast but requires that your data is small enough to fit into your GPU memory.
@vzmaster
@vzmaster Жыл бұрын
I'm running into different problem when i try to speed up pandas (or dask), they eating up memory really fast. jupiterlab environment, I load ~3-5mb data, use pandas .extractall() function on string field, a then compare results with int fields(count of matches) In single thread it takes several week to calculate. If i use multiprocessing, then when comparing results with df.loc it eats up to 200gb+ memory.
@robmulla
@robmulla Жыл бұрын
That doesn't sound right. If your data is 3-5Mb I can't imagine any sort of processing needing to take a week to calculate. I'm thinking it's probably something in your code and not an issue with pandas or dask.
@vzmaster
@vzmaster Жыл бұрын
@@robmulla I actually have 2 tables, big one and small one. The Big one has data(~2M entries, ~150mb), the small has patterns (~10k entries ~2mb). I need to run each pattern on all data. Thats why it may take long. But primary problem is not speed, its memory consumption. Thats why i take small chunks of big table ~5-30mb. But even with 5mb i get memory overflow 200gb. Here is code: import numpy as np import pandas as pd import sqlalchemy as sql sql_engine=sql.create_engine('mysql+mysqlconnector://.......................') df=pd.read_sql_query("..........................",sql_engine) #big table patterns=pd.read_sql_query("..........................",sql_engine) #small table ----------------------------------------------jupyterlab block seperation-------------------------------------------------------- def findoccurancesofpatterns(pat): (idx,row)=pat res=df.summarynormalized.str.extractall(row.pattern) numvaluesstats=pd.DataFrame(columns=['pattern','numvalueorder','param1','param2','param3']) if len(res)>0: numvaluecount=len(res.columns) res=pd.merge(res.reset_index(),df[[,'param1','param2','param3']],how='left',left_on='id',right_index=True) for i in range(numvaluecount): numvaluesstats.loc[len(numvaluesstats)]=[row.pattern,i,(res['param1']==res[i].astype('Int64')).sum(),(res['param2']==res[i].astype('Int64')).sum(),(res['param3']==res[i].astype('Int64')).sum()] return numvaluesstats from multiprocessing.pool import Pool pool = Pool(50) allnumvaluesstats=[] for numvaluesstats in pool.imap_unordered(findoccurancesofpatterns, patterns.iterrows()): allnumvaluesstats.append(numvaluesstats)
@mishmohd
@mishmohd Жыл бұрын
Can we suggest they change the name to Polaris
@robmulla
@robmulla Жыл бұрын
Why do you suggest that?
@XavierSoriaPoma
@XavierSoriaPoma Жыл бұрын
So why should we use polars instead of pandas?
@robmulla
@robmulla Жыл бұрын
Did you watch the video? 😂 speed is the main reason.
@XavierSoriaPoma
@XavierSoriaPoma Жыл бұрын
@@robmulla yeah but still I'm not convinced, it's like tensorflow or pytorch they are not as fast as Flux, but we still use them in python
@hanabimock5193
@hanabimock5193 Жыл бұрын
I already see books and videos about polars. The same as with pandas. It is like come on, who needs a book for pandas? Are you kidding me ?
@robmulla
@robmulla Жыл бұрын
Why do you dislike the fact that there are books about it? Honestly curious. Thanks for watching!
@hawrezangana8240
@hawrezangana8240 Жыл бұрын
Unless you need to run your scripts over and over, I believe Polars cannot replace Pandas, as it takes more effort to write a simple aggregation. 2 seconds of faster execution is not worth 20 seconds of writing a line for every aggregation column and giving it an alias.
@robmulla
@robmulla Жыл бұрын
Yea. For quick scripts on small data and EDA, I’m sticking with pandas.
@leonidgrishenkov4183
@leonidgrishenkov4183 Жыл бұрын
In some cases Polars syntax seems like PySpark
@robmulla
@robmulla Жыл бұрын
I've been hearing that a lot :D
@leonidgrishenkov4183
@leonidgrishenkov4183 Жыл бұрын
@@robmulla ahaha sorry, I’m just a captain obvious 😂
@robmulla
@robmulla Жыл бұрын
@@leonidgrishenkov4183 No it's a good point that I didn't realize until people pointed it out. I personally don't use pyspark a ton. Thanks for watching.
@cradleofrelaxation6473
@cradleofrelaxation6473 Жыл бұрын
Is it just me, the syntax is a bit more complicated than pandas whenever they differ!!
@robmulla
@robmulla Жыл бұрын
Yes. I agree, it ends up being more verbose.
@commonsense1019
@commonsense1019 Жыл бұрын
Well the core of pandas can also be changed using RUST no big deal
@robmulla
@robmulla Жыл бұрын
It can. But will it?
@AWest-ns3dl
@AWest-ns3dl Жыл бұрын
Polars syntax is similar to spark
@robmulla
@robmulla Жыл бұрын
I’ve been hearing that 😃
@whitebai6367
@whitebai6367 Жыл бұрын
Okay, I'd like to use rust directly.
@robmulla
@robmulla Жыл бұрын
You can do it! Polars has a rust API too. Try it out and let me know what you think.
@BillyT83
@BillyT83 Жыл бұрын
So... Pandas + Dask = Polars?
@robmulla
@robmulla Жыл бұрын
Kinda… but it’s really just it’s own thing.
@go64bit
@go64bit Жыл бұрын
Panda bear vs Polar bear 😅
@robmulla
@robmulla Жыл бұрын
Bear fight!
@ErikS-
@ErikS- Жыл бұрын
Just take a huge amount of RAM. I did that also...
@robmulla
@robmulla Жыл бұрын
I used polars on a live stream and crashed my computer during it because it ate all my memory. There is a way to set it to limit the amount it uses I think
@richardbennett4365
@richardbennett4365 Жыл бұрын
It is the problem with people who use pandas. They don't by and large know about polars. But why? Polars creator's fault for not promoting his product or laziness by pandas operators who just don't look for something better. Also, if one writes import polars as pd, then one doesn't need to rewrite code written for pandas. Or, one can import polars s po. I never understood why people import this package as pl. That would be for a package called plank, line the dock replacement.
@robmulla
@robmulla Жыл бұрын
Importing as pl makes the most sense to me and it’s what their docs recommend.
@richardbennett4365
@richardbennett4365 Жыл бұрын
He said 15, but he wrote 10 at 7min 05s.
@robmulla
@robmulla Жыл бұрын
Good catch!
@xuantungnguyen9719
@xuantungnguyen9719 Жыл бұрын
Very similar to pyspark
@robmulla
@robmulla Жыл бұрын
I seems so. Just not distributed.
@nitinkumar29
@nitinkumar29 Жыл бұрын
I will let it mature before dealing with this.
@robmulla
@robmulla Жыл бұрын
That’s a fair approach. Adopting things too early can be problematic.
@NickWindham
@NickWindham Жыл бұрын
Just use Julia instead of Python. Then you can do all this with speed similar to Rust in one language that even simpler syntax than Python.
@robmulla
@robmulla Жыл бұрын
Oh really? I haven’t had a chance to need to use Julia but I know it’s popular to use with spark.
@richardbennett4365
@richardbennett4365 Жыл бұрын
What??? Polars is supposed to give the same result as pandas. Duh. Polars is a pandas replacement.
@robmulla
@robmulla Жыл бұрын
Yep
@jaqo92
@jaqo92 Жыл бұрын
Looks like pyspark
@robmulla
@robmulla Жыл бұрын
Indeed! I just released a whole video comparing the two.
@ManAcadie
@ManAcadie Жыл бұрын
I'll stick to pandas
@robmulla
@robmulla Жыл бұрын
Any specific reason why?
The BEST library for building Data Pipelines...
11:32
Rob Mulla
Рет қаралды 67 М.
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 253 М.
Barriga de grávida aconchegante? 🤔💡
00:10
Polar em português
Рет қаралды 59 МЛН
格斗裁判暴力执法!#fighting #shorts
00:15
武林之巅
Рет қаралды 37 МЛН
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
Rob Mulla
Рет қаралды 255 М.
Make Your Pandas Code Lightning Fast
10:38
Rob Mulla
Рет қаралды 174 М.
PLEASE Use These 5 Python Decorators
20:12
Tech With Tim
Рет қаралды 84 М.
Sound Data Engineering in Rust-From Bits to DataFrames
34:36
Databricks
Рет қаралды 10 М.
An Introduction to Coding In Rust for Pythonistas
20:42
ArjanCodes
Рет қаралды 114 М.
7 Python Data Visualization Libraries in 15 minutes
15:03
Rob Mulla
Рет қаралды 66 М.
Barriga de grávida aconchegante? 🤔💡
00:10
Polar em português
Рет қаралды 59 МЛН