Polars: The Next Big Python Data Science Library... written in RUST?

Рет қаралды 176,851

Күн бұрын

Пікірлер: 247

@rahuldev2380 2 жыл бұрын

Polars is built on top of Apache Arrow which pandas supports. So you can easily convert your polars dataframe to pandas with almost zero overhead. I use polars to do the hard part and jump back to pandas for the visualization stuff

@cryptoworkdonkey 2 жыл бұрын

If you use pyarrow firstly. Pandas convert arrow in his inner representation (numpy arrays managed by BlockManager) and reverse. It not zero cost.

@rahuldev2380 2 жыл бұрын

@@cryptoworkdonkey Ah my bad. I thought they had updated their internals from numpy

@jakobullmann7586 2 жыл бұрын

Same here. There are some things where Pandas is more convenient, but for most stuff I strongly prefer Polars. It’s not just execution performance, but also the speed of writing the code.

@adrianjdelgado Жыл бұрын

@@cryptoworkdonkey good news, Pandas 2.0 release candidate now uses pyarrow as the backend. Polars Pandas conversions will be zero cost.

@bigphab7205 Жыл бұрын

10000 points for printing the version. Every tutorial video should do that.

@robmulla Жыл бұрын

Thanks! I forget to do it on all of my videos but your comment is going to remind me to do it in the future.

@jakobullmann7586 2 жыл бұрын

13:20 Regarding learning the syntax… It’s worth mentioning that Polars syntax is very similar to PySpark, so it’s really two birds with one stone.

@robmulla 2 жыл бұрын

That’s a good point. Thanks for pointing it out. I really need to do a spark vs polars comparison video.

@Joselias156216 2 жыл бұрын

Nice video. Very interesting to see how polar works, hope to see it more frequent in your future streams to learn more about the practical use.

@robmulla 2 жыл бұрын

Thanks Jose! I apprecaite the feedback. I'm going to definately give it a try in a future stream. I just need to find a good dataset for it.

@brd5548 2 жыл бұрын

Our team tried to integrate polars into our analytics pipeline last year, and the result was kinda on and off. To be honest, the performance of pandas is not that bad, we spent some time on doing several fine tunings, like rewriting key bottlenecks with our native modules or with these vectorized pandas methods, and the result turned out just ok. On the other hand, the integration work of polars did require some major revamping and refactoring, due to API gaps and implementation differences between the two. However, the performance gains didn't seem to justify the effort. What's worse, while pandas does come with pitfalls and caveats here and there, polars is a relatively young project and it comes with bugs on basic text manipulating operations. But don't get me wrong, that was my experience last year. I do think polars has the potential. It has a much more robust and modern architecture than pandas in my opinion. Its API style is cleaner and more consistent. And it comes with a query optimization engine, which many users can appreciate if you are familiar with tools like apache spark or some databases. Given time, I think polars should become another powerful player in the future. So, definitely give it a try if you're building something new!

@robmulla 2 жыл бұрын

Thanks for sharing! I haven't used polars in production yet, so it's interesting to hear about your experience. I guess there are limitations I didn't consider in this video. I totally agree it's worth giving a try.

@BiologyIsHot 2 жыл бұрын

This is,the major bit.. Who is bottlenecked by Pandas? I think the bottlenecks happen with ML or other modeling libraries which are working with the data in the form of Numpy arrays.

@leventelajos5078 2 жыл бұрын

"Its API style is cleaner" Really? I think Pandas is much more pythonic.

@incremental_failure Жыл бұрын

@@leventelajos5078 Agree. Column assignment in Pandas seems more pythonic.

@konstagold Жыл бұрын

@@BiologyIsHot When you're working with large data sizes, you will be bottlenecked by pandas in no time. Typically at that point, you switch to spark, which has its advantages, but also downsides. Polars looks to be a good middle fit between the two that dask was trying to achieve.

@calum.macleod 2 жыл бұрын

Thanks for a good explanation of how Polars could benefit people who use Pandas and need more speed. In my project we already have a heavy emphasis on multi processing and fast inter process communication, so I am especially interested to see a Pandas vs Polar single core performance comparison for group and join. I hope that someone does the comparison and posts it to KZbin.

@robmulla 2 жыл бұрын

Glad it was helpful! If you look in the polars repo they have some queries that they benchmark. H2o also has a benchmark comparison of a few different libraries.

@calum.macleod 2 жыл бұрын

@@robmulla Thanks for the reply. I will look into the benchmarks and h2o.

@santiagoperman3804 2 жыл бұрын

Great timing, I was looking to start playing with Polars since Mark Tenenholtz mentioned it some days ago. I went back to Pandas because couldn't find the assign() and astype() equivalents in Polars, I thought they were lacking, but they seem to be with_columns() and cast(). Now I will resume more persistently.

@robmulla 2 жыл бұрын

Glad you found this video helpful. It does seem like polars may be worth the time investment now that it's becoming more established.

@tmb8807 Жыл бұрын

I'm blown away by how fast this is. Sure there are some things it can't do, but man, even for just reading large data sets it's absolutely blazing.

@gregharvey8574 2 жыл бұрын

Thanks for brining this to my attention, I think I might include polars into some productionalization processes. For data exploration, typically I only use parts of dataframes for plotting or investigation. Given that you can convert a polars dataframe to pandas, it seems like a good approach would be to have the the full dataset in polars and then filter into a pandas dataframe and plot.

@robmulla 2 жыл бұрын

That's a good point about how you can convert the dataframe to pandas when you need to do exploration. I'll have to think about how to use this in my EDA pipelines.

@headbangingidiot 2 жыл бұрын

@@robmulla you can pass polars columns into plotting libs like plotly

@BiologyIsHot 2 жыл бұрын

The question though is do you save much time when doing this? Instantiation of Numpy arrays and Pandas dataframes themselves isn't the fastest. I guess if you have multiple "slow" actions to perform on the data you might have some benefits? Or if you really are working at such a massive scale with many many users that saving compute time is really valuable.

@scraps7624 2 жыл бұрын

I saw some tweets about Polars but seeing it in action is something else Also, I can't believe it took me this long to find your channel, subbed!

@robmulla 2 жыл бұрын

That’s awesome! Glad you found my channel. Feel free to share with others!

@GiasoneP 2 жыл бұрын

Like PySpark AND Pandas. Second half mirrors PySpark. Due to the speed, and out of the box parralelization, I wonder how it stacks up against Spark and how it’s functionality compares to a cluster of machines. Take AWS for example, can it be applied to an EMR cluster? As a side note, I’m super excited about Rust and it’s future in data.

@cryptoworkdonkey 2 жыл бұрын

There is some Apache Arrow based Spark competitors (too young) like Ballista (distributed Data Fusion, written in Rust). We "buy" Spark for Resilent in RDD abbr. Polars can process 50gb on machine, Spark - 35gb because not so effective row-based abstraction from "distributed" trade-off, scala case classes memory blowing etc. vs skinny Rayon runtime in Polars. Ray platform has same arrow format backend and more effective than Spark but can't streaming (yet). In Polars repo polars-dask integration is empty.

@pabtorre 2 жыл бұрын

Yeah the syntax is very similar to pyspark Wonder how well it'll run on a spark cluster...

@robmulla 2 жыл бұрын

Good question. I don’t think polars is meant as a replacement for pyspark because from I can tell it doesn’t computation across nodes.

@AWest-ns3dl 2 жыл бұрын

I can confirm, Polars does not use nodes.

@RyanApplegatePhD 2 жыл бұрын

@@robmulla With the ever improving compute, I think Polars could be in a sweet spot between Spark and Pandas. I know when I was parsing very raw large datasets in pandas I did feel sometimes constrained and moved to Spark, however; there is a lot of overhead for using Spark effectively and this might split the difference.

@rackstar2 Жыл бұрын

I recently decided to fully transition over to using polars instead of pandas for a data pipeline project. The primary reason im liking polars over pandas is not just the speed (the speed is nice dont get me wrong) but its the Space usage! Allmost all of my operations entailed working with data larger than memory. One of the operations i have to do is pivoting a dataframe. My end result has thousands of columns! My kernel never seems to hold steady when doing this with pandas, but polars is really doing the trick for me. One small problem did face tho is when it comes to exporting the results of the pipline. I still have to resort to something like pyarrow and use its writer to do the export in chunks. This might just be because of how low my system memory is. Regardless of this, polars seems to be an excellent option for data processing and manipulation, and if you do want to showcase your data, you can always convert back and forth with pandas !

@ChaiTimeDataScience 2 жыл бұрын

DataTable is also pretty legendary, you might also find it super awesome. Thanks again for your amazing videos, I have watched and learned from every one of them. I hope I'll interview you about your 100k celebration sometime next year 🙏

@robmulla 2 жыл бұрын

Thanks Sanyam! I need to check it out. Hopefully 100k will come next year, but maybe 2024! Talk soon.

@jcbritobr 2 жыл бұрын

Nice stuff. This Polars seems a killer tool. Thank you for share.

@robmulla 2 жыл бұрын

Thanks for watching. It does seem promising.

@DiegoSilva-dv9uf Жыл бұрын

Valeu!

@robmulla Жыл бұрын

Thanks so much 🙌

@curlyman_ 2 жыл бұрын

This is my little trick for hyper optimizing data processing haha. Pivots are insanely fast in polars

@robmulla 2 жыл бұрын

Ohh. Never tried pivots in it.

@juan.o.p. 2 жыл бұрын

Thanks for the recommendation, I will definitely give it a try 😊

@robmulla 2 жыл бұрын

Please do and let me know what you think. There might be negatives about it that I'm not aware of.

@patrickonodje1428 2 жыл бұрын

I love your work. You should have a course on data science.. for folks like us just learning

@robmulla 2 жыл бұрын

Maybe one day! Thanks for watching Patrick!

@patrickonodje1428 2 жыл бұрын

@@robmulla Looking forward

@mutley11 Жыл бұрын

Very compelling presentation; many thanks. I would have liked to see an example of how user-friendly the error messages are. Rust error messages are surprisingly good in general and I was wondering if that is true of polars. You missed at least one opportunity to illustrate a typo. 😊

@robmulla Жыл бұрын

Glad it was helpful! Next time I'll try to throw more errors :D

@sonnix31 2 жыл бұрын

This is fantastic. Thank you

@robmulla 2 жыл бұрын

You're very welcome!

@bubbathemaster 2 жыл бұрын

Extremely interesting. It’ll be hard to dethrone pandas due to the huge community support but I really like the lib.

@robmulla 2 жыл бұрын

I agree pandas is too entrenched at this point to be easily dethroned.

@nikjs 2 жыл бұрын

For the python library developers : Pls create a wrapper lib that does this job of converting regular pandas syntax into the wee-bit more complicated polars syntax. I can see that not all ops would be readily convertible, but there's definitely some low-hanging fruit here, which would cover a lot of simple use cases.

@robmulla 2 жыл бұрын

That would be nice. But I also think it’s nice to have it different to make it clear it’s not the same.

@samstanton-cook1419 2 жыл бұрын

Great video thanks Rob! Our data science teams use polars alot. For long timeseries aggregation queries (100M+) rows we use the pykx python package to access q kdb+ language for higher performance still over pandas and polars. Have you seen it? kx.q.qsql.select(qtab, columns={'minCol2': 'min col2', 'medCol3': 'med col3'}, by={'groupCol1': 'col1'}, where=['col30.7'] )

@robmulla 2 жыл бұрын

I need to check that out. Pykx… first time hearing of it. Sounds cool though. Thanks for watching.

@rohitnair4268 2 жыл бұрын

as usual rob nice video i have learned a lot from you

@tonik2558 2 жыл бұрын

The usage in Python seems to mirror a lot of the standard Rust iterator API. Looks like it would be even better if used directly in Rust. Thanks for making a video about this.

@brainsniffer 2 жыл бұрын

I think that there is so much for data that is built in python that it’s easier to use an abstraction like this than to do things in rust, especially for interactions. It’s an interesting idea.

@robmulla 2 жыл бұрын

I have learning RUST on my todo list. Will you teach me? 😝

@tonik2558 2 жыл бұрын

@@robmulla The Book is an amazing starting resource. It's how I learned Rust, and it's probably the fastest way to get started with the language

@shadowangel8005 2 жыл бұрын

@@robmulla google just posted a small course a week or so back

@The-KP 2 жыл бұрын

@Rob Mulla Nice that Polars can perform rdbms-like ops, but what about the computation libs bind to Pandas dataframes, like numpy, scipy, scikit-learn? If it can be used with those, or somehow replaces them, I'm in! Hopefully Polars is not an island.

@robmulla 2 жыл бұрын

I know you can easily convert from polars back to a pandas dataframe and they use similar Apache Arrow.

@AaronWoodrow1 2 жыл бұрын

I don't fully get why it's geared more toward data pipelining rather than data exploration (as mentioned @ 13:33) if the data needs to be contained to a single host. Even with parallelization across multiple CPUs, there's still a data size cap limited by available memory. A tool such as PySpark (or Dask) seems better suited for pipelining, which ultimately consumes larger amounts of data.

@robmulla 2 жыл бұрын

Yea. I see your point. Sometimes you have data in between or just want a faster pipeline for a small job you run on a regular basis. Either way, if it was identical to python and faster then people would use it for sure!

@AaronWoodrow1 2 жыл бұрын

@@robmulla True, just a minor nit. Great video btw!

@aminehadjmeliani72 2 жыл бұрын

Hi @rub, I think it's a good approach to diversity our tools this days, especially when it comes to deal with memory (sometimes I find myself running out of time with pandas)

@robmulla 2 жыл бұрын

Absolutely! Well said.

@BiologyIsHot 2 жыл бұрын

I think the big problem is that it isn't inter-operable with Numpy-based libraries. I'm honestly struggling to think of many cases where Pandas is too slow. Some of thd features like a lazy/eager API could be nice, but I think most of the slow computations people are doing is within libraries that are going to require conversion to Numpy arrays already.

@robmulla 2 жыл бұрын

Yea, I guess it really depends on your use case. I've run across a few recently where polars was helpful.

@adrianjdelgado Жыл бұрын

You can convert to and from Pandas very easily. Now that Pandas 2.0 will use pyarrow as the backend, that conversion will be truly zero cost.

@MaavBR 2 жыл бұрын

7:10 Quick correction, SAN is San Diego, not San Francisco San Francisco airport's code is SFO

@robmulla 2 жыл бұрын

Doh! Good catch.

@Mari_Selalu_Berbuat_Kebaikan 2 жыл бұрын

Let's always do good and encourage more people to do the same 🙏

@robmulla 2 жыл бұрын

ok!

@pimziengs2900 2 жыл бұрын

Thanks for this video! I am a data scientist always looking for some new techniques xD. Cheers from the Netherlands! PS: There is some background noise in your video around 3:30.

@robmulla 2 жыл бұрын

Welcome! Glad to have a viewer from the Netherlands. Sorry about the noise at 3:30 - I didn't notice it until after I was done editing and then it was too late.

@gabrielperfumo1122 2 жыл бұрын

Great channel!! Thanks for sharing. I'll check it out for sure!

@robmulla 2 жыл бұрын

Thanks Gabriel!

@nikjs 2 жыл бұрын

3:35 - some audio interference starts from around this point, pls check the video

@robmulla 2 жыл бұрын

Thanks for the heads up. I noticed that when editing. Sorry about it.

@jackychan4640 2 жыл бұрын

Happy New Year 2023

@robmulla 2 жыл бұрын

Same to you Jacky! 🎆

@neronjp9909 Жыл бұрын

how come everytime when u click the column name, the column name then copied into yr tpying code.. is there a hot key for that? my company raw data column name is so long and with _ / space / dot...i always get slow down when typing code across the column name, may i know how u do that 8:07..thx

@AlexanderHyll 2 жыл бұрын

As a btw. If you want to plot smth quick, converting to a pandas is super fast (if ofc a bit mem inefficient). Can also just pass columns to plt. Just my 2 cents.

@robmulla 2 жыл бұрын

Good point, I do use df.plot() a lot though so it would take some getting used to.

@adrianjdelgado Жыл бұрын

Now that Pandas 2.0 uses pyarrow as backend, conversions will be truly zero cost.

@chris_kouts Жыл бұрын

You should do a benchmarking video i was waiting for you to tell me if i should start using it

@robmulla Жыл бұрын

I made a video about it just yesterday! Check it out on my channel.

@hensonjhensonjesse 2 жыл бұрын

It looks surprisingly similar to pyspark. Especially the lazy implementation. Pretty cool stuff!

@robmulla 2 жыл бұрын

Yea, a lot of similarities to pyspark!

@ApeWithPants 2 жыл бұрын

Pandas has some strange quirks that always bothered me. Strange syntax or unintuitive copy/not copy behavior. Glad to see more competitors

@robmulla 2 жыл бұрын

I’m a big fan. But also think polars and others like it have good potential. Thanks for watching! Are you a kraken fan? Go Caps!

@akhil-menon Жыл бұрын

Hi Rob, thank you for this super informative video! In one of your takeaways, you mentioned that Polars is a good fit if we have some really heavy data processing work. Would you be able to share some insight on how Polars would stack up against Pandas when having to perform heavy NumPy specific computations?(Think linear and vector algebra, trigonometry, matrix operations) I read on SO that it is imperative to not kill the parallelization that Polars provides by using Python specific code, so it is my intuition that applying NumPy operations on Polars columns could result in a loss of parallelization. It would be great if you could share your thoughts on this. Thank you again for the amazing content you produce!

@adityasrivastav7159 Күн бұрын

Polars is not working in my Jupyter Notebook, whenever I am importing it its showing kernel died.

@CaribouDataScience 2 жыл бұрын

Good stuff!!

@robmulla 2 жыл бұрын

Glad you enjoyed it

@cryptoworkdonkey 2 жыл бұрын

I think Polars must be replace Pandas in ETL tasks. But it have some struggles for comfortable Exprs construction. And in Arrow universe there is Data Fusion project as alternative.

@robmulla 2 жыл бұрын

I agree. I haven't fully tested out the expressions to notice what I use in pandas that polars is missing. What is the Data Fusion project, I'm not familiar with that?

@cryptoworkdonkey 2 жыл бұрын

@@robmulla , DataFusion is more "arrow-society" convented project (part of Apache Arrow project) as Spark/Hive/MR challenger. This is designed more modularity with SQL and DataFrame APIs. This project can be used as library (it positioned self as query engine for arrow) for more high level projects. Polars positioning self as classical DataFrames libraries challenger. But with both you can use as SQL CLI. Both has plan optimizers, Rayon parallelism, simd optimizations etc. Both are cool. I don't know about larger-than-memory capabilities of DataFusion. DataFusion is fundament of Blaze/Ballista distributed computing engines. Polars Dask integration repo currently not active.

@simplemanideas4719 Жыл бұрын

Speed is always priority, because it is equal to resource optimization. However, this leads to question how effizient are both libs in per core efficiency?

@robmulla Жыл бұрын

Good question. I'd guess polars is faster on all fronts but it would depend on a lot of things.

@JustinGarza 2 жыл бұрын

i like this, but i wish i covered graphs? does this use matplotlib or something use to make graphs and charts ?

@robmulla 2 жыл бұрын

It doesn’t. But you can always convert it back to a pandas data frame to plot.

@JustinGarza 2 жыл бұрын

@@robmulla umm maybe I’ll wait til it gets more graphic/chart support or until pandas gets updated

@bryanwilly4086 Жыл бұрын

Perfect, thank you!

@두두-b2d 2 жыл бұрын

OMG.. thank you!!

@PlatinumDragonProductions999 2 жыл бұрын

I love Pandas, but I prefer Spark. This looks very Spark-like to me; I'm eager to make it my goto dataframe processor. :-)

@robmulla 2 жыл бұрын

If you prefer spark I’m guessing this will be a great package for you.

@Pedro_Israel 2 жыл бұрын

Hey Rob can you do a video about automatic EDA librearies? I used them and they blew my mind. I am amazed I didn´t know them earlier.

@robmulla 2 жыл бұрын

That's a good suggestion. What libraries have you used that you like? The main one I've seen is pandas profiling.

@georgiyveter6391 2 жыл бұрын

Use python 3.10. Created dictionary: d = {'a': [1,2,3], 'b': [4, -5, 6]} Created dataframe: df = pl.DataFrame(d) print(type(df)) print(df) It all works. But if I change in dictionary d any number to float, for example 6.8, then functions print type still shows it's a dataframe, but next print silently do nothing, like 'pass', and script ends. Why?

@robmulla 2 жыл бұрын

That’s a great question. Is it only with 3.10?

@bazoo513 Жыл бұрын

I wonder what authors of these tabular data manipulation libraries didn't adopt relational algebra terminology (or even SQL as a, if not the, manipulation language). For example, why is not choosing only some columns called "projection"? Subtle syntax (and _especially_ semantics) differences between libraries designed to do essentially the same tasks make life of users unnecessarily more difficult.

@robmulla Жыл бұрын

That’s a good point. Some libraries (like spark) do have the ability to write SQL directly on flat files like this.

@HyperFocusMarshmallow 2 жыл бұрын

The rust community really produce brilliant stuff. Very impressive! Did you find any areas where polars is lacking vs pandas? Btw, have you checked out nu-shell? It’s essentially a new shell language designed to do the Unix-philosophy but with data frames for data flow. At least as far as I understand it. Written in rust of course. It’s in pretty early development but it feel pretty great to play around with and can probably produce some nice workflows.

@robmulla 2 жыл бұрын

Never heard of my-shell but I’ll check it out. I am not too familiar with the RUST community but this package is pretty solid. As people have mentioned the syntax is much more verbose and it lacks some of the built in pandas features.

@Myektaie Жыл бұрын

Hi, thanks for this great video! It looks like polar is very similar to spark, do you know how they compare?

@robmulla Жыл бұрын

Thanks for the comment. They are very similar. Check out my most recent video where I compare the two.

@JordiRosell 2 жыл бұрын

For ploting polars, I think plotnine is a good option.

@robmulla 2 жыл бұрын

I have a video all about my favorite plotting libraries (including plotnine): kzbin.info/www/bejne/aoDCoGhplsxml8k&feature=shares

@bazoo513 Жыл бұрын

"Split, apply, combine" approach sounds like it could employ massively parallel processing of graphics cards. Is there a CUDA implementation?

@robmulla Жыл бұрын

Yes! It’s called rapids. I need to make a video about it.

@bazoo513 Жыл бұрын

@@robmulla Thanks!

@praveenmogilipuri4524 2 жыл бұрын

Hi, anyone can help me how to connect polars with snowflake. Through pandas i can but i don't want to use pandas.

@robmulla 2 жыл бұрын

I’ve never done anything like that before but maybe others will know how.

@valuetraveler2026 2 жыл бұрын

URLError:

@robmulla 2 жыл бұрын

Strange. Did you get this error when trying to pip install? Otherwise polars shouldn't be using anything to connect to the internet.

@user-fv1576 11 ай бұрын

Looks a bit like SQL with the select. Newbie question, why not just use pandasql library?

@suvidani 2 жыл бұрын

How does the performance compares to pyspark? The syntax very similar to pyspark.

@robmulla 2 жыл бұрын

Good question. I might need to test it out. Haven’t used spark in years and had some bad experiences but it’s probably gotten better since then.

@张世濠-j8e 2 жыл бұрын

somehow it's very similar to Spark on AWS Glue ?

@robmulla 2 жыл бұрын

Yes, very similar but I think polars is intended for a single machine vs. spark which can be distributed across nodes.

@Matias-eh2pn 2 жыл бұрын

How did you configured that theme on jupyter?

@robmulla 2 жыл бұрын

I have a whole video on my setup. Check it out here: kzbin.info/www/bejne/ipXFlqyjiciMj6c

@akshaydushyanth9720 2 жыл бұрын

Is it similar to pyspark? Whats the difference between both?

@robmulla 2 жыл бұрын

Only runs on a single node. Much faster than pyspark when working with data that can fit in memory.

@chintansawla 2 жыл бұрын

The library feels like it's based off the syntax/methods of pyspark. A lot of the methods used are similar to how RDDs are converted to DataFrames in pyspark

@robmulla 2 жыл бұрын

Yes, definitely a lot of similarities between pyspark and polars. Pyspark has always been much slower for me when running on a single node.

@chintansawla 2 жыл бұрын

@@robmulla that's a bit shocking! Both seem to be performing in a similar fashion theoretically (lazy evaluation, parallel computing). Going to try and compare polars soon. Thanks

@jordanfox470 2 жыл бұрын

@@robmulla have you tried pandas on spark? Databricks has that running.

@robmulla 2 жыл бұрын

@@jordanfox470 no. Have you? How does it compare?

@AyahuascaDataScientist Жыл бұрын

Polars doesn’t have a .info() method? I can’t use it…

@donnillorussia 2 жыл бұрын

Isn't this "split-apply-combine" approach similar to map-reduce? Just curious 😉

@robmulla 2 жыл бұрын

Yes! Exactly. Map reduce (like in spark) is very similar. Polars only runs single node, and map reduce I believe can be done across nodes.

@K-mk6pc Жыл бұрын

I am working on large data in pandas.But its not a problem for me. Pandas is doing fine in few mins.

@yayasssamminna 11 ай бұрын

Please make a tutorial on Dask!!!

@JohannPetrak 2 жыл бұрын

Your timeit presentation includes the time to read the data which might not be such a good idea.

@robmulla 2 жыл бұрын

Nice catch, but I actually did that intentionally because data I/O is one area where polars can be much faster.

@JohannPetrak 2 жыл бұрын

@@robmulla it is just very bad practice to do this and there other issues which may totally distort the measurements like the OS caching read data in buffers from a previous read.

@robmulla 2 жыл бұрын

@@JohannPetrak that’s a good point. Any idea how I could properly compare the read time in a way that wouldn’t be messed up by the caching?

@JohannPetrak 2 жыл бұрын

@@robmulla i think there is no way to avoid it, but it may be possible to reduce the effect by loading files that are much larger than what the OS might use for caching, and also load a sequence of many different files for a single benchmarking run, then repeat this several times and take the average (and stdev). Also maybe check how much the external storage is the bottleneck by also loading from SSDs or memcached files. With HDDs this will be A LOT slower than the CPU based benchmarks, so I would argue to separate these benchmarks from each other. But even with the CPU based ones, running on larger data structures (on a computer that has even larger RAM) may give better results as the impact of other OS, memory management, (JIT) interpreter etc optimizations gets reduced. Sorry, I do not want to claim I know how to do proper benchmarks, but I do know (from experience) it is easy to not do it properly :)

@FabioRBelotto 6 ай бұрын

You should have tested polars with the same test as you did with dask, modin and vaex

@rhard007 2 жыл бұрын

Is it not possible to use Matplotlib or Seaborn with Polars?

@robmulla 2 жыл бұрын

It probably is possible. It's just not built into the dataframe as methods like it is in pandas. Just one additional step or you can convert the final data to pandas after processing.

@jay_wright_thats_right 2 ай бұрын

Orders of magnitude faster? What does that even mean?

@XavierSoriaPoma 2 жыл бұрын

So why should we use polars instead of pandas?

@robmulla 2 жыл бұрын

Did you watch the video? 😂 speed is the main reason.

@XavierSoriaPoma 2 жыл бұрын

@@robmulla yeah but still I'm not convinced, it's like tensorflow or pytorch they are not as fast as Flux, but we still use them in python

@JayRodge 2 жыл бұрын

Have you tried RAPIDS cuDF?

@robmulla 2 жыл бұрын

A little bit. It can be really fast but requires that your data is small enough to fit into your GPU memory.

@ankan650 2 жыл бұрын

Wow. It looks like Apache Spark might be obsolete soon. Can you also compare Ray packages with Polar. I think Ray is not exactly for data processing instead for more compute intensive tasks. Thanks.

@robmulla 2 жыл бұрын

I benchmark ray in a different video if you want to check it out.

@ArnabAnimeshDas 2 жыл бұрын

I would import another plotting library which produces a better plot anyways.

@robmulla 2 жыл бұрын

Yep, that's totally reasonible. Thanks for watching.

@ArnabAnimeshDas 2 жыл бұрын

@@robmulla also you can convert polars dataframe to pandas if you want to

@fredgavin 2 жыл бұрын

Tried Polars multiple times, and felt that it was too verbose. Just cannot give up R's data.table, which is the best data manipulation package in the data science world, no competitor at all.

@robmulla 2 жыл бұрын

Yea. Definitely more verbose than pandas. I haven’t used R in years but don’t remember it ever being the fastest.

@mishmohd 2 жыл бұрын

Can we suggest they change the name to Polaris

@robmulla 2 жыл бұрын

Why do you suggest that?

@JonLikesStats 5 ай бұрын

Why do we compare polars to pandas instead of polars to dask? I dabble in Rust myself, so im interested in polars. But the comparison most people make seems inherently unfair because of multithreading.

@grabani 2 жыл бұрын

Interesting.

@robmulla 2 жыл бұрын

Glad you think so!

@michaeldeleted 2 жыл бұрын

OMG I just completely replaced pandas with polars and all the regular pandas commands worked

@robmulla 2 жыл бұрын

Wait, what? I think the syntax should be very different. Unless they released a new version that I don't know about. Can you show an example?

@michaeldeleted 2 жыл бұрын

Oops, didn't change all my pd to pl. LOL was still using pandas

@robmulla 2 жыл бұрын

@@michaeldeleted oh! That explains it.

@rolandheinze7182 Жыл бұрын

Polars syntax seems very similar to pyspark, and in my opinion therefore hurts readability vs pandas

@EircWong 2 жыл бұрын

Nosie at 3:29, about 10 seconds

@robmulla 2 жыл бұрын

Yes! I noticed that. I forgot to put my phone further away from the mic. I tried to edit it out as much as possible. Hopefully it wasn't too distracting.

@hanabimock5193 2 жыл бұрын

I already see books and videos about polars. The same as with pandas. It is like come on, who needs a book for pandas? Are you kidding me ?

@robmulla 2 жыл бұрын

Why do you dislike the fact that there are books about it? Honestly curious. Thanks for watching!

@ibekweobinna3514 2 жыл бұрын

Rob,can I add you to website as one of the best tutors of data science. Man you are good. But funny enough I am still learning pandas,then boom came polars.

@robmulla 2 жыл бұрын

Thanks Ibekwe. Never stop learning!

@Capsaicinophile 2 жыл бұрын

Unless you need to run your scripts over and over, I believe Polars cannot replace Pandas, as it takes more effort to write a simple aggregation. 2 seconds of faster execution is not worth 20 seconds of writing a line for every aggregation column and giving it an alias.

@robmulla 2 жыл бұрын

Yea. For quick scripts on small data and EDA, I’m sticking with pandas.

@AWest-ns3dl 2 жыл бұрын

Polars syntax is similar to spark

@robmulla 2 жыл бұрын

I’ve been hearing that 😃

@leonidgrishenkov 2 жыл бұрын

In some cases Polars syntax seems like PySpark

@robmulla 2 жыл бұрын

I've been hearing that a lot :D

@leonidgrishenkov 2 жыл бұрын

@@robmulla ahaha sorry, I’m just a captain obvious 😂

@robmulla 2 жыл бұрын

@@leonidgrishenkov No it's a good point that I didn't realize until people pointed it out. I personally don't use pyspark a ton. Thanks for watching.

@rahulrjb Жыл бұрын

Very pysparke syntax

@commonsense1019 2 жыл бұрын

Well the core of pandas can also be changed using RUST no big deal

@robmulla 2 жыл бұрын

It can. But will it?

@BillyT83 2 жыл бұрын

So... Pandas + Dask = Polars?

@robmulla 2 жыл бұрын

Kinda… but it’s really just it’s own thing.

@richardbennett4365 2 жыл бұрын

It is the problem with people who use pandas. They don't by and large know about polars. But why? Polars creator's fault for not promoting his product or laziness by pandas operators who just don't look for something better. Also, if one writes import polars as pd, then one doesn't need to rewrite code written for pandas. Or, one can import polars s po. I never understood why people import this package as pl. That would be for a package called plank, line the dock replacement.

@robmulla 2 жыл бұрын

Importing as pl makes the most sense to me and it’s what their docs recommend.

@whitebai6367 2 жыл бұрын

Okay, I'd like to use rust directly.

@robmulla 2 жыл бұрын

You can do it! Polars has a rust API too. Try it out and let me know what you think.

@NickWindham Жыл бұрын

Just use Julia instead of Python. Then you can do all this with speed similar to Rust in one language that even simpler syntax than Python.

@robmulla Жыл бұрын

Oh really? I haven’t had a chance to need to use Julia but I know it’s popular to use with spark.

@cradleofrelaxation6473 2 жыл бұрын

Is it just me, the syntax is a bit more complicated than pandas whenever they differ!!

@robmulla 2 жыл бұрын

Yes. I agree, it ends up being more verbose.

@vzmaster 2 жыл бұрын

I'm running into different problem when i try to speed up pandas (or dask), they eating up memory really fast. jupiterlab environment, I load ~3-5mb data, use pandas .extractall() function on string field, a then compare results with int fields(count of matches) In single thread it takes several week to calculate. If i use multiprocessing, then when comparing results with df.loc it eats up to 200gb+ memory.

@robmulla 2 жыл бұрын

That doesn't sound right. If your data is 3-5Mb I can't imagine any sort of processing needing to take a week to calculate. I'm thinking it's probably something in your code and not an issue with pandas or dask.

@vzmaster 2 жыл бұрын

@@robmulla I actually have 2 tables, big one and small one. The Big one has data(~2M entries, ~150mb), the small has patterns (~10k entries ~2mb). I need to run each pattern on all data. Thats why it may take long. But primary problem is not speed, its memory consumption. Thats why i take small chunks of big table ~5-30mb. But even with 5mb i get memory overflow 200gb. Here is code: import numpy as np import pandas as pd import sqlalchemy as sql sql_engine=sql.create_engine('mysql+mysqlconnector://.......................') df=pd.read_sql_query("..........................",sql_engine) #big table patterns=pd.read_sql_query("..........................",sql_engine) #small table ----------------------------------------------jupyterlab block seperation-------------------------------------------------------- def findoccurancesofpatterns(pat): (idx,row)=pat res=df.summarynormalized.str.extractall(row.pattern) numvaluesstats=pd.DataFrame(columns=['pattern','numvalueorder','param1','param2','param3']) if len(res)>0: numvaluecount=len(res.columns) res=pd.merge(res.reset_index(),df[[,'param1','param2','param3']],how='left',left_on='id',right_index=True) for i in range(numvaluecount): numvaluesstats.loc[len(numvaluesstats)]=[row.pattern,i,(res['param1']==res[i].astype('Int64')).sum(),(res['param2']==res[i].astype('Int64')).sum(),(res['param3']==res[i].astype('Int64')).sum()] return numvaluesstats from multiprocessing.pool import Pool pool = Pool(50) allnumvaluesstats=[] for numvaluesstats in pool.imap_unordered(findoccurancesofpatterns, patterns.iterrows()): allnumvaluesstats.append(numvaluesstats)

@richardbennett4365 2 жыл бұрын

What??? Polars is supposed to give the same result as pandas. Duh. Polars is a pandas replacement.

@robmulla 2 жыл бұрын

Yep

@nitinkumar29 2 жыл бұрын

I will let it mature before dealing with this.

@robmulla 2 жыл бұрын

That’s a fair approach. Adopting things too early can be problematic.

@ErikS- Жыл бұрын

Just take a huge amount of RAM. I did that also...

@robmulla Жыл бұрын

I used polars on a live stream and crashed my computer during it because it ate all my memory. There is a way to set it to limit the amount it uses I think

@richardbennett4365 2 жыл бұрын

He said 15, but he wrote 10 at 7min 05s.

@robmulla 2 жыл бұрын

Good catch!