Pandas 2.0 : Everything You Need to Know

  Рет қаралды 119,785

Rob Mulla

Rob Mulla

Күн бұрын

In this video I give an overview of pandas 2.0 and the main changes related to the apache arrow backend.
Marc Garcia's Article: datapythonista.me/blog/pandas...
Timeline:
00:00 Intro
01:04 Legacy Numpy
02:49 Arrow Backend
03:44 Missing Values
04:33 Speed
05:47 Interoperability
07:42 Arrow Data Types
Check out my other videos:
Data Pipelines: Polars vs PySpark vs Pandas: • The BEST library for b...
Polars for Data Science: • Polars: The Next Big P...
Speed up Pandas Dataframes: • This INCREDIBLE trick ...
Avoid These Pandas Mistakes: • 25 Nooby Pandas Coding...
Links to my stuff:
* KZbin: youtube.com/@robmulla?sub_con...
* Discord: / discord
* Twitch: / medallionstallion_
* Twitter: / rob_mulla
* Kaggle: www.kaggle.com/robikscube
::::::::::::::::::::
Music: Head Candy - William Rosati
Support by RFM - NCM: bit.ly/3jpOhJn
::::::::::::::::::::

Пікірлер: 125
@irfanshaikh262
@irfanshaikh262 Жыл бұрын
I'm not sure what'd i'd be without you and this YT channel Rob. Thanks for being and amazing teacher to a rookie like myself lots of respect and love
@robmulla
@robmulla Жыл бұрын
Thanks for the feedback!
@JohnnyBoun
@JohnnyBoun Жыл бұрын
Yeah. Except I'm no 'rookie' I'm a seasoned veteran of what was called back in the day, a "database programmer.' We carved our code into stone with flint arrowheads. Now, things are very different. No more stone carving. We've got "Full stacks" now. So it took me about oh... couple of weeks to get any data on my console. Still can't get the "Full Monty", maybe because 3.8 GHz Pentium and 8 Gb isn't enough desktop power? Hard to believe, but hey, it's 2023. Time to face Big Data. By the way, I studied computational biology recently and the amount of DNA sequences is known as "totally insane." Back to thanking Mr. Mulla. Can't thank you enough for getting the important imports imported in the correct order not to confuse the wonderful "Full Stack" too much!
@BabakFiFoo
@BabakFiFoo Жыл бұрын
Rob, you are making the best videos. I am always watching them and learn new stuff. I learned python and pandas myself, and your videos helped me improve them several times fold. thank you!
@TheMacister
@TheMacister Жыл бұрын
Thanks Rob! Couple months seeing your videos and your content is on the spot ! 🎉
@robmulla
@robmulla Жыл бұрын
Hey! I apprecaite the feedback Marcos. I'll try to keep it up.
@igordeoliveirabarrosfaluhe6350
@igordeoliveirabarrosfaluhe6350 Жыл бұрын
Thank you Rob! Your videos are always useful with a such nice flow!
@edwardCYHsu
@edwardCYHsu Жыл бұрын
I am working exactly on my programming assignment of the date manipulation and it is very confusing. With the PyArrow datatype support, it is a lot easier from now. Thank you for highlighting the significance to us. You are a saint.
@mrdbourke
@mrdbourke Жыл бұрын
Epic video Rob! I use pandas everyday and glad to hear it’s getting faster!
@chintansawla
@chintansawla Жыл бұрын
Thanks for sharing the update! Very well articulated
@robmulla
@robmulla Жыл бұрын
Appreciate the feedback!
@gustavojuantorena
@gustavojuantorena Жыл бұрын
Great video Rob! I think Pandas will continue to be very useful in the data science community.
@robmulla
@robmulla Жыл бұрын
Glad you liked the video. I agree, I don't think it's going anywhere anytime soon and 2.0 is a good move for them to adapt to newer backends.
@baldpolnareff7224
@baldpolnareff7224 Жыл бұрын
I agree, they used the right approach. Don't break legacy code, while allowing people to refactor it easily when necessary
@Alexr26
@Alexr26 Жыл бұрын
Your channel is pure gold.
@robmulla
@robmulla Жыл бұрын
Thanks. I appreciate that!
@puzobaklan
@puzobaklan Жыл бұрын
Thank you Rob! Great explained!
@robmulla
@robmulla Жыл бұрын
Thanks for watching. 🙏
@Chuukwudi
@Chuukwudi Жыл бұрын
Awesome! Thanks for the update.
@robmulla
@robmulla Жыл бұрын
Anytime Chuck.
@kevon217
@kevon217 Жыл бұрын
Thanks for this update!
@robmulla
@robmulla Жыл бұрын
Thanks for watching!
@rayankhan12
@rayankhan12 Жыл бұрын
I particularly liked the conversion between the Pandas with Arrow backend and polars.. maybe you should make a separate video on it
@farazahmed1668
@farazahmed1668 Жыл бұрын
thank you so much for such a kind information.
@robmulla
@robmulla Жыл бұрын
Glad you found it helpful!
@AbrahamMendoza
@AbrahamMendoza Жыл бұрын
Thanks, Rob. Great video!
@robmulla
@robmulla Жыл бұрын
Thanks for the comment!
@AFlockOfToasters
@AFlockOfToasters Жыл бұрын
Good job pronouncing Interoperability!
@michaelsoldmann7792
@michaelsoldmann7792 Жыл бұрын
Hi Rob, thank you for the content. I'd be interested in seeing a video on using the old np.where() and np.select() in the new pandas 2.0.
@duypham6729
@duypham6729 Жыл бұрын
great video, tks Rob
@LeveragedAlpha
@LeveragedAlpha Ай бұрын
Dope video! Would be interesting to see how it compares to polars. Also, what software do you use to record your face in that circle?
@TheSoonAnn
@TheSoonAnn Жыл бұрын
very informative, thanks
@robmulla
@robmulla Жыл бұрын
Thanks for watching!
@incremental_failure
@incremental_failure Жыл бұрын
I was about to switch to Polars but now with Pandas Arrow dtypes, I need to do more research. Speed is one thing but lazy processing and memory usage are very important.
@vincentverdugo
@vincentverdugo Жыл бұрын
Hey, your DS videos are awesome! I was using ChatGPT to learn more about Apache Arrow, Polars, etc from all your videos. Can you do a coding livestream or video about bioinformatics data like biological sequence data or drug development data? Thank you!
@franky12
@franky12 Жыл бұрын
I would be interested to know how much time you lose when converting from pandas to polars and back to pandas?
@robmulla
@robmulla Жыл бұрын
That’s a good question that I should experiment with. The article says the only thing it actually needs to move is metadata, so I’m guessing it doesn’t take very long at all.
@omarei
@omarei Жыл бұрын
As a non coder I thought this was a joke
@dansplain2393
@dansplain2393 Жыл бұрын
Depends on the temperature
@dev_time
@dev_time Жыл бұрын
@@dansplain2393 lmao
@x-axis97
@x-axis97 Жыл бұрын
Hey Rob! That text editor looks awesome. Which one is it?
@KenJee_ds
@KenJee_ds Жыл бұрын
Awesome! Are there any drawbacks to using the pyarrow backend?
@robmulla
@robmulla Жыл бұрын
Thanks Ken! Backwards compatibility and the fact that the integration is still very new and might be buggy are two that come to mind.
@RuslanKovtun
@RuslanKovtun Жыл бұрын
My very first thought was: yeah, with pyarrow as a backend, everything single software that relies on numpy has to be rewritten. But at the ends it doesn't looks like a big problem or a problem at all.
@saitaro
@saitaro Жыл бұрын
Yeah, cool thing indeed.
@berdeter
@berdeter Жыл бұрын
Just one thing I don't understand. If boolaens are stored on single bits (really a great idea), how comes they can have 3 states (True, False, None). Wouldn't that require 2 bits ?
@wolfeygamedev1688
@wolfeygamedev1688 7 ай бұрын
Aint nothing stored as one bit in python
@nikjs
@nikjs Жыл бұрын
This int converting into float has been a long time PITA for me. This upgrade will be much welcome.
@pietraderdetective8953
@pietraderdetective8953 Жыл бұрын
ah yess the much needed Pandas improvement! this is what I like in a competitive environment..it sharpens everyone involved. I agree with what has been said in the comment section: now we can use Pandas to handle large dataset properly. especially that 20x speed improvement when reading data using Pyarrow backend is so good!
@robmulla
@robmulla Жыл бұрын
I agree. I'm excited to see how it takes off once officially released. The only problem is people won't be able to ask chatGPT how to properly use pandas 2.0 since chatGPT only goes up to 2021 😆
@katrinabryce
@katrinabryce Жыл бұрын
In a few projects I’ve worked on recently, I found that switching from Pandas to pure Numpy speeded up execution time from about 30 minutes to less than a second on my hardware with about 50m rows of data.
@arjunekrishna7044
@arjunekrishna7044 Жыл бұрын
@@robmulla chatgpt 4 incoming
@Micro-bit
@Micro-bit Жыл бұрын
Hey Rob, Great stuff! .. I moved my data transformation app to PD2.0 and I have a problem with data conversion. When Im converting data from int[pyarrow] to string using astype(str) im loosing pyarrow and pandas convert it to object :/. I cant find the tool to do it properly. All best!
@Alexander-pk1tu
@Alexander-pk1tu Жыл бұрын
I would like to see more about aggregate performance. How it scales in CPU cores if it is still single-core.All for varying sizes of dfs
@robmulla
@robmulla Жыл бұрын
Thats a good idea. I'll try to think about how I could do that. There might be some settings for limiting the CPU use but also I know h2o does some benchmarks across different libraries.
@EricLebigot
@EricLebigot Жыл бұрын
Rob, do you have any limitations of Arrow to share? Until recently, for example, if I'm not mistaken, Arrow didn't seem to handle Pandas' multi indexes.
@robmulla
@robmulla Жыл бұрын
That’s a good question! I’m sure we will know a better answer once people start using it more. I forgot about the multindex limitation! I need to see how that works in pandas 2.0
@murphygreen8484
@murphygreen8484 Жыл бұрын
Can you make a video giving techniques to update pandas (or polars) columns using vectoring instead of .apply() for more complicated custom function calls? eg. I want to take a str column, split it on a delimiter, do work on each section and then combine them back and return it as the new value for the same column
@robmulla
@robmulla Жыл бұрын
I actually have a short about this topic - using the str methods on string columns in pandas: kzbin.info5eYTaYHzoEE
@incremental_failure
@incremental_failure Жыл бұрын
What's the issue, you just return the new value in the applied function. You will be needing to use python functions to split and join though.
@haierpad5669
@haierpad5669 Жыл бұрын
do you think this can have any "consequences" to the develop of numpy? I use pandas for data management but also use numpy and scipy and is very handy the way you can work with all of them till now. don't know if arrow can make the job so easy. also, can this make things more difficult for plotting stuff with matplotlib and seaborn, i.e.? p.d. thanks for the video and also for the previous 1h live stream in-ter-oper-a-bi-li-ty :D
@robmulla
@robmulla Жыл бұрын
Haha. I pronounced interoperability in this video 😂. I don’t think numpy is going anywhere. When working with non-tabular types of data numpy is strong. Also libraries like PyTorch and Tensorflow allow you to convert between tensors and numpy and those are very popular.
@Erosis
@Erosis Жыл бұрын
When you swap between pandas and polars, is python creating a copy of the dataframe into memory or is it just referencing the already allocated block of memory with the pyarrow backend?
@robmulla
@robmulla Жыл бұрын
Great question. It's a little complicated and for things like metadata there is a copy made, but my understanding is that the underlying data is not copied. The article I've linked in the description goes into a lot more detail about it and I'd suggest checking that out.
@Erosis
@Erosis Жыл бұрын
@@robmulla Awesome! Thank you so much! I've been avoiding pandas for large datasets, but this looks like it will make me a more dedicated user!
@robmulla
@robmulla Жыл бұрын
@@Erosis Exactly! I think that's the idea.
@victord8866
@victord8866 Жыл бұрын
When you imported Polars, I thought you were going to do a versus on speed between Polars and Pandas with the new backend. But still very informative video, thanks so much!
@robmulla
@robmulla Жыл бұрын
Maybe I can do that in the next video. I did test it out in my live stream video from yesterday. For speed generally: Polars > Pandas 2.0 > Pandas 1.5
@incremental_failure
@incremental_failure Жыл бұрын
@@robmulla Seeing the same here, Polars is still much faster than Pd 2.0.
@econhelp583
@econhelp583 Жыл бұрын
Thanks very much Rob! Your content is off the charts good, a true outlier, and a great outlier! To get my grad degree in stats, I had to take a PhD-level course in statistical consulting. The prof for that course actually wrote a whole book solely on the topic of outliers, outliers can be extremely important. To me, data science is a subject of stats, similar to the way experimental design or survey sampling are subjects of statistics. I think it is unfortunate that data science is not widely seen as belonging to “statistics” since it makes it harder (at least for me) to use data science tools when teaching stats, as the expectation is to use an old school approach to descriptive and inferential statistics in the classroom. I think your content has the clarity and insight to convince teachers that data science methods MUST be brought into the standard curriculum, e.g. courses like Introductory Business Statistics. Thanks again for posting truly remarkable content!
@tahaknk1485
@tahaknk1485 Жыл бұрын
Now that the beta is out is there any estimation about release ? Btw thx for the video !
@robmulla
@robmulla Жыл бұрын
I only know what has been publicly said by the core dev team. Hope to know soon
@dmail44
@dmail44 Жыл бұрын
There is an Int64 (capital i) which stores int with nulls
@sitrakaforler8696
@sitrakaforler8696 Жыл бұрын
Pandas is life. Even if Polars is cool.
@robmulla
@robmulla Жыл бұрын
Yes it is! 😆
@yveslaporte5808
@yveslaporte5808 Жыл бұрын
One day, I learned that Python was slow for large data, so I started to study Julia who's reputation is "the fastest". One programmer I talked with said to me "No Python is fast!" And maybe new Pandas helps for speed matter. How should I take it from your point of view?
@robmulla
@robmulla Жыл бұрын
I think it’s mostly about choosing the correct tool for the job. But this in general is moving pandas towards being a faster library.
@random-drops
@random-drops Жыл бұрын
Wow! I'm still struggling in pronouncing "Interoperability" and it came out so fluid out of your mouth. Bet you are to release a video comparing pandas 2.0 vs Polars. Is Numpy still in pandas 2.0 due to backward compatibility?
@robmulla
@robmulla Жыл бұрын
Wait... did I actually say "Interoperability" correctly? I filmed myself saying it a bunch of times and just keep the one that sounds the best :D To answer your question yes, pandas has backwards compatability, as I show in the video by default it still uses the numpy backend.
@LeoAr37
@LeoAr37 Жыл бұрын
Is the PyArrow backend not gonna be the default? it would be annoying to have to specify the dtype for every table I have
@robmulla
@robmulla Жыл бұрын
With a library as established as pandas, I don’t think they want to implement breaking changes, so having it as an option, at least in early releases I think is the right choice.
@AxDhan
@AxDhan Жыл бұрын
and what about loading big dataframes? it will improve?
@ys98110
@ys98110 Жыл бұрын
Wow. First time seeing these data types. When would you not use pyarrow and just use default python or something else?
@robmulla
@robmulla Жыл бұрын
I think eventually it will become adopted by most users, but pandas doesn’t want to make major changes. It also might be some time before it’s stable enough to trust in production code.
@MrOnePieceRuffy
@MrOnePieceRuffy Жыл бұрын
It's a very good Video and this Library a good improvement, however ^^ nobody stops you from using a single int and store 32 different boolean states into it with bit shifting operations, but using a Interpreted Script Language means from the beginning "I trade resources for convenience". The underlaying Engine of Python is C and in C there is no actual boolean datatype, the smallest unit is the exact smallest unit the Operating System can provide which are 1 byte / 8 bits which the Operating itself only can provide to you if it reserves a whole virtual memory page for you. I just think, to nitpick about 7 wasted bits as a Python Programmer is a little bit awkward. The rest was great, thanks for the Video
@kerimsever6674
@kerimsever6674 Жыл бұрын
I get unknown engine: pyarrow but I have pandas 2.0 and Pyarrow 10.0.1. Am I missing something ?
@kliti09
@kliti09 9 ай бұрын
pandas 2.0 with pyarrow backend VS pyspark dataframes?
@onedori1
@onedori1 Жыл бұрын
Can you make a tutorial on how to use the new upgraded MediaPipe object detection in python with live stream footage?
@robmulla
@robmulla Жыл бұрын
Great idea. I can try! Too many good ideas and not enough time.
@onedori1
@onedori1 Жыл бұрын
@@robmulla Nice to hear! If you make this work, you'll make my whole month. I've been struggling with this one since the upgrade came out..
@bagavathirajanramaraj7390
@bagavathirajanramaraj7390 Жыл бұрын
When should I be using pandas 2.0 vs Polars?
@robmulla
@robmulla Жыл бұрын
Check out my video on data pipelines comparing them and pyspark!
@MattRose30000
@MattRose30000 Жыл бұрын
Wait, how can Arrow store booleans as one bit each AND allow None values?
@robmulla
@robmulla Жыл бұрын
Good catch! I think the bool values use a single bit but every column has a null array associated with it. But I also don’t know that for sure. But that would mean that a nullable bool would me more than one bit.
@rito_ghosh
@rito_ghosh Жыл бұрын
Do you have any affiliations to Polars?
@robmulla
@robmulla Жыл бұрын
Nope. I’m just me.
@TylerMacClane
@TylerMacClane Жыл бұрын
Hi Rob my second comment I'm glad)
@robmulla
@robmulla Жыл бұрын
Nice! The first commenter was able to see it before I even made the video public :D so lets call it a tie.
@sujeetomar5843
@sujeetomar5843 Жыл бұрын
what is purpose of use_nullable_dtype=True
@googleyoutubechannel8554
@googleyoutubechannel8554 6 ай бұрын
Hmm, Arrows seems great, but I had no idea something as fundamental as Pandas and NumPy was so janky, with really really sketchy quirks. Now I'm wondering what the speedup would be if you tried these operations with a system that wasn't using at least some scripting language interpreter calls for raw data shuffling? Like just a normal struct.
@smellypunks
@smellypunks Жыл бұрын
This sorts two big issues, non-nullable ints, and the object dtype both can be frustrating.
@prashlovessamosa
@prashlovessamosa Жыл бұрын
Hey rob this video is not showing not even got a notification re-upload it, only way I got access to this video by going to your playlist.
@robmulla
@robmulla Жыл бұрын
Hey! You aren't supposed to see this yet :D - I have it as unlisted and am suprised you could even find it!
@prashlovessamosa
@prashlovessamosa Жыл бұрын
@@robmullahey Rob actually haven't updated my yt using one year older version of KZbin I always get 5 sec ads and i can also download videos in 1080p idk how that's possible but working for me even sometime i am able to see private videos those youtubers also always asked me how i am able to play their private videos. Conclusion:- Google is stupid
@arturjaroszewicz8424
@arturjaroszewicz8424 Жыл бұрын
First time watcher feedback: Great video! Clickbait-y title, could at least have a more positive spin :)
@robmulla
@robmulla Жыл бұрын
Thanks for the feedback. Unfortunately the KZbin algorithm favors videos with higher click through rates. I’m experimenting with more enticing titles although I agree with you I wish I didn’t have to.
@joelluth6384
@joelluth6384 Жыл бұрын
I wonder how much of my code this will break. Upgrading sqlalchemy to v2 was like sticking a grenade in my computer.
@robmulla
@robmulla Жыл бұрын
Its supposed to be backwards compatible. You won't know for sure until you try.
@Kaassap
@Kaassap Жыл бұрын
Id be living in the dark if it wasnt for this video 🤣
@tigerbojiteol
@tigerbojiteol Жыл бұрын
in·tr·aa·puh·ruh·bi·luh·tee. Remember that Rob, 😉
@robmulla
@robmulla Жыл бұрын
Oh no. Did I say it wrong 😑
@tigerbojiteol
@tigerbojiteol Жыл бұрын
@@robmulla Not at all. You nailed it 👍
@mlengineer8564
@mlengineer8564 8 ай бұрын
pd.options.mode.dtype_backend = 'pyarrow' This option is dead now in pandas 2.0.3. It must have only been a part of the release candidate
@solomon5888
@solomon5888 Жыл бұрын
do you think 90% of data analyst will be eliminated due to GPT4.0 and further integration of machine learning in MS 365?
@robmulla
@robmulla Жыл бұрын
Absolutely not.
@zeus7914
@zeus7914 Жыл бұрын
lol. funny stuff.
@hannessteffenhagen61
@hannessteffenhagen61 Жыл бұрын
I think with all this talk about pandas and polars people these days often neglect to even consider grizzlies.
@robmulla
@robmulla Жыл бұрын
I can't BARE it! 😄
@fahnub
@fahnub Жыл бұрын
Apache Arrow is so much better.
@robmulla
@robmulla Жыл бұрын
Oh yea? In what way? Do you mean how it’s integrated into pandas 2.0?
@fahnub
@fahnub Жыл бұрын
@@robmulla yessir 💯
@comment8767
@comment8767 Жыл бұрын
Background music is slightly annoying. Your normal voice is peasant, so no need to supplement it.
Unbelievable Face Swapping with 5 Lines Code
11:00
Rob Mulla
Рет қаралды 66 М.
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
Rob Mulla
Рет қаралды 259 М.
100❤️
00:20
Nonomen ノノメン
Рет қаралды 63 МЛН
Can you beat this impossible game?
00:13
LOL
Рет қаралды 64 МЛН
Is it Cake or Fake ? 🍰
00:53
A4
Рет қаралды 14 МЛН
СНЕЖКИ ЛЕТОМ?? #shorts
00:30
Паша Осадчий
Рет қаралды 1,9 МЛН
This INCREDIBLE trick will speed up your data processes.
12:54
Rob Mulla
Рет қаралды 256 М.
DuckDB vs Pandas vs Polars For Python devs
12:05
MotherDuck
Рет қаралды 13 М.
25 nooby Python habits you need to ditch
9:12
mCoding
Рет қаралды 1,7 МЛН
Speed Up Your Pandas Dataframes
11:15
Rob Mulla
Рет қаралды 67 М.
5 Useful Dunder Methods In Python
16:10
Indently
Рет қаралды 51 М.
NumPy vs SciPy
7:56
IBM Technology
Рет қаралды 30 М.
Make Your Pandas Code Lightning Fast
10:38
Rob Mulla
Рет қаралды 176 М.
Python dataclasses will save you HOURS, also featuring attrs
8:50
My top 25 pandas tricks
27:38
Data School
Рет қаралды 264 М.
Әділ Чабанды қалай ұтты? І АСАУ І 7 серия
32:16
Mr. Krabs's Regret #spongebobexe #shorts
0:11
ANA Craft
Рет қаралды 28 МЛН
Кәріс өшін алды...| Synyptas 3 | 10 серия
24:51
kak budto
Рет қаралды 1,2 МЛН
The past and future of Orange juice🍊
0:17
ISSEI / いっせい
Рет қаралды 11 МЛН