Speed Up Your Pandas Dataframes

Рет қаралды 67,236

Күн бұрын

In this video Rob Mulla teaches how to make your pandas dataframes more efficient by casting dtypes correctly. This will make your code faster, use less memory and smaller when saving to disk or a database.
Timeline:
00:00 Intro
00:47 Imports and Data Creation
02:32 Dataframe Memory Use
03:20 Baseline Speed Test
04:15 Casting Categorical
05:45 Downcasting Ints
07:07 Downcasting floats
08:15 Casting Bool Types
09:15 Benchmark Comparison
11:08 Outro
Thanks for taking the time to watch this video. Follow me on twitch for live coding streams: / medallionstallion_
Speed up Pandas Code: • Make Your Pandas Code ...
Intro to Pandas video: • A Gentle Introduction ...
Exploritory Data Analysis Video: • Exploratory Data Analy...
* KZbin: youtube.com/@robmulla?sub_con...
* Discord: / discord
* Twitch: / medallionstallion_
* Twitter: / rob_mulla
* Kaggle: www.kaggle.com/robikscube
#python #code #datascience #pandas

Пікірлер: 133

@anoopbhagat13 2 жыл бұрын

That's a brilliant way to save memory & computational cost. Thanks Rob ! it was very useful.👍

@robmulla 2 жыл бұрын

Exactly! Casting the correct column types is very important to speeding up your code.

@shreyaskulkarni5823 Жыл бұрын

I really predict that this guys channel is gonna grow a lot.The content is pure without any bs and straight to point with actually new info

@robmulla Жыл бұрын

Thanks for the feedback. I hope you’re right.

@gilzeevi9263 Жыл бұрын

Listen Rob, i came across your channel pretty randomly and your content is pure gold! straight to the point, and professionally presented! Thanks a lot! keep rocking

@Cmax15 Жыл бұрын

Wow, this is one of the things that we rarely encounter in courses and yet the impact matters so much for overall efficiency. Thank you for making these types of videos. Wish this channel the best!

@robmulla Жыл бұрын

Thanks for the positive feedback. Totally agree that some of these specific details might not be covered in school but are great at speeding up your code and pipelines!

@dirk-jantoot1029 Жыл бұрын

@@robmulla it depends very much on the amount of data. Essentially you are making a trade off between faster and more efficient code and developing time. If you work with relatively small amounts of metadata and quickly want to get something done, this might be too much of a hassle. But if your code goes to production and has to go through massive amounts of data then it's certainly worth it.

@robmulla Жыл бұрын

@@dirk-jantoot1029 Good point. There is always a tradeoff between time spent on implementation and code speed, but it's still best practice to properly set column dtypes.

@shrekinahell 2 жыл бұрын

this is incredibly useful information and explained nicely! subbed

@robmulla 2 жыл бұрын

Thanks for the feedback

@FilippoGronchi 2 жыл бұрын

I love all the videos with these tricks that are critical in the daily developer activities! thanks so much

@robmulla 2 жыл бұрын

Thanks Filippo

@clibonthegrind1105 Жыл бұрын

Very nice videos mate! Thanks for sharing your knowledge with us

@robmulla Жыл бұрын

Glad you like them!

@edmundgoldsberry Жыл бұрын

Dude, where were you when I was starting out... you would have saved me hours of struggle. Great content. Please keep it coming.

@robmulla Жыл бұрын

So glad I could help you out Edmund! Share the channel with anyone else you think might also find it helpful.

@gigiosbar 10 ай бұрын

That's brilliant! Thanks, Rob!

@mohamed0h0hamed 2 жыл бұрын

Thx Rob, really enjoyed this episode 👍🏼

@robmulla 2 жыл бұрын

Thanks!

@rakeshkumarkuwar6053 2 жыл бұрын

Thanks Rob, for sharing such a great concept.

@robmulla 2 жыл бұрын

Thanks for watching Rakesh.

@loctranp 2 жыл бұрын

It's just wow! Beginner like me really appreaciate your video. Keep it up my man.

@robmulla 2 жыл бұрын

Glad it was helpful to you. Thanks for the feedback.

@robmulla 2 жыл бұрын

This note in the docs goes into detail about how categorical values only reduce memory use when the number of unique values are low: pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-memory

@oommggdude 5 ай бұрын

What would you recommend is a good formula for determining when it should be categorical vs simple string? Unique values < 50% df length?

@SimpleExcelVBA Жыл бұрын

I'm surprised how much info I learnt from this video, really good work!

@robmulla Жыл бұрын

Glad it was helpful! Check out my other videos and share it with some friends!

@SimpleExcelVBA Жыл бұрын

@@robmulla I will for sure! :)

@TheRecordedLife 2 жыл бұрын

Fantastic Video, definitely using this in my daily work.

@robmulla 2 жыл бұрын

Glad you like it! I hope to continue to create more videos like this in the future.

@wilmermorales 2 жыл бұрын

this is life changing

@deepsajwan Жыл бұрын

Thanks Rob! for making this useful video

@robmulla Жыл бұрын

Glad you found it useful. Thanks for watching.

@alfredoch3811 Жыл бұрын

Nice vid! Glad I ran into your channel...subscribed!

@robmulla Жыл бұрын

Thanks so much for the feedback! Hope you enjoy the other videos too.

@gatorpika Жыл бұрын

Great explanation. Just found out about that a while ago and wish I would have seen this video first instead of doing a bunch of googling. Was trying to get a 36M record dataset with categoricals and positional data to fit in memory on an average laptop for a mapping application. Recasting the datatypes made it all work out.

@robmulla Жыл бұрын

Glad you enjoyed the video! Casting dtypes correctly is really helpful but easy to overlook

@anpham7108 Жыл бұрын

Thank you for making this video.

@robmulla Жыл бұрын

My pleasure! Thanks for watching.

@maxwellarnold570 Жыл бұрын

I deal with stock data for my job a lot- dealing with data frames that have daily data for 3000 companies across 20+ years means dealing with 16+ million rows. These tips are incredibly helpful for saving memory- which for my role is often the limiting factor of pandas and my computer. Too much memory load can slow down groupby calls, your computer as a whole and all code, and even worse crash your computer which has happened to me.

@robmulla Жыл бұрын

Glad this was helpful for you. Sounds like you are working with a lot of data, must be fun!

@TedMan55 3 ай бұрын

Working with 24 milion rows of TMY weather data and I totally feel your pain

@mahfujulalamanik7308 Жыл бұрын

After watching 70% length of this video, i just stopped the video and mashed the like button. 🔥

@robmulla Жыл бұрын

What took you so long. 😂

@rajeshkakawat97 Жыл бұрын

Thanks it usefull , will apply in my project

@robmulla Жыл бұрын

Glad to hear you found it useful Rajesh!

@bongkem2723 11 ай бұрын

awesome man, gonna save me $$$$ by not upgrading cpu but downsizing code !!!

@robmulla 11 ай бұрын

Glad I could help! Efficient code is important!

@ShiNguyenchu 3 ай бұрын

helpful video !

@danielandarge6652 9 күн бұрын

Perfect !

@brendenmorley2643 Жыл бұрын

Thanks!

@robmulla Жыл бұрын

Appreciate the super thanks!!

@skarevaara Жыл бұрын

Very nice video, thanks for the tips! If you do an update you could talk about unsigned integers also, like "uint8" for the Age data?

@robmulla Жыл бұрын

Great suggestion! Thanks for the feedback.

@obayram4615 Жыл бұрын

Good explanation thank yiu very much 👋👍🙂

@robmulla Жыл бұрын

Glad it was helpful! Thanks for watching.

@mehdismaeili3743 2 жыл бұрын

hi,thanks for this video.

@robmulla 2 жыл бұрын

Thanks for watching. Hope you found it helpful.

@gangxaaku 2 жыл бұрын

awesome!!

@robmulla 2 жыл бұрын

Thanks Akshat

@cradleofrelaxation6473 5 ай бұрын

This guy is a legend. He started by using “size” in the random function which he didn’t define. Please did you have the value of “size” in memory before you started recording?

@Omer698 2 ай бұрын

I was about to ask this same question

@myself4024 3 ай бұрын

🎯 Key Takeaways for quick navigation: 00:00 📊 *Efficient Memory Use in Pandas Introduction* - Importance of efficient memory use in pandas for code speed, reduced memory consumption, and storage efficiency. 02:44 📉 *Initial Data Size and Considerations* - Creating a large dataset (1 million rows) and checking its memory usage. - Highlighting the impact of increasing data size on performance and memory requirements. 05:27 🧹 *Optimizing Categorical Columns* - Demonstrating memory reduction by casting categorical columns (position and team). - Significant reduction in dataset size by utilizing categorical data types. 06:44 🎲 *Downcasting Integer Columns* - Explaining downcasting of integer columns to smaller types for memory optimization. - Choosing appropriate integer types based on the data range to avoid information loss. 07:40 📉 *Downcasting Float Columns* - Downcasting float columns to reduce memory usage while maintaining precision. - Highlighting the impact of float type selection on data frame size. 09:02 🧹 *Optimizing Boolean Columns* - Efficiently casting boolean columns for minimal memory usage. - Using boolean type for binary data representation (win column). 10:23 🔄 *Performance Comparison* - Comparing computation times before and after applying dtype optimizations. - Demonstrating the overall improvement in code performance and memory efficiency. Made with HARPA AI

@HAUPTSCHUELER99 Жыл бұрын

Thanks

@robmulla Жыл бұрын

Welcome

@jsp2518 2 жыл бұрын

Sorry, would you tell me how the first time you made the table it must have been a 1000 rows? I saying cause it says size = size but I don’t get where is the 1000 size from. Awesome vids btw!

@robmulla 2 жыл бұрын

Great question. I think someone else pointed this out. I think I might have edited out that part of the video but I did end up editing that function to take in the size = 1000 at some point. Check the gist I posted here: gist.github.com/RobMulla/f04b144bb766b692f9314e3782d724d3

@jeffgraham1389 10 ай бұрын

3x49x4x2=1176. Glad this comment was in here. Drove me a little nuts not knowing where 1000 came from.

@John5ive 24 күн бұрын

thanks. stuff a newb ie me would not think about

@rockwellshabani5180 2 жыл бұрын

Excellent video. Would converting the 'yes/no' to 1 or 0 save as much space as converting them to bool?

@robmulla 2 жыл бұрын

Thanks for watching. I believe any int will always take up more memory than a bool. That is because a bool only uses one bit. int8, int16, etc use 8, 16 bits. A bool is essentially an int1

@juliansteden2980 2 жыл бұрын

@@robmulla This is not completely true. In theory a bool needs only 1bit (true/false, 0/1). In practice CPUs can't address anything smaller than a byte, therefore a bool usually needs 1byte (8bit) of memory just like an int8. Nevertheless great video, thanks!

@robmulla 2 жыл бұрын

@@juliansteden2980 Thanks for clarifying! I stand corrected, that's interesting to know but totally makes sense.

@vrbaac1641 Жыл бұрын

very nice video ^^ just a question... will this help with the browser error "not enough memory" when doing EDA via Jupyter notebook? thanks ^^

@robmulla Жыл бұрын

Thanks for the comment. It’s probably the cause of the error if you are running out of memory.

@FabioRBelotto Жыл бұрын

Category columns are great, but it's important to set observed = true when doing a group by.

@robmulla Жыл бұрын

Whoa! I didn’t know about that option. Need to try it next time. Actually would’ve been helpful on yesterdays stream.

@bhavinmoriya9216 2 жыл бұрын

Awesome! I saw in a Video by Matt Harrison that, there is a library which sort of tells which dtypes needs to be converted to do memory saving. Unfortunately, I do not remember exact video. Are you aware of any such library?

@robmulla 2 жыл бұрын

Thanks! I don’t know of that library but let me know if you find it.

@AmexL Жыл бұрын

Why did you use the ‘map’ method over the ‘astype’ when changing the yes/no strings to bool? Thanks again for this vid.

@robmulla Жыл бұрын

I think I did it that way because you need to define how to convert the strings to a bool. Astype won’t automatically know to convert those strings unless the we “true” or “false”

@andyr8833 Жыл бұрын

Hi Rob, very useful thank you, but how do you deal with the following situation: you have a large MongoDB collection that you want to use locally to develop functions and play with the data. If you import it as a pandas data frame, it is just too large for the PC to handle. What's the best practice in this case? Worth a video tutorial? Thank you

@robmulla Жыл бұрын

Thanks. Can you aggregate the data in some way before exploring it locally? You could also just get a really large ec2 instance to run it on :D - another option would be something like dask. I have a video about pandas alternatives you should check out.

@user-ld5dn3fv4m Жыл бұрын

if you use the astype and round operators on float data, pandas needs to set the signature or leave float64 by default ?

@robmulla Жыл бұрын

Not sure I totally understand the question. But float precision depends on how precise you need the values to be.

@prathamsingh7033 2 жыл бұрын

Does saving it in parquet, and then reading it back retains the dtypes? (Dont have a pc with pyarrow installed near me)

@robmulla 2 жыл бұрын

Great question. Yes it does!

@prathamsingh7033 2 жыл бұрын

@@robmulla Great. Another reason to use parquet!

@dariuszspiewak5624 Жыл бұрын

I know there's a Python module that takes a dataframe and calculates what type transformations one could do on the columns to reduce the size of the frame (it's pretty neat). Can't remember the name of the package now, though... I watched a YT video on it just about yesterday.

@robmulla Жыл бұрын

Cool! Let me know if you find it. There is a function that I’ve used before from Kaggle that does it.

@mikele5355 Жыл бұрын

Hey! I recently really enjoy watching your videos. Could you maybe create a video in which you explain how I can run my python scripts automatically online, so that I don't always have to do this manually and with a switched on computer? I am getting a little bit more advancecd through your videos and I'd be super interested in this topic. Cheers! :)

@robmulla Жыл бұрын

Thanks for watching. That’s a great idea for a video. I think it would differ a lot how you would automate it depending on what you were running. Small program vs a really computational intense process.

@mikele5355 Жыл бұрын

@@robmulla Sounds amazing! Thanks for appreciating the idea :)

@ChimeraGilbert Жыл бұрын

Can you use this astype method if the column contains missing data?

@robmulla Жыл бұрын

Good question. It depends. Int or bool columns can’t contain null. Floats can.

@pierrebernard142 Жыл бұрын

Question : what is the time complexity of all those cast operations ?

@robmulla Жыл бұрын

That’s a great question. I don’t know exactly but i haven’t ever come across a time when the cast operation has been an issue.

@pierrebernard142 Жыл бұрын

@@robmulla thx :)

@luismontero3416 Жыл бұрын

👏👏👏👏👏

@robmulla Жыл бұрын

💪

@Pedro_Israel Жыл бұрын

Isn´t there a library or function to do this? or at least some of the steps?

@Pedro_Israel Жыл бұрын

For example, I made this code which could help anyone. But I bet there are even better options out there: # 6) Change DTYPES #if column dtype == float & has no values after the decimal point = change dtype to int: for col in data1.columns: if data1[col].dtype == 'float64': if data1[col].astype(int).equals(data1[col]): data1[col] = data1[col].astype(int) #Else: try to reduce float to float 8,16,32,64: else: if data1[col].min() >= -128 and data1[col].max() = -32768 and data1[col].max() = -2147483648 and data1[col].max() = -128 and data1[col].max() = -32768 and data1[col].max() = -2147483648 and data1[col].max()

@robmulla Жыл бұрын

I haven't used it myself but you could also look into this: pypi.org/project/pandas-dtype-efficiency/ Understanding this concept is important beyond just pandas dataframes, but I agree it could be somewhat automated.

@robmulla Жыл бұрын

This function is good (I've used it on kaggle before) but you just need to be sure you are ok with casting the datatypes automatically, for instance if you expect to have new data added that could be larger or more percise, or if you would not like to automatically cast as categorical then you might not want to do it automatically.

@Pedro_Israel Жыл бұрын

@@robmulla Sory for my late response. Thank you for the answer!. Yes, that package automates a big part of the process. I would add some functionalities to make it even more customizable but it´s great for anyone who wants to check it. I also agree with you that it´s important to understand the concept. I needed an automation because after your video I use this frequently in large datasets.

@Pedro_Israel Жыл бұрын

@@robmulla Yes, you are right. I usually work with past data that will not be updated in the future so the function comes in handy. But of course, if new data will be added one should cast dtypes carefully.

@mcdolla7965 Жыл бұрын

sir can u plz xplain how we can convert string to catagory without using any inbuit function

@robmulla Жыл бұрын

I'm not sure what you mean. You can set the dtype to 'category' using .astype('category') read more about it here: pandas.pydata.org/docs/user_guide/categorical.html

@mcdolla7965 Жыл бұрын

@@robmulla thanks for reply sir but lemmi clear my question again. without using astype or any inbuilt method how to convert the dtype to categorical of any column

@robmulla Жыл бұрын

@@mcdolla7965 not sure that’s possible.

@mcdolla7965 Жыл бұрын

@@robmulla it is possible sir, inside categorical class is being called and two lists are returned but mechanics were too complex to be understood by me, but im sure you will help me out ,plz send me your email id,so that i will send u the github link then you can go through it..nd that would be gr8 content for your channel too..cuz nowhere its available.

@adelinorafailov233 Жыл бұрын

NameError: name 'size' is not defined this is what i get from the beggining why ?

@robmulla Жыл бұрын

Are you sure you wrote the code correctly? It looks like you might have not been running size as a method and instead python thinks it's a variable.

@muhammadfadliaktsar7172 10 ай бұрын

@@robmulla I got same problem too, but I have been following your instruction correctly and still get that error

@muhammadfadliaktsar7172 10 ай бұрын

and the error look like been read from kaggle notebook is variable that not been declare

@bernard-ng Жыл бұрын

amazing 38 mb to 7 mb 🤩

@robmulla Жыл бұрын

Yes 😁 its crazy how much space can be saved!

@beda9beda 11 ай бұрын

Easy way to optimize run time

@doullagdz9479 2 жыл бұрын

is 10_000 equivalent to 10000?

@robmulla 2 жыл бұрын

Yes! It was added in python 3.6 peps.python.org/pep-0515/

@yusufcan1304 16 күн бұрын

this was traffic

@Omer698 2 ай бұрын

"NameError: name 'size' is not defined"

@Hy60K Жыл бұрын

Team, not "time"! where is size variable init ?

@robmulla Жыл бұрын

Not sure I get that you mean. Did I misspeak in the video?

@nonetype66 18 күн бұрын

Huh!?

@ErikS- Жыл бұрын

If you really want to use Python and run into big bottlenecks, first try moving all things to numpy only...

@robmulla Жыл бұрын

That’s not always so easy but I agree in some cases it is necessary indeed.

@debunkthis Жыл бұрын

Error in the thumb nail pd.DataFrame

@robmulla Жыл бұрын

Nice catch!

@kapamagicman Ай бұрын

Perfect!