No video

How do I make my pandas DataFrame smaller and faster?

  Рет қаралды 66,738

Data School

Data School

8 жыл бұрын

Are you working with a large dataset in pandas, and wondering if you can reduce its memory footprint or improve its efficiency? In this video, I'll show you how to do exactly that in one line of code using the "category" data type, introduced in pandas 0.15. I'll explain how it works, and how to know when you shouldn't use it.
SUBSCRIBE to learn data science with Python:
www.youtube.co...
JOIN the "Data School Insiders" community and receive exclusive rewards:
/ dataschool
== RESOURCES ==
GitHub repository for the series: github.com/jus...
"info" documentation: pandas.pydata.o...
"memory_usage" documentation: pandas.pydata.o...
"astype" documentation: pandas.pydata.o...
Overview of categorical data in pandas: pandas.pydata.o...
API reference for categorical methods: pandas.pydata.o...
== LET'S CONNECT! ==
Newsletter: www.dataschool...
Twitter: / justmarkham
Facebook: / datascienceschool
LinkedIn: / justmarkham

Пікірлер: 245
@dataschool
@dataschool 6 жыл бұрын
Starting in pandas version 0.19, you can create a category column during the file reading process! Learn more here: kzbin.info/www/bejne/Y3_Fimp7bs1-rs0 And starting in pandas 0.21, the method for specifying ordered categories has changed. Learn the new method here: kzbin.info/www/bejne/qpaYe6WJeLxggrs
@WaltterValdez
@WaltterValdez 8 жыл бұрын
Thanks, I reduced mya data from 592.4 MB to 195.0 MB using categories That's amazing!!!
@dataschool
@dataschool 8 жыл бұрын
That is awesome!!
@ilyastrojnov7627
@ilyastrojnov7627 3 жыл бұрын
remember, with big data you need pd.eval and df.query for filter, these functions don't use memore for temp bool Series
@BadriNathJK
@BadriNathJK 8 жыл бұрын
I am recommending your channel to all my friends. You are too good.
@dataschool
@dataschool 8 жыл бұрын
Wow, thank you!
@amandal8170
@amandal8170 3 жыл бұрын
Yes, he is too good. Even our professor recommended learn pandas from him. lol.
@amandal8170
@amandal8170 3 жыл бұрын
@@dataschool Thanks a lot. Could we have some of R shiny or Python visualisation ? Like your teaching style.
@readtilleternity
@readtilleternity 5 жыл бұрын
Dude, you are awesome! This is THE best tutorial on Pandas I have come across on the internet. You are really doing the internet a great favor! Thanks a lot!
@dataschool
@dataschool 5 жыл бұрын
Wow! Thank you so much for your kind words! :) You are very welcome.
@andreacazzaniga8488
@andreacazzaniga8488 7 жыл бұрын
very useful! I was still a bit skeptical but the example with the country series made it all very clear! you are good at giving the best frame to understand things
@dataschool
@dataschool 7 жыл бұрын
Excellent! Glad to hear that this video was helpful to you.
@fruitfcker5351
@fruitfcker5351 5 жыл бұрын
If anyone is seeing a FutureWarning error when specifying categories, instead of: df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True) use: quality_dtype = pd.api.types.CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True) df['quality'] = df.quality.astype(quality_dtype)
@dataschool
@dataschool 4 жыл бұрын
Right! The API changed in pandas 0.21. More details here: kzbin.info/www/bejne/qpaYe6WJeLxggrs
@UndoubtablySo
@UndoubtablySo 11 ай бұрын
category feature super powerful, glad i learnt this
@dataschool
@dataschool 10 ай бұрын
Great to hear!
@JR-di9uk
@JR-di9uk 6 жыл бұрын
You should mention that if you perform a df['mycolumn'].astype=('category'), you won't be able to enter arbitrary strings into the DataFrame anymore (write ops are limited to the exact categories). This may be an advantage (typo protection) or disadvantage, depending on the use case! Otherwise, thanks for the conscise and clear instructions!
@dataschool
@dataschool 6 жыл бұрын
That's a great point, thank you for bringing it up! I really appreciate it.
@FabioRBelotto
@FabioRBelotto Жыл бұрын
I understand that the category becomes "available" to only the kinds of values used on it, but how should I do when need to edit? For example, on sex gender I used to have Male of Female. Now I should store many other types. How to edit / increase the category list?
@jiwonkim5315
@jiwonkim5315 5 жыл бұрын
You’re amazing at explaining, thanks for uploading these content
@dataschool
@dataschool 5 жыл бұрын
You're very welcome! Thanks for your kind comments :)
@JSchellergJ
@JSchellergJ 5 жыл бұрын
Good lord man, this is awesome and your way of teaching is well paced and easy to follow. You're a incredible teacher, keep this way and you will hit the stars!
@dataschool
@dataschool 5 жыл бұрын
Thanks so much for your kind words! Much appreciated!
@ahmadmponda3294
@ahmadmponda3294 Жыл бұрын
Thank you a million. being struggling with inplace returning none type df most of the time.
@Diachron
@Diachron 7 жыл бұрын
Well I must sound like a broken record about how good these videos are but they only get better. I've come close on occasion to manually implementing what the category dtype does, so thanks for that revelation.
@dataschool
@dataschool 7 жыл бұрын
Thank you! I'm glad the category tip was helpful to you!
@uguree
@uguree 3 жыл бұрын
custom ordered category is now a bit different: from pandas.api.types import CategoricalDtype cat_type = CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True) df.quality.astype(cat_type)
@hsrayyar
@hsrayyar 3 жыл бұрын
thanks! It works!
@nitishkumar-bk8kd
@nitishkumar-bk8kd 4 жыл бұрын
loved ur explanation, great teacher
@dataschool
@dataschool 4 жыл бұрын
Thank you! 😃
@kp9834
@kp9834 4 жыл бұрын
Thank you for an excellent video on writing memory efficient code with categorical data in input. I'm interested in understanding various options to read in large dataframes (other than common pandas and spark methods) containing only numerical data, iterate over its length, create smaller dataframe out of it based on a condition and do some processing, all of which in a faster and memory efficient way. Please cover it if possible.
@dataschool
@dataschool 4 жыл бұрын
Thanks for your suggestion!
@sibinh
@sibinh 7 жыл бұрын
Really useful tips. Thanks Kevin.
@dataschool
@dataschool 7 жыл бұрын
You're very welcome!
@Russel4973
@Russel4973 8 жыл бұрын
Great explanation! Never knew about "category" before.
@dataschool
@dataschool 8 жыл бұрын
Thanks! It's so useful, I knew I had to cover it in the video series!
@GregHacob
@GregHacob 8 жыл бұрын
Very useful tips. You make pandas easy to understand. Thank you!
@dataschool
@dataschool 8 жыл бұрын
You're very welcome!
@vinayakmaheshwari3697
@vinayakmaheshwari3697 6 жыл бұрын
Can you make a video on how to merge, join and concatenate in python and also differences between these. Nice videos by the way!
@dataschool
@dataschool 6 жыл бұрын
Thanks for your suggestion, I'll consider it! :)
@senupranesh
@senupranesh 5 жыл бұрын
Amazing explanation along with hands on. I am really stunned with the way of teaching. Thank you very much. Your accent sometimes remembers me Bruce Lee.
@dataschool
@dataschool 4 жыл бұрын
Thank you!
@silverahmad
@silverahmad 4 жыл бұрын
Amazing as always. This entire playlist is in my favorites bar now! I have a quick questions, I tried the bonus tip on the drinksby continent dataframe just to see how it works drinks['continent']=drinks.continent.astype('category', categories=['South America', 'Africa', 'North America', 'Europe', 'Asia', 'Oceania'], ordered=True) and I get this error TypeError: astype() got an unexpected keyword argument 'categories' Any idea why?
@jolespin
@jolespin 5 жыл бұрын
Possible new topic: Methods in pandas that are not well known to most users. I've been using pandas for years and didn't know about the `cat`, `str`, and `memory_usage` methods. I'm familiar with `groupby`, `applymap`, `map`, etc. but it would be cool if you could show case some other methods that are less well known to the common users. Thanks
@dataschool
@dataschool 5 жыл бұрын
Great suggestion, thanks!
@FabioRBelotto
@FabioRBelotto Жыл бұрын
I usually have to work over big big data samples, even for simple analysis. The main issue I face is that pandas takes more time to read/store the data frame than working on it. Sadly, is quicker e easier to just run some extractions using sql as is runs on the database server than importing data to my local machine.
@jaikapoor3666
@jaikapoor3666 4 жыл бұрын
why does .info( ) have parenthesis? Isn't it an attribute of the DataFrame?
@s.baskaravishnu22
@s.baskaravishnu22 7 жыл бұрын
I very much congratulate you for sharing code used in video with us. Many thanks for that. It is very much useful to me. My warm regards to you.
@dataschool
@dataschool 7 жыл бұрын
You're welcome!
@mmimpositive
@mmimpositive 5 жыл бұрын
How to make the output to appear in a tabular form as is shown in your video? This gives the better clarity of data.
@dataschool
@dataschool 5 жыл бұрын
The way the output looks is determined by your editor. I'm using the Jupyter notebook, though note that the output varies even across different versions of the notebook.
@pldeepesh
@pldeepesh 4 жыл бұрын
This on the coolest tutorials I have watched on pandas. Thanks for making it. I have a question though, would these categories improve the speed of a for loop, if I user iterrows() on the data frame
@dataschool
@dataschool 4 жыл бұрын
Thanks for your kind words! As for your question, I'm not sure, sorry!
@Kralnor
@Kralnor 4 жыл бұрын
Using iterrows() in pandas is an anti-pattern and should only be done as a last resort. See engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
@biswajitpatowary5784
@biswajitpatowary5784 7 жыл бұрын
Thats too good. Can you plz come up with tutorial videos of Matplotlib?
@dataschool
@dataschool 7 жыл бұрын
Thanks for the suggestion! :)
@jolespin
@jolespin 5 жыл бұрын
Didn't know about the memory_usage, cat, str, etc. Nice!
@dataschool
@dataschool 5 жыл бұрын
Thanks!
@amish1502
@amish1502 3 жыл бұрын
The tutorials are super nice and helpful, but I just got a slight problem that the 'categories' and 'ordered' arguments are not working in python 3.9 and pandas version 1.2.2
@dataschool
@dataschool 3 жыл бұрын
See here: kzbin.info/www/bejne/qpaYe6WJeLxggrs
@mrmuranga
@mrmuranga 3 жыл бұрын
Amazing....I enjoy learning from the channel
@dataschool
@dataschool 3 жыл бұрын
Thank you!
@sunoreal
@sunoreal 4 жыл бұрын
There is no 'categories' or 'ordered' parameters in the astype() method I use pandas version 0.25.1 So, how do I set a priority in this version? Oh you did explain in your message Thank you
@dataschool
@dataschool 4 жыл бұрын
This should help: nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas_changes.ipynb
@rvg296
@rvg296 3 жыл бұрын
Seems like in the latest pandas 1.1.2 version df['quality'] = df.quality.astype('category',categories=['good','verygood','excellent'],ordered=True) this throws an error saying unexpected categories argument. I guess this should work. df['quality'] = pd.Categorical(df.quality,categories=['good','verygood','excellent'],ordered=True)
@dataschool
@dataschool 3 жыл бұрын
Thanks for sharing! Yes, the pandas API for ordered categories has changed since I recorded this video.
@niteshsrivastava6504
@niteshsrivastava6504 4 жыл бұрын
Thanks for ur knowledge sharing. My question is how this category is different from label encoding. They do the same thing?
@dataschool
@dataschool 4 жыл бұрын
Great question! When using the category data type, you are defining how pandas stores that column of data. However, you still treat that column as strings when working with it within pandas. With label encoding, your goal is to convert categories to numbers so that you can work with the numbers, not the strings. Does that answer your question?
@virenr5767
@virenr5767 6 жыл бұрын
Great Videos. Thank you. Would appreciate your advice on the following - I am attempting to maintain customer-wise product wise monthly sales data. The index would be the product and the columns would be the customer name. Data would have to be captured into the table every month. 1. How would you recommend setting up the structure - As different data frames for each month or as a 3 dimensional array, with the 3rd dimension being the monthly data. 2. How do you set up a blank structure containing all possible products and customers and then populate each data frame with monthly sales data received? 3. Suppose you start dealing with a new customer mid year, how do you populate the entire table with this new customer Series and then start capturing their sales data from the month they start buying? Thank you in advance, for the answers
@dataschool
@dataschool 6 жыл бұрын
I'm sorry, but this is way beyond what I can address in a comment... good luck!
@jewel3761
@jewel3761 4 жыл бұрын
Why did sort_values() method not work in line 9 and instead you used sorted()?
@AbrahamHoffman
@AbrahamHoffman 8 жыл бұрын
Yeah, this one was totally awesome. Thanks for making the videos!
@dataschool
@dataschool 8 жыл бұрын
Ha! Thank you for the comment! And you are very welcome, I enjoyed making these videos.
@nelsonmacy1010
@nelsonmacy1010 3 жыл бұрын
Brilliant video! Thx.bonus was awesome
@dataschool
@dataschool 3 жыл бұрын
Glad you enjoyed it!
@KhalilYasser
@KhalilYasser 3 жыл бұрын
Thank you very much. Amazing tutorial. When trying this line `df['Quality'] = df.Quality.astype('category', categories = ['good', 'very good', 'excellent'], ordered=True)`, I encountered an error `TypeError: astype() got an unexpected keyword argument 'categories'`
@KhalilYasser
@KhalilYasser 3 жыл бұрын
Searched and solve like that: `from pandas.api.types import CategoricalDtype` then I used the line like that `df['Quality'] = df['Quality'].astype(CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True))`
@jaikishank
@jaikishank 3 жыл бұрын
Great explanation .Thank you.
@dataschool
@dataschool 3 жыл бұрын
You are welcome!
@FabioRBelotto
@FabioRBelotto Жыл бұрын
What is the amount of non unique values that still worth becoming a category?
@hoegwonkim1727
@hoegwonkim1727 5 жыл бұрын
I should have found your channel more earily! Tks for sharing great vedio
@dataschool
@dataschool 4 жыл бұрын
😄
@priyankrajsharma
@priyankrajsharma 5 жыл бұрын
awesome tutorial.. you made it so easy
@dataschool
@dataschool 5 жыл бұрын
Thanks!
@oliverf2924
@oliverf2924 3 жыл бұрын
Great tutorial, thank you
@dataschool
@dataschool 3 жыл бұрын
You are welcome!
@vvasani
@vvasani 7 жыл бұрын
You are the best! I'm feeling Lucky that I found your channel at right time in my learning path ...Thanks a lot! I have one question here. could you please help understanding general idea behind using 'categories' in astype method since it is not a pre-defined parameter in method documentation (if we click shift+ tab :) )? I mean what all parameters we can use in place of kwargs in an instancemethod just like we used 'categories' here? (All properties/attributes of an object?)
@dataschool
@dataschool 7 жыл бұрын
Glad you like the videos! Please consider subscribing to the Data School mailing list: www.dataschool.io/subscribe/ Regarding your question, I don't know how to explain the technical details behind why you can pass the argument 'categories' in this case, other than to say that it's because the pandas code has been written to allow that argument. I'm sorry if that's not what you were looking for!
@Leonardo-jv1ls
@Leonardo-jv1ls 5 жыл бұрын
Man. You are insanely good.
@dataschool
@dataschool 4 жыл бұрын
Thank you! 😊
@tugraalp01
@tugraalp01 3 жыл бұрын
(11:00) That method might be usefull for data analysis studies, but if we apply some macine learning algorithms, we HAVE TO use label encoding or one hot encoding etc. technics , right ? I actually want to know that how much correct to convert the attribute as 'category' type in ML instead of not appliyng encoding technics ?
@dataschool
@dataschool 3 жыл бұрын
You are correct that converting to the category type does not prepare it for ML. See this video for more: kzbin.info/www/bejne/ZqiaaXZ-gsSomK8
@itsme.samrat
@itsme.samrat 4 жыл бұрын
loved this part
@dataschool
@dataschool 3 жыл бұрын
Thanks!
@olabrew
@olabrew 7 жыл бұрын
Hi, could you do a lesson on using the pivot function in Pandas? Haven't seen a good example anywhere.
@dataschool
@dataschool 7 жыл бұрын
Thanks for the suggestion! Maybe this might be helpful to you? pbpython.com/pandas-pivot-table-explained.html
@olabrew
@olabrew 7 жыл бұрын
Thanks! That helps to explain it a bit better. Cheers
@user-rj9vs3pr2n
@user-rj9vs3pr2n 9 ай бұрын
Hi! The data file url doesn't seem to be working all of a sudden. Could you look it up please?
@dataschool
@dataschool 7 ай бұрын
You can get the datasets from here if needed: github.com/justmarkham/pandas-videos
@amitghosh425
@amitghosh425 4 жыл бұрын
at 16:44 I get the error message "ValueError: Got an unexpected argument: categories" for running "df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered =True)" . please help
@dataschool
@dataschool 4 жыл бұрын
The pandas API has changed. See this video: kzbin.info/www/bejne/qpaYe6WJeLxggrs
@AlonsoParejawee
@AlonsoParejawee 4 жыл бұрын
Thank you! Is it possible to create multiple dataframes based on the categories I have in my dataset?
@serdarb8995
@serdarb8995 6 жыл бұрын
You are great Kevin
@dataschool
@dataschool 6 жыл бұрын
Thanks! You are great Serdar!
@ItsWithinYou
@ItsWithinYou 2 жыл бұрын
As usual, great lesson. Many thanks!
@dataschool
@dataschool 2 жыл бұрын
Thank you!
@asiftandel8750
@asiftandel8750 3 жыл бұрын
Great Video Sir
@dataschool
@dataschool 3 жыл бұрын
Thanks!
@RohanB-xg6vg
@RohanB-xg6vg 3 жыл бұрын
Hello ,currently I am using pandas version 1.2.2,in that I get an error while runing this code , df.quality.astype('category',categories=[''good','very good','excellent'],ordered =True) And it says that astype() got an unexpected keyword argument 'categories' Do they removed those parameters in newer version of pandas as this video was few years old?
@dataschool
@dataschool 3 жыл бұрын
See this video: kzbin.info/www/bejne/qpaYe6WJeLxggrs
@jdavis38100
@jdavis38100 8 жыл бұрын
Great job Kevin!
@dataschool
@dataschool 8 жыл бұрын
Thanks! :)
@gcm4312
@gcm4312 8 жыл бұрын
Very useful!
@dataschool
@dataschool 8 жыл бұрын
Agreed! It's surprising that it's not more widely known! I'm trying to change that :)
@geocarvalhont
@geocarvalhont 7 жыл бұрын
Amazing tip, thank you again!
@dataschool
@dataschool 7 жыл бұрын
You're very welcome!
@haoshanduan6314
@haoshanduan6314 6 жыл бұрын
The way you wrote categories is deprecated, now we need to write like this: from pandas.api.types import CategoricalDtype df['quality']=df.quality.astype('category') CategoricalDtype(['good','very good','excellent'], ordered=True)
@dataschool
@dataschool 6 жыл бұрын
Thanks! You're correct that this changed in pandas 0.21. However, I think this is the correct substitute code, which is slightly different from what you wrote: from pandas.api.types import CategoricalDtype cat = CategoricalDtype(['good','very good','excellent'], ordered=True) df['quality']=df.quality.astype(cat) Hope that helps!
@dataschool
@dataschool 6 жыл бұрын
I discuss the new syntax for specifying categories in my latest video, "5 new changes in pandas you need to know about": kzbin.info/www/bejne/qpaYe6WJeLxggrs
@tkannab1
@tkannab1 6 жыл бұрын
Excellent video!! thank you!
@dataschool
@dataschool 6 жыл бұрын
You're very welcome!
@grijeshmnit
@grijeshmnit 4 жыл бұрын
brilliantly explained.
@dataschool
@dataschool 4 жыл бұрын
Thank you!
@richardanderson8377
@richardanderson8377 8 жыл бұрын
My question is about using categorical variables to build a logistic regression model using statsmodels. I had some 0-1 integer variables that I wanted to use as some of the predictor variables to build a logistic regression model, but converted them to categorical thinking this would avoid being treated as numerical. However, I got a ValueError: unrecognized data structures: / . Do you understand why? I can take this to a different forum if that would be better..
@dataschool
@dataschool 8 жыл бұрын
My video coming out on July 12 will answer that question! I'll let you know when it's posted.
@dataschool
@dataschool 8 жыл бұрын
Check out my latest video, and see if it answers your question: kzbin.info/www/bejne/ZqTCYnyph7SaesU Hope that helps!
@richardanderson8377
@richardanderson8377 8 жыл бұрын
Nice video. My question goes a bit further. Suppose you wanted to use your k-1 dummy variables in a statsmodels or sci-kit learn logistic regression. would you leave them as type integers or convert them to type categorical?
@dataschool
@dataschool 8 жыл бұрын
You would leave them as type integer. Good luck!
@spacedustpi
@spacedustpi 5 жыл бұрын
Thanks. Very useful. Why do you prefer df.loc[df.quality > 'good', :] over df[df.quality > 'good']?
@dataschool
@dataschool 5 жыл бұрын
Either is fine. The first is more explicit, whereas the second is more readable, so I go back and forth! :)
@saurabhkhodake
@saurabhkhodake 7 жыл бұрын
For the bonus tutorial i got error as "_astype() got an unexpected keyword argument 'categories' " Has the definition to astype() changed? Appreciate if someone could help.
@mleiano
@mleiano 7 жыл бұрын
I had a similar error, I think what you did is you somehow ran the code without the "ordered = True" bit of the code at first or some such partial code and then tried to run it again with all the arguments as shown in the tutorial above, in that case it does show the error you mentioned. Just run the DataFrame creation command; ie, df = pd.DataFrame(...) again and then run the df.quality.astype(...) code, it should work. It did for me anyways. Let me know how it goes. Can anyone explain why it happens though? I am not sure about that.
@dataschool
@dataschool 7 жыл бұрын
What version of pandas are you running?
@KimmoHintikka
@KimmoHintikka 7 жыл бұрын
Thanks to re-running the the df creation again worked. My pandas version info from conda. pandas 0.19.2 np112py36_1 ------------------------- file name : pandas-0.19.2-np112py36_1.tar.bz2 name : pandas version : 0.19.2 build string: np112py36_1 build number: 1 channel : defaults size : 8.4 MB arch : x86_64 date : 2017-02-04 license : BSD md5 : 5ce048ed69412b7bec27989c5c963678 noarch : None platform : darwin url : repo.continuum.io/pkgs/free/osx-64/pandas-0.19.2-np112py36_1.tar.bz2 dependencies: numpy 1.12* python 3.6* python-dateutil pytz
@mdzahidulislam6857
@mdzahidulislam6857 7 жыл бұрын
I am glad that I came across your videos. It is really helpful for me. However, can we use categorical and numeric features for building decision trees in sklearn? I am getting the following errors: ValueError: could not convert string to float: 'Zimbabwe' Thank you very much for your help.
@dataschool
@dataschool 7 жыл бұрын
You can use categorical features with any scikit-learn model, however you will need to transform them to numeric values. Here are two videos that may help you: kzbin.info/www/bejne/ZqTCYnyph7SaesU kzbin.info/www/bejne/r521nXp5qaann6c
@mdzahidulislam6857
@mdzahidulislam6857 7 жыл бұрын
Thanks a lot! There are awesome..
@dataschool
@dataschool 7 жыл бұрын
You're very welcome! Glad they were helpful to you :)
@evapatrick3476
@evapatrick3476 5 жыл бұрын
Hi there, thanks for your excellent tutorial. I have a question that I unable to find an answer to, Can you use these columns (ones which have been converted into categories) in analysis, specifically machine learning models? If not how can one do without have to use get_dummies option since I have a column of about 8,000 unique rows?
@dataschool
@dataschool 4 жыл бұрын
I recommend scikit-learn's OneHotEncoder for this case. No, you can't directly feed a category column to scikit-learn. Hope that helps!
@Kavyashree40
@Kavyashree40 6 жыл бұрын
Hi, Your videos are superb. Learnt a lot.Could you please explain me about pivot and pivot_table?
@dataschool
@dataschool 6 жыл бұрын
Thanks! I will consider that for future videos.
@rahulgulati890
@rahulgulati890 8 жыл бұрын
Thanks for sharing such great videos. Can you create one video in explaining pivot table in pandas. That would be really helpful. Regards Rahul
@dataschool
@dataschool 8 жыл бұрын
You're welcome! And, I will do my best to create one on pivot table. In the meantime, here's a good post on it: pbpython.com/pandas-pivot-table-explained.html
@rahulgulati890
@rahulgulati890 8 жыл бұрын
+Data School thank you kevin
@LonglongFeng
@LonglongFeng 7 жыл бұрын
question: at 5:20, when you coded drinks.memory_usage(deep=True).sum(), it gave '24920L'. What does the 'L' mean after the figure? I think I seemed to see the 'L' thing appears when using the '.shape' function. what does that 'L' mean?
@dataschool
@dataschool 7 жыл бұрын
L stands for "long", which I believe refers to the "long integer" type, which is the NumPy data type being used to store that data. In other words, it's an implementation detail that you don't really need to know. Hope that helps!
@nishitsethi9405
@nishitsethi9405 7 жыл бұрын
Thanks for the very informative video. I have one question. How do we convert multiple columns to 'category' data type at once? In my data set, I have 25 categorical columns and 6 integer columns. So is there an efficient way of converting these 25 columns to categorical while importing the data set or after importing? Thanks.
@dataschool
@dataschool 7 жыл бұрын
Great question! There might be an easy way to do this, perhaps with the apply function, but I'm not sure at the moment. Let me know if you figured out an efficient method!
@kostasnikoloutsos5172
@kostasnikoloutsos5172 7 жыл бұрын
I am wondering if there is any cryptographic system that can convert strings to integers and then decrypt them back. If yes then why pandas do not implement that in the background to reduce space? Also if we use this astype("category") does has any effects when we export this dataframe into csv or excel file?
@dataschool
@dataschool 7 жыл бұрын
Question 1 - I'm not sure. Question 2 - no effect. Hope that helps!
@srosell100
@srosell100 4 жыл бұрын
Hi, why do you hace to put memory_usage = 'deep' and not only memory_usage
@dataschool
@dataschool 4 жыл бұрын
That's how you specify the parameter
@srosell100
@srosell100 4 жыл бұрын
@@dataschool Thank you very much!!!, never though you would answer, and thank very much in general for your content you have thought me so much!!!
@niteshsawant2716
@niteshsawant2716 4 жыл бұрын
How to autoupdate the ID column
@experimentalhypothesis1137
@experimentalhypothesis1137 5 жыл бұрын
these videos are excellent!
@dataschool
@dataschool 5 жыл бұрын
Thanks!
@muhammadfayyaz7134
@muhammadfayyaz7134 6 жыл бұрын
Would grateful if you make some tutorials on big data analytics thanks
@dataschool
@dataschool 6 жыл бұрын
Thanks for your suggestion!
@muhammadfayyaz7134
@muhammadfayyaz7134 6 жыл бұрын
Data School i hope will see a great tutorial series from you about big data soon. 😊
@annelizabeth728
@annelizabeth728 6 жыл бұрын
Thanks for another fantastic video! I tried the tip at the end, and got a warning message: "FutureWarning: specifying 'categories' or 'ordered' in .astype() is deprecated; pass a CategoricalDtype instead." I checked the pandas documentation and substituted CategoricalDType, e.g. "cat_type = CategoricalDtype(categories=["good", "very good", "excellent"],ordered=True) [newline] df['quality'].astype(cat_type)" but that didn't really work the way I was expecting either. Is there a newer way of accomplishing this?
@dataschool
@dataschool 6 жыл бұрын
Thanks for your kind words! Regarding your question, you are correct that this has changed in the latest versions of pandas. However, your proposed code looks exactly correct to me. What exactly are you expecting that you are not seeing? Just to be clear, you do need to overwrite the existing 'quality' column if you want there to be a permanent change: df['quality'] = df['quality'].astype(cat_type)
@dataschool
@dataschool 6 жыл бұрын
I discuss the new syntax for specifying categories in my latest video, "5 new changes in pandas you need to know about": kzbin.info/www/bejne/qpaYe6WJeLxggrs Hope that helps!
@ishaangupta2223
@ishaangupta2223 4 жыл бұрын
Hey python shows an error whenever I type categories in astype, saying: astype got an unexpected keyword argument 'categories'. Can you please help.
@anngu3086
@anngu3086 4 жыл бұрын
the syntax got updated, you better check out the first comment he pinned on top
@ishaangupta2223
@ishaangupta2223 4 жыл бұрын
Ann Gu Thanks
@ganeshs8522
@ganeshs8522 5 жыл бұрын
Hi Thanks for the nice videos! df[df.quality >'good'] also works Is there any reason you use df.loc[df.quality > 'good'] in the last part of this video? Under what conditions you use df[ condition] vs df.loc[condition]?
@dataschool
@dataschool 5 жыл бұрын
In this case, I use loc to be more explicit. I general, I use loc whenever its flexibility is required.
@hariharamoorthythennetipan2190
@hariharamoorthythennetipan2190 7 жыл бұрын
cool. Very nice examples.
@dataschool
@dataschool 7 жыл бұрын
Thanks! Glad it was helpful to you!
@reazshafqat5504
@reazshafqat5504 7 жыл бұрын
first of all thank you for all of your videos! my question would be: in your case the size of the continent category is 488KB but in my case its 744KB. Can you explain the reason behind this difference?
@dataschool
@dataschool 6 жыл бұрын
Glad you like the videos! Regarding your question, it's probably due to the version of pandas or Python.
@safeeqahmed3306
@safeeqahmed3306 5 жыл бұрын
Great video. I have a doubt. Suppose if i have a dataset about computers. I have a column for number of antivirus installed in a computer. I have total 100 observations but only 3 unique values for this column (1, 2 and 3). So should I consider this column as numeric or categorical?
@dataschool
@dataschool 5 жыл бұрын
It depends - what are you trying to predict?
@safeeqahmed3306
@safeeqahmed3306 5 жыл бұрын
Data School I am predicting if a particular machine will be attacked by a malware soon, based on its configurations and a number of other parameters including number of antiviruses installed
@dataschool
@dataschool 5 жыл бұрын
You would consider the column numeric.
@safeeqahmed3306
@safeeqahmed3306 5 жыл бұрын
Data School thanks a lot. May I know the reason please? And why it depends on the predictor?
@bhanu4187
@bhanu4187 5 жыл бұрын
i want to compare two date and time columns and produce the categorical value of new column if both columns have the same value , like if two columns have the same date and time i need to have 1 else 0. how it can be done pls help me
@dataschool
@dataschool 5 жыл бұрын
df['new'] = (df.first == df.second)
@rikicade2012
@rikicade2012 3 жыл бұрын
df['Quality'] = df.Quality.astype("category", categories=["good", "very good", "excellent"],ordered=True) any idea when I run this I get this
@dataschool
@dataschool 3 жыл бұрын
See this video for details: kzbin.info/www/bejne/qpaYe6WJeLxggrs
@rephechaun
@rephechaun 5 жыл бұрын
Hi Kevin, Does this mean we can throw in this category converted variable into machine learning model like Logistic Regression in sklearn or statmodels?
@dataschool
@dataschool 5 жыл бұрын
No, that's not how it works, sorry!
@Jacob930321
@Jacob930321 6 жыл бұрын
What about cols=['col1', 'col2' ]; df[cols].apply(lambda x: x.astype('category')
@dataschool
@dataschool 6 жыл бұрын
That seems like it would work!
@danielmayper6548
@danielmayper6548 4 жыл бұрын
I've been following along on your examples and they've all been incredible, but I encountered an error I can't see to get around on this one. At about 16:45, the command df['quality'] = df.quality.astype('category', categories=['good','very good','excellent'], ordered=True) is given and whenever I try and submit that line to the compiler I get the error ValueError: Got an unexpected argument: categories Was there an update to Pandas that may have changed this function or is there some kind of error I'm not aware I'm making?
@danielmayper6548
@danielmayper6548 4 жыл бұрын
I had tried going to your github and copying the line you used from there, but I was getting the same error
@dataschool
@dataschool 4 жыл бұрын
The pandas API has changed, please see this video: kzbin.info/www/bejne/qpaYe6WJeLxggrs
@TheAlderFalder
@TheAlderFalder 5 жыл бұрын
This was awesome!
@dataschool
@dataschool 5 жыл бұрын
Thanks!
@aakashkumarnain7592
@aakashkumarnain7592 8 жыл бұрын
Hello Kevin!! How can I rename my columns which I changed to categorical data to the original names of the columns?
@dataschool
@dataschool 8 жыл бұрын
You can use the DataFrame method 'rename', which I talk about in this video: kzbin.info/www/bejne/ZqalmqWPe82csKc
@kostasnikoloutsos5172
@kostasnikoloutsos5172 7 жыл бұрын
You used a parameter called categories.This is not in the parameters of astype method. I think its in **kwargs.In docs I found this: kwargs : keyword arguments to pass on to the constructor. Where is the constructor I cannot understand this
@dataschool
@dataschool 7 жыл бұрын
Sorry, I don't know how to answer your question!
@rdg8268
@rdg8268 6 жыл бұрын
I need something like categories for a age range, for example 0-10, 0-20... Is it possible?
@dataschool
@dataschool 6 жыл бұрын
Sure!
@lonewolf2547
@lonewolf2547 5 жыл бұрын
For my dataset it reduced the size by approximately 50%. What i wanted to ask is if it has to lookup each time, does this increases the time complexity?
@dataschool
@dataschool 5 жыл бұрын
No, the lookup shouldn't take a meaningful amount of time.
@PradeepKumar6
@PradeepKumar6 8 жыл бұрын
Amazing always !!! Is it possible to convert these type of data into category while we read the data into python? Also, There is another datatype called datetime. I think it would be great if you may enlighten us with that as well for the purpose of datetime manipulation in future.
@dataschool
@dataschool 8 жыл бұрын
Thanks! Regarding your first question, I haven't figured out a way to do it. Regarding datetimes, I will cover that in an upcoming video :)
@dataschool
@dataschool 8 жыл бұрын
My latest video on the datetime format has been released: kzbin.info/www/bejne/r3TKe3qpnJWLl5Y Hope that helps!
@vasanthnayak4086
@vasanthnayak4086 7 жыл бұрын
Hi... Thanks for sharing the Greatest series of videos on Pandas...!!! Quick question: Is there a way to convert a csv (size more than 2 GB) to a pandas data frame in the system where the RAM is 2 GB. I am getting 'memory error', while executing the code. I cant use 'category', I need the data as same as in the csv. Thanks...!!!
@dataschool
@dataschool 7 жыл бұрын
Thanks for your kind words! One strategy is to read in only some of the rows and columns (only the ones you need), demonstrated here: kzbin.info/www/bejne/eF7VaomrgJ1jms0
@Om-iy9ix
@Om-iy9ix 6 жыл бұрын
Hie there Great videos, when we wrote drinks.continent.cat.codes.head() we got 1 2 0 2 0 and when I did drinks.head after that, it displayed Asia Europe and all instead of just numbers which should point to a look up table containing strings. Then I did was drinks.memoryusage(deep =True ) which gave reduced continent size... How does this worked . One side it does not reflect in Data frame and on other side it shows reduced . Hope you help me out soon.. Thanks a lot for your amazing videos. Please make more videos on Data Science ML topics .
@dataschool
@dataschool 6 жыл бұрын
Great question! The integers are the internal encodings for those categories, and the size is reduced due to those encodings. Does that help? You might like this video series: kzbin.info/aero/PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
@tusharraikar7103
@tusharraikar7103 6 жыл бұрын
My data is 3gb and its giving memory issues when I use df= read.csv(path) how to avoid this
@dataschool
@dataschool 6 жыл бұрын
Sorry, I don't have a simple answer for you!
@vanmemet
@vanmemet 7 жыл бұрын
Thanks for your great videos, I am very enjoying watching, learning a lot. But most of these concepts are already addressed in sql world. I think when you tutor the video, you may reference these subjects to sql subjects. IMHO.
@dataschool
@dataschool 7 жыл бұрын
SQL and pandas can indeed accomplish many of the same tasks. For SQL users, you are right that SQL comparisons might be helpful. You might like resource #5 here: www.dataschool.io/best-python-pandas-resources/
@TrevorHigbee
@TrevorHigbee 4 жыл бұрын
Looks like Pandas ordered categories syntax has changed. Should now be: from pandas.api.types import CategoricalDtype df['quality'] = df.quality.astype(CategoricalDtype(categories=['good', 'very good', 'excellent'], ordered=True))
@pdileepan
@pdileepan 4 жыл бұрын
What worked for me is: df['quality'] = pd.Categorical(df.quality, categories=['good', 'very good','excellent'], ordered=True)
@mehmetbugu
@mehmetbugu 4 жыл бұрын
@@pdileepan thanks
@dataschool
@dataschool 4 жыл бұрын
Thanks for sharing! I have more details here: kzbin.info/www/bejne/qpaYe6WJeLxggrs
@surbhiagrawal3951
@surbhiagrawal3951 4 жыл бұрын
why i am getting the error :-- ValueError: Got an unexpected argument: categories also getting the same error for ValueError: Got an unexpected argument: ordered
@surbhiagrawal3951
@surbhiagrawal3951 4 жыл бұрын
just now got the answer . We now need to now pass it in via CategorialDtype as the astype method no longer accepts them . from pandas.api.types import CategoricalDtype df = pd.DataFrame({'A':[1,2,3,4,5], 'B':['a','b','c','d','e'], 'C':['A','B','A','B','A']}) df['C']=df['C'].astype(CategoricalDtype(categories=['A','B']))
@dataschool
@dataschool 3 жыл бұрын
Right, the pandas API for this has changed since the recording!
@dhananjaykansal8097
@dhananjaykansal8097 5 жыл бұрын
Guys I get this error. Why so? Is anyone facing the same. TypeError: '
@dataschool
@dataschool 5 жыл бұрын
I'm not clear what you are trying to do here. However, I don't recommend trying to use Python functions with pandas objects.
More of your pandas questions answered!
19:24
Data School
Рет қаралды 28 М.
Can This Bubble Save My Life? 😱
00:55
Topper Guild
Рет қаралды 44 МЛН
WORLD'S SHORTEST WOMAN
00:58
Stokes Twins
Рет қаралды 177 МЛН
I'm Excited To see If Kelly Can Meet This Challenge!
00:16
Mini Katana
Рет қаралды 34 МЛН
25 Nooby Pandas Coding Mistakes You Should NEVER make.
11:30
Rob Mulla
Рет қаралды 266 М.
Turn numbers into categories with the Pandas "cut" method
9:44
Python and Pandas with Reuven Lerner
Рет қаралды 4,7 М.
How do I use the MultiIndex in pandas?
25:01
Data School
Рет қаралды 173 М.
Loop / Iterate over pandas DataFrame (2020)
11:05
Chart Explorers
Рет қаралды 81 М.
Three ways to optimize your Pandas data frame's memory footprint
13:37
Python and Pandas with Reuven Lerner
Рет қаралды 2,3 М.
How do I merge DataFrames in pandas?
21:49
Data School
Рет қаралды 158 М.
Do these Pandas Alternatives actually work?
20:19
Rob Mulla
Рет қаралды 14 М.
This Is Why Python Data Classes Are Awesome
22:19
ArjanCodes
Рет қаралды 800 М.
WHY IS THE STACK SO FAST?
13:46
Core Dumped
Рет қаралды 146 М.
Can This Bubble Save My Life? 😱
00:55
Topper Guild
Рет қаралды 44 МЛН