How do I create dummy variables in pandas?

Рет қаралды 86,012

Data School

Күн бұрын

Пікірлер: 381

@dy1262 8 жыл бұрын

very easy to follow and understand, in contrast with many other tutorials I found, great and many thx

@dataschool 8 жыл бұрын

Great to hear! Thanks for your kind words!

@jwilliams8210 Жыл бұрын

Amazingly clear explanation!!! Thank you!!

@dataschool 11 ай бұрын

You're very welcome!

@souraneelmandal7912 2 жыл бұрын

How do you create that structure table style in jupyter

@brendensong8000 3 жыл бұрын

Wow!!! This is my first video watching you teach. it's crystal clear!!! Looking forward to more video!

@dataschool 3 жыл бұрын

Awesome! Thank you!

@TheNikhileshYadav 5 жыл бұрын

Hello Kevin, for a Multi-label categorical field with more than 600 entries can the same strategy of dummy variables followed ? If not then please suggest the ways in which it can be converted to numeric form. Thank You.

@dataschool 5 жыл бұрын

You can use the same strategy, though I would recommend using OneHotEncoder from scikit-learn. Hope that helps!

@Raaajzzz Жыл бұрын

Thankyou for illustrating it so well , i was not clear with the reasoning behind dropping the first column when using the dummies. But now i have clear idea about that

@dataschool Жыл бұрын

Glad I could be helpful!

@sandhya6818 4 жыл бұрын

That bonus is awesome... Thankyou so much... You explained it so well....

@dataschool 4 жыл бұрын

My pleasure!

@tronalddump2444 2 жыл бұрын

Thanks bro. You are my hero ❤

@dataschool 2 жыл бұрын

Thank you!

@crigar001 4 жыл бұрын

amazing bro, are you have material in español?

@ramleo1461 5 жыл бұрын

Hi Kevin, In relation to the bonus question, Do I need to assign the results of get dummies to a variable in order to make the changes permanent?

@dataschool 5 жыл бұрын

Yes you do!

@luisportillo3491 3 жыл бұрын

Dude, you're amazing! new follower here!

@dataschool 3 жыл бұрын

Thanks!

@alainleclerc4523 Жыл бұрын

you are a wonderful teacher!! thank you very much!!

@dataschool Жыл бұрын

Thanks so much!

@Negr0ni 3 жыл бұрын

Yours videos are making me passionate about the data science career again, also they are making my first Job on data analytics so much easier. Thank you so much!

@dataschool 3 жыл бұрын

You're welcome!

@haciendadad 5 жыл бұрын

I really like that he explains the extra attributes and the things that people gloss over. For example, the : and axis. I'm a newbie, so that little stuff was useful to me.

@dataschool 4 жыл бұрын

Great to hear!

@rohitjacob8890 6 жыл бұрын

Hello Kevin. I am a big fan of your work.Being a big user of R, your tutorials have made me like Python so much that I have completely switched to Python at work now. It would be very helpful if you did a video series each on other basic packages in python like numpy,matplotlib, seaborn , stats models and bokeh.Learning from your videos is so much easier and less time consuming. Currently I am working on my internship during my course and I use atleast one of your tips daily at work.Thanks again. Hoping to see more good content like this.Cheers!!!!!!

@dataschool 6 жыл бұрын

That's awesome to hear! Thanks for your kind comments and suggestions! I will do my best :)

@nadineprins1647 4 жыл бұрын

This was so useful! i didn't know your channel before I googled how to make dummies in pandas. Definitely going to check out your other videos :)

@fet1612 4 жыл бұрын

2:05 the Series-Map method train['Sex_male']=train.Sex.map({'female':0, 'male':1}) train.head(2) Dummy encoded map({'female':0, 'male':1}) female ==> 0, male ==> 1

@fet1612 4 жыл бұрын

7:10 train.Embarked.value_counts() S 644 C 168 Q 77 Name: Embarked, dtype: int64 the embarkation points of The RMS Titanic were: (1) Southampton, England, (2) Cherbourg, France, and finally (3) Queenstown, Ireland in April 1912,

@fet1612 4 жыл бұрын

3:55 try the following piece of code train.columns Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Sex_male'], dtype='object') >>>>

@AhmedKhaliet 2 жыл бұрын

Thank you 💞 it's really great ❣️

@dataschool 2 жыл бұрын

You're welcome!

@RachelBb-k6q 8 күн бұрын

Hey Kevin, regarding dummy variable, what is the technique I can apply in a model input data if i am foreseeing high sales performance in a future or pent up demand etc? Would you still add 0 and 1 to flag those dates?

@aytachuseyn3810 3 жыл бұрын

I have a question. 🙋‍♂️ There are two variables and dozens of observations on the set that we converted to dummy variables. If we delete one of the dummy variables and then delete the original variable, how does the train time machine understand which one belongs to which one? E.g; Sex_Female and Remarked_C have been deleted. Then came the new variable for prediction: Sex_Male: 0, Remarked_Q: 0 Remarked_Q: 0, 1, 0. Is it Sex_Female 0 or is it Remarked_C 0? How does the machine know which variable is Sex_Female and which is Remarked_C? (No ordering because real variables have been deleted) P.s. If you do not understand the question, I m sorry for my bad English.

@sarmigarmi 4 жыл бұрын

Awesome! I have a question. Why are we dropping the first column? As in, for example, Embarked_C?

@dikshyasurvi6869 3 жыл бұрын

This was useful. How do you create dummies for specific ranges ? For instance, 10-50% 1 group, 50-70% - group 2, etc.

@andreacazzaniga8488 4 жыл бұрын

very good especially the last trick !

@dataschool 4 жыл бұрын

Thank you!

@Malachiasz1983 4 жыл бұрын

Great video. It's a shame that KZbin algorithm will probably demonetize it due to "sexual content" :(

@Jinsh0 5 жыл бұрын

SUPER VIDEO!! Very Useful!

@dataschool 5 жыл бұрын

Thanks!

@ROT4C 3 жыл бұрын

Suppose I have multiple columns of dummy variables and I simply want a sum of the variables across those columns, how do I do that?

@ashokgahatraj1210 2 жыл бұрын

It is crystal clear , thanks man❤️

@dataschool 2 жыл бұрын

You're very welcome!

@ashwinsingh1325 5 жыл бұрын

These are great tutorials! Finally found a clear, concise explanation for why your code is written the way it is :)

@dataschool 4 жыл бұрын

Thank you!

@LS-rw3hn 5 жыл бұрын

Dude seriously, you just saved me a lot of work.

@dataschool 5 жыл бұрын

Awesome, that's great to hear!

@MrKingoverall 5 жыл бұрын

THAAAANKKK YOUUUUUU !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! I love you man !!!!!!!

@dataschool 5 жыл бұрын

Ha! You are very welcome! 😍

@VarunKumar-pz5si 4 жыл бұрын

Living Legend Kudos...!!!!!!

@dataschool 3 жыл бұрын

Thanks!

@JerryBlane 5 жыл бұрын

Hi, just wanted to say I love your videos. Can you please do a video on join(), concat(), and merge()?

@dataschool 5 жыл бұрын

Thanks for your suggestion! See here for concat: kzbin.info/www/bejne/Z2bUXpypbbWSfpY

@watheusbr 2 жыл бұрын

so helpful, thanks a lot!

@dataschool 2 жыл бұрын

Great to hear!

@rhettsmedia 4 жыл бұрын

How’s the public repository moved

@lumosyang 6 жыл бұрын

OMG you just saved my ass, thank you!!! and love you!!! will follow up and watch thru all your data videos.

@dataschool 6 жыл бұрын

You're very welcome! :)

@vijayanandhan4649 5 жыл бұрын

Great Tutorial about to deal categorical variables with dummies. The last bonus tips is helped my assignment.

@dataschool 5 жыл бұрын

Great to hear!

@robertue1 3 жыл бұрын

Thank you so much for this video, really well and easily explained!

@dataschool 3 жыл бұрын

Thank you!

@angsumandas1 3 жыл бұрын

My dox. Its 4 year old I am seeing it now

@alensadventures2080 4 жыл бұрын

Hey I'm new to Python and I just wanted to say that your videos are super clear and easy to understand! This has been a great help for me! Teaching code is clearly your calling

@dataschool 4 жыл бұрын

Thanks very much for your kind words! I really appreciate it 🙏

@D4nte-RN 10 ай бұрын

Like usual... I will try to understand some ML concept which is not clear for me. I make the same way: clik, clik, clik between movies from youtubers - most of them make movies from the same source, without thinking, without understand. And then finally, once again, I'm on your channel and you explain me everything with clear and slow. Thanks for your amazing job!

@dataschool 9 ай бұрын

Thanks so much for your kind words!

@ajaykushwaha-je6mw 3 жыл бұрын

Best tutorial video on Dummy variable.

@jasonwong8315 4 жыл бұрын

awesome!! excellent!!!

@dataschool 4 жыл бұрын

Thank you!

@dembobademboba6924 11 ай бұрын

Very helpful and very interesting....keep up the good work always bro....

@dataschool 9 ай бұрын

Thank you!

@arjunpukale3310 5 жыл бұрын

Should we apply feature scaling to categorial columns?

@dataschool 4 жыл бұрын

I'm not sure there is a definitive answer to this, sorry!

@manasa41087 8 жыл бұрын

I am addicted to your videos ...I want to re do my old assignments with all the tricks :)

@dataschool 8 жыл бұрын

Ha! Great to hear :)

@vipul5340 4 жыл бұрын

Can we assign numbers from 0,1,2,3... to a categorical variable rather than making n-1 extra columns?

@dataschool 4 жыл бұрын

Yes, but you should generally only do that if the categories have a natural ordering.

@Thelaunius 4 жыл бұрын

Hi. I understand we only need k-1 dummy variables because we can infer the last variable from the rest, but how would that affect certain classifiers like rule-based ones for example? If they don't have that last variable they cannot create rules like "IF Vk = 1 THEN class = 0". I am thinking that they might not be able to infer it because they only use what columns they have.

@dataschool 4 жыл бұрын

I'm not sure how to answer that question, I'm sorry!

@sandy011187 5 жыл бұрын

Thank you. i was searching for what is drop_first=True. And i found this video. The bonus tip which you had explained cleared this doubt. Please make more videos like this, on interesting tricks and tips on python, machine learning and data science.

@dataschool 5 жыл бұрын

You are in luck, because I'm working on a video of my top 25 pandas tricks right now!! Stay tuned...

@flutterflowhack 2 жыл бұрын

Easy to understand, straight to the point thank you for your tutorials they have been of great help

@dataschool 2 жыл бұрын

You're welcome!

@8eck 3 жыл бұрын

Very clear and very helpful, thank you very much. But i still don't understand why we need to remove a column after making dummy columns? It is like with training and test data?

@dataschool 3 жыл бұрын

Whether or not you need to depends on the circumstances. See this video for more: kzbin.info/www/bejne/hIrXqKysrtt3e80

@haciendadad 5 жыл бұрын

Wow, rarely do you see such a high rating. usually about 10 - 20% vote down. Good ones are like 5%, this guy has less than 1%. Gotta subscribe to him if he is that good! I loved the first video, cant wait to see more.

@dataschool 4 жыл бұрын

Thank you so much!

@fet1612 4 жыл бұрын

6:45 Break the bottom piece of code into several segments and contemplate the output. Ask yourself why it happens and when it doesn't. then following along will start making more sense. Remember, a good data scientist is always thinking and he is always LEARNING. pd.get_dummies(train.Sex) pd.get_dummies(train.Sex, prefix='Sex') pd.get_dummies(train.Sex, prefix='Sex').iloc[:,:1]

@Ihsan_almohsin 3 жыл бұрын

you are simply awesome

@dataschool 3 жыл бұрын

Thank you!

@prathikasundaramurthy 5 жыл бұрын

Hey, what if I have a larger amount of categorical data? E.g.: 15000 unique values of that feature

@dataschool 5 жыл бұрын

It depends, but you probably wouldn't use dummy variables. There's not a simple answer, sorry!

@marklittlewood2418 7 жыл бұрын

If you can create a video or series on Tensorflow that is not esoteric then I would be more impressed than I already am with your video tut's, many thanks

@dataschool 7 жыл бұрын

Thanks for your suggestion!

@fet1612 4 жыл бұрын

3:50 Dummy Variables - an alternative method pd.get_dummies(train.Sex) this is a top-level function meaning you have to write pandas. (or, pd.) before it such as: pandas.get_dummies()

@twafsimon103 3 жыл бұрын

if we want all the three values of the Embarked feature in a single column mean values 0,1 and 2 for the Individual category How could we do it?

@dataschool 3 жыл бұрын

You could use the map method, see this video for an example: kzbin.info/www/bejne/hpDUYaehjtapic0

@sumitbali9194 5 жыл бұрын

Can't thank you enough for the BONUS tip!!!! Impressed!!!

@dataschool 5 жыл бұрын

You're very welcome! :)

@sushichanel7299 6 жыл бұрын

We'd like to know more about tensorflow and machine learning. Thanks so much for great videos.

@dataschool 6 жыл бұрын

Thanks for your suggestion!

@sushichanel7299 6 жыл бұрын

Thanks so much Sir.

@samc2481 6 жыл бұрын

yeah, Thanks kevin, but tensorflow tutorial would be booommmm, please try it, Thanks

@dataschool 6 жыл бұрын

I appreciate the suggestion!

@vinayaknaik540 4 жыл бұрын

Hi, I wanted to know which one do you prefer onehotencoder from sklearn or get_dummies pandas method.... What are the pros and cons of both methods...

@dataschool 4 жыл бұрын

I now recommend OneHotEncoder from scikit-learn if your goal is to prepare your dataset for Machine Learning. I have a whole video explaining exactly how to do this: kzbin.info/www/bejne/n6OrmXeDl9xmrtE

@amitdarak 7 жыл бұрын

Why does pd.get_dummies works with iloc and not loc?

@dataschool 7 жыл бұрын

It will work with either, but I use iloc because it allows me to always use the same code since I'm referencing columns by position. If you use loc, you have to reference columns by name, but the names will change every time. More information is here: kzbin.info/www/bejne/rqfTf3Rtl6hrmdU

@carlosdiaz3428 4 жыл бұрын

Hi Kevin, How could I apply this to numeric variables? For example, if the ticket fare is in [0, 2000) have a 0 and if it is in [2000, inf) have a 1 Thanks!

@bharath-cm2bt 5 жыл бұрын

thank u......

@dataschool 5 жыл бұрын

You're welcome!

@dipakraut6058 5 жыл бұрын

Great Explanation, Just Amazing.

@dataschool 4 жыл бұрын

Thank you!

@rashayahya 4 жыл бұрын

Can you please explain the difference between join, concat, and append... .thanks

@dataschool 4 жыл бұрын

I just released a video on that topic! See here: kzbin.info/www/bejne/n4q6fJmLhNl6l9k

@sourovroy7951 5 жыл бұрын

Great!

@dataschool 5 жыл бұрын

Thanks!

@alal-zj4zb 5 жыл бұрын

Very nice video and great explanatio. Keep it up 👏👏

@dataschool 5 жыл бұрын

Thank you!

@kostasnikoloutsos5172 7 жыл бұрын

I cannot understand why is dummy variables useful? At first I thought that it was something similar to type categories we learned earlier but at the end of this video I realized that they are not worth I did not know when and why do I need those dummy variables!

@dataschool 7 жыл бұрын

I cover dummy encoding in this lesson: github.com/justmarkham/DAT8/blob/master/notebooks/10_linear_regression.ipynb Hope that helps!

@muslumyildiz5694 3 жыл бұрын

Thank you so much. You are a really wonderful great instructor..

@dataschool 3 жыл бұрын

Thank you so much!

@seansantiagox 3 жыл бұрын

Thanks for showing how to add this to the dataframe, very helpful!

@dataschool 3 жыл бұрын

Glad it was helpful!

@kuldipchauhan524 6 жыл бұрын

your vedios are awesome - i get back to your vedios whenever get stuck anywhere - not only i get solutions- i get bonus - which is always for real

@dataschool 6 жыл бұрын

Thanks for your kind words! Glad I can be helpful :)

@sabinadhikari2643 3 жыл бұрын

Which encoder should we use If the column has more than 100 categorical values?

@dataschool 3 жыл бұрын

That's a complex question, but you can always try one-hot encoding or ordinal encoding, regardless of the number of levels.

@jaikishank 4 жыл бұрын

Great video and simple explanation . Thank you. One clarification if we need to feed the columns to the data frame for modelling hope we should not use drop=True (since the variable will be lost) or am i assuming wrong???

@alimahmood4158 4 жыл бұрын

hi there bro i have 24 different categories .So how many column should i have to drop in that case

@dataschool 4 жыл бұрын

Sorry, I'm not sure I understand?

@wuminminnie 3 жыл бұрын

This is awesome, thank you so much

@misslindiwelive 2 жыл бұрын

Once again, my fighter!

@ankitgupta6697 4 жыл бұрын

Sir i want to know .What does get_dummies() function do and why it is needed?

@dataschool 4 жыл бұрын

That's what the video covers! Hope it's helpful to you.

@borntolose_livetowin 6 жыл бұрын

Let's imagine I have NaN in my Embarked-column. Regardless of my replacement-value, I would have 4 new columns. How many columns (or which) do would I have to remove?

@dataschool 6 жыл бұрын

I think I understand your question... you can still define any of those columns as the baseline level and remove it. Hope that helps!

@borntolose_livetowin 6 жыл бұрын

Ahhh, ok, I see ... a NaN-value is more or less nothing else than another category... just checked the documentation, NaN will be handled by the get_dummies-function by default as baseline :-) thanks!

@22MJangel 5 жыл бұрын

Detailed and systematic= easy to follow..

@dataschool 5 жыл бұрын

Thanks!

@eric3372 6 жыл бұрын

This was an exceptional video! Thank you so much! Sincerely!

@dataschool 6 жыл бұрын

You're very welcome! Glad it was helpful!

@jourdango2615 5 жыл бұрын

Hi, I understand how dummy variables work, but why would we want to drop the first dummy variable column? If i were someone looking at the dataframe, i'm going to end up thinking that 'male' or 'not male' are the categorical values for Sex, and i'm going to think embarked only has 'Q', 'S', and 'Not Q and Not S', i'm not going to know that the other Embarked Value is 'C'. Isn't this dropping readability and data? how does this help????

@juliangermek4843 5 жыл бұрын

As someone who didn't come too far in data science yet, I'd say: You're right, by dropping these columns you forget what this first category was. We don't create this dummy dataset for humans to read, however, but for computers and their algorithms. They don't know what these letters mean anyway (neither do I, as a matter of fact); for them it is just important to be able to distinguish between three cases; and this they can still do: Q, S, or neither of them. I stand to be corrected by someone with more experience ;) (Was curious: the letters apparently indicate the Port of Embarkation: C = Cherbourg; Q = Queenstown; S = Southampton)

@dataschool 5 жыл бұрын

Excellent answer, Julian! 👏

@twafsimon103 3 жыл бұрын

I am always inspired by your lecture thanks

@dataschool 3 жыл бұрын

Thank you! 🙏

@PankajMishra-rt6hr 8 жыл бұрын

Hey kevin :) One question....here if we use get_dummies we add more and more colums to our data frame,is there any way to do this inplace like if our series has 'adult','kid','senior_citizen' so whenever it occurs adult get replaced by 0,kid with 1,senior citizen with 2 and so on for different values whenever it occurs in the series,can I map like this ? Thanks

@PankajMishra-rt6hr 8 жыл бұрын

EDIT : I have found it,for future readers,we can do this using sklearn's preprocessing package. STEPS: 1)Import Package - from sklearn.preprocessing.LabelEncoder() 2)Make object(or whatever it is called) - le=LabelEncoder() 4)To convert into numbers- train['Sex']=le.fit_transform(train['Sex']) 5) To convert back - train['Sex']=le.inverse_transform(train['Sex']) That's it :)

@dataschool 8 жыл бұрын

Right! LabelEncoder is useful for taking a series of categorical data and converting it into a series of integers representing the categories. You can also do this within pandas using factorize: pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.factorize.html

@thereadletter2426 7 жыл бұрын

That bonus tip is amazing. Thank you!

@dataschool 7 жыл бұрын

Glad you liked it! You're very welcome :)

@NikhilKumar-pz3uz 6 жыл бұрын

I used the bonus tip but the columns names I didnot passed in the column list is also getting converted how can we solve this??

@dataschool 6 жыл бұрын

Sorry, it's hard for me to say without seeing your code. Good luck!

@dembobademboba6924 11 ай бұрын

Please send to me more link of python , entire guidelines ....

@dataschool 9 ай бұрын

courses.dataschool.io/python-essentials-for-data-scientists

@resap.9128 5 жыл бұрын

I really like your teaching style. Very clear!

@dataschool 5 жыл бұрын

Thanks very much for your kind words!

@anandrathi871 4 жыл бұрын

how do u use get_dummies in data pipiline for example when test data and train data is not split from same source ?

@dataschool 4 жыл бұрын

For creating dummy variables within a pipeline, I definitely recommend using scikit-learn's OneHotEncoder instead. I have a lesson about that here: kzbin.info/www/bejne/n6OrmXeDl9xmrtE

@mkosinski 8 жыл бұрын

Let's say you did conversion to dummy variable and now want to train multinomial classification algorithm (say logistic regression) on Embarked column. You would need the original Embarked column, wouldn't you?

@dataschool 8 жыл бұрын

You would not need the original Embarked column. The dummy variables encode the same information as the Embarked column, but in a numerical way that can be used by a machine learning model - that's the primary reason you create the dummy variables. Hope that helps!

@mkosinski 8 жыл бұрын

Thanks!