How do I encode categorical features using scikit-learn?

  Рет қаралды 139,881

Data School

Data School

Күн бұрын

Пікірлер: 455
@dataschool
@dataschool 5 жыл бұрын
*Are you new to Machine Learning?* Watch my video series, "Introduction to Machine Learning in Python with scikit-learn": kzbin.info/aero/PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
@arunjohn492
@arunjohn492 4 жыл бұрын
Sir what about dummy variable trap , When we use Column Transformer ?
@dataschool
@dataschool 3 жыл бұрын
Great question! See this video: kzbin.info/www/bejne/hIrXqKysrtt3e80
@GoredGored
@GoredGored 2 жыл бұрын
For beginners: When I tried to complete an ML project of say a simple model based on Logistic or Linear regression it used to take me about a month. As I was a beginner in Python, Pandas, SQL and the rest of it, I thought this will take me a long time to master and may be I am a late comer into this. But a year forward now and thanks to Data School, Sentdex, Krish naik, Statquest, Thinkful Webinar and more I am surprised that all I need is a day or less to complete these projects. Because of the meticulous analysis on Data School when I needed a deeper understanding that's where my gps leads me to. Thank you Data School.
@dataschool
@dataschool 2 жыл бұрын
You are so very welcome!
@terryhenyo9216
@terryhenyo9216 5 жыл бұрын
The Legendary Data Science guy is back!
@dataschool
@dataschool 5 жыл бұрын
Thank you for the warm welcome! 😄
@altunbikubra
@altunbikubra 4 жыл бұрын
Your guideline does not only involves basic codes, but it actually involves very practical and useful functions. I want to sincerely thank you for your effort!
@dataschool
@dataschool 4 жыл бұрын
Thanks very much for your kind words!
@liquid_absabs1334
@liquid_absabs1334 4 жыл бұрын
There is something about your explanations, that i just get it instantly. You deserve an award
@dataschool
@dataschool 4 жыл бұрын
You are too kind, thank you!
@dataschool
@dataschool 3 жыл бұрын
Yes, that is the role of the OneHotEncoder.
@fet1612
@fet1612 4 жыл бұрын
00:58 1) It allows you to properly cross-validate a process rather than just a model. In other words, when you are doing cross-validation like cross_val_score, normally you just pass a model to it. Well, there are cases when that is not going to give you accurate results because you're doing the preprocessing outside of the cross-validation. So a pipeline, generally speaking, is useful because you can cross-validate a process that includes (a) *preprocessing* as well as (b) *model building*.
@hieungotrung5411
@hieungotrung5411 5 жыл бұрын
OMG!!! I’ve just started ML in kaggle for the past few weeks. Theres a lot of information to absorb but you teach us in the most understandable way and yet up-to-date question why we should use scikit instead of using dummies. This video is extremely helpful and informative. Thank you alot!!! Guess I gonna spend the rest of the day to watch all of your videos
@dataschool
@dataschool 4 жыл бұрын
Awesome! Glad to hear this was helpful to you 👍
@420nyk
@420nyk 3 жыл бұрын
Thanks, this helps a lot. Was scratching my head on pipeline and column transformer before this video. Also you got a very soothing voice and it helps to relax and really enjoy the learning.
@dataschool
@dataschool 2 жыл бұрын
Great to hear!
@harshitarawat8941
@harshitarawat8941 3 жыл бұрын
Man I love you. I just love you. I love your videos. I love the way you explain things. I love the pace of you videos. I love everything. Thank you.
@dataschool
@dataschool 3 жыл бұрын
Thank you so much, Harshita! 🙏
@rommeltito123
@rommeltito123 4 жыл бұрын
Dayyyyuuummmm.......why did I not stumble upon ur videos earlier ????!!!!!!
@dataschool
@dataschool 3 жыл бұрын
😄
@harshalkulkarni511
@harshalkulkarni511 5 жыл бұрын
Preprocessing with pipeline was complex topic to understand for me before watching this video. Thanks a lot for the video.
@dataschool
@dataschool 4 жыл бұрын
You're very welcome! Glad it helped 👍
@quocanhhbui8271
@quocanhhbui8271 2 жыл бұрын
My god I love your detailed solution. Even my 5yo sibling can understand it. Wonderful. Definitely worth a subscribe.
@dataschool
@dataschool 2 жыл бұрын
Awesome! 🙌
@Rationalist-Forever
@Rationalist-Forever 2 жыл бұрын
I was looking for clear explanation of Pipeline for a long time. You nailed it. Crystal clear explanation and understood by watching one time. Thank you.
@dataschool
@dataschool 2 жыл бұрын
You're so very welcome! 🙏
@chr1112
@chr1112 3 жыл бұрын
you are the best tutor i have ever met , keep up the good work. Thank you
@dataschool
@dataschool 3 жыл бұрын
Wow, thanks!
@christianiheanacho4976
@christianiheanacho4976 5 жыл бұрын
You are a high quality TEACHER , thank you very much.
@dataschool
@dataschool 5 жыл бұрын
You are very welcome! 😄
@Tothefutureand
@Tothefutureand 2 жыл бұрын
Thx kevin, one of best & simplest explanations of pipeline
@dataschool
@dataschool 2 жыл бұрын
Glad it was helpful!
@sandeep1026
@sandeep1026 4 жыл бұрын
I feel fortunate that I stumbled across this video. Very well articulated. Slows down pace, so that folks can hear, understand and digest. Most videos I come across, seem to rush through the contet before one can digest. Thanks for taking time and sharing your knowledge
@dataschool
@dataschool 3 жыл бұрын
Thanks very much for your kind words! 🙏
@Putinka1000
@Putinka1000 5 жыл бұрын
Thank you for speaking slowly. It’s nice to listen to a non-English speaking person
@dataschool
@dataschool 4 жыл бұрын
You're very welcome! :)
@Steven-se5jd
@Steven-se5jd 4 жыл бұрын
just want to say thank you. I am a beginner and you teach much better than my professor.
@dataschool
@dataschool 4 жыл бұрын
Glad to hear I have been helpful! 🙏
@tald747
@tald747 4 жыл бұрын
This is an excellent and simple explanation of this topic. I must say that you are a very talented in the way you teach! You choose your words in a way that emphasizes only the important and relevant staff. Thanks!!!
@dataschool
@dataschool 3 жыл бұрын
Wow, thank you!
@amitsharma8337
@amitsharma8337 4 жыл бұрын
THANK YOU for this tutorial! Was wandering around the web to solve unexpected errors that came by following, apparently, outdated tutorials. If I have landed up on this tutorial the very first time, it would have saved me around 4 hours of useless surfing. Thanks again
@dataschool
@dataschool 4 жыл бұрын
That's awesome to hear... glad I could be of help! By the way, I'll be launching a full course covering these topics (and more)... sign up here to get notified when it launches: scikit-learn.tips
@georgeognyanov
@georgeognyanov 3 жыл бұрын
God damn this video is good. I was struggling with column_transformer and pipelines till late last night. The options you suggest here are so much better and easier to understand for me. I am totally going through your "Introduction to Machine Learning in Python with scikit-learn" playlist soon. Thanks for putting this out!
@dataschool
@dataschool 3 жыл бұрын
You're very welcome! If you want to go deeper into this topic, you may want to check out my course: courses.dataschool.io/building-an-effective-machine-learning-workflow-with-scikit-learn
@PaulBillingtonFW
@PaulBillingtonFW Жыл бұрын
Thanks, for this clear and well paced tutorial.
@dataschool
@dataschool Жыл бұрын
Glad it was helpful!
@horoshuhin
@horoshuhin 3 жыл бұрын
thank you Kevin, very thorough explanation. I'm glad I found your channel. I like the way you teach.
@dataschool
@dataschool 3 жыл бұрын
Thank you so much! 🙏 That's great to hear!
@krishkonnect814
@krishkonnect814 4 жыл бұрын
I just found solution to my problem after watching your video. Thanks a lot.
@dataschool
@dataschool 3 жыл бұрын
You're welcome!
@fahadkhankhattak8339
@fahadkhankhattak8339 3 жыл бұрын
thank you so much!!!!! it was very helpful. yours is the only channel i come running to for help whenever im stuck somewhere. rich conent!! keep sharing these wonderful thingss
@dataschool
@dataschool 2 жыл бұрын
Thank you so much!
@adarshr30
@adarshr30 4 жыл бұрын
After searching alot, i found this channel n i feel its best for me:)
@dataschool
@dataschool 3 жыл бұрын
Happy to hear that!
@salonisamant5410
@salonisamant5410 3 жыл бұрын
Thank you for explaining the pipeline approach so well!
@dataschool
@dataschool 3 жыл бұрын
You're very welcome!
@fet1612
@fet1612 4 жыл бұрын
00:26 " What is the point of the pipeline? The point of the pipeline is to chain steps together sequentially. Normally, you put preprocessing steps and model building steps in a pipeline. Now, why should you build a pipeline? There are two main reasons."
@luisguaniloquinones8936
@luisguaniloquinones8936 4 жыл бұрын
Thanks!
@frankgiardina205
@frankgiardina205 4 жыл бұрын
Excellent! I was using the pandas dummies and your explanation of why pipeline and ohe is a better solution solves all the problems. thanks again
@dataschool
@dataschool 4 жыл бұрын
Glad it helped!
@JainmiahSk
@JainmiahSk 5 жыл бұрын
Sir, just before 5 minutes I visited our channel to ask you the same question where it was difficult for me to encode multivariables in kaggles house prediction using advanced regression dataset. Fortunately and surprisingly you posted same. Thank you so much.
@dataschool
@dataschool 5 жыл бұрын
That's amazing! 🙌 I hope this video is helpful to you, and let me know if you have any questions!
@JainmiahSk
@JainmiahSk 5 жыл бұрын
@@dataschool I have a problem with functions, I can't write custom functions in Python which is very important what to do sir?
@dataschool
@dataschool 5 жыл бұрын
@@JainmiahSk You can definitely write custom functions in Python!
@aaqibsoomro5776
@aaqibsoomro5776 5 жыл бұрын
You are a great teacher. Please make the tutorials or series for Data Visualization, In-Depth Data Analysis, and Cleaning, and Project Deployment, etc. Since after Learning Python and its libraries and ML, these are the next steps.
@dataschool
@dataschool 5 жыл бұрын
I have many more tutorials! Many of them are listed here: www.dataschool.io/launch-your-data-science-career-with-python/
@nishantchaudhary7528
@nishantchaudhary7528 2 жыл бұрын
That was really something amazingly explained, I was looking for all these topics to understand. I got it in one go. Thanks a ton.
@dataschool
@dataschool 2 жыл бұрын
You're very welcome!
@jkore2554
@jkore2554 4 жыл бұрын
Thank you for this tutorial. I was working with logistic regression this week and was trying to figure out how to one hot encode for a categorical variable with hundreds of categories. I was getting 100% accuracy and precision so something wasn’t right. I’m going to try the steps that you outlined in this tutorial. Thanks.
@dataschool
@dataschool 4 жыл бұрын
Good luck!
@dhananjaykansal8097
@dhananjaykansal8097 5 жыл бұрын
Nice to have u back sir. This session was so fruitful. Thanks a ton. Keep it up!
@dataschool
@dataschool 5 жыл бұрын
That's awesome to hear!
@TheAstralftw
@TheAstralftw 4 жыл бұрын
Finally someone explained me properly what is columns transformer and why we use pipeline. I would like you to put your course to udemy , then i ll buy it 100% .. maybe on average you will sell each course for less price, but trust me, you are explaining this so good, you can sell tens of thousands of courses in few months , ... or in the case you have this on udemy , please provide me with the link!
@dataschool
@dataschool 3 жыл бұрын
Thanks for your kind words and your suggestion! I know that many students like Udemy courses, but my values as a course creator don't align with their business model, and so I'm not currently interested in publishing a course there. I prefer to offer courses directly to interested students. Thanks for understanding!
@jobihara
@jobihara 2 жыл бұрын
Thankyou dataschool, it was not only helpful, it was great, enlightening and awesome.
@dataschool
@dataschool 2 жыл бұрын
What a nice thing to say, thank you so much! 🙏
@aimenbaig6201
@aimenbaig6201 3 жыл бұрын
i just discovered your channel and i gotta tell you , you got a permanent subscriber here!!! LOVE YOUR TEACHING STYLE!!!!!!!!!!!!!!!
@dataschool
@dataschool 3 жыл бұрын
Thank you! 🙏
@sandeeppreetam
@sandeeppreetam 4 жыл бұрын
Thank you good sir, this tutorial was better than many paid tutorials on Udemy. Blessed!
@dataschool
@dataschool 3 жыл бұрын
Glad it was helpful! 🙌
@Takk6
@Takk6 4 жыл бұрын
You are by far the best data science teacher on youtube. Can you make a video on creating your own custom transformers using it to modify your data, then using that custom transformer in a ColumnTransformer and a Pipeline?
@dataschool
@dataschool 4 жыл бұрын
Thanks for your suggestion! I'm working on a course that will likely cover that topic. Sign up here to get notified when it launches: scikit-learn.tips
@lovejazzbass
@lovejazzbass 4 жыл бұрын
Kevin, it's 5:20am Winston-Salem time and I am digging this. I was very confused. Thank you so much.
@dataschool
@dataschool 4 жыл бұрын
Excellent!
@David-fr7ee
@David-fr7ee 4 жыл бұрын
Great content, i am learning this in my college data science class. You did better than my professor!
@CE-vd2px
@CE-vd2px 3 жыл бұрын
Are you undergrad or grad?
@dataschool
@dataschool 3 жыл бұрын
Thank you! 🙏
@jatinshetty
@jatinshetty 4 жыл бұрын
yo! Mind blown with the amount of things i learnt from this. Please keep at it!
@dataschool
@dataschool 4 жыл бұрын
Thank you! You might like my scikit-learn tips: github.com/justmarkham/scikit-learn-tips
@Anarchy977
@Anarchy977 4 жыл бұрын
Fantastic tutorial! Great teacher, best Machine Learning teacher on youtube! Thank you!
@dataschool
@dataschool 4 жыл бұрын
Thanks so much!
@amitblizer4567
@amitblizer4567 Жыл бұрын
Very clearly explained and helpful video - Thank you!
@dataschool
@dataschool Жыл бұрын
Glad it was helpful!
@asimssheikh
@asimssheikh 3 жыл бұрын
Impressive explanation, and logical approach to material presentation. You just got a new sub.
@dataschool
@dataschool 3 жыл бұрын
Welcome aboard!
@abdelkaderkaouane1944
@abdelkaderkaouane1944 Жыл бұрын
Your explanation is very clear, thank you very much
@dataschool
@dataschool Жыл бұрын
You're welcome!
@artyb3115
@artyb3115 4 жыл бұрын
Absolutely perfect and useful lessons! Thinking of becoming a patron member as I get a little more confident with ML
@dataschool
@dataschool 4 жыл бұрын
That would be awesome, thank you so much! You can join here: www.patreon.com/dataschool
@brandonbermudez9047
@brandonbermudez9047 Жыл бұрын
Absolute goat bruh, really thankful for your content
@dataschool
@dataschool Жыл бұрын
Thank you!
@sanaullahkhanhassanzai8432
@sanaullahkhanhassanzai8432 5 жыл бұрын
Thank you very much and welcome back after a long time. You are as good as gets when it comes to Machine Learning. You have made me learn a lot. I cant wait for videos on deep learning. I hope you ll come up with deep learning soon. Thanks again
@dataschool
@dataschool 5 жыл бұрын
Thanks very much for your kind words, and for your suggestion as well!
@NoWhiteGullibility
@NoWhiteGullibility 5 жыл бұрын
Perfect timing, was just searching on pipelines the other day. Would be great to follow-up by tacking on Gridsearch in this context.
@dataschool
@dataschool 5 жыл бұрын
That's awesome to hear! I will definitely cover grid search of a pipeline at some point - thanks for the suggestion!
@sowash2020
@sowash2020 Жыл бұрын
You just gained another subscriber...this was super useful
@dataschool
@dataschool Жыл бұрын
Great to hear!
@xinchenzou4558
@xinchenzou4558 2 жыл бұрын
Thank you sir! You've really saved my life...
@dataschool
@dataschool 2 жыл бұрын
🙌
@12345shipreck
@12345shipreck 4 жыл бұрын
You are 100x better than my ML course teacher at uni. GG bro.
@dataschool
@dataschool 4 жыл бұрын
Thank you! 😄
@sophiar5280
@sophiar5280 4 жыл бұрын
Always love your step by step, clear lessons. Keep it coming.
@dataschool
@dataschool 4 жыл бұрын
Thank you!
@gardnmi
@gardnmi 5 жыл бұрын
Since pandas get_dummies ignores non categorical values I've always done below but I might have to start using pipelines. Great video! train = pd.get_dummies(train) test = pd.get_dummies(test) test = test.reindex(columns=train.columns, fill_value=0)
@dataschool
@dataschool 5 жыл бұрын
Thanks for sharing! It's still okay to use get_dummies, but you may end up with a gigantic DataFrame that includes columns you're not interested in. Plus, you will definitely have problems if any of the categorical features in your test data include different values than your training data. Anyway, glad you liked the video and I hope to bring you over to Pipeline! 😉
@gardnmi
@gardnmi 5 жыл бұрын
@@dataschool I ran into the misaligned shapes issues a lot. That's what test.reindex(columns=train.columns, fill_value=0) solved for me but it seems pipeline is a bit more elegant.
@dataschool
@dataschool 5 жыл бұрын
@@gardnmi Even though reindexing *appears* to fix the problem with misaligned shapes, there's a high likelihood that the columns of your test DataFrame no longer match the column ordering of your train DataFrame. That's a significant problem because it means that your features are in the wrong order in test, and thus your model will make incorrect predictions. Pipeline thankfully solves that problem!
@gyanendergandhar
@gyanendergandhar 2 жыл бұрын
Thanks alot for this tutorial Kevin. It really saved me😅
@dataschool
@dataschool 2 жыл бұрын
Glad to hear that!
@eugenechew1476
@eugenechew1476 4 жыл бұрын
Why pay $900 at Uni when you can watch this amazing tutorial for free, and its wayyyy better!
@dataschool
@dataschool 4 жыл бұрын
Thanks! Stay tuned for a course that explores these topics is much more detail...
@83vbond
@83vbond 3 жыл бұрын
I paid $6000 :((
@barulli87
@barulli87 4 жыл бұрын
MIND BLOWN!!!! CV FOR A PROCESS!!! NOICE ONE!!
@dataschool
@dataschool 3 жыл бұрын
🤯
@abdoulayebalde2139
@abdoulayebalde2139 4 жыл бұрын
A very nice video that save my life I can see it is well explained keep uploading
@dataschool
@dataschool 3 жыл бұрын
Thanks!
@joxa6119
@joxa6119 2 жыл бұрын
God this video answered my month unsolved question. God blessed you.
@dataschool
@dataschool 2 жыл бұрын
Great to hear!
@Universe4mi
@Universe4mi 3 ай бұрын
Thanks, very clear and insightful!!
@dataschool
@dataschool 3 ай бұрын
You're welcome!
@ayyappahemanth7134
@ayyappahemanth7134 5 жыл бұрын
Oh my god! after so much of exhaustive waiting another video came, which is far more useful than others for me! I just love your videos, the content was really useful in my real life, most of the youtube channels they just take the ideal ones which I might not encounter in my whole life! please do these videos regularly!
@dataschool
@dataschool 5 жыл бұрын
That is awesome to hear, thanks so much for your kind words! 🙏 Actually, I publish a new Q&A video every month for Data School Insiders at the $5 level: www.patreon.com/dataschool
@salakkal
@salakkal 4 жыл бұрын
Really great that you did a video like this . It just helped me a lot and I am really thankful for it brother . Keep going .
@dataschool
@dataschool 3 жыл бұрын
Thanks!
@trentjones6468
@trentjones6468 4 жыл бұрын
Amazing video. You are an excellent instructor. Got yourself a new subscriber :)
@dataschool
@dataschool 4 жыл бұрын
Thank you so much!
@Susuwho
@Susuwho 4 жыл бұрын
this is so helpful that I have to comment. great job. thanks a lot
@dataschool
@dataschool 4 жыл бұрын
Glad it was helpful!
@christianiheanacho4976
@christianiheanacho4976 5 жыл бұрын
I am enriched by this teaching.
@dataschool
@dataschool 5 жыл бұрын
Great to hear!
@SaunakDey
@SaunakDey 3 жыл бұрын
awesome explanation!! Thanks a lot
@dataschool
@dataschool 3 жыл бұрын
You're very welcome!
@honprarules
@honprarules 4 жыл бұрын
Amazing explanation, as always!
@dataschool
@dataschool 3 жыл бұрын
Thank you!
@AjayVerma-xi2us
@AjayVerma-xi2us 5 жыл бұрын
Very good, it cleared my many doubts
@dataschool
@dataschool 5 жыл бұрын
Great to hear!
@Pqj613
@Pqj613 2 жыл бұрын
It's a good tutorial for some reasons that you will explain later.:D
@kishanlal676
@kishanlal676 5 жыл бұрын
Thank you for this amazing video. Please do some videos on feature selection and scaling techniques in python!
@dataschool
@dataschool 5 жыл бұрын
I'm hoping to cover feature scaling in a future video, but I do have a video about feature selection: kzbin.info/www/bejne/j5Kufph3oa2ap7M Hope that helps!
@eatbreathedatascience9593
@eatbreathedatascience9593 3 жыл бұрын
This video is excellent.
@dataschool
@dataschool 3 жыл бұрын
Thank you!
@1stophchr
@1stophchr 4 жыл бұрын
thank you very much, very clear video
@dataschool
@dataschool 4 жыл бұрын
You're very welcome! 😄
@hichamamchtkou7343
@hichamamchtkou7343 5 жыл бұрын
Thank you very much, it 's very interesting and by the way, it is exactly what i need in my current ML project.
@dataschool
@dataschool 5 жыл бұрын
That's great to hear! Good luck with your project 🙌
@hichamamchtkou7343
@hichamamchtkou7343 5 жыл бұрын
@@dataschool thanks 👍
@TheAdrianPardo
@TheAdrianPardo 5 жыл бұрын
Thank you so much! You're the best! Please go over scaling when you have a chance :) Question: Is is ok to leave in all of the OneHotEncoded columns with this pipe approach? I believe you previously mentioned how it's best to drop one of the columns to prevent multicollinearity. Any way to do this within the pipe?
@dataschool
@dataschool 5 жыл бұрын
You are so kind, thank you! 😊 Yes, I plan to cover StandardScaler at some point. Yes, it is okay to leave in all of the one-hot encoded columns. However, the "drop" parameter for OneHotEncoder (new in scikit-learn 0.21) does allow you to drop one feature per category. Hope that helps!
@ramleo1461
@ramleo1461 5 жыл бұрын
Even I had the same doubt... Thank you for clarifying 😊
@absar66
@absar66 5 жыл бұрын
Great ! Great ! Great! tutorial..many thanks Kevin
@dataschool
@dataschool 5 жыл бұрын
You're very welcome!
@surfzion
@surfzion 4 жыл бұрын
Extremely helpful, thank you so much !!!
@dataschool
@dataschool 3 жыл бұрын
Glad it helped!
@gisleberge4363
@gisleberge4363 2 жыл бұрын
Great example, educational.
@dataschool
@dataschool 2 жыл бұрын
Thank you!
@nowhere5111
@nowhere5111 4 жыл бұрын
This video helps a lot👍👍👍
@dataschool
@dataschool 4 жыл бұрын
Great!
@victor-os9wq
@victor-os9wq 2 жыл бұрын
Thanks for such a detailed tutorial. I am working on a similar problem where I have multiple categorical features. In my dataset, the categorical variables has more than 90 possible values, as a result I am having an additional 121 columns when i use the Get.dummy, but I actually want just four levels. Please kindly advise me.
@patrickmullan8356
@patrickmullan8356 5 жыл бұрын
When applying the 'make_column_transfromer()' at 17:45 it returns the results (e.g., columns) in different order than the input data. Is there a way of making it return the columns in the same order. Or at least knowing which new columns belong to which original category - without having to do the math oneself? Especially if not using the introduced pipeline functionality, but relying on this transfromation-tool anyways, for different works for example, this seems to me to be a bit difficult in handling, or at least inspecting. Great introduction to the modules, anyways ;)
@dataschool
@dataschool 5 жыл бұрын
Great question! The ordering is actually predictable: it's the ordering of the columns that I specified to the ColumnTransformer (2 columns for Sex and 3 columns for Embarked), followed by the columns that I passed through (1 column for Pclass). Does that make sense?
@patrickmullan8356
@patrickmullan8356 5 жыл бұрын
@@dataschool Yes, makes sense. That's what i meant with "having to do the math" ... ;)
@ramleo1461
@ramleo1461 5 жыл бұрын
Hi, this will be very helpful.. Thank you for making this video!!
@dataschool
@dataschool 5 жыл бұрын
You are very welcome! 🙌
@garychen6367
@garychen6367 4 жыл бұрын
Hi Kevin, thanks for the terrific tutorial. I have two questions about the feature processing, 1) when do we need to standardize or normalize the value features before training? I know that standardizing or normalizing the value features can affect the performance of some ML-algorithm, whether we should do it seems depends on what kind of ML-algorithm we adopted (i.e., it is better to standardizing value features when using ANN, but may not when using DT-related algorithm). 2) if we do need to standardize the value features, should we do it before encoding the categorical features or after? (I used to do a stupid way: first split the value and categorical features, then standardize the former ones and encode the latter ones, then concatenate them, is there a better way to do it?). Again, thank you for this fabulous tutorial.
@siddhantmittal1157
@siddhantmittal1157 4 жыл бұрын
1)We usually standardize our data when we see that there is a huge difference in the values of different columns of our dataset. Let us consider an example of predicting the salaries of employees in a firm. Different attributes can include its year of Exp., his age and salary as our target variable. Our age and YOE column can have values from 20-60 and 1-15 but our salary can have values such as 30000, 50000 and like that. this can affect our model and also affect the error as when these values will be fitted in our algorithm, then salary column will have more weightage(if not standardize) therefore we need to convert the data 2) we do that after converting categorical data into numbers. Thank you PS : Correct me if I am wrong.
@garychen6367
@garychen6367 4 жыл бұрын
​@@siddhantmittal1157 Hi Siddhant, thank you very much for answering, it really helps. For the second answer, does that mean that we need to first encode the categorical features (e.g, after encoding the categorical features part would be binary numbers like ( [[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]]) , then scale it? Thus after scaling the above categorical features would be changed from the binary number structure into a dataset with different numbers, e.g., [[ 1.414, -0.707, -0.707], [-0.707, 1.414, -0.707], [-0.707, -0.707, 1.414]]), does that matter? Thank you for your reply again! Updated:Hi I think I found the answer for my question 2), there is an example in scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_23_0.html#sphx-glr-auto-examples-release-highlights-plot-release-highlights-0-23-0-py which indicates we should encode the categorical features and scale value features separately, then concatenate/combine them together for training!
@MohammadrezaMokhtari-qh2yg
@MohammadrezaMokhtari-qh2yg 6 ай бұрын
amazing information. wow! thank you so much man.
@dataschool
@dataschool 6 ай бұрын
You're very welcome!
@pivotai525
@pivotai525 2 жыл бұрын
Simply the best!!
@dataschool
@dataschool 2 жыл бұрын
Thank you!
@WafazAli-b4u
@WafazAli-b4u 11 ай бұрын
Very Well Explained..
@dataschool
@dataschool 11 ай бұрын
Thank you!
@schuylerblasy2192
@schuylerblasy2192 4 жыл бұрын
This is a really interesting video. Column_transformer is sort of like a pipeline in itself. Kind of reminds me of vectotassembler in Spark/Pyspark.
@dataschool
@dataschool 4 жыл бұрын
Thanks Sky! One important difference is that ColumnTransformer stacks results side-by-side, whereas Pipeline feeds the output of one step to the input of the next step.
@brendensong8000
@brendensong8000 4 жыл бұрын
I love it! Amazing tips!
@dataschool
@dataschool 4 жыл бұрын
Thank you!
@cogcog312
@cogcog312 5 жыл бұрын
Just excellent. Thanks! I am very new to data science so please bear with me. Question - "For a dataset that has several categorical features each column with a lot of different values (say each categorical column has 100 different values as opposed to just 2 for Gender - male or female), after using onehotencoder to convert them to unordered numerical values, the number of table columns increases astronomically. Then you run the model and say one or more of the categorical features are amongst the most useful, how do you reverse or convert back these encoded features to know which categorical feature each represents?
@dataschool
@dataschool 4 жыл бұрын
I'm not sure off-hand, sorry!
@sihle_za
@sihle_za 4 жыл бұрын
Simply the best.
@dataschool
@dataschool 4 жыл бұрын
Thank you!
@sonalisingh2136
@sonalisingh2136 5 жыл бұрын
Just AweSomE
@dataschool
@dataschool 4 жыл бұрын
Thank you!
@vincecarter7500
@vincecarter7500 4 жыл бұрын
thanks a lot for helping everyone out, was just wondering if you will be uploading more videos in the future
@dataschool
@dataschool 3 жыл бұрын
Yes! I just started posting again last week. Thanks for watching!
@salseid1033
@salseid1033 5 жыл бұрын
Your tutorial is informative as always. May you prepare a tutorial how to interprete model. Like 'Black Box' interpretation in RF. Thank you.
@dataschool
@dataschool 5 жыл бұрын
Thanks for your suggestion! I'll consider it for the future!
@oeb5542
@oeb5542 5 жыл бұрын
Just another amazing video. 😄
@dataschool
@dataschool 5 жыл бұрын
Thank you so much for your kind words! 😊
@Narriz
@Narriz Жыл бұрын
This is amazing.
@dataschool
@dataschool Жыл бұрын
Thank you! You might be interested in this course: courses.dataschool.io/building-an-effective-machine-learning-workflow-with-scikit-learn
@KVishya
@KVishya 5 жыл бұрын
Hi Kevin, thank you so much for the wonderful explanation, could you also explain how to use GridSearch or RandomizedSearch along with Pipelines?
@dataschool
@dataschool 4 жыл бұрын
Great suggestion! I'm working on a tutorial that will be published on KZbin in late April. It will include that topic. Stay tuned!
@adityakharwade9501
@adityakharwade9501 4 жыл бұрын
Awesome video and thank you for this explanation!!! I have one request could you please make video on PCA
@dataschool
@dataschool 4 жыл бұрын
Thanks for your suggestion!
@zohrehvahdati787
@zohrehvahdati787 5 жыл бұрын
Thank you so much.😍😍🙏🙏👍👍 It helped me a lot.
@dataschool
@dataschool 4 жыл бұрын
Great to hear!
@amitkumards5609
@amitkumards5609 4 жыл бұрын
No doubt video is great, But one question, if I use Random Forest and want to know the feature importance with feature names(by using column transformers we will end up having an array without any column names, ex: after one hot encoding category name should be the column name, but that is not happening with this setup) how can we do it with this setup ?
@dataschool
@dataschool 3 жыл бұрын
Great question! Under certain conditions, you can use the ColumnTransformer's get_feature_names method to extract the feature names.
Machine Learning with Text in scikit-learn (PyCon 2016)
2:40:15
Data School
Рет қаралды 125 М.
Selecting the best model in scikit-learn using cross-validation
35:54
Support each other🤝
00:31
ISSEI / いっせい
Рет қаралды 40 МЛН
Accompanying my daughter to practice dance is so annoying #funny #cute#comedy
00:17
Funny daughter's daily life
Рет қаралды 28 МЛН
How do I handle missing values in pandas?
14:28
Data School
Рет қаралды 197 М.
One-Hot, Label, Target and K-Fold Target Encoding, Clearly Explained!!!
15:23
StatQuest with Josh Starmer
Рет қаралды 56 М.
One Hot Encoder with Python Machine Learning (Scikit-Learn)
9:03
Ryan & Matt Data Science
Рет қаралды 26 М.
How do I select features for Machine Learning?
13:16
Data School
Рет қаралды 178 М.
My top 50 scikit-learn tips
2:47:31
Data School
Рет қаралды 13 М.
Scikit-Learn Model Pipeline Tutorial
16:50
Greg Hogg
Рет қаралды 28 М.
Comparing machine learning models in scikit-learn
26:42
Data School
Рет қаралды 187 М.
Support each other🤝
00:31
ISSEI / いっせい
Рет қаралды 40 МЛН