*Are you new to Machine Learning?* Watch my video series, "Introduction to Machine Learning in Python with scikit-learn": kzbin.info/aero/PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A
@arunjohn4924 жыл бұрын
Sir what about dummy variable trap , When we use Column Transformer ?
@dataschool3 жыл бұрын
Great question! See this video: kzbin.info/www/bejne/hIrXqKysrtt3e80
@GoredGored2 жыл бұрын
For beginners: When I tried to complete an ML project of say a simple model based on Logistic or Linear regression it used to take me about a month. As I was a beginner in Python, Pandas, SQL and the rest of it, I thought this will take me a long time to master and may be I am a late comer into this. But a year forward now and thanks to Data School, Sentdex, Krish naik, Statquest, Thinkful Webinar and more I am surprised that all I need is a day or less to complete these projects. Because of the meticulous analysis on Data School when I needed a deeper understanding that's where my gps leads me to. Thank you Data School.
@dataschool2 жыл бұрын
You are so very welcome!
@terryhenyo92165 жыл бұрын
The Legendary Data Science guy is back!
@dataschool5 жыл бұрын
Thank you for the warm welcome! 😄
@liquid_absabs13344 жыл бұрын
There is something about your explanations, that i just get it instantly. You deserve an award
@dataschool4 жыл бұрын
You are too kind, thank you!
@dataschool3 жыл бұрын
Yes, that is the role of the OneHotEncoder.
@altunbikubra4 жыл бұрын
Your guideline does not only involves basic codes, but it actually involves very practical and useful functions. I want to sincerely thank you for your effort!
@dataschool4 жыл бұрын
Thanks very much for your kind words!
@hieungotrung54115 жыл бұрын
OMG!!! I’ve just started ML in kaggle for the past few weeks. Theres a lot of information to absorb but you teach us in the most understandable way and yet up-to-date question why we should use scikit instead of using dummies. This video is extremely helpful and informative. Thank you alot!!! Guess I gonna spend the rest of the day to watch all of your videos
@dataschool4 жыл бұрын
Awesome! Glad to hear this was helpful to you 👍
@420nyk3 жыл бұрын
Thanks, this helps a lot. Was scratching my head on pipeline and column transformer before this video. Also you got a very soothing voice and it helps to relax and really enjoy the learning.
@dataschool2 жыл бұрын
Great to hear!
@Rationalist-Forever2 жыл бұрын
I was looking for clear explanation of Pipeline for a long time. You nailed it. Crystal clear explanation and understood by watching one time. Thank you.
@dataschool2 жыл бұрын
You're so very welcome! 🙏
@chr11124 жыл бұрын
you are the best tutor i have ever met , keep up the good work. Thank you
@dataschool4 жыл бұрын
Wow, thanks!
@harshalkulkarni5115 жыл бұрын
Preprocessing with pipeline was complex topic to understand for me before watching this video. Thanks a lot for the video.
@dataschool4 жыл бұрын
You're very welcome! Glad it helped 👍
@harshitarawat89413 жыл бұрын
Man I love you. I just love you. I love your videos. I love the way you explain things. I love the pace of you videos. I love everything. Thank you.
@dataschool3 жыл бұрын
Thank you so much, Harshita! 🙏
@rommeltito1234 жыл бұрын
Dayyyyuuummmm.......why did I not stumble upon ur videos earlier ????!!!!!!
@dataschool3 жыл бұрын
😄
@sandeep10264 жыл бұрын
I feel fortunate that I stumbled across this video. Very well articulated. Slows down pace, so that folks can hear, understand and digest. Most videos I come across, seem to rush through the contet before one can digest. Thanks for taking time and sharing your knowledge
@dataschool3 жыл бұрын
Thanks very much for your kind words! 🙏
@tald7474 жыл бұрын
This is an excellent and simple explanation of this topic. I must say that you are a very talented in the way you teach! You choose your words in a way that emphasizes only the important and relevant staff. Thanks!!!
@dataschool4 жыл бұрын
Wow, thank you!
@Steven-se5jd5 жыл бұрын
just want to say thank you. I am a beginner and you teach much better than my professor.
@dataschool5 жыл бұрын
Glad to hear I have been helpful! 🙏
@quocanhhbui82712 жыл бұрын
My god I love your detailed solution. Even my 5yo sibling can understand it. Wonderful. Definitely worth a subscribe.
@dataschool2 жыл бұрын
Awesome! 🙌
@Tothefutureand2 жыл бұрын
Thx kevin, one of best & simplest explanations of pipeline
@dataschool2 жыл бұрын
Glad it was helpful!
@fahadkhankhattak83393 жыл бұрын
thank you so much!!!!! it was very helpful. yours is the only channel i come running to for help whenever im stuck somewhere. rich conent!! keep sharing these wonderful thingss
@dataschool2 жыл бұрын
Thank you so much!
@amitsharma83374 жыл бұрын
THANK YOU for this tutorial! Was wandering around the web to solve unexpected errors that came by following, apparently, outdated tutorials. If I have landed up on this tutorial the very first time, it would have saved me around 4 hours of useless surfing. Thanks again
@dataschool4 жыл бұрын
That's awesome to hear... glad I could be of help! By the way, I'll be launching a full course covering these topics (and more)... sign up here to get notified when it launches: scikit-learn.tips
@christianiheanacho49765 жыл бұрын
You are a high quality TEACHER , thank you very much.
@dataschool5 жыл бұрын
You are very welcome! 😄
@horoshuhin3 жыл бұрын
thank you Kevin, very thorough explanation. I'm glad I found your channel. I like the way you teach.
@dataschool3 жыл бұрын
Thank you so much! 🙏 That's great to hear!
@fet16124 жыл бұрын
00:58 1) It allows you to properly cross-validate a process rather than just a model. In other words, when you are doing cross-validation like cross_val_score, normally you just pass a model to it. Well, there are cases when that is not going to give you accurate results because you're doing the preprocessing outside of the cross-validation. So a pipeline, generally speaking, is useful because you can cross-validate a process that includes (a) *preprocessing* as well as (b) *model building*.
@PaulBillingtonFW Жыл бұрын
Thanks, for this clear and well paced tutorial.
@dataschool Жыл бұрын
Glad it was helpful!
@georgeognyanov3 жыл бұрын
God damn this video is good. I was struggling with column_transformer and pipelines till late last night. The options you suggest here are so much better and easier to understand for me. I am totally going through your "Introduction to Machine Learning in Python with scikit-learn" playlist soon. Thanks for putting this out!
@dataschool3 жыл бұрын
You're very welcome! If you want to go deeper into this topic, you may want to check out my course: courses.dataschool.io/building-an-effective-machine-learning-workflow-with-scikit-learn
@Putinka10005 жыл бұрын
Thank you for speaking slowly. It’s nice to listen to a non-English speaking person
@dataschool4 жыл бұрын
You're very welcome! :)
@salonisamant54103 жыл бұрын
Thank you for explaining the pipeline approach so well!
@dataschool3 жыл бұрын
You're very welcome!
@adarshr304 жыл бұрын
After searching alot, i found this channel n i feel its best for me:)
@dataschool3 жыл бұрын
Happy to hear that!
@krishkonnect8144 жыл бұрын
I just found solution to my problem after watching your video. Thanks a lot.
@dataschool3 жыл бұрын
You're welcome!
@Takk64 жыл бұрын
You are by far the best data science teacher on youtube. Can you make a video on creating your own custom transformers using it to modify your data, then using that custom transformer in a ColumnTransformer and a Pipeline?
@dataschool4 жыл бұрын
Thanks for your suggestion! I'm working on a course that will likely cover that topic. Sign up here to get notified when it launches: scikit-learn.tips
@aimenbaig62013 жыл бұрын
i just discovered your channel and i gotta tell you , you got a permanent subscriber here!!! LOVE YOUR TEACHING STYLE!!!!!!!!!!!!!!!
@dataschool3 жыл бұрын
Thank you! 🙏
@lovejazzbass4 жыл бұрын
Kevin, it's 5:20am Winston-Salem time and I am digging this. I was very confused. Thank you so much.
@dataschool4 жыл бұрын
Excellent!
@nishantchaudhary75282 жыл бұрын
That was really something amazingly explained, I was looking for all these topics to understand. I got it in one go. Thanks a ton.
@dataschool2 жыл бұрын
You're very welcome!
@jobihara2 жыл бұрын
Thankyou dataschool, it was not only helpful, it was great, enlightening and awesome.
@dataschool2 жыл бұрын
What a nice thing to say, thank you so much! 🙏
@dhananjaykansal80975 жыл бұрын
Nice to have u back sir. This session was so fruitful. Thanks a ton. Keep it up!
@dataschool5 жыл бұрын
That's awesome to hear!
@aaqibsoomro57765 жыл бұрын
You are a great teacher. Please make the tutorials or series for Data Visualization, In-Depth Data Analysis, and Cleaning, and Project Deployment, etc. Since after Learning Python and its libraries and ML, these are the next steps.
@dataschool5 жыл бұрын
I have many more tutorials! Many of them are listed here: www.dataschool.io/launch-your-data-science-career-with-python/
@David-fr7ee4 жыл бұрын
Great content, i am learning this in my college data science class. You did better than my professor!
@CE-vd2px4 жыл бұрын
Are you undergrad or grad?
@dataschool3 жыл бұрын
Thank you! 🙏
@TheAstralftw4 жыл бұрын
Finally someone explained me properly what is columns transformer and why we use pipeline. I would like you to put your course to udemy , then i ll buy it 100% .. maybe on average you will sell each course for less price, but trust me, you are explaining this so good, you can sell tens of thousands of courses in few months , ... or in the case you have this on udemy , please provide me with the link!
@dataschool4 жыл бұрын
Thanks for your kind words and your suggestion! I know that many students like Udemy courses, but my values as a course creator don't align with their business model, and so I'm not currently interested in publishing a course there. I prefer to offer courses directly to interested students. Thanks for understanding!
@Anarchy9775 жыл бұрын
Fantastic tutorial! Great teacher, best Machine Learning teacher on youtube! Thank you!
@dataschool5 жыл бұрын
Thanks so much!
@artyb31154 жыл бұрын
Absolutely perfect and useful lessons! Thinking of becoming a patron member as I get a little more confident with ML
@dataschool4 жыл бұрын
That would be awesome, thank you so much! You can join here: www.patreon.com/dataschool
@sandeeppreetam4 жыл бұрын
Thank you good sir, this tutorial was better than many paid tutorials on Udemy. Blessed!
@dataschool3 жыл бұрын
Glad it was helpful! 🙌
@jatinshetty4 жыл бұрын
yo! Mind blown with the amount of things i learnt from this. Please keep at it!
@dataschool4 жыл бұрын
Thank you! You might like my scikit-learn tips: github.com/justmarkham/scikit-learn-tips
@frankgiardina2054 жыл бұрын
Excellent! I was using the pandas dummies and your explanation of why pipeline and ohe is a better solution solves all the problems. thanks again
@dataschool4 жыл бұрын
Glad it helped!
@JainmiahSk5 жыл бұрын
Sir, just before 5 minutes I visited our channel to ask you the same question where it was difficult for me to encode multivariables in kaggles house prediction using advanced regression dataset. Fortunately and surprisingly you posted same. Thank you so much.
@dataschool5 жыл бұрын
That's amazing! 🙌 I hope this video is helpful to you, and let me know if you have any questions!
@JainmiahSk5 жыл бұрын
@@dataschool I have a problem with functions, I can't write custom functions in Python which is very important what to do sir?
@dataschool5 жыл бұрын
@@JainmiahSk You can definitely write custom functions in Python!
@amitblizer45672 жыл бұрын
Very clearly explained and helpful video - Thank you!
@dataschool2 жыл бұрын
Glad it was helpful!
@brandonbermudez90472 жыл бұрын
Absolute goat bruh, really thankful for your content
@dataschool2 жыл бұрын
Thank you!
@sowash20202 жыл бұрын
You just gained another subscriber...this was super useful
@dataschool2 жыл бұрын
Great to hear!
@jkore25544 жыл бұрын
Thank you for this tutorial. I was working with logistic regression this week and was trying to figure out how to one hot encode for a categorical variable with hundreds of categories. I was getting 100% accuracy and precision so something wasn’t right. I’m going to try the steps that you outlined in this tutorial. Thanks.
@dataschool4 жыл бұрын
Good luck!
@sanaullahkhanhassanzai84325 жыл бұрын
Thank you very much and welcome back after a long time. You are as good as gets when it comes to Machine Learning. You have made me learn a lot. I cant wait for videos on deep learning. I hope you ll come up with deep learning soon. Thanks again
@dataschool5 жыл бұрын
Thanks very much for your kind words, and for your suggestion as well!
@asimssheikh4 жыл бұрын
Impressive explanation, and logical approach to material presentation. You just got a new sub.
@dataschool4 жыл бұрын
Welcome aboard!
@12345shipreck4 жыл бұрын
You are 100x better than my ML course teacher at uni. GG bro.
@dataschool4 жыл бұрын
Thank you! 😄
@abdelkaderkaouane1944 Жыл бұрын
Your explanation is very clear, thank you very much
@dataschool Жыл бұрын
You're welcome!
@sophiar52804 жыл бұрын
Always love your step by step, clear lessons. Keep it coming.
@dataschool4 жыл бұрын
Thank you!
@gyanendergandhar3 жыл бұрын
Thanks alot for this tutorial Kevin. It really saved me😅
@dataschool2 жыл бұрын
Glad to hear that!
@xinchenzou45582 жыл бұрын
Thank you sir! You've really saved my life...
@dataschool2 жыл бұрын
🙌
@ayyappahemanth71345 жыл бұрын
Oh my god! after so much of exhaustive waiting another video came, which is far more useful than others for me! I just love your videos, the content was really useful in my real life, most of the youtube channels they just take the ideal ones which I might not encounter in my whole life! please do these videos regularly!
@dataschool5 жыл бұрын
That is awesome to hear, thanks so much for your kind words! 🙏 Actually, I publish a new Q&A video every month for Data School Insiders at the $5 level: www.patreon.com/dataschool
@NoWhiteGullibility5 жыл бұрын
Perfect timing, was just searching on pipelines the other day. Would be great to follow-up by tacking on Gridsearch in this context.
@dataschool5 жыл бұрын
That's awesome to hear! I will definitely cover grid search of a pipeline at some point - thanks for the suggestion!
@salakkal4 жыл бұрын
Really great that you did a video like this . It just helped me a lot and I am really thankful for it brother . Keep going .
@dataschool3 жыл бұрын
Thanks!
@barulli874 жыл бұрын
MIND BLOWN!!!! CV FOR A PROCESS!!! NOICE ONE!!
@dataschool3 жыл бұрын
🤯
@trentjones64684 жыл бұрын
Amazing video. You are an excellent instructor. Got yourself a new subscriber :)
@dataschool4 жыл бұрын
Thank you so much!
@abdoulayebalde21394 жыл бұрын
A very nice video that save my life I can see it is well explained keep uploading
@dataschool3 жыл бұрын
Thanks!
@Susuwho4 жыл бұрын
this is so helpful that I have to comment. great job. thanks a lot
@dataschool4 жыл бұрын
Glad it was helpful!
@Universe4mi5 ай бұрын
Thanks, very clear and insightful!!
@dataschool4 ай бұрын
You're welcome!
@wexwexexort4 жыл бұрын
OK, you've said that we are cross validating not the model but the pipeline at around 22:00. This might be useful in some other case, but what's the point of splitting first then applying one hot encoding? Result should be same if you do the one hot encoding first. Right? Am I missing something?
@dataschool4 жыл бұрын
You will have different results if there are small categories and (by chance) some of them only appear in test but not in train.
@honprarules4 жыл бұрын
Amazing explanation, as always!
@dataschool3 жыл бұрын
Thank you!
@christianiheanacho49765 жыл бұрын
I am enriched by this teaching.
@dataschool5 жыл бұрын
Great to hear!
@SaunakDey3 жыл бұрын
awesome explanation!! Thanks a lot
@dataschool3 жыл бұрын
You're very welcome!
@salmantabatabai3 жыл бұрын
What happens if the new dataset has some more categories than the training data. Like beside from C, S and Q in training, it also has something called B. How this is going to be handled?
@dataschool3 жыл бұрын
Great question! You just have to tell OneHotEncoder how to handle unknown categories. See this video for details: kzbin.info/www/bejne/mHKZnox5ZsaSe8k
@patrickmullan83565 жыл бұрын
When applying the 'make_column_transfromer()' at 17:45 it returns the results (e.g., columns) in different order than the input data. Is there a way of making it return the columns in the same order. Or at least knowing which new columns belong to which original category - without having to do the math oneself? Especially if not using the introduced pipeline functionality, but relying on this transfromation-tool anyways, for different works for example, this seems to me to be a bit difficult in handling, or at least inspecting. Great introduction to the modules, anyways ;)
@dataschool5 жыл бұрын
Great question! The ordering is actually predictable: it's the ordering of the columns that I specified to the ColumnTransformer (2 columns for Sex and 3 columns for Embarked), followed by the columns that I passed through (1 column for Pclass). Does that make sense?
@patrickmullan83565 жыл бұрын
@@dataschool Yes, makes sense. That's what i meant with "having to do the math" ... ;)
@21Gannu4 жыл бұрын
wondering how wound one combine gridsearch and standarization steps
@dataschool3 жыл бұрын
Great question! Standardization would either be one of the transformers in the ColumnTransformer or one of the steps in the Pipeline. GridSearchCV can be used to grid search the entire Pipeline. Hope that helps!
@will64033 жыл бұрын
Do you still have to split the data for training and test when using pipeline?
@dataschool3 жыл бұрын
You can use train/test split as your model evaluation procedure for a pipeline, or you can ues cross-validation, or you can use both (depending on your goals). Hope that helps!
@ajaysaikiranpenumareddy98093 жыл бұрын
How can we map label encoded data in training to new incoming data for predicting any class.
@gardnmi5 жыл бұрын
Since pandas get_dummies ignores non categorical values I've always done below but I might have to start using pipelines. Great video! train = pd.get_dummies(train) test = pd.get_dummies(test) test = test.reindex(columns=train.columns, fill_value=0)
@dataschool5 жыл бұрын
Thanks for sharing! It's still okay to use get_dummies, but you may end up with a gigantic DataFrame that includes columns you're not interested in. Plus, you will definitely have problems if any of the categorical features in your test data include different values than your training data. Anyway, glad you liked the video and I hope to bring you over to Pipeline! 😉
@gardnmi5 жыл бұрын
@@dataschool I ran into the misaligned shapes issues a lot. That's what test.reindex(columns=train.columns, fill_value=0) solved for me but it seems pipeline is a bit more elegant.
@dataschool5 жыл бұрын
@@gardnmi Even though reindexing *appears* to fix the problem with misaligned shapes, there's a high likelihood that the columns of your test DataFrame no longer match the column ordering of your train DataFrame. That's a significant problem because it means that your features are in the wrong order in test, and thus your model will make incorrect predictions. Pipeline thankfully solves that problem!
@modhua4497 Жыл бұрын
Thanks Kevin, do you have any video example that shows how to incorporate a self defined function in pandas pipeline?
@AjayVerma-xi2us5 жыл бұрын
Very good, it cleared my many doubts
@dataschool5 жыл бұрын
Great to hear!
@olamartins2 жыл бұрын
Please can you teach on OneHotEncoding in sklearn 0.24.1? categorical_features is deprecated in the lower version. I'm stuck
@eatbreathedatascience95933 жыл бұрын
This video is excellent.
@dataschool3 жыл бұрын
Thank you!
@12345shipreck4 жыл бұрын
My categorical variable is y and the method you did with the column transformer just doesnt work on it. It gives me error :(
@roopeshroope59883 жыл бұрын
Is this tutorial explains about how to prepare dataset for k fold cross validation to avoid data leakage?
@victor-os9wq2 жыл бұрын
Thanks for such a detailed tutorial. I am working on a similar problem where I have multiple categorical features. In my dataset, the categorical variables has more than 90 possible values, as a result I am having an additional 121 columns when i use the Get.dummy, but I actually want just four levels. Please kindly advise me.
@AmirKhan_KnowTech4 жыл бұрын
if one of the column is having numerical data and it is to be normalized then in 'remainder' how we can do that?
@IgnitedMountain3 жыл бұрын
Hello, in the last example. How is the NAN values handled. Are they removed by one of the methods or do you have to remove them by yourself?
@frosty21643 жыл бұрын
is there a way to split the data into train and test and check the accuracy in the pipeline after the dummy creation. In short i want to add the train split code in the pipeline?
@dataschool3 жыл бұрын
Yes, you can use train/test split with a pipeline (instead of cross-validation). You fit the pipeline on the training set, make predictions on the testing set, and then check the results. Hope that helps!
@amitkumards56094 жыл бұрын
No doubt video is great, But one question, if I use Random Forest and want to know the feature importance with feature names(by using column transformers we will end up having an array without any column names, ex: after one hot encoding category name should be the column name, but that is not happening with this setup) how can we do it with this setup ?
@dataschool3 жыл бұрын
Great question! Under certain conditions, you can use the ColumnTransformer's get_feature_names method to extract the feature names.
@ratnakarmaurya30664 жыл бұрын
hey do we not need to drop of a column to avoid dummy trap using (columntransformer class)?
@dataschool3 жыл бұрын
See this video: kzbin.info/www/bejne/hIrXqKysrtt3e80
@Camila-fv9qj Жыл бұрын
Why you used fit_transform for the all data
@hichamamchtkou73435 жыл бұрын
Thank you very much, it 's very interesting and by the way, it is exactly what i need in my current ML project.
@dataschool5 жыл бұрын
That's great to hear! Good luck with your project 🙌
@hichamamchtkou73435 жыл бұрын
@@dataschool thanks 👍
@1stophchr5 жыл бұрын
thank you very much, very clear video
@dataschool5 жыл бұрын
You're very welcome! 😄
@yoniziv4 жыл бұрын
How would you deal with ordinal categorical variables (e.g. Small-Medium-Large...) since they are categorical but obviously have clear "numerical" order. Would you just convert the names to integers or is there a more methodical way to do it?
@dataschool4 жыл бұрын
Great question! You would use OrdinalEncoder. See tip #6 here for an example: github.com/justmarkham/scikit-learn-tips Also, watch out for the tip #6 video in this playlist, which I'll publish on October 29: kzbin.info/aero/PL5-da3qGB5ID7YYAqireYEew2mWVvgmj6 Hope that helps!
@elvykamunyokomanunebo1441 Жыл бұрын
Hi, When we encode Embarked, we end up with 3 dummy variables, however, it's not specified if any of these new features is dropped so that it can be the reference feature. Why is that? Thanks in advance
@dataschool Жыл бұрын
My thoughts on dropping a feature are here: kzbin.info/www/bejne/hIrXqKysrtt3e80
@riteshtripathi86264 жыл бұрын
Hello teacher, please explain in layman's term about the final predicted values that you got (what are they): pipe.predict(X_new) Results: array([1, 0, 1, 1, 0])#how should I be interpreting these values. Thanks
@kyledesana64414 жыл бұрын
Those are the predicted values for whether the passengers in the X_new data frame survived or not (Survived is a dummy variable)
@riteshtripathi86264 жыл бұрын
@@kyledesana6441 Thank you, this clears the doubt. :)
@dataschool3 жыл бұрын
Great question! That's the prediction of survived (1) or did not survive (0) for the five samples in the new data.
@joxa61192 жыл бұрын
God this video answered my month unsolved question. God blessed you.
@dataschool2 жыл бұрын
Great to hear!
@raymondlee72804 жыл бұрын
After I used the OneHotEncoder, I could generate a number of columns attributing dummy variables, but I found it difficult to add to the original dataframe for other data-preprocessing steps, it shows 'ValueError: could not broadcast input array from shape (10,3) into shape (10)'. I know that there is a difference in dimension between the column being one-hot-encoded and the dummy variables columns. How could I replace the column with categorical variables with the new generated columns? or I just use ColumnTransformer to handle this issue? Thank you.
@dataschool3 жыл бұрын
Great question! Ideally, you should be doing all of your preprocessing within scikit-learn. Here's why: kzbin.info/www/bejne/r6eXkpd6fMh5e5o
@ktay8953 жыл бұрын
Am I missing something -- shouldn't there be a train-test-split somewhere?
@dataschool3 жыл бұрын
Great question! Cross-validation is a model evaluation procedure, which is a superior alternative to train/test split for model evaluation. That being said, there is some utility in first doing a train/test split, then performing hyperparameter tuning (usually with a grid search) on the training set, and then testing the best model on the testing set. That gives you a more reliable estimate of out-of-sample performance, but (1) it adds code complexity, (2) it sacrifices some of your training data that could be used for hyperparameter tuning, and (3) it's entirely unnecessary if your only goal is to find the best parameters for your pipeline. In short: whether or not you should also include a train/test split is situation-dependent.
@adityakharwade95014 жыл бұрын
Awesome video and thank you for this explanation!!! I have one request could you please make video on PCA
@dataschool4 жыл бұрын
Thanks for your suggestion!
@TheAdrianPardo5 жыл бұрын
Thank you so much! You're the best! Please go over scaling when you have a chance :) Question: Is is ok to leave in all of the OneHotEncoded columns with this pipe approach? I believe you previously mentioned how it's best to drop one of the columns to prevent multicollinearity. Any way to do this within the pipe?
@dataschool5 жыл бұрын
You are so kind, thank you! 😊 Yes, I plan to cover StandardScaler at some point. Yes, it is okay to leave in all of the one-hot encoded columns. However, the "drop" parameter for OneHotEncoder (new in scikit-learn 0.21) does allow you to drop one feature per category. Hope that helps!
@ramleo14615 жыл бұрын
Even I had the same doubt... Thank you for clarifying 😊
@adamwathieu18784 жыл бұрын
Is there a way to add a header row to the numPy Array so that you know what each hot encoded column represents?
@dataschool3 жыл бұрын
Great question! You would just have to convert the NumPy array back to a pandas DataFrame in order to add column names, because NumPy arrays can't have a header row.
@lakshmitejaswi78325 жыл бұрын
At train u have 2 different values and at test time if you have 3 different values in a column how to do it?
@dataschool4 жыл бұрын
Great question! See the handle_unknown parameter of OneHotEncoder: scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
@messi32105 жыл бұрын
Question: Why are we passing all the encoded features to the model? We should pass only k-1 dummy/encoded features when they have k categories. Or did you just not worry about it as you were trying to demonstrate the pipeline and column transform functionalities?
@dataschool4 жыл бұрын
Great question! You can drop the first level, but the current "conventional wisdom" in the ML community is that you don't get any model performance benefit from dropping a level. Thus, if your goal is model performance, you should leave the first level intact just to minimize the complexity of your process.
@surfzion4 жыл бұрын
Extremely helpful, thank you so much !!!
@dataschool4 жыл бұрын
Glad it helped!
@karinahulka1486 Жыл бұрын
so if I may ask - there's no point of creating one hot encoder for categorical or ordinal variables, which are numeric?
@dataschool Жыл бұрын
Great question! If they are already numeric, and they are logically ordered, then you are correct, there's no point in one-hot encoding them.