If you're not already familiar with the difference between "fit" and "transform", go back and watch tip #3 here: kzbin.info/www/bejne/nWO7pI2arMd2edU Thanks for watching, and let me know if you have any questions! 🙌
@neutroniumus4 жыл бұрын
What if the "embarked column" of testing data have more values than three? Should i still use only "transform"?
@dataschool4 жыл бұрын
Great question! Yes, you would still only use "transform" on the testing data. However, you would want to set the handle_unknown='ignore' parameter when creating an instance of OneHotEncoder, so that running the transform on the testing data does not cause an error. (Stay tuned for tip #7 in which I'll explain the handle_unknown parameter!)
@anandvyavahare20313 жыл бұрын
This is one of the most practical tip which no one tells you while teaching ML/DS. Wow!
@dataschool3 жыл бұрын
I agree, it's quite important! Thanks Anand!
@anandvyavahare20313 жыл бұрын
@@dataschool Thanks Kevin 🙌
@mlkgpta2869 Жыл бұрын
Just 4 mins of listening gave me really good clarity, i've been searching on internet and could not understand but this made me understood.
@dataschool Жыл бұрын
Great to hear!
@randomdude794043 жыл бұрын
Was looking for this explanation everywhere, extremely clear , concise and to the point thank you !
@dataschool3 жыл бұрын
You're very welcome! I'm glad to hear it was helpful to you!
@AnilChauhan-xm5wk10 ай бұрын
This is by far the best explanation of fit, transform I have come across. All the explanations that I have seen before made me all confused. But this is certainly the best. Thanks a lot.
@dataschool8 ай бұрын
Glad it was helpful!
@ninjaduck35343 жыл бұрын
Super important point, very well explained. Thank you!
@dataschool3 жыл бұрын
You're very welcome! Glad it was helpful to you!
@levon93 жыл бұрын
Super clear explanation - thanks!
@dataschool3 жыл бұрын
Thanks for your kind words!
@eygmrt3 жыл бұрын
This is the clearest answer about that topic, thanks
@dataschool3 жыл бұрын
You're very welcome! Glad it was helpful to you!
@jaikishank4 жыл бұрын
It was a very essential and important tip for modeling. Thank you very much.
@dataschool4 жыл бұрын
Glad it was helpful!
@thisismuchbetter21944 жыл бұрын
Why Am I not subscribing this yet... Good stuff. really really good stuff that many people ignore. Thank you.
@dataschool4 жыл бұрын
You're very welcome! Thanks for your kind words!
@Chillos1003 жыл бұрын
Genius!! Thnx a lot! I was struggling with these concepts
@dataschool3 жыл бұрын
You're very welcome! Glad it was helpful to you!
@geekyprogrammer48312 жыл бұрын
Perfect Explanation!
@dataschool2 жыл бұрын
Thank you!
@DanielMak12342 жыл бұрын
I am sorry but I don't quite understand... Suppose I initiate imp_A=SimpleImputer() and imp_B=SimpleImputer(), and then do imp_A.fit_transform(train) and imp_B.fit_transform(test). i.e. I am fitting two separate imputers on train and test. And for the model logreg = LogisticRegression() I do logreg.fit(train_X,train_y) and logreg.predict(test). Here, logreg is trained entirely on train and thus I can't see how information leakage would happen? Presumably, even in real life we would still need to pre-process the new data before we can generate predictions for it right?
@dataschool2 жыл бұрын
I'm sorry, I'm not 100% clear on your question! One comment I do have is that you would not create both imp_A and imp_B. Instead, you would create imp, then imp.fit_transform(train), then imp.transform(test). Hope that helps!
@ramanenb477311 ай бұрын
these 2 videos were very useful in clarifying this concept. Thanks :D
@dataschool11 ай бұрын
Glad it was helpful!
@vahajqureshi Жыл бұрын
Subscribing to this channel. What a fantastic explanation
@dataschool Жыл бұрын
Thank you so much!
@tymothylim65503 жыл бұрын
Thank you very much for this video! It is very helpful and I learnt something really important regarding this!
@dataschool3 жыл бұрын
You're very welcome!
@ashabhumza33942 жыл бұрын
Perfect , perfect, perfect, .....I am a big fan 👍
@dataschool2 жыл бұрын
Thank you!
@aimanzaidi4743 жыл бұрын
Nice explanation and great example. Now I can distinguish it well.
@dataschool3 жыл бұрын
Great to hear!
@kumaransekar59412 жыл бұрын
Excellent Explanation!!!! Thanks for keeping it very simple.....
@dataschool2 жыл бұрын
You're very welcome!
@gnaneshgn83412 жыл бұрын
Superb explaination...The best and clear one from all the videos ive seen...thanks kevin..Im waiting for new series..
@dataschool2 жыл бұрын
Thank you so much!
@johnanih564 жыл бұрын
Sounds simple but a golden rule!
@dataschool4 жыл бұрын
Agreed!
@reshaknarayan39443 жыл бұрын
Well Explained. For better reach, using a white board and marker would gain getter responses. Much appreciate your consistent efforts.
@dataschool3 жыл бұрын
Thanks for your suggestion!
@aaronsarinanatoledo6883 жыл бұрын
Thanks for the video. I still have a doubt. If we only use "transform" in the test data we will be using the mean and the standard deviation that we calculate with the training data, correct? That is, we are using information that comes from the training data, is this not a kind of data leakage?
@dataschool3 жыл бұрын
Yes, you will be using the values that you calculate from the training data. No, that's not data leakage. Hope that helps!
@lonewolf23411 ай бұрын
I have the exact same doubt pal. In that case why can't they do the scaling and then split the data?
@issabarack85903 жыл бұрын
Thanks a lot for this so helpful video ! One question: is it possible to fit_transform the test set based on the entire dataset encoding (train + test). In what way we encode all the possibilities of values.
@dataschool3 жыл бұрын
Excellent question! No, that's not allowed, because the whole point of a test set is to simulate the future, and if you look into the test set in order to inform your encoding, then you are not properly simulating the future. Hope that helps!
@kshitijbhargava27503 жыл бұрын
thanks beaufifully described , can you please create video on naive bayes algorithm with examples
@dataschool3 жыл бұрын
Thanks for your kind words and suggestion!
@RaviKumar-vy3qt3 жыл бұрын
Have a question....I have a dataset and done the train and test split ....I have nan values in both training and testing sets....shall I compute the mean of training seperately and replace the nan values with the training mean and similarly I compute mean of testing and replace all nan values of testing set with testing mean or do I replace the nan values of testing set with training mean?????
@dataschool3 жыл бұрын
You replace the NaN values of the testing set with the training mean.
@blink40372 жыл бұрын
@@dataschool Yeah, but i have a little interesting question arised Kevin, i’d be happy if you can clarify. Since we do this as preprocessing like before the train and test split usually, doesn’t it cause a mini leakage in a process like k fold cv as if we do not have a new data and we need to estimate model performance on our available total set. While k fold performing any folds that with filled values can be a test data and while it so, we used that test data for like compute a mean and we eventually used that should be unseen in a way . Isn’t it some kind of leakage through test to train, and the complete k fold process would be suspicous then, contextually at least? (like, we do not train and test split manually on k fold cv)
@debatradas92683 жыл бұрын
thank you so much
@dataschool3 жыл бұрын
You're welcome!
@Karthik_info_vlogs4 жыл бұрын
When we prepare scoring code, how we can use transform function without fitting whereas we will have only new data available in that session.
@dataschool4 жыл бұрын
A transformer object (such as a OneHotEncoder, for example) has a transform method, and you pass it the testing/new data. For example: ohe.transform(X_new) That will only work if you have first run fit (or fit_transform) on training data. For example: ohe.fit_transform(X) Does that help to answer your question?
@biswajitmahalik4134 Жыл бұрын
loved it
@dataschool Жыл бұрын
Thanks!
@poonamjaiswal12103 жыл бұрын
How I can sort date No changes appear with your code in data set
@dataschool3 жыл бұрын
Sorry, I don't understand your question, could you clarify what you are asking about? Thanks!
@Grand-Mono3 жыл бұрын
You mean fit_transform for xtrain and ytrain? and transform for xtest and ytest? I'm new in DS sorry >_
@dataschool3 жыл бұрын
Exactly!
@lovejazzbass4 жыл бұрын
Kevin, could you show this on a notebook?
@dataschool4 жыл бұрын
Sure! See here: nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/07_handle_unknown_categories.ipynb Watch out for the tip 7 video, which will be published on Tuesday (November 3).
@lovejazzbass4 жыл бұрын
@@dataschool Kevin, I saw the Jupyter notebook you sent. I saw how you transformed the new_data using OneHotEncoder. This is how I usually do, suppose I have three different pre-process methods that I have applied on my dataset (OneHotEncoder, PCA and StandardScaler). I normally transform the data using the above methods and fit it using the pipeline. Then predict how the model performs with validation data (the remaining 30%). To test how well my model works, I usually fit the new dataset (completely new test set) to the model or the pipeline. Did you mean that I shouldn't fit the data at this stage ? Please help!
@dataschool4 жыл бұрын
If you're using a pipeline that encapsulates your preprocessing and model building, the only thing you need to do a with a new test set is to run the predict method with the fitted pipeline. Does that help?
@Camila-fv9qj Жыл бұрын
I dtill don't understand why transform only in test data what is q
@dataschool Жыл бұрын
It's because you don't want to learn from the test data, you just want to apply what you learned from the training data. Hope that helps!