Use "fit_transform" on training data, but "transform" (only) on testing/new data

Рет қаралды 18,435

Күн бұрын

Пікірлер: 74

@dataschool 4 жыл бұрын

If you're not already familiar with the difference between "fit" and "transform", go back and watch tip #3 here: kzbin.info/www/bejne/nWO7pI2arMd2edU Thanks for watching, and let me know if you have any questions! 🙌

@neutroniumus 4 жыл бұрын

What if the "embarked column" of testing data have more values than three? Should i still use only "transform"?

@dataschool 4 жыл бұрын

Great question! Yes, you would still only use "transform" on the testing data. However, you would want to set the handle_unknown='ignore' parameter when creating an instance of OneHotEncoder, so that running the transform on the testing data does not cause an error. (Stay tuned for tip #7 in which I'll explain the handle_unknown parameter!)

@anandvyavahare2031 3 жыл бұрын

This is one of the most practical tip which no one tells you while teaching ML/DS. Wow!

@dataschool 3 жыл бұрын

I agree, it's quite important! Thanks Anand!

@anandvyavahare2031 3 жыл бұрын

@@dataschool Thanks Kevin 🙌

@mlkgpta2869 Жыл бұрын

Just 4 mins of listening gave me really good clarity, i've been searching on internet and could not understand but this made me understood.

@dataschool Жыл бұрын

Great to hear!

@randomdude79404 3 жыл бұрын

Was looking for this explanation everywhere, extremely clear , concise and to the point thank you !

@dataschool 3 жыл бұрын

You're very welcome! I'm glad to hear it was helpful to you!

@AnilChauhan-xm5wk 10 ай бұрын

This is by far the best explanation of fit, transform I have come across. All the explanations that I have seen before made me all confused. But this is certainly the best. Thanks a lot.

@dataschool 8 ай бұрын

Glad it was helpful!

@ninjaduck3534 3 жыл бұрын

Super important point, very well explained. Thank you!

@dataschool 3 жыл бұрын

You're very welcome! Glad it was helpful to you!

@levon9 3 жыл бұрын

Super clear explanation - thanks!

@dataschool 3 жыл бұрын

Thanks for your kind words!

@eygmrt 3 жыл бұрын

This is the clearest answer about that topic, thanks

@dataschool 3 жыл бұрын

You're very welcome! Glad it was helpful to you!

@jaikishank 4 жыл бұрын

It was a very essential and important tip for modeling. Thank you very much.

@dataschool 4 жыл бұрын

Glad it was helpful!

@thisismuchbetter2194 4 жыл бұрын

Why Am I not subscribing this yet... Good stuff. really really good stuff that many people ignore. Thank you.

@dataschool 4 жыл бұрын

You're very welcome! Thanks for your kind words!

@Chillos100 3 жыл бұрын

Genius!! Thnx a lot! I was struggling with these concepts

@dataschool 3 жыл бұрын

You're very welcome! Glad it was helpful to you!

@geekyprogrammer4831 2 жыл бұрын

Perfect Explanation!

@dataschool 2 жыл бұрын

Thank you!

@DanielMak1234 2 жыл бұрын

I am sorry but I don't quite understand... Suppose I initiate imp_A=SimpleImputer() and imp_B=SimpleImputer(), and then do imp_A.fit_transform(train) and imp_B.fit_transform(test). i.e. I am fitting two separate imputers on train and test. And for the model logreg = LogisticRegression() I do logreg.fit(train_X,train_y) and logreg.predict(test). Here, logreg is trained entirely on train and thus I can't see how information leakage would happen? Presumably, even in real life we would still need to pre-process the new data before we can generate predictions for it right?

@dataschool 2 жыл бұрын

I'm sorry, I'm not 100% clear on your question! One comment I do have is that you would not create both imp_A and imp_B. Instead, you would create imp, then imp.fit_transform(train), then imp.transform(test). Hope that helps!

@ramanenb4773 11 ай бұрын

these 2 videos were very useful in clarifying this concept. Thanks :D

@dataschool 11 ай бұрын

Glad it was helpful!

@vahajqureshi Жыл бұрын

Subscribing to this channel. What a fantastic explanation

@dataschool Жыл бұрын

Thank you so much!

@tymothylim6550 3 жыл бұрын

Thank you very much for this video! It is very helpful and I learnt something really important regarding this!

@dataschool 3 жыл бұрын

You're very welcome!

@ashabhumza3394 2 жыл бұрын

Perfect , perfect, perfect, .....I am a big fan 👍

@dataschool 2 жыл бұрын

Thank you!

@aimanzaidi474 3 жыл бұрын

Nice explanation and great example. Now I can distinguish it well.

@dataschool 3 жыл бұрын

Great to hear!

@kumaransekar5941 2 жыл бұрын

Excellent Explanation!!!! Thanks for keeping it very simple.....

@dataschool 2 жыл бұрын

You're very welcome!

@gnaneshgn8341 2 жыл бұрын

Superb explaination...The best and clear one from all the videos ive seen...thanks kevin..Im waiting for new series..

@dataschool 2 жыл бұрын

Thank you so much!

@johnanih56 4 жыл бұрын

Sounds simple but a golden rule!

@dataschool 4 жыл бұрын

Agreed!

@reshaknarayan3944 3 жыл бұрын

Well Explained. For better reach, using a white board and marker would gain getter responses. Much appreciate your consistent efforts.

@dataschool 3 жыл бұрын

Thanks for your suggestion!

@aaronsarinanatoledo688 3 жыл бұрын

Thanks for the video. I still have a doubt. If we only use "transform" in the test data we will be using the mean and the standard deviation that we calculate with the training data, correct? That is, we are using information that comes from the training data, is this not a kind of data leakage?

@dataschool 3 жыл бұрын

Yes, you will be using the values that you calculate from the training data. No, that's not data leakage. Hope that helps!

@lonewolf234 11 ай бұрын

I have the exact same doubt pal. In that case why can't they do the scaling and then split the data?

@issabarack8590 3 жыл бұрын

Thanks a lot for this so helpful video ! One question: is it possible to fit_transform the test set based on the entire dataset encoding (train + test). In what way we encode all the possibilities of values.

@dataschool 3 жыл бұрын

Excellent question! No, that's not allowed, because the whole point of a test set is to simulate the future, and if you look into the test set in order to inform your encoding, then you are not properly simulating the future. Hope that helps!

@kshitijbhargava2750 3 жыл бұрын

thanks beaufifully described , can you please create video on naive bayes algorithm with examples

@dataschool 3 жыл бұрын

Thanks for your kind words and suggestion!

@RaviKumar-vy3qt 3 жыл бұрын

Have a question....I have a dataset and done the train and test split ....I have nan values in both training and testing sets....shall I compute the mean of training seperately and replace the nan values with the training mean and similarly I compute mean of testing and replace all nan values of testing set with testing mean or do I replace the nan values of testing set with training mean?????

@dataschool 3 жыл бұрын

You replace the NaN values of the testing set with the training mean.

@blink4037 2 жыл бұрын

@@dataschool Yeah, but i have a little interesting question arised Kevin, i’d be happy if you can clarify. Since we do this as preprocessing like before the train and test split usually, doesn’t it cause a mini leakage in a process like k fold cv as if we do not have a new data and we need to estimate model performance on our available total set. While k fold performing any folds that with filled values can be a test data and while it so, we used that test data for like compute a mean and we eventually used that should be unseen in a way . Isn’t it some kind of leakage through test to train, and the complete k fold process would be suspicous then, contextually at least? (like, we do not train and test split manually on k fold cv)

@debatradas9268 3 жыл бұрын

thank you so much

@dataschool 3 жыл бұрын

You're welcome!

@Karthik_info_vlogs 4 жыл бұрын

When we prepare scoring code, how we can use transform function without fitting whereas we will have only new data available in that session.

@dataschool 4 жыл бұрын

A transformer object (such as a OneHotEncoder, for example) has a transform method, and you pass it the testing/new data. For example: ohe.transform(X_new) That will only work if you have first run fit (or fit_transform) on training data. For example: ohe.fit_transform(X) Does that help to answer your question?

@biswajitmahalik4134 Жыл бұрын

loved it

@dataschool Жыл бұрын

Thanks!

@poonamjaiswal1210 3 жыл бұрын

How I can sort date No changes appear with your code in data set

@dataschool 3 жыл бұрын

Sorry, I don't understand your question, could you clarify what you are asking about? Thanks!

@Grand-Mono 3 жыл бұрын

You mean fit_transform for xtrain and ytrain? and transform for xtest and ytest? I'm new in DS sorry >_

@dataschool 3 жыл бұрын

Exactly!

@lovejazzbass 4 жыл бұрын

Kevin, could you show this on a notebook?

@dataschool 4 жыл бұрын

Sure! See here: nbviewer.jupyter.org/github/justmarkham/scikit-learn-tips/blob/master/notebooks/07_handle_unknown_categories.ipynb Watch out for the tip 7 video, which will be published on Tuesday (November 3).

@lovejazzbass 4 жыл бұрын

@@dataschool Kevin, I saw the Jupyter notebook you sent. I saw how you transformed the new_data using OneHotEncoder. This is how I usually do, suppose I have three different pre-process methods that I have applied on my dataset (OneHotEncoder, PCA and StandardScaler). I normally transform the data using the above methods and fit it using the pipeline. Then predict how the model performs with validation data (the remaining 30%). To test how well my model works, I usually fit the new dataset (completely new test set) to the model or the pipeline. Did you mean that I shouldn't fit the data at this stage ? Please help!

@dataschool 4 жыл бұрын

If you're using a pipeline that encapsulates your preprocessing and model building, the only thing you need to do a with a new test set is to run the predict method with the fitted pipeline. Does that help?