What is Data Leakage In Machine Learning?

Рет қаралды 41,580

Күн бұрын

Пікірлер: 103

@JS-gg4px 4 жыл бұрын

You are presenting like you really want everyone to know the significance of DL. I like this attitude! Thank you for sharing.

@nitinpatil6611 3 жыл бұрын

Jjj

@erfanm5276 3 жыл бұрын

I love your passion in teaching man, well done to you. it was a great lesson, concise and precise

@mthandenimathabelacap5466 2 ай бұрын

Very intuitive and good that you bring knlowledge about the implications for production stage of the model. Great.

@aminderkhungura7929 3 жыл бұрын

Thanks!

@mahdikhalilnejad 4 жыл бұрын

Hello Thanks for your excellent explanation. I watched several sites and videos until finally your rescue video helped me understand this ... don't get tired, you put a lot of energy into it.

@sdc8591 5 жыл бұрын

Well that was very quick. Just sometime back in Live Q&A someone requested for this information and it is here within just few hours !! 👍

@ismailyt6627 2 жыл бұрын

That was what I was looking for, an excellent explanation with simplicity and schemas, thanks a lot

@kianestrera-hr5vt 7 ай бұрын

Wow, and that's it most of my questions and needed to clarify with data leakage is answered within 10min, Thank you so much, wish your channel grow much further

@saurabhmukherjee850 5 жыл бұрын

M a big fan of yours dear Krish... Your contents and way of presentation is simply flawless... Keep on doing the good work... God bless you

@FrancisKyalo-hb5pu 2 жыл бұрын

I really enjoyed how well you explained the concept of data leakage, Thanks Krish Naik

@Neojs565 4 жыл бұрын

Kaggle brought me to this video....wow man...really eye-opening..!

@GurtDovletov 9 ай бұрын

Thank you so much such a good video, it easy to understand. Explained with easy examples

@shivanikhandelwal8324 3 жыл бұрын

Thank you. This video solved all my doubts...

@4urandy 3 жыл бұрын

This is so legit ... Absolutely production knowledge

@TuanKhai298 Жыл бұрын

best teacher ever, thank youuu so much

@muralikrishna7617 5 жыл бұрын

Hello sir just now i can accross one of your videos i do nd that what your are giving is good

@manishagyaan 3 жыл бұрын

Great explanation Krish... You are doing great work... Stay blessed always

@Jobic-10 Жыл бұрын

Take these medals 🏅🥇🏅 🎉 Subscribed ❤

@victorahaji9717 Жыл бұрын

You a wonderful instructor.

@faustopf-. Жыл бұрын

Super informative and very well explained!

@nitayg1326 5 жыл бұрын

3 valid observations - never thought about it!

@vinaytomar5827 5 жыл бұрын

Thank you sir for these important instructions 🙌👌

@alexanderdushenin7035 4 жыл бұрын

2 days ago I meet with this problem on competition. Thanks for explanation)

@claudineievangelistanascim2562 4 жыл бұрын

Thanks 🙏 for sharing. Very Important Point ! Very much helpful !

@chaitanyakrishna5873 5 жыл бұрын

Excellent Krish

@laythherzallah3493 2 жыл бұрын

Thank you ..very fruitfull information

@vaibhavkhobragade9773 3 жыл бұрын

Thank you! This is a very important topic you've covered.

@sanjaytallolli 5 жыл бұрын

Very much helpful...

@lotmoretolearn-dataanalyti9312 3 жыл бұрын

Very nice explanation krish.. Easy to understand. I always follow ur videos. Please do more such videos on data science and data engineering as well.

@khadidjabenchaira6004 3 жыл бұрын

Thank you so much from Algeria ...

@ameermohamedr4982 5 жыл бұрын

Hello Krish, please make one video regarding data leakage with simple dataset for deep understanding for us

@marekdudzik6779 3 жыл бұрын

Best explanation!!! THX

@saurabhtripathi62 5 жыл бұрын

very imp point thanks a lot

@sudehashrafi7040 2 жыл бұрын

thank you so much, you explained it very well😍

@sandeeppanchal8615 3 жыл бұрын

Another scenario- before splitting the data, suppose we 'fit' vectorizer like BoW or TFIDF, etc, on the whole data set and after splitting, we will 'transform' both train and test set. Since the entire data has already been fitted to vectorizer, after split, transforming test set means data leakage. Only after splitting, 'fit_transform' should be done on train set and 'trasform' on test set (also CV set if created).

@xianglongchen3088 3 жыл бұрын

Thank you for sharing.

@kevinmatip9416 2 жыл бұрын

The future lies with projects like the Utopia ecosystem because there is a lot of demand for privacy and security of personal data online.

@jijie133 2 жыл бұрын

Great video!

@shubhamurolagin5466 4 жыл бұрын

Sir, lovely explaination

@Abhi-qi6wm 3 жыл бұрын

Good video. An alternate title could be Train-Test contamination.

@nakshatranahar 5 жыл бұрын

Great Bro

@jackyfree5756 3 жыл бұрын

Thanks for sharing sir!

@robertpatterson6445 3 жыл бұрын

Thank you, very helpful!

@bhavyaparikh6933 5 жыл бұрын

eagerly waiting for next video!!!!

@Mad20240 5 жыл бұрын

my discovery of the moment

@MathPhysicsFunwithGus 10 ай бұрын

Great video!!

@Abhishek-jy4ul 5 жыл бұрын

This was very informative thank you sir!!

@bhawnajyoti2000 4 жыл бұрын

Nicely explained

@Raja-tt4ll 4 жыл бұрын

Very nice video

@ArunKoundinyaParasa Жыл бұрын

super explanation

@nicolasmarcos5996 3 жыл бұрын

Wonderful explanation. Congratulations! I'll subscribe at the channel.

@rushikeshbulbule8120 5 жыл бұрын

Excellent👍👏😆

@MrMandarpriya Жыл бұрын

thanks you very much Sir for the explanation. So then what could be the receipe. As in Finance/Economics we have the full the dataset, so then is it better to divide the data based on timeseries(date) and then do the ususal or is there other approach.

@thulasikumar770 4 жыл бұрын

Well cleared my doubt

@afzalhasan6494 2 жыл бұрын

u r legend ❤

@heenagirdher6443 3 жыл бұрын

@krish pls explain the concept of data leakage in deep learning

@madhureddy5328 5 жыл бұрын

I am a new learner in data science course. In time series analysis when we split data. Do the data is splitted (test data) accordingly required. Because my doubt is like when we split data randomly there is a chance of getting a old data in test dataset and new data in training data. So that may leads to decrease in accuracy right. Is there any way to split data accordingly that can cover equally through out the data set at every corner equally splitted or like covering latest data as test data. Not like covering only latest data or old data. As you are saying we train latest data as testing. Can we split in this way. If possible can you make a video related to splitting data.

@dhonidamaka5013 3 жыл бұрын

One. Suggestion is test should be chosen wiiith respect to time. Since time series is not random and rather temporal. So split should also happen temporally . For example is u have built a model for data pertaining to Jan-oct. Test should be Nov and dec

@AshrafVideos 5 жыл бұрын

Hi.What are the differences between data leakage and overfitting of the model and the baising of the training data

@BiancaAguglia 5 жыл бұрын

1. Data leakage happens when your train set and your test set share a common value for a specific feature. Krish explained it perfectly with the mean example. Let's say you try to create a model that predicts the score a senior high-school student will get on his SAT. Your dataset is the senior-year grades of 1,000 students from different schools. For some of those students, you have missing grades, so you replace the missing values with the average grade of all 1,000 students. After that you separate the dataset into training and test. See how that will be a problem? All the students in the train set and the test set have the same value for their missing grades. Depending on how many missing values you had (in other words, how often you had to use the average to replace missing values), and how important the feature that had missing values was in the training of your model, data leakage can make your model perform very poorly when it is fed new data. 2. Bias is when your model isn't able to pick up all the true patterns. It usually happens because the model is too simple. As a real world example, let's say you're trying to predict if a candidate is good for a ML engineer position. A very simple model would just use the age as a feature and predict that a person over 50 is not good candidates. That's high bias. 😊An unbiased model would train on more features. 3. Overfitting is when your model picks up the true patterns in the data but also all the noise and all the outliers. It usually happens because your model is too complex. Using the example above, a model would train on all the features related to a candidate. I'll be silly and exaggerate here just to make a point, but let's say that, in your train set, you have data like what books the candidates read, what movies they like, whether or not they drink coffee. Most of the data from those features is noise. 😊

@kunadha 2 жыл бұрын

@@BiancaAguglia Thank you for your time. you explained so well.

@yashodhansatellite1 5 жыл бұрын

Salute

@shalakamhetre8093 4 жыл бұрын

Good topic

@haneulkim4902 Жыл бұрын

Thanks for great video. So when I have great performance in all train, valid, and even test dataset it data leakage is the only problem? Are there some other problem that lead to such issue? Currently my model is doing too well on all of train, valid, and test dataset however I know from experience that performance on testset shouldn't be this good. I've check all the data preprocessing steps and I did split data in temporal fashion, scaled or filled missing value for each train, valid, and testset independently. I'm still having too high performance on testset. What other issues can there be?

@SonalAtale 11 ай бұрын

Hi Krish, I had a doubt for time series. I am having a dataframe with 2000 datapoints (spectral data-Xs) out of which I am having Y values for only 30 datapoints which i am using for model training and testing along with 3 4 more similar subsets. Now if i try predicting the remaining 2970 values, will the model be biased as it is a time series data? Or i can consider it as good as an unbiased test set?

@Actanonverba01 4 жыл бұрын

Thank you

@user-cw7yi1ew1z Ай бұрын

So there is no way to prevent data leakage on cross validation then?

@adeboolusegun1857 5 жыл бұрын

What's the difference between data leakage and data Lake. Also I have not seen any video where you make use of feature selection in practice. I watch video where you mentioned all the types of feature selection but no practical project. I would like to see how you perform feature selection in a dataset with about 150 features. I beg for this, please do a video on it sir.

@sabyasachidas347 2 жыл бұрын

Thank you :)

@rusiano-kq9ws 3 жыл бұрын

I think that in the test you should still replace with the mean with the train, not the test. Isn't that so? The same applies with scaling, if you applied for example some standard scaling to the train, in the test you will apply the same scaling using the mean and the std of the train, no?

@nijagunadarshan2529 3 жыл бұрын

are you saying we should not preprocess the test data?!

@rusiano-kq9ws 3 жыл бұрын

@@nijagunadarshan2529 no. what i am saying is that if in the train you applied some transformation that involved/took into account the values of other samples (say normalization, or null imputation with mean, for example), then when you get your hands on the test you must apply those transformations using values from the train. This means, for example, filling nulls with train mean or normalizing using train mean and std.

@vamsikrishnagannamaneni912 3 жыл бұрын

yea true

@ankit689 Жыл бұрын

Shouldn't you replace with train mean rather than test mean for test missing?

@kunalpandya8468 3 жыл бұрын

So, outliers handling, feature selection everything should be done after split...????

@ThePentanol 4 жыл бұрын

Thanks

@SR-ng5kw 4 жыл бұрын

I have a doubt here , what if train mean and test mean are totally different. Wont the model learn from train mean and then fail on test data. I beleive the universal mean is a better value for missing value imputation as it represents the global distribution. If your test data is totally different from train data anyhow your model will fail in production.

@gamer-shub 5 жыл бұрын

sir please make practical videos on recurrent neural network

@avinashreddy1911 4 жыл бұрын

Nice observations. But what about Encoding for categories?. If train data categories (A,B,C) don't contain test data categories(L,M), we usually combine the train and test data for getting all the categories. What should we do it this is kind of data leakage?

@ashishsaman 2 жыл бұрын

Stratified sampling? Also, there is no issue if we perform One hot encoding on the entire data set.

@sulagnanandi2024 3 жыл бұрын

Hi , can anyone tell me does feature encoding like OneHot encoding and ordinal encoding on the whole data set lead to data leakage?

@manikandannatarajan9665 4 жыл бұрын

Krish, Thanks for your video. I have one doubt. During model selection and finalization we split the data and find the model. But for deployment, we combine all the data to create the newly trained model for deployment. Will this data leakage happens there?

@AshishBaidyanathan 2 жыл бұрын

I guess K fold cross validation should take care of that

@HappyBelly10 2 жыл бұрын

Really informative. Hilarious that you purposely used data leakage to improve Kaggle results. Shows we shouldn't necessarily use winner's code.

@harinishre664 3 жыл бұрын

Should we do EDA only after train set after the split?

@antifragile01 4 жыл бұрын

Dont you think that global values are better representatives when it comes to imputation? Hence the question that how is a initial split justified?

@apoorvshrivastava3544 4 жыл бұрын

Sir I have big doubt cross validation can also lead to data leakage because we already seen that data in first iteration and again in second iteration K we given that data in test set

@gauthamsreekumar5358 4 жыл бұрын

In cross validation, we train K (in K-fold) different models each with train-valid split. That is , In each iteration, we train new model, predict with validation-set and get the accuracy. That’s why we get K accuracies( length of the output array is K) as output. Then we take average of all those values.

@mehrdadkazemi3969 3 жыл бұрын

thnks

@bintangmuhammad7082 3 жыл бұрын

So Data Leakage on missing value will happens if i'm doing imputations to replace the null values right? what if i just remove the rows with missing value(if it's only a a small percentage), does Data Leakage will not be happening(because i dont do replacements) or it will still happen?

@pritampatra6077 3 жыл бұрын

then u may lose some important information if the missing values are more in number.

@shubhammishra-ht2oo 4 жыл бұрын

Hello sir , what is data mismatch in machine learning?

@hardikpatel-go2ko 4 жыл бұрын

I am new learner. I was supposed to do missing value handling first then process for train test split which was wrong method.. so pls can you tell us exact steps from 1 to last...

@favourchiemelaonoh9869 2 жыл бұрын

split first, then handle missing values.

@AdityaBhagavatula-i1w 29 күн бұрын

god tier stuff

@welcometooaudioland7877 Жыл бұрын

This guy deserves an award. If you only knew how difficult it was to find this topic and someone who nails the issue on the head.

@336_saranyamaity8 3 жыл бұрын

so thats how some people getting 0 MAE -_- in Kaggle , damn

@raghavendra2096 4 жыл бұрын

There is no better way to explain data leakage!!!!!!!!!!!!!!!!!!

@jawad237fx 9 ай бұрын

so I'm suppose to do things the wrong way in Kaggle amm interesting

@devkumaracharyaiitbombay5341 Ай бұрын

kon kon dl ke playlist ke cnn wale implementation se aaya

@muralikrishna7617 5 жыл бұрын

Sir can i expcet avideo or a simple message to me from you about free website where we can learn hacking i would be very happy if you reply tot this comment thank you sir