You are presenting like you really want everyone to know the significance of DL. I like this attitude! Thank you for sharing.
@nitinpatil66113 жыл бұрын
Jjj
@erfanm52763 жыл бұрын
I love your passion in teaching man, well done to you. it was a great lesson, concise and precise
@mthandenimathabelacap54662 ай бұрын
Very intuitive and good that you bring knlowledge about the implications for production stage of the model. Great.
@aminderkhungura79293 жыл бұрын
Thanks!
@mahdikhalilnejad4 жыл бұрын
Hello Thanks for your excellent explanation. I watched several sites and videos until finally your rescue video helped me understand this ... don't get tired, you put a lot of energy into it.
@sdc85915 жыл бұрын
Well that was very quick. Just sometime back in Live Q&A someone requested for this information and it is here within just few hours !! 👍
@ismailyt66272 жыл бұрын
That was what I was looking for, an excellent explanation with simplicity and schemas, thanks a lot
@kianestrera-hr5vt7 ай бұрын
Wow, and that's it most of my questions and needed to clarify with data leakage is answered within 10min, Thank you so much, wish your channel grow much further
@saurabhmukherjee8505 жыл бұрын
M a big fan of yours dear Krish... Your contents and way of presentation is simply flawless... Keep on doing the good work... God bless you
@FrancisKyalo-hb5pu2 жыл бұрын
I really enjoyed how well you explained the concept of data leakage, Thanks Krish Naik
@Neojs5654 жыл бұрын
Kaggle brought me to this video....wow man...really eye-opening..!
@GurtDovletov9 ай бұрын
Thank you so much such a good video, it easy to understand. Explained with easy examples
@shivanikhandelwal83243 жыл бұрын
Thank you. This video solved all my doubts...
@4urandy3 жыл бұрын
This is so legit ... Absolutely production knowledge
@TuanKhai298 Жыл бұрын
best teacher ever, thank youuu so much
@muralikrishna76175 жыл бұрын
Hello sir just now i can accross one of your videos i do nd that what your are giving is good
@manishagyaan3 жыл бұрын
Great explanation Krish... You are doing great work... Stay blessed always
@Jobic-10 Жыл бұрын
Take these medals 🏅🥇🏅 🎉 Subscribed ❤
@victorahaji9717 Жыл бұрын
You a wonderful instructor.
@faustopf-. Жыл бұрын
Super informative and very well explained!
@nitayg13265 жыл бұрын
3 valid observations - never thought about it!
@vinaytomar58275 жыл бұрын
Thank you sir for these important instructions 🙌👌
@alexanderdushenin70354 жыл бұрын
2 days ago I meet with this problem on competition. Thanks for explanation)
@claudineievangelistanascim25624 жыл бұрын
Thanks 🙏 for sharing. Very Important Point ! Very much helpful !
@chaitanyakrishna58735 жыл бұрын
Excellent Krish
@laythherzallah34932 жыл бұрын
Thank you ..very fruitfull information
@vaibhavkhobragade97733 жыл бұрын
Thank you! This is a very important topic you've covered.
@sanjaytallolli5 жыл бұрын
Very much helpful...
@lotmoretolearn-dataanalyti93123 жыл бұрын
Very nice explanation krish.. Easy to understand. I always follow ur videos. Please do more such videos on data science and data engineering as well.
@khadidjabenchaira60043 жыл бұрын
Thank you so much from Algeria ...
@ameermohamedr49825 жыл бұрын
Hello Krish, please make one video regarding data leakage with simple dataset for deep understanding for us
@marekdudzik67793 жыл бұрын
Best explanation!!! THX
@saurabhtripathi625 жыл бұрын
very imp point thanks a lot
@sudehashrafi70402 жыл бұрын
thank you so much, you explained it very well😍
@sandeeppanchal86153 жыл бұрын
Another scenario- before splitting the data, suppose we 'fit' vectorizer like BoW or TFIDF, etc, on the whole data set and after splitting, we will 'transform' both train and test set. Since the entire data has already been fitted to vectorizer, after split, transforming test set means data leakage. Only after splitting, 'fit_transform' should be done on train set and 'trasform' on test set (also CV set if created).
@xianglongchen30883 жыл бұрын
Thank you for sharing.
@kevinmatip94162 жыл бұрын
The future lies with projects like the Utopia ecosystem because there is a lot of demand for privacy and security of personal data online.
@jijie1332 жыл бұрын
Great video!
@shubhamurolagin54664 жыл бұрын
Sir, lovely explaination
@Abhi-qi6wm3 жыл бұрын
Good video. An alternate title could be Train-Test contamination.
@nakshatranahar5 жыл бұрын
Great Bro
@jackyfree57563 жыл бұрын
Thanks for sharing sir!
@robertpatterson64453 жыл бұрын
Thank you, very helpful!
@bhavyaparikh69335 жыл бұрын
eagerly waiting for next video!!!!
@Mad202405 жыл бұрын
my discovery of the moment
@MathPhysicsFunwithGus10 ай бұрын
Great video!!
@Abhishek-jy4ul5 жыл бұрын
This was very informative thank you sir!!
@bhawnajyoti20004 жыл бұрын
Nicely explained
@Raja-tt4ll4 жыл бұрын
Very nice video
@ArunKoundinyaParasa Жыл бұрын
super explanation
@nicolasmarcos59963 жыл бұрын
Wonderful explanation. Congratulations! I'll subscribe at the channel.
@rushikeshbulbule81205 жыл бұрын
Excellent👍👏😆
@MrMandarpriya Жыл бұрын
thanks you very much Sir for the explanation. So then what could be the receipe. As in Finance/Economics we have the full the dataset, so then is it better to divide the data based on timeseries(date) and then do the ususal or is there other approach.
@thulasikumar7704 жыл бұрын
Well cleared my doubt
@afzalhasan64942 жыл бұрын
u r legend ❤
@heenagirdher64433 жыл бұрын
@krish pls explain the concept of data leakage in deep learning
@madhureddy53285 жыл бұрын
I am a new learner in data science course. In time series analysis when we split data. Do the data is splitted (test data) accordingly required. Because my doubt is like when we split data randomly there is a chance of getting a old data in test dataset and new data in training data. So that may leads to decrease in accuracy right. Is there any way to split data accordingly that can cover equally through out the data set at every corner equally splitted or like covering latest data as test data. Not like covering only latest data or old data. As you are saying we train latest data as testing. Can we split in this way. If possible can you make a video related to splitting data.
@dhonidamaka50133 жыл бұрын
One. Suggestion is test should be chosen wiiith respect to time. Since time series is not random and rather temporal. So split should also happen temporally . For example is u have built a model for data pertaining to Jan-oct. Test should be Nov and dec
@AshrafVideos5 жыл бұрын
Hi.What are the differences between data leakage and overfitting of the model and the baising of the training data
@BiancaAguglia5 жыл бұрын
1. Data leakage happens when your train set and your test set share a common value for a specific feature. Krish explained it perfectly with the mean example. Let's say you try to create a model that predicts the score a senior high-school student will get on his SAT. Your dataset is the senior-year grades of 1,000 students from different schools. For some of those students, you have missing grades, so you replace the missing values with the average grade of all 1,000 students. After that you separate the dataset into training and test. See how that will be a problem? All the students in the train set and the test set have the same value for their missing grades. Depending on how many missing values you had (in other words, how often you had to use the average to replace missing values), and how important the feature that had missing values was in the training of your model, data leakage can make your model perform very poorly when it is fed new data. 2. Bias is when your model isn't able to pick up all the true patterns. It usually happens because the model is too simple. As a real world example, let's say you're trying to predict if a candidate is good for a ML engineer position. A very simple model would just use the age as a feature and predict that a person over 50 is not good candidates. That's high bias. 😊An unbiased model would train on more features. 3. Overfitting is when your model picks up the true patterns in the data but also all the noise and all the outliers. It usually happens because your model is too complex. Using the example above, a model would train on all the features related to a candidate. I'll be silly and exaggerate here just to make a point, but let's say that, in your train set, you have data like what books the candidates read, what movies they like, whether or not they drink coffee. Most of the data from those features is noise. 😊
@kunadha2 жыл бұрын
@@BiancaAguglia Thank you for your time. you explained so well.
@yashodhansatellite15 жыл бұрын
Salute
@shalakamhetre80934 жыл бұрын
Good topic
@haneulkim4902 Жыл бұрын
Thanks for great video. So when I have great performance in all train, valid, and even test dataset it data leakage is the only problem? Are there some other problem that lead to such issue? Currently my model is doing too well on all of train, valid, and test dataset however I know from experience that performance on testset shouldn't be this good. I've check all the data preprocessing steps and I did split data in temporal fashion, scaled or filled missing value for each train, valid, and testset independently. I'm still having too high performance on testset. What other issues can there be?
@SonalAtale11 ай бұрын
Hi Krish, I had a doubt for time series. I am having a dataframe with 2000 datapoints (spectral data-Xs) out of which I am having Y values for only 30 datapoints which i am using for model training and testing along with 3 4 more similar subsets. Now if i try predicting the remaining 2970 values, will the model be biased as it is a time series data? Or i can consider it as good as an unbiased test set?
@Actanonverba014 жыл бұрын
Thank you
@user-cw7yi1ew1zАй бұрын
So there is no way to prevent data leakage on cross validation then?
@adeboolusegun18575 жыл бұрын
What's the difference between data leakage and data Lake. Also I have not seen any video where you make use of feature selection in practice. I watch video where you mentioned all the types of feature selection but no practical project. I would like to see how you perform feature selection in a dataset with about 150 features. I beg for this, please do a video on it sir.
@sabyasachidas3472 жыл бұрын
Thank you :)
@rusiano-kq9ws3 жыл бұрын
I think that in the test you should still replace with the mean with the train, not the test. Isn't that so? The same applies with scaling, if you applied for example some standard scaling to the train, in the test you will apply the same scaling using the mean and the std of the train, no?
@nijagunadarshan25293 жыл бұрын
are you saying we should not preprocess the test data?!
@rusiano-kq9ws3 жыл бұрын
@@nijagunadarshan2529 no. what i am saying is that if in the train you applied some transformation that involved/took into account the values of other samples (say normalization, or null imputation with mean, for example), then when you get your hands on the test you must apply those transformations using values from the train. This means, for example, filling nulls with train mean or normalizing using train mean and std.
@vamsikrishnagannamaneni9123 жыл бұрын
yea true
@ankit689 Жыл бұрын
Shouldn't you replace with train mean rather than test mean for test missing?
@kunalpandya84683 жыл бұрын
So, outliers handling, feature selection everything should be done after split...????
@ThePentanol4 жыл бұрын
Thanks
@SR-ng5kw4 жыл бұрын
I have a doubt here , what if train mean and test mean are totally different. Wont the model learn from train mean and then fail on test data. I beleive the universal mean is a better value for missing value imputation as it represents the global distribution. If your test data is totally different from train data anyhow your model will fail in production.
@gamer-shub5 жыл бұрын
sir please make practical videos on recurrent neural network
@avinashreddy19114 жыл бұрын
Nice observations. But what about Encoding for categories?. If train data categories (A,B,C) don't contain test data categories(L,M), we usually combine the train and test data for getting all the categories. What should we do it this is kind of data leakage?
@ashishsaman2 жыл бұрын
Stratified sampling? Also, there is no issue if we perform One hot encoding on the entire data set.
@sulagnanandi20243 жыл бұрын
Hi , can anyone tell me does feature encoding like OneHot encoding and ordinal encoding on the whole data set lead to data leakage?
@manikandannatarajan96654 жыл бұрын
Krish, Thanks for your video. I have one doubt. During model selection and finalization we split the data and find the model. But for deployment, we combine all the data to create the newly trained model for deployment. Will this data leakage happens there?
@AshishBaidyanathan2 жыл бұрын
I guess K fold cross validation should take care of that
@HappyBelly102 жыл бұрын
Really informative. Hilarious that you purposely used data leakage to improve Kaggle results. Shows we shouldn't necessarily use winner's code.
@harinishre6643 жыл бұрын
Should we do EDA only after train set after the split?
@antifragile014 жыл бұрын
Dont you think that global values are better representatives when it comes to imputation? Hence the question that how is a initial split justified?
@apoorvshrivastava35444 жыл бұрын
Sir I have big doubt cross validation can also lead to data leakage because we already seen that data in first iteration and again in second iteration K we given that data in test set
@gauthamsreekumar53584 жыл бұрын
In cross validation, we train K (in K-fold) different models each with train-valid split. That is , In each iteration, we train new model, predict with validation-set and get the accuracy. That’s why we get K accuracies( length of the output array is K) as output. Then we take average of all those values.
@mehrdadkazemi39693 жыл бұрын
thnks
@bintangmuhammad70823 жыл бұрын
So Data Leakage on missing value will happens if i'm doing imputations to replace the null values right? what if i just remove the rows with missing value(if it's only a a small percentage), does Data Leakage will not be happening(because i dont do replacements) or it will still happen?
@pritampatra60773 жыл бұрын
then u may lose some important information if the missing values are more in number.
@shubhammishra-ht2oo4 жыл бұрын
Hello sir , what is data mismatch in machine learning?
@hardikpatel-go2ko4 жыл бұрын
I am new learner. I was supposed to do missing value handling first then process for train test split which was wrong method.. so pls can you tell us exact steps from 1 to last...
@favourchiemelaonoh98692 жыл бұрын
split first, then handle missing values.
@AdityaBhagavatula-i1w29 күн бұрын
god tier stuff
@welcometooaudioland7877 Жыл бұрын
This guy deserves an award. If you only knew how difficult it was to find this topic and someone who nails the issue on the head.
@336_saranyamaity83 жыл бұрын
so thats how some people getting 0 MAE -_- in Kaggle , damn
@raghavendra20964 жыл бұрын
There is no better way to explain data leakage!!!!!!!!!!!!!!!!!!
@jawad237fx9 ай бұрын
so I'm suppose to do things the wrong way in Kaggle amm interesting
@devkumaracharyaiitbombay5341Ай бұрын
kon kon dl ke playlist ke cnn wale implementation se aaya
@muralikrishna76175 жыл бұрын
Sir can i expcet avideo or a simple message to me from you about free website where we can learn hacking i would be very happy if you reply tot this comment thank you sir