Data Leakage in Machine Learning | Understanding Data Leakage with example in machine learning

Рет қаралды 6,258

Unfold Data Science

Күн бұрын

Пікірлер: 41

@rezamanekia8974 8 ай бұрын

very clear and detailed explanation 🥰

@soumyagupta9301 3 жыл бұрын

I have a question... In case of imbalanced data, we generally upsample our data I. e. Repeat some of the samples of the class which is less in number. So, in such cases, there is a high chance of same data being present in both train and test data. So, what to do in such cases?

@UnfoldDataScience 3 жыл бұрын

Good question. That is where upsample is not a recommended way in industry. One quick solution can be first break test ans train and then upsample train.

@soumyagupta9301 3 жыл бұрын

@@UnfoldDataScience thanks for the reply... Keep up your good work 👍

@soumyaranjansethi1790 3 жыл бұрын

Amazing video sir thank you, before i have doubt about this now it's clear

@UnfoldDataScience 3 жыл бұрын

Welcome Soumya.

@mouleshm210 3 жыл бұрын

Tq so much for accepting my request and making a video on this topic sir👍😇

@UnfoldDataScience 3 жыл бұрын

Welcome Moulesh.

@krishnab6444 2 жыл бұрын

simple language and very will explained sir, tysm sir!!

@UnfoldDataScience 2 жыл бұрын

Always welcome

@umeshtiwari800 2 жыл бұрын

Very informative

@5minutesfacts876 3 жыл бұрын

Nice information thank You

@UnfoldDataScience 3 жыл бұрын

Welcome :)

@shreyjain6447 3 жыл бұрын

So drop_duplicates should be done before splitting and other kinds of pre processing should be done after splitting?

@UnfoldDataScience 3 жыл бұрын

drop_duplicates - in the beginning other kind of preprocessing, depends on what we are doing.

@elvykamunyokomanunebo1441 Жыл бұрын

@@UnfoldDataScience what if I have repeated measures e.g. taking blood samples at given intervals and the measurement are at times exactly the same for a given patient. Wouldn't removing these as duplicates delete valuable information that aren't duplicates?

@RamanKumar-ss2ro 3 жыл бұрын

Very nice content.

@UnfoldDataScience 3 жыл бұрын

Thanks.

@jesuswords5814 2 жыл бұрын

Good

@BangaruAmarnath 3 жыл бұрын

So, in the scenario 2, scaling would have been different if we had done the splitting first and then the preprocessing step. But if we do the scaling on the entire dataset before splitting, then it results in some influence on the train dataset due to test/validation dataset, right?

@UnfoldDataScience 3 жыл бұрын

Yes true.

@elvykamunyokomanunebo1441 Жыл бұрын

@@UnfoldDataScience The full dataset is sample that is said to be representative from the population under study, so performing scaling on the full dataset would generate the best approximation of population scaled values. while performing scaling on train dataset will generate noisier approximation of the true population scaled values which would result in a poorer model, bad generalizability onto unseen data. Would you agree?

@elvykamunyokomanunebo1441 Жыл бұрын

Hi there, In production models deteriorate! When they do, no one says it's because of data leakage. Instead we say it is because of concept drift or data drift. Then we re-train the model adjusting for the new data or answering a different question. Is data leakage just fast model deterioration?

@sadhnarai8757 3 жыл бұрын

Good video.

@UnfoldDataScience 3 жыл бұрын

Thank you.

@sandipansarkar9211 3 жыл бұрын

FINISHED WATCHING

@harshyadav1190 3 жыл бұрын

Data leakage and overfitting looks similar, is there any difference in them ?

@UnfoldDataScience 3 жыл бұрын

When data leakage occurs, it may lead to overfitting. Cause and effect thing

@colabwork1910 3 жыл бұрын

Any daily life example of data leakage?

@UnfoldDataScience 3 жыл бұрын

If you take data where there is lot of variation, this may happen. For example household income of people in Bangalore.

@prajwalbhandary17 3 жыл бұрын

Hii Aman I had a question How to check bias in data (not a selection bias)

@UnfoldDataScience 3 жыл бұрын

That's a good question, Let me try covering it in next video Prajwal.

@prajwalbhandary17 3 жыл бұрын

@@UnfoldDataScience thanks for responding

@heenagirdher6443 3 жыл бұрын

Pls tell me how to check data leakage in image classification where we have already different folders of train and test data.

@UnfoldDataScience 3 жыл бұрын

Data Leakage you cant measure. you just observer steps you are doing.

@denzelomondi6421 2 жыл бұрын

👏

@peacefullmusic8374 3 жыл бұрын

and how to determine the data leakage element?

@peacefullmusic8374 3 жыл бұрын

how to tackle with data leakage ?

@mayurgite4050 2 жыл бұрын

1. If data is duplicate then drop duplicate at start. 2. Break the dataset into train and test the go for pre-processing step. 3. Keep watch on Model training so that no data gets exchanged between train and test phases. 4. Best way to measure data leakage/overfitting/underfitting is to perform k-fold class validations.

@rahuljaiswal141 3 жыл бұрын

Hi aman, a small request. Can you make a video on business use cases for various data like e commerce, insurance data, pharmaceutical data. So what all business use cases can be developed around these kind of data in real world data.

@UnfoldDataScience 3 жыл бұрын

Sure, there are some videos already. You can search for "use case in customer and retail analytics", " Use case in manufacturing " Etc