I have a question... In case of imbalanced data, we generally upsample our data I. e. Repeat some of the samples of the class which is less in number. So, in such cases, there is a high chance of same data being present in both train and test data. So, what to do in such cases?
@UnfoldDataScience3 жыл бұрын
Good question. That is where upsample is not a recommended way in industry. One quick solution can be first break test ans train and then upsample train.
@soumyagupta93013 жыл бұрын
@@UnfoldDataScience thanks for the reply... Keep up your good work 👍
@soumyaranjansethi17903 жыл бұрын
Amazing video sir thank you, before i have doubt about this now it's clear
@UnfoldDataScience3 жыл бұрын
Welcome Soumya.
@mouleshm2103 жыл бұрын
Tq so much for accepting my request and making a video on this topic sir👍😇
@UnfoldDataScience3 жыл бұрын
Welcome Moulesh.
@krishnab64442 жыл бұрын
simple language and very will explained sir, tysm sir!!
@UnfoldDataScience2 жыл бұрын
Always welcome
@umeshtiwari8002 жыл бұрын
Very informative
@5minutesfacts8763 жыл бұрын
Nice information thank You
@UnfoldDataScience3 жыл бұрын
Welcome :)
@shreyjain64473 жыл бұрын
So drop_duplicates should be done before splitting and other kinds of pre processing should be done after splitting?
@UnfoldDataScience3 жыл бұрын
drop_duplicates - in the beginning other kind of preprocessing, depends on what we are doing.
@elvykamunyokomanunebo1441 Жыл бұрын
@@UnfoldDataScience what if I have repeated measures e.g. taking blood samples at given intervals and the measurement are at times exactly the same for a given patient. Wouldn't removing these as duplicates delete valuable information that aren't duplicates?
@RamanKumar-ss2ro3 жыл бұрын
Very nice content.
@UnfoldDataScience3 жыл бұрын
Thanks.
@jesuswords58142 жыл бұрын
Good
@BangaruAmarnath3 жыл бұрын
So, in the scenario 2, scaling would have been different if we had done the splitting first and then the preprocessing step. But if we do the scaling on the entire dataset before splitting, then it results in some influence on the train dataset due to test/validation dataset, right?
@UnfoldDataScience3 жыл бұрын
Yes true.
@elvykamunyokomanunebo1441 Жыл бұрын
@@UnfoldDataScience The full dataset is sample that is said to be representative from the population under study, so performing scaling on the full dataset would generate the best approximation of population scaled values. while performing scaling on train dataset will generate noisier approximation of the true population scaled values which would result in a poorer model, bad generalizability onto unseen data. Would you agree?
@elvykamunyokomanunebo1441 Жыл бұрын
Hi there, In production models deteriorate! When they do, no one says it's because of data leakage. Instead we say it is because of concept drift or data drift. Then we re-train the model adjusting for the new data or answering a different question. Is data leakage just fast model deterioration?
@sadhnarai87573 жыл бұрын
Good video.
@UnfoldDataScience3 жыл бұрын
Thank you.
@sandipansarkar92113 жыл бұрын
FINISHED WATCHING
@harshyadav11903 жыл бұрын
Data leakage and overfitting looks similar, is there any difference in them ?
@UnfoldDataScience3 жыл бұрын
When data leakage occurs, it may lead to overfitting. Cause and effect thing
@colabwork19103 жыл бұрын
Any daily life example of data leakage?
@UnfoldDataScience3 жыл бұрын
If you take data where there is lot of variation, this may happen. For example household income of people in Bangalore.
@prajwalbhandary173 жыл бұрын
Hii Aman I had a question How to check bias in data (not a selection bias)
@UnfoldDataScience3 жыл бұрын
That's a good question, Let me try covering it in next video Prajwal.
@prajwalbhandary173 жыл бұрын
@@UnfoldDataScience thanks for responding
@heenagirdher64433 жыл бұрын
Pls tell me how to check data leakage in image classification where we have already different folders of train and test data.
@UnfoldDataScience3 жыл бұрын
Data Leakage you cant measure. you just observer steps you are doing.
@denzelomondi64212 жыл бұрын
👏
@peacefullmusic83743 жыл бұрын
and how to determine the data leakage element?
@peacefullmusic83743 жыл бұрын
how to tackle with data leakage ?
@mayurgite40502 жыл бұрын
1. If data is duplicate then drop duplicate at start. 2. Break the dataset into train and test the go for pre-processing step. 3. Keep watch on Model training so that no data gets exchanged between train and test phases. 4. Best way to measure data leakage/overfitting/underfitting is to perform k-fold class validations.
@rahuljaiswal1413 жыл бұрын
Hi aman, a small request. Can you make a video on business use cases for various data like e commerce, insurance data, pharmaceutical data. So what all business use cases can be developed around these kind of data in real world data.
@UnfoldDataScience3 жыл бұрын
Sure, there are some videos already. You can search for "use case in customer and retail analytics", " Use case in manufacturing " Etc