148 - 7 techniques to work with imbalanced data for machine learning in python

Рет қаралды 13,967

Күн бұрын

Imbalanced data is part of life! With a proper knowledge of the data set and a few techniques from this video imbalanced data can be easily managed.
Prerequisites: Pick the right metrics as overall accuracy does not provide information about the accuracy of individual classes. Look at confusion matrix and ROC_AUC.
Technique 0: Collect more data, if possible.
Technique 1: Pick decision tree based approaches as they work better than logistic regression or SVM. Random Forest is a good algorithm to try but beware of over fitting.
Technique 2: Up-sample minority class
Technique 3: Down-sample majority class
Technique 4: A combination of Over and under sampling.
Technique 5: Penalize learning algorithms that increase cost of classification
mistakes on minority classes.
Technique 6: Generate synthetic data (SMOTE, ADASYN)
Technique 7: Add appropriate weights to your deep learning model.
References:
imbalanced-learn.org/stable/o...
scikit-learn.org/stable/modul...
Code generated in the video can be downloaded from here: github.com/bnsreenu/python_fo...

Пікірлер: 35

@mayukh_ 3 жыл бұрын

Sorry if I am saying someting wrong but I found some logical errors in the code. Generally upsampling causes an highly overfit model and specially when using it with RandomForest. Here what you have is something called data leakage. Seen that in the code of lot of experienced developers. Let me explain.. What you are doing is upstamplig on the actual dataframe and then splitting it into train and test. What you should have done is first split the data into train and test and then perform all the usampling on the train and create a model and then test it on test data. You don't touch or modify the test data. This way you can actually prevent the data leakage. Now because you have upsampled first and then split, obviously your model would be better for every class. But I dont think it is a true generalized representation. You can use stratified splitting before splitting the data to represent all the class labels.

@DigitalSreeni 3 жыл бұрын

Please don't feel sorry for helping people by correcting any mistakes in my videos. I do rely on viewers like you to shed some new perspective on my topics. Balancing classes after splitting does make sense. I should admit I never performed that experiment but it makes logical sense. Thanks for the tip.

@mingzhang4200 2 жыл бұрын

Fantastic presentation on how to handle imbalanced data in modeling!

@DigitalSreeni 2 жыл бұрын

Glad you enjoyed it!

@himanshu8006 3 жыл бұрын

thanks a lot, you were quick and on the spot through out the video

@DigitalSreeni 3 жыл бұрын

Glad it helped

@hik381 3 жыл бұрын

Hey, i have a multi label classification task (images of different landscapes). Im wondering how i can (or should) balance my data set. Some of the labels are in gerneral more common than the others, as well as various combinations, e.g. (mountains and desert) appear more often than (forest and desert) as well as combinations of three labels, e.g. (mountains, forest, snow). By the way i want to use traditional machine learning algorithmes (knnClassifier, Logistic Regression with sklearn.MultiPutputClassification). Do you have any advice ?

@matancadeporco 2 жыл бұрын

hi sir, ty in advice for all those videos, its helping me a lot.. i'm trying to do semantic segmentation on leaf images, that contain on labels, 0 for backgorund, 1 for leaf and 2 for the plague.. i'm trying to apply class_weight but i'm getting the error when fitting the model (ValueError: `class_weight` not supported for 3+ dimensional targets.) how can i overcome this, already tried sample weights, and defining weights manually, but still receiving this error... ty

@junaidlatif2881 Жыл бұрын

Amazing... Can we use at regression?

@zeeshankhanyousafzai5229 Жыл бұрын

Please make a video on Data Augmentation with GANs

@user-gb6py4re3o 4 жыл бұрын

Thanks for explaining easily.

@DigitalSreeni 4 жыл бұрын

My pleasure

@huanwangyang4458 2 жыл бұрын

Hi, this is a good video, but I have to say the model with sampling data is wrong. You should resample the X_train & y_train, then train the model, and finally apply this model to the untouched original data (X_test).

@imadsaddik 2 ай бұрын

Thank you

@stevenmr1215 4 жыл бұрын

Thanks for sharing. We know training first, then predict a new sample. Here you use a classifier training an image's all pixels with segmentation label. Then you use this trained model to predict another image? or just compare acc of different classifiers here?

@DigitalSreeni 4 жыл бұрын

I use a few labels to train a model and predict a single image accurately. If it is not accurate, I will add more labels. Once I am happy with my segmented image then I use it to train another model and then segment a few more images. If I am happy with them then I use them as masks to segment even more images. Finally, I use all these segmented images as training masks for deep learning. If you want to annotate your labels: www.apeer.com/annotate (It is free)

@muazimran 3 жыл бұрын

how can we do this with 3 RGB channels instead of gray?

@yuepengliu664 2 жыл бұрын

Help a lot.

@nafaszareee6871 2 жыл бұрын

hi, thanks for video. My image datasets are imbalanced. And I used the method of deep learning, but I get this error.ValueError: `class_weight` not supported for 3+ dimensional targets. Help me solve the problem😓😓

@JesseThings 2 жыл бұрын

Hello! Thanks for the good video! I have a question about the train test split with the resampling. In this example you have the train_test_split after resampling. Wouldn't that create bias in testing and same with validation. Shouldn't I do the train test split first and then resample only the training data? Thanks!

@GARUDA1992152 2 жыл бұрын

From my experience, you are right.. we should only resample the train data, and keep the validation data as is

@hEmZoRz Жыл бұрын

This was bugging me as well, and you're 100% correct. Performing a simple upsampling before the train-test split is particularly problematic, as your test set will (most likely) have same exact samples that were used for training the model. Not surprisingly, this will increase the model performance. A similar logic applies to using SMOTE prior to the train-test split: your test set will likely have synthetic samples that are affected by samples in the train set. That is, information has leaked once again from the train set to the test set. A simple undersampling of the majority class will not have such issues, though.

@mihretdesta9153 Жыл бұрын

sir, data augmentation is the best technique for imbalanced data.?

@smartitbyeng.nareman1573 3 жыл бұрын

thank you so much

@DigitalSreeni 3 жыл бұрын

You're welcome!

@user-sh5hn2gn1k 8 ай бұрын

@DigitalSreeni Hi There! Does these libraries (SMOTE, etc.) work for the Image (Computer Vision) Data?

@DigitalSreeni 7 ай бұрын

SMOTE is for structured data, not for image data. For images, you can try image augmentation methods.

@user-sh5hn2gn1k 7 ай бұрын

@@DigitalSreeni Thanks!

@surajshah4317 4 жыл бұрын

Dear sir I have a question How to classify gray mask image into 5 classes and label them for multiclass segmentation ..

@DigitalSreeni 4 жыл бұрын

I have videos on multiclass, not sure if you are referring to multiclass. I suspect you are interested in segmenting every pixel of an image into one of the 5 regions. If that is the case look for Unet segmentation for deep learning approach. But if you do not have many labels then checkout my videos 67 and 67b. For labeling, you can use www.apeer.com/annotate It is easy and free.

@surajshah4317 3 жыл бұрын

@@DigitalSreeni thank you so much sir Yes you understand my concern I have mask image with value from 0-4 in each pixel Should I use one hot encode

@eventhatsme 3 жыл бұрын

I have seen that stratified k-fold cross validation is used for imbalanced datasets with traditional machine learning. How will this work out of the box compared to these techniques? Would you say that a combination of over- and undersampling will work the best also for deep learning models?