I was at the same problem for the imbalance in dataset and by then I researched for different methods to take on. Here I am presenting my shortlist that I have created which might help you somewhere. Possible Solutions: 1. Make some changes in the algorithm • Adjust the class weight so it becomes sensitive to the minority class • Adjust the decision threshold (we can check by PR curve) • Penalize the algorithms by putting class_weight='balanced' 2. Discard the minority examples and treat all classes as one • Here we can treat the problem as the "anomaly detection" problem instead of classification For anomaly detection "Isolation forest" tend to give promising results 3. Balance the dataset by sampling • Undersample • Oversample & SMOTE 4. Ensemble learning by downsampling • It bootstraps different samples and each time it will balance the classes by undersampling the majority classes and then aggregates the results for voting 5. Usage other techniques • Algorithms such as Tomek links (which removes k nearest majority pair to increase division) • Focal loss I have also tried to look for the kaggle notebooks there people have also found out that XGBoost slightly outperforms other algorithms even it would require to give different class weights. - This was my cheat sheet of the 5 ways. Share your thoughts!!
@UnfoldDataScience2 жыл бұрын
Very good explanation and thanks for putting the learning here. I will pin this comment on top for others benefit. My view - Data Science is all about trying/experimenting/failing and learning. Then something very good comes up.
@enchanted_swiftie2 жыл бұрын
@@UnfoldDataScience Won't lie, but when I started watching your videos, your explanations made things much simpler. You know, I was used to freak out (sorry for the words) by listening DBSCAN, Hierarchical Clustering and what not, but when I see those topics explained by you I feel so comfortable that now I would understand this. How simply but accurately you explain without missing the important things. PS: I was introduced to assumptions of linear regression by your channel. Before that I knew the model, came to know that there is something called "assumptions" and how important are they!! Totally missed by the instructions on online courses! Your channel is a huge contribution to the data science community on YT.
@KastijitBabar7 ай бұрын
You are the best Data Science And Machine Learning Teacher I have ever seen. Thanks a lot!!
@UnfoldDataScience7 ай бұрын
You are welcome!
@sreebvmcreation93885 ай бұрын
Thank you sir, iam searching methods for imbalaced data , finally i got the methods with your video.Thank u so much once again. All in methods which one is best method .
@karthebans2482 жыл бұрын
Learned new things about the balancing of data sets for Imbalanced data sets. Thanks.
@UnfoldDataScience2 жыл бұрын
Welcome.
@zahedinima7322 жыл бұрын
Such a clear and concise explanation. Thank you, Aman!
@UnfoldDataScience2 жыл бұрын
Thanks A lot.
@nivednambiar68452 жыл бұрын
An important concept when dealing with classification Thanks for sharing Aman 👍👍
@UnfoldDataScience2 жыл бұрын
Thanks Nived.
@mamataparab98032 жыл бұрын
Hello Aman, this is the third time I have watched this video, simply to learn your way of explaining things. Is it possible for you to create a video or give us some notes so we can find all the important questions for ensembling techniques?
@UnfoldDataScience2 жыл бұрын
Thanks Mamata, I do keep sharing on Instagram, please follow "unfolddatascience" On Instagram.
@mamataparab98032 жыл бұрын
Sure, Aman. Thank you
@atod25722 жыл бұрын
Awesome explanation. Can you please tell us when we use which technique? I mean with an example of dataset and selection of sampling technique.
@ayushparihar5989 Жыл бұрын
Good explanation
@riva.4484 Жыл бұрын
Thank you so much! This video help me a lot. I have a question, how can we choose and decide which way is the best fit for our imbalance dataset?
@UnfoldDataScience Жыл бұрын
Its always trial and error.
@bijaynayak64732 жыл бұрын
Very Nice explanation kudos
@UnfoldDataScience2 жыл бұрын
Thanks for liking Bijay
@Samtoosoon2 ай бұрын
Undersampling, oversampling minority class, combo, ensemble random forest, batch selection
@dd33712 жыл бұрын
Thanks very much for sharing and explaining. What's your thought on logistic regression? Would imbalanced data still a problem if you build the model in GLM using logistic regression?
@younesgasmi8518 Жыл бұрын
Can I use oversampling or undersampling before Splitting the dataset into training and testing ?
@avikdinda78276 ай бұрын
If oversampling gives data leakage issues in total data? Or if I use smote in train data after the train test split it is giving poor precision to the minority however recall is ok...so what do I do to improve the precision of the minority class?
@nagarajsundar79312 жыл бұрын
Hi Aman, Thanks for explaining various method. One question, when to use which method ?
@UnfoldDataScience2 жыл бұрын
Thanks Naga, cant have like one to one go for rule. some pointers are there which I can cover in different video, thanks for asking
@NeeRaja_Sweet_Home2 жыл бұрын
Hi Aman, In most of videos we could see imbalanced Dataset for classification problems but how to check and Handle imbalanced Dataset for regression problem. Thanks,
@dilshadmuhammed8224 Жыл бұрын
in my case i have more than 2 classes and those classes are in text ,for eg- well being , business analytics etc how will balance such classes
@swapnilgiram13554 ай бұрын
Can we use smote technique
@snehalvaidya58432 жыл бұрын
Thanks for sharing knowledge 🙂, plz share how to explain PCA in front of interviewer..
@UnfoldDataScience2 жыл бұрын
kzbin.info/www/bejne/paTKooSvbq2lbtU
@dhanushraj36972 жыл бұрын
The video was good but i request to add some extra information and explanation for each methods.
@sadhnarai87572 жыл бұрын
Very nice Aman
@UnfoldDataScience2 жыл бұрын
Thank you
@tharindumadusanka30382 жыл бұрын
i am doing MBA using apriori algorithm by using google colab. the problem is when i use more than 20 rows in csv transaction data it displays error. if the no of rows is less than 20 expected result come.
@UnfoldDataScience2 жыл бұрын
Thats not number of rows problem, some hidden issue may be there with row number 21 probably. I am just guessing.
@chalmerilexus20722 жыл бұрын
Which method is preferable?
@UnfoldDataScience2 жыл бұрын
This is discussed towards end.
@mihretdesta9153 Жыл бұрын
hey sir, how about imbalanced image data for deep learning?
@UnfoldDataScience Жыл бұрын
Data augmentation is one option.
@maasahebbiustad85142 жыл бұрын
Hello sir, How to solve A Classification problem in which training data has only one class? 'This solver needs samples of at least 2 classes in the data, but the data contains only one class: 1', please help me out