This is why you should care about unbalanced data .. as a data scientist

Рет қаралды 17,550

ritvikmath

Күн бұрын

Пікірлер: 25

@jessibenzel243 3 жыл бұрын

We just talked about this in my machine learning course this week!! Great timing! This video is very helpful.

@pgbpro20 3 жыл бұрын

ritvikmath coming with a video of one of my favorite topics - instant like!

@haneulkim4902 3 жыл бұрын

Great content, these practical content is gold. Thank you :)

@tech-n-data 2 жыл бұрын

Thank you so much for all you do.

@JessWLStuart Жыл бұрын

Well presented!

@igorbreeze3734 2 жыл бұрын

Hi! Great video. Is there any way you would like to creat a full in-depth catboost tutorial on some random data? Would be super useful.

@joelrubinson9973 3 жыл бұрын

very interesting. AdTech modeling of conversions as caused by advertising always suffers from imbalance. (Conversion rates are usually low-mid single digits).

@d.a.k.o.s9163 Жыл бұрын

Great video! But don’t you think with such unbalanced dataset it would be better going for an anomaly detection algorithm instead of classification algorithm?

@aghazi94 3 жыл бұрын

you are seriously so underrated

@danielwiczew 3 жыл бұрын

Okey, but with oversampling - how do you use cross validation ? Because if you use it on the oversampled dataset, you'll have dataleak

@ritvikmath 3 жыл бұрын

I think you'd want to define the folds on the original data and then oversample holding some folds fixed. Example: 3-fold CV. - split original data into 3 folds (A,B,C) - consider (A,B) as training data -> oversample that data -> validate using C. - repeat using A,B as validation sets - note that there is no data leak in this case

@bmebri1 3 жыл бұрын

Excellent video! One question though: are certain classification models immune from class imbalance? Thanks!

@LanNguyen-eq6lf 3 жыл бұрын

To my knowledge, don't think any classification what immunes from imbalanced dataset because they are data-driven. However, you are still able to get very good accuracy from imbalanced dataset. It happens when inter-class separability is very high, for example, detection of water bodies (often a minority class) over a large area is often quite accurate.

@Sameerahmed373 3 жыл бұрын

Can we customise loss function? For example more weight for misclassification of true minor class and less weight for the other error?

@davidzhang4825 2 жыл бұрын

Great video. For other ML algorithms like logistic regression, SVM, KNN etc, can we implement the first method (upweight the minority class) ? or this is only applicable to decision tree ?

@zahrashekarchi6139 2 жыл бұрын

Great demo! just one thought, why did you not talk about downsampling the majority class? and see what can be the impact?

@douwe7493 10 ай бұрын

This is something I am wondering about too!

@chenxiaodu2557 8 ай бұрын

It should be "imbalanced data" instead of "unbalanced data"

@brenoingwersen784 7 ай бұрын

Lol 😂

@mrirror2277 3 жыл бұрын

Hi just wondering if SMOTE is applicable for image data? I saw only one article on it online, so I am not sure if it even works since generating synthetic images is likely much harder.

@shahrinnakkhatra2857 Жыл бұрын

That's where image augmentation comes to play. You can create different variations of that image by rotating, flipping etc various transformations

@bernardfinucane2061 3 жыл бұрын

You could predict that aircraft engines NEVER fail and almost always be right.

@Septumsempra8818 3 жыл бұрын

Are you familiar with Latent vectors in network analysis? s/o from South Africa

@junkbingo4482 3 жыл бұрын

hi when people have problems with unbalanced data, it's just the proof they did not get what they do when i was young ( a long time ago, so), our teachers wanted us to do things ' step by step' to be ( nearly) sure we knew what we were calculating as it's not the case anymore, yes, people dont get the methodology and the maths, but practice data science, wich is sad

@junkbingo4482 3 жыл бұрын

ups, nuance wrote 'yes'!!; thx to lstm, i did not check my post, sorry! ;-)