Tutorial 45-Handling imbalanced Dataset using python- Part 1

Рет қаралды 132,711

Күн бұрын

Machine Learning algorithms tend to produce unsatisfactory classifiers when faced with imbalanced datasets. For any imbalanced data set, if the event to be predicted belongs to the minority class and the event rate is less than 5%, it is usually referred to as a rare even
If you like my efforts please do subscribe the channel and share with your friends
github url :github.com/kri...
Data Science Interview Question playlist: • Complete Life Cycle of...
Data Science Projects playlist: • Generative Adversarial...
NLP playlist: • Natural Language Proce...
Statistics Playlist: • Population vs Sample i...
Feature Engineering playlist: • Feature Engineering in...
Computer Vision playlist: • OpenCV Installation | ...
You can buy my book on Finance with Machine Learning and Deep Learning from the below url
amazon url: www.amazon.in/...

Пікірлер: 98

@rajeshreddy3133 4 жыл бұрын

@6:53 get rid of for used for loop list comprehension by.. columns = data.columns.to_list() target = columns[-1] columns =columns[:-1]

@drishtisharma6843 4 жыл бұрын

yes... :)

@mohammedameen3249 3 жыл бұрын

X = df.drop(labels='Class', axis=1) y = df['Class']

@ruchitaphalak3603 3 жыл бұрын

Hello there. Firstly a very big thank you Krish Naik for these videos. A major part of my ML coursework for masters degree was possible almost entirely due to your theory and practical sessions. You are doing such good work. Cant thank you enough. These videos are a big saviour. No kidding but none of my Masters professors could match your teaching way. Alas! I have to pay so much to the University only to learn from youtube. Thank you once again :)

@alricliew714 3 жыл бұрын

Suggestion: It would be convincing if you include the before-after performance comparison.

@btsislife9659 5 ай бұрын

Superb and amzing explanation....the teachers jas explained it very well and really really amazinggg🎉

@smrahman4595 2 жыл бұрын

Hey Krish Naik, your explanation is just awesome. Salute.. 👍👍👍

@tahfim37 4 жыл бұрын

Thanks , explanations are easy to understand, constructive and well organized. Keep doing it 💯💯👏👏

5 жыл бұрын

Krish Super fast express has just arrived in KZbin station!

@BiranchiNarayanNayak 5 жыл бұрын

Nice tutorial on imbalanced dataset.

@sandipansarkar9211 4 жыл бұрын

Thanks Krish. Great Explanation

@dipeoluayomide5097 Жыл бұрын

You're the best Krish

@boringhuman9427 3 жыл бұрын

### Implementing Undecannot unpack non-iterable NearMiss object sampling for Handling Imbalanced nm = NearMiss() X_res,y_res=nm.fit_resample(X,Y) y_res.value_counts()

@samisoomro1324 5 жыл бұрын

Can you please make a video on image classification for multiple classes with imbalanced classes

@sandeepdesale6395 3 жыл бұрын

TypeError: __init__() got an unexpected keyword argument 'random_state' : This error occurs when i am using NearMiss, what to do??

@MOHITBARTHWAL 3 жыл бұрын

FIrst of all thanks for bringing such nice video lectures on machine learning. I have Two query regarding down sampling. 1. As we have down sampled the dataset belonging to one class(drastically) will it not result in loss of significant amount of information. 2.And how to decide upto where the downsampling is to be done? as here in the video we downsample the dominant class and made it equivalent to the other one.

@satyamshukla4209 3 жыл бұрын

It will. This is a major disadvantage of undersampling.You may lose valuable information. Undersampling will be done in majority class. Oversampling will be done in minority class.

@kesavae9552 4 жыл бұрын

I have a doubt, by performing up or down sampling we are actually making the probability for fraud almost equal to genune in the dataset, which is not true. Then the model learns that both fraud and genuine are equally likely events...?? If I'm wrong please tell me why

@leeroyndlovu7983 4 жыл бұрын

Hmm 🤔 interesting point, but that’s actually not true. What we’re interested in is learning the characteristics of defaults and non defaults. So we’re not saying all are equally likely possibilities. We’re saying what attributes does a defaulter have and what attributes does a non defaulter have so when we see a new dataset we’d be able to properly classify it

@kesavae9552 4 жыл бұрын

@@leeroyndlovu7983 thanks budd, really helpful ✌️

@reshamsundar6941 4 жыл бұрын

Sir could you please tell me that in an imbalanced dataset whether is it better to do train_test_split before performing under or over sampling or is it better to do after under or over sampling?

@Arjun147gtk 4 жыл бұрын

you can do it before. That will be better.

@shubhamgalande3315 3 жыл бұрын

if you apply train_test_split to imbalanced dataset it might takes only negative or positive classification in your training data.

@mohammedameen3249 3 жыл бұрын

Yes is better to split the data then apply under or over sampling on train data to avoid data leakage .

@aguyfromparalleluniverse Жыл бұрын

thankyou krish sir

@thealgorithm7633 5 жыл бұрын

Very helpful

@rajivrsrivastava6379 4 жыл бұрын

It was really nice. Could you please make a video on SentimentAnalysis ?

@ameyakaranjkar9359 4 жыл бұрын

Hi sir, I am just a beginner in ML, so this question might be very naive, but in general is over-sampling better than under-sampling, as in under-sampling we are essentially decreasing the size of our data set which in turn would affect our accuracy?

@Ayra_Is_Cool_lol 3 жыл бұрын

He just said that you should under only when you have a large dataset, and over when your dataset is small

@drishtisharma6843 4 жыл бұрын

superb! so helpful... thank you so much :)

@louerleseigneur4532 3 жыл бұрын

Thanks Krish

@babbibaljeethkaur3995 5 жыл бұрын

Thank you so much Krish 🙏, Very clearly explained. Searched Whole Web for this, finally found the functions could you please make a video on "Bias and Variance tradeoff"-- using python function. i'll be gratefull :)

@learnenglish699 2 жыл бұрын

hello bibi have u got the job hun ta do sal hogee kithe jee>

@TrNz21 3 жыл бұрын

had an error like this 'NearMiss' object has no attribute 'fit_sample' please tell me what to do?

@saitejagoud4143 3 жыл бұрын

I'm also facing same problem

@mohammedameen3249 3 жыл бұрын

nm.fit_resample(X,y)

@vinayaksharma-ys3ip Жыл бұрын

thanks a lot Sir!!

@eduardodimperio 2 жыл бұрын

What's the difference between this and slice the same amount of rows from 0 that's exists in 1?

@shreyasb.s3819 3 жыл бұрын

Nice work.

@Alhamdou_lilah 4 жыл бұрын

hello , thank you for the video ! tell me please , how can plot histogramm for the new balanced data ? and how can i get the new X and Y ? thank you !

@thanveen 3 жыл бұрын

Suppose we have a dataset which is imbalanced and also has missing values ..which should be treated first ?

@CatBlack01 3 жыл бұрын

I have the same question. Can anyone answer it? My instinct is to try both ways and see if I get different results.

@priyanshugupta2104 Жыл бұрын

What's problem in this You can handle missing values

@priyankadurairajan887 4 жыл бұрын

Whether the technique is applicable for Multilabel classification or not.

@kennytuwiro7561 3 жыл бұрын

nice one

@yasserothman4023 3 жыл бұрын

Why the undersampling is applied before the train test split not on the training data set alone ?

@rahulsharma-dk5jf 3 жыл бұрын

Why random_state is set as 42 , is there a specifc reason for that

@94fuckmylife 3 жыл бұрын

This helps in reproducibility. If you run the code in your terminal with random_state = 42, it will give the same results as the one he has got in the video.

@guganr9321 2 жыл бұрын

i am getting this error sir, could you help? 'NearMiss' object has no attribute 'fit_sample'

@sandipansarkar9211 4 жыл бұрын

Did my practice in Jupyter notebook but could not import the library Nearmiss and imblearn.Please guide

@chdhc9922 2 жыл бұрын

undersampling can lead to underfitting?

@sreeramsaravanan8132 4 жыл бұрын

Krish make a video on multicollinearity

@yuvrajverma6832 3 жыл бұрын

fit_resample give error: MemoryError: Unable to allocate 1.00 GiB for an array with shape (236, 568630) and data type float64 How can i solve it

@samriddhlakhmani284 4 жыл бұрын

Haha, I had already worked on this data set.

@magelauditore333 4 жыл бұрын

Sir what if in a column of independent feature 90% of data is of same category. Then should i drop the column or not

@abhishekjn3390 4 жыл бұрын

why u mentioned randomstate as 42 in nearmiss?

@winyourself553 3 жыл бұрын

12:13 what about remaining original data. after having 492 in each ..

@renuverma5633 4 жыл бұрын

i am not able to import near miss from imblearn.under_sampling. ImportError: cannot import name 'six'. what to do?

@ashwinchavan6391 Жыл бұрын

I am getting this error whta should i do :- TypeError: __init__() got an unexpected keyword argument 'random_state'

@ahmedhelal920 3 жыл бұрын

nm = NearMiss(random_state = 42) give me error __init__ unexpected keyword argument 'random_state'

@AhmedKhan-fh8fq 3 жыл бұрын

hello sir, what kind of algorithm i apply to convert diabetes dataset into readable format guide me please

@anupamborah2551 4 жыл бұрын

One question, At which stage do we use imblearn? Should we use it after feature engineering or right at the start and then go for feature engineering, feature selection, model training etc?

@shanbhag003 3 жыл бұрын

I think just before train test split

@GuruCharanT-w9t Жыл бұрын

Could you please share the dataset. it is not available on the link.

@HITD47 4 жыл бұрын

Now how the train test split will work on that?

@abhinavpratapsingh4445 4 жыл бұрын

_init_ got an unexpected error

@prajothshetty6848 4 жыл бұрын

sir should i do under-sampling ,if i am using random forest algo for classification problem?

@patelajay1010 3 жыл бұрын

I have one doubt. What if data contains Nan values and you want to do under_sampling? If you impute Nan values with Mean() then there will be information leakage because we impute data before splitting it into train and test dataset. Could you please tell me what should be the possible solution in this case?

@AmeerulIslam 4 жыл бұрын

will it work for regression problem?

@DonutTechBites 4 жыл бұрын

sir where is tutorial 44 ?? please do upload it

@saurabhsharma-tm3co 4 жыл бұрын

Hi, I am unable to use imblearn package , tried to install conda install -c conda-forge imbalanced-learn conda install -c conda-forge/label/gcc7 imbalanced-learn conda install -c conda-forge/label/cf201901 imbalanced-learn and the above packages installed successfully using anaconda prompt but still getting error ModuleNotFoundError: No module named 'imblearn' Please help

@ishitagupta6584 4 жыл бұрын

I am also facing this error. Can you please share the solution if you got it. Thanks.

@abhishekrai6803 3 жыл бұрын

Same to me

@pasalapravalli5968 7 ай бұрын

Where is the creditcard csv file?

@souravsaha7751 2 жыл бұрын

Sir imblearn is not installing.

@imayushthakur 4 жыл бұрын

Couldn't find the related data anywhere for practicing even not at the provided links so please share the source

@jayjagani5998 3 жыл бұрын

Dataset Link: www.kaggle.com/mlg-ulb/creditcardfraud

@harshiths5140 Жыл бұрын

Any contact information to get join subscription plan for projects please let me know.

@akshatrailaddha5900 Жыл бұрын

I'm getting No module named 'imblearn' error even after installing in anaconda prompt can anyone assist me with this

@akshatrailaddha5900 Жыл бұрын

even after successfully installing library in command prompt also and updating python , still not working

@shishirdixit5996 4 жыл бұрын

In my dependent categorical variable, I have 80% 0s and 20% 1s out of 3400 records, it will be considered as an imbalanced dataset?

@bytblaster 4 жыл бұрын

yes

@apurvakulkarni7725 2 жыл бұрын

Hey Krish, Thanks for the video but while Implementing Undersampling for Handling Imbalanced i am getting an error "TypeError: __init__() got an unexpected keyword argument 'random_state'" and unable to find an solution it would be great if you could assist.

@sahiltamboli7371 2 жыл бұрын

did you found solution?

@omsonawane2848 Жыл бұрын

@@sahiltamboli7371 actually the function to be used is nm.fit_resample . And remove the random_state arguement in the NearMiss initiator.

@kajalkapasiya193 5 жыл бұрын

Hi Krish, Can you please provide the data file as well.

@generationwolves 4 жыл бұрын

www.kaggle.com/mlg-ulb/creditcardfraud

@abhinandanpawar.7880 2 жыл бұрын

how to get this data

@mohammadkaif2534 9 ай бұрын

from 5000 to 1 miliion !!

@riteshtripathi8626 4 жыл бұрын

__init__() got an unexpected keyword argument 'random_state' any expert out there to help me here, using mac and executing command: # Implementing Undersampling for Handling Imbalanced nm = NearMiss(random_state=42) X_res,y_res=nm.fit_sample(X,Y)

@yashrajadventures 4 жыл бұрын

I am also getting same error , please help me .

@yashrajadventures 4 жыл бұрын

The error is you haven't installed imblearn library instead you may have installed imbalanced-learn. They both are different

@riteshtripathi8626 4 жыл бұрын

@@yashrajadventures thanks for suggesting out, though above suggestion hasn't helped yet, I have imblearn library preinstalled, and executed again pip install imblearn, it says requirement already satisfied. did yours work?

@yashrajadventures 4 жыл бұрын

@@riteshtripathi8626 Yes, please install via command "pip install imblearn" other commands don't work.After installing close every kernel and re-run all the cells.

@riteshtripathi8626 4 жыл бұрын

@@yashrajadventures nope, didn't work for me, tried again,it says:requirement already satisfied, looks like something's a miss. anyway, I am using Rstudio to handle imbalance datasets for my day to day work, I will keep digging more. Thanks,

@jagupatigolguri1875 Жыл бұрын

Some time this code from imblearn.under_sampling import NearMiss nm = NearMiss() X_res,y_res = nm.fit_sample(X,y) is not working so we us this also where we can adjust sample size from imblearn.under_sampling import RandomUnderSampler from collections import Counter # Instantiate RandomUnderSampler with desired sampling strategy rus = RandomUnderSampler(sampling_strategy={0: 4900, 1: 250}) # Perform under-sampling X_res, y_res = rus.fit_resample(X, y) # Check the class distribution after under-sampling print(Counter(y_res))