4.7. How to Handle imbalanced Dataset | Data Pre-Processing

4.7. How to Handle imbalanced Dataset | Data Pre-Processing | Machine Learning Course

Рет қаралды 21,705

Күн бұрын

Hi! I will be conducting one-on-one discussion with all channel members. Checkout the perks and Join membership if interested: / @siddhardhan Check membership Perks: / @siddhardhan
. This video is about how to handle imbalanced Dataset in Python for Machine Learning and Data Science. This is one of the important Data Pre-Processing Steps.
All presentation files for the Machine Learning course as PDF for as low as ₹200 (INR): Drop a mail to siddhardhans2317@gmail.com
Enroll at One Neuron to learn from 100 courses in one subscription with 5% discount: courses.ineuro...
Machine Learning Projects Playlist: • Machine Learning Projects
Machine Learning Course with Python Playlist: • Machine Learning Cours...
Hello everyone! I am setting up a donation campaign for my KZbin Channel. If you like my videos and wish to support me financially, you can donate through the following means:
From India 👉 UPI ID : siddhardhselvam2317@oksbi
Outside of India? 👉 Paypal id: siddhardhselvam2317@gmail.com
(No donation is small. Every penny counts)
Thanks in advance!
Hi guys! I am Siddhardhan. I work in the field of Data Science and Machine Learning. It all started with my curiosity to learn about Artificial Intelligence and the ability of AI to solve several Real Life Problems. I worked on several Machine Learning & Deep Learning projects involving Computer Vision.
I am on this journey to empower as many students & working professionals as possible with the knowledge of Machine Learning and Artificial Intelligence.
Let's build a Community of Machine Learning experts! Kindly Subscribe here👉 tinyurl.com/md...
I am making a "Hands-on Machine Learning Course with Python" in KZbin. I'll be posting 3 videos per week: Monday Evening; Wednesday Evening; Friday Evening.
Dataset file: drive.google.c...
Colab File Link: colab.research...
Download the Course Curriculum File from here: drive.google.c...
LinkedIn: / siddhardhan-s-741652207
Telegram Group: t.me/siddhardhan
Facebook group: www.facebook.c... Instagram: / siddhardhan23

Пікірлер: 50

@unvatopregunta1572 Жыл бұрын

Excuse me sir, I have a problem when uploading my cvs file to colab. The problem is that it does not load it completely, there is data that does not appear. I realized since when I saw the distribution of the classes, they appeared to me different from yours. I wanted to see if there is a way to load the entire csv. Keep up the great content! A hug from Mexico!

@jitendratrivedi7889 2 жыл бұрын

Sid, you are doing very good job. Please keep posting videos ...

@younesgasmi8518 10 ай бұрын

thank you mister for this amazing work...we use oversampling before or after splitting the data set into training and testing

@prawinselvam9102 3 жыл бұрын

Now my life is balanced.🙏

@Siddhardhan 3 жыл бұрын

😂😂 great

@rajeshmidha4243 2 жыл бұрын

Hello sir, Here we are just creating a balanced dataset using 492 legit data points out of above 2 lac data points with 492 fraudulent data points. do you not think we will lost valuable information as we ignored other 2 lac rows.

@shahbazkhalilli8593 Жыл бұрын

the same situation is on my mind

@nakkapraneethreddy3378 3 жыл бұрын

Can you do end to end projects with deployment

@Siddhardhan 3 жыл бұрын

hi! I am planning to do the deployment topics after completing the algorithms concepts and other topics.

@lionspacedanger4060 5 ай бұрын

I loved the content man !!!!!!

@ryidgilani8987 10 ай бұрын

how can i proceed to use pycaret from here?

@saurabhsingh5472 3 жыл бұрын

Does using stratify = y while doing train_test_split solve the imbalance problem?

@Siddhardhan 3 жыл бұрын

hi! no. it won't... it would be best to use stratify when the distribution is even. not the other way around.

@melodylyricskannada 3 жыл бұрын

why do we have to do sampling here.... cannot we compare with full data?? why did you reduce it to 93??

@Siddhardhan 3 жыл бұрын

hi! we cannot train the model with imbalanced dataset. that's the reason.

@DSlayer007 Жыл бұрын

how much percent data should be consider as imbalanced data? is 75:25 imbalanced dataset?

@madhumithasanyal4628 5 ай бұрын

In the case of a 75:25 ratio, the majority class is 3 times larger than the minority class (75/25 ). This falls into the "High Imbalance" category

@ekaagustina9010 3 жыл бұрын

sir, after i do the sampling and do the prediction, accuracy score is decreasing. without sampling accuracy score is higher

@Siddhardhan 3 жыл бұрын

hi! that's is because, the model will predict only one value as output. consider this. there are 2 labels (0,1). in our training data we have 90 values as label 1. 10 values as label 0. now when you train your model, your model will predict the value as only 1. let's say that test data has 50 values. out of that 50 in our test data, 40 data points has label 1 and 10 data points has label 0. now for the entire test data, the model will predict the value as only 1 for the entire 50 datapoints. in this case, the accuracy score will be 80%. but we cannot say that the model is making good predictions. this happens because the model is trained on imbalanced data. do you understand this?

@Siddhardhan 3 жыл бұрын

print the values predicted by the model. it would have definitely predicted only one value which is predominant.

@sanjanapanisetty3854 3 жыл бұрын

Hello sir,based on what parameters we will know that fraud had done how do identify it.?

@sanjanapanisetty3854 3 жыл бұрын

class =0 and 1 how we are defining them

@Siddhardhan 3 жыл бұрын

hi! labelling will be done manually in most of the dataset.

@Siddhardhan 3 жыл бұрын

the data curator would have done it in this case.

@ishfaqueshaku8246 Жыл бұрын

I have 0-7 types of classes. How I solve it?

@adiityabairwa3444 3 жыл бұрын

Thanks a lot... Sir..

@Siddhardhan 3 жыл бұрын

Most welcome😇

@saurabhsingh5472 3 жыл бұрын

Reducing Legit from 2 lakh to 492 might lead to loss of information? What is your opinion

@Siddhardhan 3 жыл бұрын

there's a possibility. but this is a better way we can do. if the model doesn't find the patterns better, the accuracy would have been very low. here, this is not the case. we got a good accuracy score.

@arunkumarr2302 Жыл бұрын

@@Siddhardhan a dataset of heart disease contains 253680 rows . Heart disease with yes = 229787 Heart disease with no = 23893 . As you said I tried with under sampling then I got accuracy of 70% . But original data I can get with 80% with naive bayes . In this case we need not go to under sampling may I correct sir . Please reply sir

@bosszz1282 2 жыл бұрын

auto subtitle for 4.7 please

@halfbloodprince1788 3 жыл бұрын

subtitle ?

@Tony-vo9ok Ай бұрын

Amazing

@sachinvithubone4278 3 жыл бұрын

Thanks Siddhardhan.. it's great and very simple to understand.. imbalances dataset 👍

@Siddhardhan 3 жыл бұрын

My pleasure😇

@jehanbhathena6270 3 жыл бұрын

Hello,what is the difference between feature engineering and data pre-processing

@Siddhardhan 3 жыл бұрын

Hi! Kindly check the following site for detailed information: www.quora.com/What-is-the-difference-between-data-pre-processing-and-feature-engineering-in-machine-learning

@rahulpurbey1362 3 жыл бұрын

Great video

@Siddhardhan 3 жыл бұрын

Thanks 😇

@mihretdesta9153 Жыл бұрын

hey sir, how to balance imbalance data for deep learning for classification purposes?

@theodorejagrup5210 3 жыл бұрын

can you add an example with over sampling so class 1 will be 284315 records

@NeeRaja_Sweet_Home 2 жыл бұрын

Hi Siddhardhan... In most we could see imbalanced Dataset for classification problems but How to check and Handle imbalanced Dataset for regression problem. Thanks,

@shahbazkhalilli8593 Жыл бұрын

please answer

@gowthami712 3 жыл бұрын

How can we convert .mat file to .CSV file

@Siddhardhan 3 жыл бұрын

hi! refer this page: www.mathworks.com/matlabcentral/answers/195151-how-to-convert-a-mat-file-into-a-csv-file

@Crecky_21 3 жыл бұрын

Sir, I have some questions. 1.How to know which evaluation metrics should we use for our regression problem statements? 2. What are pros and cons of each metrics? 3. When each of the regression models is best and when it fails?

@Siddhardhan 3 жыл бұрын

hi! we generally use Mean Absolute error, root mean squared error, mean squared error. it's hard to explain all the details in this comment. I already have these topics in my course curriculum. kindly stay tuned.

@Crecky_21 3 жыл бұрын

@@Siddhardhan thank you sir.

@Siddhardhan 3 жыл бұрын

You're welcome