Credit Card Fraud Detection using Machine Learning from Kaggle

Рет қаралды 219,153

Krish Naik

Күн бұрын

Пікірлер: 132

@snehalbogar6326 2 жыл бұрын

Thanks!

@karthikdeepan1998 4 жыл бұрын

As this is imbalance dataset, accuracy metric doesn't work. We have to use confusion matrix, recall as metrics.

@muhammadwajahatali2286 Жыл бұрын

Great bro❤

@swapniljena8684 4 жыл бұрын

Hey I have a suggestion for improvement of recall. As this is a class imbalance data to detect the fraud transaction so accuracy is not the metric that should be counted on. The best recall score you have got is 0.27. I got a recall score for class "1" i.e. fraud 0.74. Using the random forest ensemble method, further when I used decision stumps of random forest and trained it using Adaboost I got even better recall of 0.76. Thank you for the video, got to learn a lot.

@dineshramachandran1961 4 жыл бұрын

@Puneet Rajput , Im impressed with ur knowledge. I am aspiring DS. Can i get ur email id pls

@akshats5996 2 жыл бұрын

So true!

@manthanrathod1046 2 жыл бұрын

never heard about this method of using decision stump and further boosting it. Although I used plain smote enn and got 82% f1 score. Can you redirect me to video or site? I need to study about this decision stump and ada boost technique. Hope you will reply. Thanks :)

@divyanshumishra5993 Жыл бұрын

@swapnil jena Can you please help me to create the front end(GUI) for this project as web application??

@snehalbogar6326 2 жыл бұрын

Thanks for explaining things in really simple language, yet covering all the complex theory.

@ayush51379 5 жыл бұрын

Thanks a lot for sharing this very useful video :-) I would like to add that we tested the model here only on the training data set itself, while this is an important step, the next step is to test the model on a testing data set. Probably, splitting the data set into 80% training data and 20% testing data will help. Also, one should perform error analysis and hyper-parameter tuning to get best results, taking care to keep the model generalized enough to accommodate for the entire data set that also consists of the data set given to us. Have a nice day! :-)

@divyanshumishra5993 Жыл бұрын

Can you please help me to create the front end(GUI) for this project as web application

@sankarsai5054 2 жыл бұрын

finally, find my...final year project on youtube

@adityatiwari2488 2 жыл бұрын

Are u going to create a web application for the project?

@sankarsai5054 2 жыл бұрын

@@adityatiwari2488 no

@adityatiwari2488 2 жыл бұрын

@@sankarsai5054 bro i am facing name error

@adityatiwari2488 2 жыл бұрын

In jupyter notebook NameError :- name 'Fraud' is not defined Can please help me what should I do?

@sankarsai5054 2 жыл бұрын

@@adityatiwari2488 in just..buy my project 🤗

@otmaneelaloi7926 4 жыл бұрын

Accuracy sccore isn't a good metric for this task since the data is very imbalnced, you can check the precision and recall you've gotten for isolation forest and local outlier factor. A good approch to compare theses algorithms on this specific task would be to use auc_score. Thanks for your vedios.

@magicmushroom9670 3 жыл бұрын

exactly.

@kimchi6284 9 ай бұрын

hey , i have a project in this topic and i don't have any idea about it could you help please

@nashgaming2761 4 жыл бұрын

is there a dataset that shows the variables used, without confidentiality issues

@Han-ve8uh 2 жыл бұрын

11:31 says "root node will be selected in such a way that the outlier will be splitted". This doesn't sound like ranodm selection of split point, which contradicts what the text in notbook is saying? "by randomly selecting a feature and then randomly selecting a split value"

@lanashin2631 3 жыл бұрын

Wow you explain it so well. i learned a lot from you, thank you!

@baidash3104 3 жыл бұрын

Fraud data:492 Normal data: 284315 If we use a dumb model and set it to predict all transactions are normal then we will be getting an accuracy of 99.83% which i believe both models are doing internally based on the Precison and recall of +ve class label..

@piyapiyagill 2 ай бұрын

which dumb model?

@baidash3104 2 ай бұрын

@@piyapiyagill Wont need a model. Given any query we just say its not fraud.

@jimmywhite5458 2 жыл бұрын

Great explanation- super interesting thank you

@jammulanarendar9910 4 жыл бұрын

Hi Krish , I have a small doubt ,if I observer the Classification report of Isolation forest ,Local outlier factor and SVM ,The Recall is very less for 1(i.e fraud), can we say the model is accurate only by just looking at the Accuracy (as per my knowledge accuracy is not a good metric for class imbalanced data because let say sample of 100 records having 90:10 Non Fraud : Fraud records even if model blindly say everything is Non Fraud then the accuracy would be 90%.) That's why Kaggle people Recommended to use the AUPR Curve as performance metric for performance check. Thanks in advance.

@pawansapkota3970 2 жыл бұрын

Yes Area under the precision curve is the best metrics for the given datasets. I am also doing the project in the same dataset so balancing the datasets using SMOTE and then classification with the Random Forest yield a better result.

@divyanshumishra5993 Жыл бұрын

Can you please help me to create the front end(GUI) for this project as web application?

@thisaintarf 4 жыл бұрын

why you dont use sensitivity and f1 score to evaluate model perform than accuracy instead

@Amina-xu8uj 2 жыл бұрын

Thank you, nice explanation

@mbmk92 5 жыл бұрын

Thank you for this approach regarding the imbalanced dataset issue for fraud detection. However what you opinion in using oversampling with Synthetic Minority Over-sampling Technique (SMOTE) together with edited nearest-neighbours (ENN) (ENN) instead of Anomaly detection approach?

@fekiyounes5181 Жыл бұрын

I think that if you were able to generate data you are able to classify them... :D UnderSampling is the other side of the coin, you are creating a boundary by eliminating the most important and useful data that you could probably misclassify if not deleted... In my opinion the best options are sampling and creating voting classification or autoencoders and loss thresholding....

@meetmeraj2000 4 жыл бұрын

why didnt you use SMOTE for upsampling the data?

@mishralucky 2 жыл бұрын

How authorised push payment fraud prevention and detection can be implemented using ML/DL.. please share a video if possible Krish

@jigneshkhandare321 9 ай бұрын

I want to make project on this topic using different algorithm can you help

@ANJALIVERMA-d4c Жыл бұрын

I wanted to know why everyone is using only this dataset as on the kaggle websites lot of datasets are available for the same name "credit card fraud detection" but wit different data. Does anyone knows, can I make my project on different one also ?

@rishabhjain2559 3 жыл бұрын

how can we say isolation forest and LOF are performing well? Recall values for class 1 are very low in both cases. model is not able to detect even 50% outliers, accuracy won't be a good metric, because of the imbalanced data

@motivational_19171 2 жыл бұрын

IN[27] i am getting 'not all arguments converted during string formatting' this error

@shikharsaxena9989 5 жыл бұрын

thanks for clearing my concept

@swaniketchowdhury 5 жыл бұрын

This may sound crazy but how can we check if a transaction is fraud or not by using some other variables like card number or something like that?

@doomsday7699 4 жыл бұрын

If you can convert such variables into numerical values that have meaning, then you can try to make a correlation table and see if the output is dependent on such variables. That is one way of doing it. Another way is, if you feel that a variable is not useful, you could directly drop it and then see the accuracy. Or there might be another statistical way that I do not know about.

@MrChudhi 2 жыл бұрын

Just a question which metric is the best metric to detect the credit card fraud. I believe it should not be accuracy. So it should be recall. Am I correct.

@omkarr8282 5 жыл бұрын

Hats off man!! I this is one of the best videos for Imbalanced datasets, i was tired of reading about SMOTEs every time i wanted to look for ways to deal with taget variable imbalance! Thanks for sharing :) I was wondering, for training , did you stratify the data ? does it really matter if we stratify the data or not because it is already imbalanced?

@sgracem2863 3 жыл бұрын

SMOTE actually seems a lot simpler than this though?

@sneha2502 3 жыл бұрын

Last part of the code is not working sir.. 😬.. it is showing that random_state is unexpected keyword argument... What i have to make changes.. please let me knw sir.. please

@ajoychatterjee3105 3 жыл бұрын

Accuracy is not a factor here to concern. But precession and recall are 100% accurate for non-fraud . But only 0.26 and 0.27 for fraud transaction. Is it a good number to consider ? Just curious to know what real time business accepts.

@meghasmita152 3 жыл бұрын

This is on 0.1 percent of data, results could vary with other data samples, how to effectively sample in such situations?

@tanzinahossain8179 Жыл бұрын

Hello, can someone explain about v1 to v28? And how many features are taken And also the heat map...

@ashwinimagar4822 5 жыл бұрын

Thanks for the explanation. Why recall in your result looks really low?

@krishnaik06 5 жыл бұрын

It is an imbalanced dataset.

@ashwinimagar4822 5 жыл бұрын

@@krishnaik06 Yes I got that but generally cost associated with false negatives in fraud detection is very high. Marking fraudulent transactions as non-fraudulent is expensive and hence recall for fraud class should be high. What is your cost function here?

@prashantsolanki007 5 жыл бұрын

@@ashwinimagar4822 Yeah his whole metrics is messed up. That model is no better than a dumb model giving all negative.

@quaziharisahmed6172 4 жыл бұрын

@@krishnaik06 sir whether it is class imbalance or balance class what is the use of model if he is not able to predict properly ?? in this case random forest is performing much better than isolation forest

@quaziharisahmed6172 4 жыл бұрын

@@prashantsolanki007 true

@naveenmami7438 5 жыл бұрын

Thanks for the inputs ! will it exclude the anomalies detected before building the model with isolation forest and calculates the accuracy? please elaborate sir

@divyanshumishra5993 Жыл бұрын

Can you please help me to create the front end(GUI) for this project as web application

@hariharangr4045 2 жыл бұрын

Thanks for explanation.Can you explain how to remove the detected outliers in the dataset?

@eswarsaipallapolu7484 3 жыл бұрын

from pylab import rcParams rcParams['figure.figsize'] = 14, 8 RANDOM_SEED = 42 LABELS = ["Normal", "Fraud"] For what purpose is this block of code used? Can any one explain it line by line

@divyanshumishra5993 Жыл бұрын

Can anyone tell me, how to create the front-end of this project?

@divyanshumishra5993 Жыл бұрын

Pls... Help, if you can..

@muditrustagi5775 4 жыл бұрын

sir i have a doubt regarding a code cell that you have provided

@bharathkulkarni4207 2 жыл бұрын

How will it really reduce the fraud??, even the user himself can report to the bank about his unnotice transaction , what action will be taken next?? Literally am unable to understand how will this project reduce fraud, it will just identify the fraud which can be detected by user himself. anyone please reply.

@rakeshvanga5514 2 жыл бұрын

would you pls provide me the link for dataset

@TT-ds7kb 3 жыл бұрын

And what it will be considered as I mean as a software or application

@gungunalewithsakshi7579 4 жыл бұрын

Could you please explain that if we have any transaction how we will get know that this is fake transaction, if we don't have any feature like class (0, 1)

@vivekkumar-dn2vb 3 жыл бұрын

How to resolve this error sir plz help In[25] : __init__() got an unexpected keyword argument 'random_state'

@kislayanupam1027 3 жыл бұрын

Set the 'random_state'= None or remove it.

@thecryptotradingclub 5 жыл бұрын

Nice Video. Thanks for sharing.!

@bogdanilie6152 4 жыл бұрын

Why are you using SVM? Isn't this an algortihm for Supervised Learning? Given that we have unlabeled data, I assume we should not use it. I'd greatly appreciate your explanation. Many thanks

@kushagrasharma4075 3 жыл бұрын

It is labeled

@shreyasbhosale5651 Жыл бұрын

Where do i get dataset for the code?

@sakshimishra6450 4 ай бұрын

on kaggle

@satyabansahoo6075 4 жыл бұрын

Sir, can I fork your projects and work on that?

@lizmathew1481 4 жыл бұрын

Sir,where can i find the source code for this project?

@chiragrana5039 4 жыл бұрын

I still feel that if a bank has an fraud of very high amount it won't include that in the dataset as it might create a hugh problem for the bank, so ho w much to trust the data is difficult to predict.

@tyitb156shubhampakale5 2 жыл бұрын

Where can I get creditcard.csv file

@aniketgaikwad1157 5 жыл бұрын

Do we have to fill the null values before applying the ML algos in this video???

@doomsday7699 4 жыл бұрын

Yup. Else will always mess the entire algorithm up. Might even throw exceptions depending on the library you use.

@rh334 3 жыл бұрын

Code not working. Declare input variable first

@jen_1105 4 жыл бұрын

sir where can we download this note file ?

@louerleseigneur4532 3 жыл бұрын

Thanks

@TT-ds7kb 3 жыл бұрын

Sir where we will deploye the project

@nashgaming2761 4 жыл бұрын

Can I get a dataset with principle parameters as transaction location and user patter(behaviour) like daily transaction amount etc

@swapniljena8684 4 жыл бұрын

All the datasets on kaggle are PCA performed on them related to credit card transacion

@nashgaming2761 4 жыл бұрын

@@swapniljena8684 is there any other site where I can get??

@swapniljena8684 4 жыл бұрын

@@nashgaming2761 Well you can try to search to get raw information but I don't think it is easy to find. No banks would post out such information without encryption.

@prakritsinha3094 3 жыл бұрын

can anyone explain why are the metrics for class '1' so low? eg in isolation forest algorithm why are the precision, recall and f1 score 0.26,0.27 and 0.26 respectively? is that the desired result?

@annapoornayaligar4187 4 жыл бұрын

Sir this was very usefull for my presentation but can you please tell me how you will predict the card is frauded whether by scanning card or using transaction history plz reply it's very important to my presentation

@datatorture3086 3 жыл бұрын

Dataset has already defined which transactions are fraudulent but sir has used unsupervised learning algorithm to carve out the relevant information pertaining to the dataset better you should explore pycaret module that will simplified all the preprocessing steps

@bhanuprakashreddyvennapusa7790 4 жыл бұрын

For checking the Transaction, Where we can get that Transaction Details

@bhanuprakashreddyvennapusa7790 4 жыл бұрын

@Krish Naik Can you reply for my comment and question

@shahalam-ei4cm 5 жыл бұрын

sir from where we get csv file plz .....tell

@vinuyesudas5396 5 жыл бұрын

Kaggle.com is the best site for various datasets

@sgracem2863 3 жыл бұрын

www.kaggle.com/mlg-ulb/creditcardfraud?select=creditcard.csv

@shubhangisakarkar9532 4 жыл бұрын

Hello Sir, On what basis the dataset is imbalanced.Please elaborate.

@doomsday7699 4 жыл бұрын

The number of examples per class in the dataset. Number of examples for the fraudulent case are much much lower than non-fraudulent examples. Balanced dataset means equal or close number of examples per class.

@sandipansarkar9211 3 жыл бұрын

good

@jaysoni7812 4 жыл бұрын

why our classifier return pred value -1 and 1 instead of 0 and 1 ????????

@chamodmaduranga3866 4 жыл бұрын

its default return.If you want to change to 1's for fraud and 0's for non-fraud, use map function.

@jaysoni7812 4 жыл бұрын

@@chamodmaduranga3866 but why, for other problems it give same value as in y anything special in this case?

@ashwinikattimani9298 4 жыл бұрын

How to train this data using tensorflow lite?

@ashwinikattimani9298 4 жыл бұрын

And how to add it to the android app?

@mithunkumar7063 5 жыл бұрын

How to remove those outliers from the data

@__ALahari 2 жыл бұрын

How we know whether it is fraud or not

@sujithpawan7246 5 жыл бұрын

How to download ipytnb of your data

@ramanjeet1111 3 жыл бұрын

companies are still using the manual methods

@redditinside.8722 5 жыл бұрын

bro i need description of these project

@magicmushroom9670 3 жыл бұрын

kzbin.info/www/bejne/nKOwkGqLgqmSY6M at this point is giving you that there are 73 errors but those errors are from both fraud and normal transactions. I researched and in my opinion this is not a very good metric for checking accuracy of an anomaly detection system. You didnt explained the Recall which is most important part of this whole video. Apart from that it is more helpful if you explained the ROC and threshhold in this scenario which is what is needed. not just implementation of an algo. BTW I have great respect for your work and love what you are doing.

@shrenikaadsul1154 3 жыл бұрын

data=pd.read_csv('creditcard.csv') data.head() In this command give a syntax error

@ruthwikakumari3622 4 жыл бұрын

Can I have a vedio on random tree based and cart based algorithms for credit card fraud detection

@divyanshumishra5993 Жыл бұрын

Can you please help me to create the front end(GUI) for this project as web application?

@akshats5996 2 жыл бұрын

This is a classic imbalanced dataset problem! You lost the plot when you focused on accuracy instead of Recall. Recall is very poor in the outputs you have shown.

@varshatn2537 5 жыл бұрын

@krish naik

@sudarshankadge7945 4 жыл бұрын

How to detect fraud transaction with same file without using "Class" column.

@jeevanchavan143 4 жыл бұрын

Using unsupervised learning,clustering

@swapniljena8684 4 жыл бұрын

First we need to know how do we classify a transaction as a fraudulent one, then only we will be able to know how to detect them.

@swapniljena8684 4 жыл бұрын

@@jeevanchavan143 How do we know which cluster belongs to the class "1" i.e. fraud

@shravanKumar-yc9cj 4 жыл бұрын

after clustering in the know data ,u can use in the the file with unknown class

@varshatn2537 5 жыл бұрын

Hello sir, we are currently working on this project... We want to get in touch with u for more help and details... Cud u pls provide us with your email .

@vinuyesudas5396 5 жыл бұрын

How can analyze transaction is fraud or not without class? This dataset already contains class as transaction is fraud or not info ryt?

@manankalra7978 4 жыл бұрын

@@vinuyesudas5396 I guess, knowing the Class variable has no role in training those models. Class here is just used to calculate the accuracy of our predictions.

@chamodmaduranga3866 4 жыл бұрын

@@manankalra7978 contamination in isolation forest is used based on a class variable.