Implementing a Spam classifier in python| Natural Language Processing

  Рет қаралды 115,363

Krish Naik

Krish Naik

Күн бұрын

Пікірлер: 170
@priyasinha2251
@priyasinha2251 4 жыл бұрын
I am not a girl who generally comments on you tube videos but I am learning from your videos and this is my genuine comment that you are amazing and your concept in data science is very clear and to the point. I am very happy that the teacher like you is present here. Superb job Sir !
@amankukar7586
@amankukar7586 2 жыл бұрын
Who asked you if you generally comments or not?
@unknownfacts3716
@unknownfacts3716 2 жыл бұрын
@@amankukar7586 good one bro
@unknownfacts3716
@unknownfacts3716 2 жыл бұрын
pehli fursat mei nikal yahan zyaada formality mat kar
@moindalvs
@moindalvs 2 жыл бұрын
"I am not a girl" okay can't say these days "who generally comments on youtube videos" first of all youtube doesn't have any comment history data to prove this second How dare you call this another youtube video? How dare you generalised an educational video that free of cost while people pay an hefty amount of price for such contents? shame on you!
@vipinbansal8886
@vipinbansal8886 4 жыл бұрын
I was trying to understand NLP concepts referring to various books and videos from last two months but concepts were not clear for me.But this explaination is really awesome .Explained in very easy way .Thanks Krish
@yonasbabulet3836
@yonasbabulet3836 2 жыл бұрын
i have seen a lot of youtube tutorials , but i cant find tutorial like you which are clear and more precise. keep going.
@piyushaneja7168
@piyushaneja7168 4 жыл бұрын
You are great sir, its very difficult to find a good channel that explains the code line by line ❤💥👏
@navrozlamba
@navrozlamba 4 жыл бұрын
I would say to prevent leakage we should split our data before we fit_transform on the corpus. So in other words, we are teaching vocabulary to our model on the whole dataset which defeats the purpose of splitting into train and test after. The whole purpose of the test set is to test our model on unique data that our model has never seen before. Please correct me if I am wrong! Cheers!!
@cristianovivk4935
@cristianovivk4935 4 жыл бұрын
i agree should split before fit_transform to prevent leakage.,.....
@iEntertainmentFunShorts
@iEntertainmentFunShorts 4 жыл бұрын
+1
@tejashshah5202
@tejashshah5202 3 жыл бұрын
Agree, split before getting BOW.
@КаратэПацан-я6б
@КаратэПацан-я6б 2 жыл бұрын
Hi. The CountVectorizer is not a ML model, it just converts to vectors(matrix of numbers)
@utkar1
@utkar1 5 жыл бұрын
Thank you, the whole NLP playlist is very helpful!
@arjyabasu1311
@arjyabasu1311 4 жыл бұрын
Exactly
@dipakwaghmare1228
@dipakwaghmare1228 Жыл бұрын
Sir meri tapshya hi puri ho gae ye apka lecture dekhake ❤️thank you so so so so so much sir ❤️❤️❤️❤️
@mansoorbaig9232
@mansoorbaig9232 4 жыл бұрын
Great work Krish. You have this knack of explaining the things in pretty simple manner.
@billyerickson353
@billyerickson353 11 ай бұрын
🎯 Key Takeaways for quick navigation: 00:00 📚 *Introduction to Spam Classifier Project* - Creating a spam classifier using natural language processing. - Overview of the dataset from UCI's SMS Spam Collection. - Reading and understanding the dataset structure. 01:47 📂 *Exploring the Dataset and Data Preprocessing* - Explanation of the SMS spam collection dataset. - Reading the dataset using pandas and handling tab-separated values. - Data cleaning and preprocessing steps using regular expressions and NLTK. 05:46 🧹 *Text Cleaning and Preprocessing* - Using regular expressions to remove unnecessary characters. - Lowercasing all words to avoid duplicates. - Tokenizing sentences, removing stop words, and applying stemming. 13:52 🎒Creating *the Bag of Words* - Introduction to bag-of-words representation. - Implementation of count vectorization using sklearn's CountVectorizer. - Selecting the top 5,000 most frequent words as features. 17:27 📊 *Preparing the Output Data* - Converting the categorical labels (ham and spam) into dummy variables. - Finalizing the output data with one column representing the spam category. - Overview of the preprocessed data for training the machine learning model. 21:04 📊 *Data Preparation for Spam Classification* - Data preparation involves creating independent (X) and dependent (Y) features. - Explanation of dummy variable trap in categorical features. - Introduction to the train-test split for model training. 22:30 🛠️ *Addressing Class Imbalance and Train Spam Classifier* - Discussion on class imbalance issue in the data. - Introduction to Naive Bayes classification technique. - Implementation of the Naive Bayes classifier using multinomial Naive Bayes. 24:22 📈 *Evaluating Spam Classifier Performance* - Explanation of the prediction process using the trained model. - Introduction to confusion matrix for model evaluation. - Calculation of accuracy score for the spam classifier (98% accuracy). 27:50 🔄 *Improving Spam Classifier Accuracy* - Suggestions for improving accuracy, including the use of lemmatization. - Mention of addressing class imbalance for better performance. - Recommendation to explore TF-IDF model as an alternative to count vectorization. Made with HARPA AI
@matanakhni
@matanakhni 3 жыл бұрын
Best NLP videos of all time . A complete gist , mind you not for the faint hearted . Execllent job Krish. Initially ibhad given up NLP completely but now have renewed vigour after such exemplary teaching
@niksvp93
@niksvp93 3 жыл бұрын
The best possible tutorial on Data Science/Machine Learning on KZbin. Cheers to you brother! :D
@javiermarti_author
@javiermarti_author 5 жыл бұрын
You are an excellent teacher. Thanks for making/uploading these videos
@sivabalaram4962
@sivabalaram4962 2 жыл бұрын
You are genius in explanation krish Naik Ji, your the best 👍👌👌👌
@dheerajkumar9857
@dheerajkumar9857 3 жыл бұрын
Excellent , very happy to see such type of explanation @Krissh Naik, we will definitely do good.
@sauravkumar-cw5bm
@sauravkumar-cw5bm 3 жыл бұрын
I used Lemmatization and TF-IDF in text preprocessing and got an accuracy score of 0.971.
@ashishn.c.7913
@ashishn.c.7913 4 жыл бұрын
I am getting these accuracy values for different combinations: Stemming and CountVectorizer accuracy=98.5650% Lemmatization and CountVectorizer accuracy=98.29596% Lemmatization and TfidfVectorizer accuracy=97.9372197309417% Stemming and TfidfVectorizer accuracy=97.9372197309417%(same as Lemmatization and TfidfVectorizer)
@sanandapodder5027
@sanandapodder5027 4 жыл бұрын
Thank you very much sir,your videos are really very helpful i am learning NLP from your channel first time . I don't know machine learning thats why facing little problem
@mohammedsohilshaikh6831
@mohammedsohilshaikh6831 3 жыл бұрын
I am so much addicted to his videos, sometimes even forget to like the video.😂
@ABHINAVARYA
@ABHINAVARYA 3 жыл бұрын
Best playlist to learn NLP. Thank you Krish.. 🙂
@ushirranjan6713
@ushirranjan6713 4 жыл бұрын
Its really a fantastic video sir. Your really explained the many things which can be understand in very easy manner. Thanks a lots sir!!!
@lifebytesss
@lifebytesss 4 жыл бұрын
Just amazing sir , cant comment you too usefull sessions thankyou
@DhananjayKumar-oh2hh
@DhananjayKumar-oh2hh 3 жыл бұрын
you are really great sir. each and every topic u have explained very well. Hats off to u.
@mbmathematicsacademic7038
@mbmathematicsacademic7038 3 ай бұрын
I used logistic regression ,multiclass was specified and I achieved 94.3% accuracy on test data and 95.7% accuracy on test data
@vinimator
@vinimator 4 жыл бұрын
Hi Krish, I am the newest subscriber of your channel and I hope your this video will help me to complete a project of mine own. Thank you so much. Will continue to learn
@mandeep8696
@mandeep8696 3 жыл бұрын
Thank You Krish for sharing the knowledge.
@rahuljaiswal9379
@rahuljaiswal9379 5 жыл бұрын
u r awesome teacher, it really helpful for me...... god bless u
@debatradas1597
@debatradas1597 3 жыл бұрын
Thank you so much Krish Sir...!!!
@Thebeautyoftheworld1111
@Thebeautyoftheworld1111 Жыл бұрын
keep up the good work.Thanks
@tarung7088
@tarung7088 4 жыл бұрын
Here the dataset is highly imbalenced (i.e ham : 4825, spam : 747) so got the high accuray
@gauravpardeshi6056
@gauravpardeshi6056 2 жыл бұрын
very good video sir...thank you
@ManiKandan-ol9gm
@ManiKandan-ol9gm 3 жыл бұрын
Really no words to represent you.....lottttttttttts of love sir❤️tq so much sir means alot
@Skandawin78
@Skandawin78 5 жыл бұрын
Good job Krish with the NLP playlist
@chandrakanthshalivahana1417
@chandrakanthshalivahana1417 5 жыл бұрын
hello,sir i am very happy that u r making videos..please make more videos on kaggle competitions...
@AdityaKumar-cr9mc
@AdityaKumar-cr9mc 2 жыл бұрын
You are simply amazing
@usaikiran96
@usaikiran96 11 ай бұрын
How to decide when to use count vectorizer, or tfidf? How to decide whether/when to use Stemming or Lemmatization? Like in this example why didnt you use tfidf instead of bag of words? And why lemmatization was not used instead of stemming?
@mujeebrahman5282
@mujeebrahman5282 4 жыл бұрын
Sometimes the error is good for health😂
@nehasrivastava8927
@nehasrivastava8927 4 жыл бұрын
Thanku sir...for the wonderful explanation
@AltafAnsari-tf9nl
@AltafAnsari-tf9nl 3 жыл бұрын
Thank you so much for sharing your knowledge with us
@jinks6887
@jinks6887 2 жыл бұрын
You are bhagwaan for me Sir
@nehamanpreet1044
@nehamanpreet1044 5 жыл бұрын
Please make videos on word embedding like word2vec/GloVe/BERT/Elmo/GPT/XLNet etc
@farnazfarhand5957
@farnazfarhand5957 3 жыл бұрын
it was so clear and helpful, thank you so much
@sandipansarkar9211
@sandipansarkar9211 4 жыл бұрын
Thanks Krish .Superb explanation once again.All my concepts about NLP is very crystal clear.I know career in NLP is superb.But can you explain what is its exact value in terms of data science carrer. Please guide and feel free to reply as I am eagerly waiting. Thanks once again.
@ranjanjena2996
@ranjanjena2996 5 жыл бұрын
i have created the model and saved the same using joblib. I am not getting how to use the model for prediction? Is there anyway where i can pass the email text to the body and model can detect spam or ham. I am newbie plz help. Thanks
@aleenajames7609
@aleenajames7609 5 жыл бұрын
Have you got how to do? If yes please let me know also
@soumyadev100
@soumyadev100 3 жыл бұрын
Hi Krish, good session. I have one comment. For getting test corpus, better practice may be to use transform. Fit transform on train and only transform test. And train test split to be done before we build corpus. Let me know what you think.
@shahariarsarkar3433
@shahariarsarkar3433 2 жыл бұрын
Brother you are making helpful content for us. Can you tell me how to remove the stopwords of other languages like Bangla or Hindi etc?
@suvarnadeore8810
@suvarnadeore8810 3 жыл бұрын
Thank you krish sir
@afaqueumer7968
@afaqueumer7968 3 жыл бұрын
Hello Sir...can you please make video on Topic Analysis - LDA. There isn't any clear cut videos on utube yet like yours.
@sathishk8685
@sathishk8685 5 жыл бұрын
Hi Krish, Excellent explanation
@nehamanpreet1044
@nehamanpreet1044 5 жыл бұрын
Sir please make videos on LDA, NMF, SVD and Word2Vec Models
@amruthasankar3453
@amruthasankar3453 Жыл бұрын
Thankyou sir❤️🔥
@roshankumarsharma8725
@roshankumarsharma8725 4 жыл бұрын
Sir in this model why we have used MultinomialNB and not BernoulliNB ? and can we use BernoulliNB this instead of MultinomialNB
@emajhugroo109
@emajhugroo109 4 жыл бұрын
Hello sir, I would like to know how to calssify a new message as ham or spam after building the NB model
@yogeshprajapati7107
@yogeshprajapati7107 4 жыл бұрын
You can do it like this. df=pd.DataFrame(['this message is a spam'],columns=['message']) corpus=[] for i in range(0,len(df)): review=re.sub('[^a-zA-Z]',' ',df['message'][i]) review=review.lower() review=review.split() review=[ps.stem(word) for word in review if word not in stopwords.words('english')] review=' '.join(review) corpus.append(review) df=cv.transform(corpus).toarray() pred=spam_detect_model.predict(df) label=pred[0] if label==1: print('Spam') else: print('Ham')
@joelkhaung
@joelkhaung 3 жыл бұрын
@@yogeshprajapati7107 how does model handle for features 2500 when doing predict? I believe there will mismatch between number of features from new message and number of features from trained model. can share how to overcome this?
@indian-inshorts5786
@indian-inshorts5786 4 жыл бұрын
Sir u r too good
@saratht8223
@saratht8223 3 ай бұрын
Hi Kris, supposing we need to implement a functionality for identifying spam afresh, how can we come up with a solution. The sample data used here already have something tagged as spam and ham by someone, sometime, somewhere. In practice, do we need to have a sample data upfront? Can you please advice?
@puttacse
@puttacse 5 жыл бұрын
Hi Krish, Why are we hard-coding Max_features=5000, What if this code is Migrated to Production as-is and face more Tokens/Features in Live Data(Ex: if live data has 0.1 Million(1 Lakh) features)? In this scenario, Do our Model fails?
@rajarshidgp2003
@rajarshidgp2003 2 жыл бұрын
instead of pd.get_dummies , we can sklearn.preprocessing.LabelEncoder can be used
@tapabratacse
@tapabratacse 2 жыл бұрын
why didnt u use label encoder for terget column spam/ham
@nikhilsharma6218
@nikhilsharma6218 4 жыл бұрын
i have 2 questions first : Why only multinomialNB, is there specific reason, cant we use bernoulliNB or gaussianNB ?? second : if dataset is imbalanced we have used complimentNB, but how do we know that dataset is balanced or imbalanced??
@manikhindwan6790
@manikhindwan6790 4 жыл бұрын
BinomialNB - when spam classification is being done with a two step decision approach i.e if 'X' is present, then 'spam' else 'not spam' GaussianNB - used when the values are present and are continuous MultinomialNB - counts the presence of words and the frequency of occurrence to decide the decision boundary
@babyyoda5140
@babyyoda5140 4 жыл бұрын
Boss please also include sentiment analysis and topic modelling to your already wonderful repertoire!
@juanelnino
@juanelnino 2 жыл бұрын
I have a ERROR it is saying unhashable type of list even if all the steps are same
@gowrisancts
@gowrisancts 5 жыл бұрын
Good one... actually u may need to use bernoulis naive bayes model as it deals with binary values 0 and 1...correct me if am wrong
@arjyabasu1311
@arjyabasu1311 4 жыл бұрын
Awesome work sir !!
@aninditadas832
@aninditadas832 3 жыл бұрын
hello sir, why have we not used lemmatization here? Stemming may or may not give meaningful words but we need meaningful words here right?
@avinashsingh7698
@avinashsingh7698 4 жыл бұрын
Sir, can you please make a video on 'Generate paraphrase from the text using NLP'.
@yogeshprajapati7107
@yogeshprajapati7107 4 жыл бұрын
To predict whether the new message is spam or ham.write this code. df=pd.DataFrame(['this message is a spam'],columns=['message']) corpus=[] for i in range(0,len(df)): review=re.sub('[^a-zA-Z]',' ',df['message'][i]) review=review.lower() review=review.split() review=[ps.stem(word) for word in review if word not in stopwords.words('english')] review=' '.join(review) corpus.append(review) df=cv.transform(corpus).toarray() pred=spam_detect_model.predict(df) label=pred[0] if label==1: print('Spam') else: print('Ham')
@maYYidtS
@maYYidtS 5 жыл бұрын
excellent........ sir instead of taking max_feature parameter at 16:43.....wt if we apply PCA or LDA on that total columns...
@Anurag_077
@Anurag_077 3 жыл бұрын
Wonderful
@ashishgeorge2766
@ashishgeorge2766 4 жыл бұрын
can we apply label encoder instead of one hot encoding at label column
@awaisniaz5300
@awaisniaz5300 4 жыл бұрын
yes we can apply but when feature have two category
@JoshDenesly
@JoshDenesly 5 жыл бұрын
Hi Krish, Please make a project relating to Bigram , unigram also . Thank you
@krishnaik06
@krishnaik06 5 жыл бұрын
Sure I will do that
@tarunsubramanian9792
@tarunsubramanian9792 3 жыл бұрын
TypeError: cannot use a string pattern on a bytes-like object Error shown when line 17-20 is executed..... How do I rectify it.... someone help please
@efefmichelle
@efefmichelle 3 жыл бұрын
I get the same error!! What did you do?
@yogeshwarshendye4857
@yogeshwarshendye4857 3 жыл бұрын
How do you deploy sklearn models?
@AarushiMishra-x3w
@AarushiMishra-x3w Жыл бұрын
can't we make this code work in jupyter notebook instead of spyder because i cant really see any output for spyder
@Lijoperumpuzhackal
@Lijoperumpuzhackal 5 жыл бұрын
I had gone through the 7 videos in the playlist . Well explained in every videos . Can you please tell me how can implement this program in real scenario ? Everyone has completing their videos by making only the models . So pls try to explain how we can use this model ? If I have text message. Then how to find whether it is spam or not using this model ..
@krishnaik06
@krishnaik06 5 жыл бұрын
Check my deployment playlist u will get to know
@pradeepvaranasi
@pradeepvaranasi 2 жыл бұрын
Can we just use an if-else condition on the label column to derive the 0-1 (spam-ham) column? What is the purpose of using the get_dummies function for a binary class column?
@dushyanthande1556
@dushyanthande1556 Жыл бұрын
sir i have tried running the code but the shape of x function and y is not the same so train test split is not working its saying Found input variables with inconsistent numbers of samples: [11144, 5572]
@aishwaryabhargava6554
@aishwaryabhargava6554 2 жыл бұрын
how can we check the model on user provided input?
@pinkalshah5237
@pinkalshah5237 4 жыл бұрын
With lemmatization and max_features accuracy is 97%
@akashr9973
@akashr9973 3 жыл бұрын
Hi sir, please correct me if I'm wrong. In the line number 30 you are applying the transform function for the whole data , won't it be data leakage? The transform has to be applied after splitting the data right? Thank you.
@КаратэПацан-я6б
@КаратэПацан-я6б 2 жыл бұрын
Hi. The CountVectorizer is not a ML model, it just converts to vectors(matrix of numbers)
@abhishekpurohit3442
@abhishekpurohit3442 4 жыл бұрын
Sir, is Deep learning necessary to be learned before coming to this playlist (as I see Keras and LSTM being there in the last videos)??
@ashwinbj
@ashwinbj 3 жыл бұрын
practically how to check weather the message is spam or ham.? ie how to pass the message in the mode.
@premranjan4440
@premranjan4440 3 жыл бұрын
We could have used drop_first in get_dummies label instead of iterating the whole array.
@abhishekpurohit3442
@abhishekpurohit3442 4 жыл бұрын
Sir, why did we go for Bag of Words and not for TF-IDF? Is TF-IDF only used for sentiment analysis?
@salmankhan-vq7pc
@salmankhan-vq7pc 3 жыл бұрын
Hi nice lecture. I have a dataset with 1.3 million rows. I used your code When I perform bagging of words my Google Collab get crashed. Any solution.
@gemrose455
@gemrose455 4 жыл бұрын
I don't have a dependent variable like "spam" in my imported document. How will I train the dataset
@sardar92
@sardar92 5 жыл бұрын
very nice kindly post new videos
@techbenchers69
@techbenchers69 4 жыл бұрын
Sir, What is the reason behind choosing navie bays classifier.why not other classifier
@thunuguntlaruparani2058
@thunuguntlaruparani2058 4 жыл бұрын
There wouldn't be a data leakage problem if we use fit_transform on entire data?
@Sonu-bc7sm
@Sonu-bc7sm 5 жыл бұрын
Hi Krish, Nice video, Just had a question. What if i put the model in production and the new message have a word which are not part of my training dataset then the features won't match and the model will give error?
@yashwanthsrinivas4590
@yashwanthsrinivas4590 4 жыл бұрын
Hello Krish,How can we handle mulitple label classificaton problems?
@suhailhafizkhan9800
@suhailhafizkhan9800 5 жыл бұрын
How can we visualize at the actual result for a clarification?Thanks
@deepakjoshi4699
@deepakjoshi4699 2 жыл бұрын
I tried with Tf-IDF but my score is better with bag of words ? is it possible or am I making some mistakes?
@monicameduri9692
@monicameduri9692 2 жыл бұрын
Thanks a lot!
@kanishkapatel9077
@kanishkapatel9077 4 жыл бұрын
How to make GUI for this project ? any idea about it? It would be of great help !
@Rohan-cw9gn
@Rohan-cw9gn 3 жыл бұрын
U can use stream lit framework without knowledge of html, css u can make beatiful web apps
@Anandasys
@Anandasys 5 жыл бұрын
hello sir if we have different number of labels or category such as business,sports, entertainment,category,politics,tech,history.then how can we get the dummy variables and bag of words and how to find which present are the which labels.
@furkhanmehdi6405
@furkhanmehdi6405 4 жыл бұрын
Legend ❤️
@praddhumnasoni3364
@praddhumnasoni3364 3 жыл бұрын
sir, how to predict on real world text ( means text from gmail or something).
@parimalbhoyar8579
@parimalbhoyar8579 4 жыл бұрын
very helpful...!!!
@karthikelangovan5006
@karthikelangovan5006 4 жыл бұрын
i have typed code which you explained in this topic "Implementing a Spam classifier in python| Natural Language Processing" but i am not getting the corpus list... i got empty corpus ie.. [' ', ' ',' ', ............]
@datascience3008
@datascience3008 2 жыл бұрын
Awesome
@insightworld9910
@insightworld9910 5 жыл бұрын
By using lemmatization method we get accuracy of97.6
Word2Vec Easily Explained- Data Science
22:50
Krish Naik
Рет қаралды 172 М.
風船をキャッチしろ!🎈 Balloon catch Challenges
00:57
はじめしゃちょー(hajime)
Рет қаралды 51 МЛН
Real Man relocate to Remote Controlled Car 👨🏻➡️🚙🕹️ #builderc
00:24
Trapped by the Machine, Saved by Kind Strangers! #shorts
00:21
Fabiosa Best Lifehacks
Рет қаралды 36 МЛН
Elza love to eat chiken🍗⚡ #dog #pets
00:17
ElzaDog
Рет қаралды 25 МЛН
Email Spam Detection Using Python & Machine Learning
33:11
Computer Science (compsci112358)
Рет қаралды 92 М.
Natural Language Processing (NLP) Tutorial with Python & NLTK
38:10
freeCodeCamp.org
Рет қаралды 373 М.
Spam Mail Detection with Machine Learning in Python
15:42
NeuralNine
Рет қаралды 14 М.
Professional Preprocessing with Pipelines in Python
21:48
NeuralNine
Рет қаралды 65 М.
How I’d learn ML in 2024 (if I could start over)
7:05
Boris Meinardus
Рет қаралды 1,2 МЛН
風船をキャッチしろ!🎈 Balloon catch Challenges
00:57
はじめしゃちょー(hajime)
Рет қаралды 51 МЛН