Text Representation Using Bag Of Words (BOW): NLP Tutorial For Beginners

Text Representation Using Bag Of Words (BOW): NLP Tutorial For Beginners - S2 E3

Рет қаралды 52,749

Күн бұрын

Пікірлер: 39

@codebasics 2 жыл бұрын

Check out our premium machine learning course with 2 Industry projects: codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced

@harshalbhoir8986 Жыл бұрын

This Was So Cool Explaination Thank You So Much!!

@mohammadriyaz5586 9 ай бұрын

thank you soo much sir for the easiest explanation

@apurav363 Ай бұрын

Very helpful

@tomernx5 Жыл бұрын

Great video! thanks!

@AdityaKumar-w2t1z Жыл бұрын

Great Work! Keep doing.

@napoleanbonaparte9225 5 ай бұрын

So Sir, what we can get from this video is we can find out the precisioj % of spam & its number in the list of emails.That we can do it in two ways 1.using the traintest,multinomialnb 2. directly importing the countvectorizer using pipeline().So bag of words means simply collection of same words together,as we had collected spam here.

@AmitKumar-BIDSP Жыл бұрын

Great presentation; Thank you

@Kaafirpeado54-6ayesha Ай бұрын

Thanks 👍

@surinder3677 8 ай бұрын

Notes: 16:40 CountVectorizer

@ashishpanchal701 2 жыл бұрын

Hello Sir, Thank You for being such a wonderful teacher!!! Just had a doubt in the Naive Bayes model that is built in this video... where have we used stemming, lematization ?? which would make it an NLP problem If we have not used them, then won't it be a simple Naive Nayes Machine Learning Problem

@codebasics 2 жыл бұрын

it is not like you have to do stemming etc to consider it as an NLP problem. Here as you can see without stemming etc we got pretty good accuracy. Hence you can consider this as both NLP and machine learning problem. In fact ML is used to solve NLP problems so NLP is at a higher level. Now you can definitely try stemming etc, I would request you to build that out and see how the model performance improvers.

@nareshkumarvanga3127 6 ай бұрын

Thank you Guruji

@pragtisood3239 Жыл бұрын

Doubt: From where can we get the csv file

@pineappleld 10 ай бұрын

In this example I can see goal was somehow a classify either ham or spam. Is it possible to build similar but classify on four options?

@vyaspadala8468 2 жыл бұрын

Sir please provide us big data engineer and data science course

@arvindatmuri5604 2 жыл бұрын

Have a look at Numpy pandas and Matplotlib and Machine learning courses, Data science is almost covered in all these topics

@swanandAragade Жыл бұрын

Even if we put input as a spam mail ,then it's not detecting that it's a spam, it only shows ham to all mail.

@ayushgupta80 9 ай бұрын

Bag of words --- size of vector is equal to size of vocab [ all elements are 0 , except the words present in statement ] Sparse representation - It may consume too much memory & computer resources .

@siddhanthardikar2468 Жыл бұрын

Hello sir, I had one doubt. is fit_transform and transform the same? Thats because to transform X_train you used v.fit_transform(X_train.values) and for X_test you used v.transform(X_test). I hope you can just clear this doubt for me. Thank you.

@siddharthrox Жыл бұрын

I don't know if you've figured this out by now or not. I'll share my understanding anyway. fit_transform will try to learn the vocabulary from the training data. After that it will create a matrix representation based on what it learned. But in transform, it is assumed that the learning has already happened and only a matrix representation needs to be generated. That is why you see that fit_transform is used with training data and transform is used with test data.

@datahead_girl 4 ай бұрын

@@siddharthrox yes, i think that transform and not fit_transform -> v.transform(X_test) must be used with the test data. fit_transform -> the transformer learns the necessary parameters from the training data. transform -> ensures that the test data are transformed in the same way as the training set, without altering the learned parameters. Correct me if Im wrong but i think this is the whole point of using fit_transform with the training set and transform with the test set

@JakeThalacker 2 жыл бұрын

Is there a simple way to edit this to use bigrams instead of single words?

@roopeshn9394 2 жыл бұрын

Hi, sir If you can assist me in any way, please do so with my issue. I have four or five columns of data in dict format with keys and values. I need to make a sentence or narrative from this data. Is it possible or not? If possible, Please guide me sir. input: { "source": "Sanju", "type": "message.cloud.display.AUTO_SCALING", "value": "1" }, Output: Sanju has value with this type of "message. cloud.display.AUTO_SCALING"

@hsekar6701 Жыл бұрын

I'm unable to download the en_core_web_sm pipeline..! So could anyone please help me....!

@vivekjha9952 2 жыл бұрын

Hi Dhaval sir, I want to learn technology for data science by you and mentored too, Could you please provide guidance for an experienced 8 year IT professional who wants to transition to Data Science as Iam not able to figure out which institute to select.

@nemsingh6035 2 жыл бұрын

Follow codebasics

@bhaskarbsarkar5232 2 жыл бұрын

Doubt : When vectorizing, we are taking X_train. According to my understanding, the vectorization is building a vocabulary w.r.t the data given. So, is it better to take the whole X instead of X_train to build the vocab and after that we can split into train and test. Because there is a possibility that some words would be in test data and not in train data. And when I took X for vectorization, the vocab size increased. So, what is the correct method here?

@codebasics 2 жыл бұрын

Excellent question Bhaskar. In our case what would have happened is we had more than 4k samples in training set which probably covered majority of the vocab in test samples also. Right way would be to create a CountVectorizer and call .fit (instead of fit_transform) on entire dataset. After that on individual training set and later on test set just call .transform

@malshininissanka4106 Жыл бұрын

@@codebasics Should not we consider the test data as unseen data? If we fit countVectorizer on the entire dataset data leakage might happen?

@gopalpawar7352 2 жыл бұрын

Sir full stack developer course create please one videos and create playlists ...

@changeorbeextinct 2 жыл бұрын

if any email has Nigeria and prince then it is authentic.. NOT :) BTW, great videos.

@debarghabhattacharjee4000 2 жыл бұрын

Please provide the spam.csv file....

@bhaskarbsarkar5232 2 жыл бұрын

It's in the git repo itself.

@kinghezzy Жыл бұрын

I cant find it there

@uptoolate1896 2 жыл бұрын

And that was the moment that ignoring his suggested prerequisites finally caught up with me.

@rinkisingh5529 Жыл бұрын

Why does your wife use your account to watch CID? And you have mentioned this in at least 2 of your videos, are you trying to cover something up.. Someone call the CID to investigate :)

@ramandeepbains862 2 жыл бұрын

Solution of bug : AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names_out' use get_feature_names() instead of get_feature_names_out its a version issue . sample code : v.get_feature_names()[790:800]