Text Representation Using Bag Of Words (BOW): NLP Tutorial For Beginners

Text Representation Using Bag Of Words (BOW): NLP Tutorial For Beginners - S2 E3

Рет қаралды 48,264

Күн бұрын

Пікірлер: 37

@codebasics 2 жыл бұрын

Check out our premium machine learning course with 2 Industry projects: codebasics.io/courses/machine-learning-for-data-science-beginners-to-advanced

@mohammadriyaz5586 6 ай бұрын

thank you soo much sir for the easiest explanation

@harshalbhoir8986 Жыл бұрын

This Was So Cool Explaination Thank You So Much!!

@tomernx5 Жыл бұрын

Great video! thanks!

@AmitKumar-BIDSP Жыл бұрын

Great presentation; Thank you

@ashishpanchal701 2 жыл бұрын

Hello Sir, Thank You for being such a wonderful teacher!!! Just had a doubt in the Naive Bayes model that is built in this video... where have we used stemming, lematization ?? which would make it an NLP problem If we have not used them, then won't it be a simple Naive Nayes Machine Learning Problem

@codebasics 2 жыл бұрын

it is not like you have to do stemming etc to consider it as an NLP problem. Here as you can see without stemming etc we got pretty good accuracy. Hence you can consider this as both NLP and machine learning problem. In fact ML is used to solve NLP problems so NLP is at a higher level. Now you can definitely try stemming etc, I would request you to build that out and see how the model performance improvers.

@AdityaKumar-w2t1z Жыл бұрын

Great Work! Keep doing.

@napoleanbonaparte9225 2 ай бұрын

So Sir, what we can get from this video is we can find out the precisioj % of spam & its number in the list of emails.That we can do it in two ways 1.using the traintest,multinomialnb 2. directly importing the countvectorizer using pipeline().So bag of words means simply collection of same words together,as we had collected spam here.

@nareshkumarvanga3127 3 ай бұрын

Thank you Guruji

@vyaspadala8468 2 жыл бұрын

Sir please provide us big data engineer and data science course

@arvindatmuri5604 2 жыл бұрын

Have a look at Numpy pandas and Matplotlib and Machine learning courses, Data science is almost covered in all these topics

@pineappleld 7 ай бұрын

In this example I can see goal was somehow a classify either ham or spam. Is it possible to build similar but classify on four options?

@vivekjha9952 2 жыл бұрын

Hi Dhaval sir, I want to learn technology for data science by you and mentored too, Could you please provide guidance for an experienced 8 year IT professional who wants to transition to Data Science as Iam not able to figure out which institute to select.

@nemsingh6035 2 жыл бұрын

Follow codebasics

@pragtisood3239 Жыл бұрын

Doubt: From where can we get the csv file

@bhaskarbsarkar5232 2 жыл бұрын

Doubt : When vectorizing, we are taking X_train. According to my understanding, the vectorization is building a vocabulary w.r.t the data given. So, is it better to take the whole X instead of X_train to build the vocab and after that we can split into train and test. Because there is a possibility that some words would be in test data and not in train data. And when I took X for vectorization, the vocab size increased. So, what is the correct method here?

@codebasics 2 жыл бұрын

Excellent question Bhaskar. In our case what would have happened is we had more than 4k samples in training set which probably covered majority of the vocab in test samples also. Right way would be to create a CountVectorizer and call .fit (instead of fit_transform) on entire dataset. After that on individual training set and later on test set just call .transform

@malshininissanka4106 Жыл бұрын

@@codebasics Should not we consider the test data as unseen data? If we fit countVectorizer on the entire dataset data leakage might happen?

@swanandAragade Жыл бұрын

Even if we put input as a spam mail ,then it's not detecting that it's a spam, it only shows ham to all mail.

@siddhanthardikar2468 11 ай бұрын

Hello sir, I had one doubt. is fit_transform and transform the same? Thats because to transform X_train you used v.fit_transform(X_train.values) and for X_test you used v.transform(X_test). I hope you can just clear this doubt for me. Thank you.

@siddharthrox 10 ай бұрын

I don't know if you've figured this out by now or not. I'll share my understanding anyway. fit_transform will try to learn the vocabulary from the training data. After that it will create a matrix representation based on what it learned. But in transform, it is assumed that the learning has already happened and only a matrix representation needs to be generated. That is why you see that fit_transform is used with training data and transform is used with test data.

@datahead_girl Ай бұрын

@@siddharthrox yes, i think that transform and not fit_transform -> v.transform(X_test) must be used with the test data. fit_transform -> the transformer learns the necessary parameters from the training data. transform -> ensures that the test data are transformed in the same way as the training set, without altering the learned parameters. Correct me if Im wrong but i think this is the whole point of using fit_transform with the training set and transform with the test set

@ayushgupta80 6 ай бұрын

Bag of words --- size of vector is equal to size of vocab [ all elements are 0 , except the words present in statement ] Sparse representation - It may consume too much memory & computer resources .

@roopeshn9394 2 жыл бұрын

Hi, sir If you can assist me in any way, please do so with my issue. I have four or five columns of data in dict format with keys and values. I need to make a sentence or narrative from this data. Is it possible or not? If possible, Please guide me sir. input: { "source": "Sanju", "type": "message.cloud.display.AUTO_SCALING", "value": "1" }, Output: Sanju has value with this type of "message. cloud.display.AUTO_SCALING"

@gopalpawar7352 2 жыл бұрын

Sir full stack developer course create please one videos and create playlists ...

@JakeThalacker Жыл бұрын

Is there a simple way to edit this to use bigrams instead of single words?

@surinder3677 6 ай бұрын

Notes: 16:40 CountVectorizer

@hsekar6701 Жыл бұрын

I'm unable to download the en_core_web_sm pipeline..! So could anyone please help me....!

@changeorbeextinct Жыл бұрын

if any email has Nigeria and prince then it is authentic.. NOT :) BTW, great videos.

@uptoolate1896 2 жыл бұрын

And that was the moment that ignoring his suggested prerequisites finally caught up with me.

@debarghabhattacharjee4000 2 жыл бұрын

Please provide the spam.csv file....

@bhaskarbsarkar5232 2 жыл бұрын

It's in the git repo itself.

@kinghezzy Жыл бұрын

I cant find it there

@rinkisingh5529 Жыл бұрын

Why does your wife use your account to watch CID? And you have mentioned this in at least 2 of your videos, are you trying to cover something up.. Someone call the CID to investigate :)

@ramandeepbains862 2 жыл бұрын

Solution of bug : AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names_out' use get_feature_names() instead of get_feature_names_out its a version issue . sample code : v.get_feature_names()[790:800]