Feature Extraction from Text (USING PYTHON)

Рет қаралды 77,807

Күн бұрын

Hi. In this lecture will transform tokens into features. And the best way to do that is Bag of Words. Let's count occurrences of a particular token in our text. The motivation is the following. We're actually looking for marker words like excellent or disappointed, and we want to detect those words, and make decisions based on absence or presence of that particular word, and how it might work. Let's take an example of three reviews like a good movie, not a good movie, did not like. Let's take all the possible words or tokens that we have in our documents. And for each such token, let's introduce a new feature or column that will correspond to that particular word. So, that is a pretty huge metrics of numbers, and how we translate our text into a vector in that metrics or row in that metrics. So, let's take for example good movie review. We have the word good, which is present in our text. So we put one in the column that corresponds to that word, then comes word movie, and we put one in the second column just to show that that word is actually seen in our text. We don't have any other words, so all the rest are zeroes. And that is a really long vector which is sparse in a sense that it has a lot of zeroes. And for not a good movie, it will have four ones, and all the rest of zeroes and so forth. This process is called text vectorization, because we actually replace the text with a huge vector of numbers, and each dimension of that vector corresponds to a certain token in our database. You can actually see that it has some problems. The first one is that we lose word order, because we can actually shuffle over words, and the representation on the right will stay the same. And that's why it's called bag of words, because it's a bag they're not ordered, and so they can come up in any order. And different problem is that counters are not normalized. Let's solve these two problems, and let's start with preserving some ordering. So how can we do that? Actually you can easily come to an idea that you should look at token pairs, triplets, or different combinations. These approach is also called as extracting n-grams. One gram stands for tokens, two gram stands for a token pair and so forth. So let's look how it might work. We have the same three reviews, and now we don't only have columns that correspond to tokens, but we have also columns that correspond to let's say token pairs. And our good movie review now translates into vector, which has one in a column corresponding to that token pair good movie, for movie for good and so forth. So, this way, we preserve some local word order, and we hope that that will help us to analyze this text better. The problems are obvious though. This representation can have too many features, because let's say you have 100,000 words in your database, and if you try to take the pairs of those words, then you can actually come up with a huge number that can exponentially grow with the number of consecutive words that you want to analyze. So that is a problem. And to overcome that problem, we can actually remove some n-grams. Let's remove n-grams from features based on their occurrence frequency in documents of our corpus. You can actually see that for high frequency n-grams, as well as for low frequency n-grams, we can show why we don't need those n-grams. For high frequency, if you take a text and take high frequency n-grams that is seen in almost all of the documents, and for English language that would be articles, and preposition, and stuff like that. Because they're just there for grammatical structure and they don't have much meaning. These are called stop-words, they won't help us to discriminate texts, and we can pretty easily remove them. Another story is low frequency n-grams, and if you look at low frequency n-grams, you actually find typos because people type with mistakes, or rare n-grams that's usually not seen in any other reviews. And both of them are bad for our model, because if we don't remove these tokens, then very likely we will overfeed, because that would be a very good feature for our future classifier that can just see that, okay, we have a review that has a typo, and we had only like two of those reviews, which had those typo, and it's pretty clear whether it's positive or negative. So, it can learn some independences that are actually not there and we don't really need them. And the last one is medium frequency n-grams, and those are really good n-grams, because they contain n-grams that are not stop-words, that are not typos and we actually look at them. And, the problem is there're a lot of medium frequency n-grams. And it proved to be useful to look at n-gram frequency in our corpus for filtering out bad n-grams. What if we can use the same frequency for ranking of medium frequency n-grams?

Пікірлер: 32

@ijeffking 4 жыл бұрын

Very very well explained. Thank you so much.

@akshaysoni6877 5 жыл бұрын

amazing clarity in your explanation.

@elenabarrows182 4 жыл бұрын

very clearly explained, thanks!

@aghileslounis 4 жыл бұрын

best explanation ever and very calm

@desolate_tunes__ 2 жыл бұрын

So easy to understand. Great explantion and nice demonstrate

@aiinabox1260 3 жыл бұрын

The seq of topics covered , especially the order is very good. It helps us to understand for ex., what is bow & it's demerits the next topic talks about the solution ...gr8 job..

@sau002 4 жыл бұрын

Nice one

@fardaddanish8113 4 жыл бұрын

I appreciated your beautiful work! please make a video on gender text classification by using python. e.g by seeing a text machine should tell that the writer of the text is a male or female. i shall be very thankful to you for this act of kindness.

@Jay-su5yo 2 жыл бұрын

Very nice

@shahi_gautam 5 жыл бұрын

Can you give an example where you have passed BOW with n-gram features? Take n=2, will it create an extra number of columns for the n-grams?

@zakiraza7399 4 жыл бұрын

when I don't pass any parameter to TfIdfVectorizer(), the tf-idf result obtained is different from what is achieved using the formula. Can you explain why?

@santoshbehera6994 5 жыл бұрын

can you please upload the continuation of this lecture. Thank q

@MachineLearningTV 5 жыл бұрын

We have already uploaded the next lecture. You may watch it here: kzbin.info/www/bejne/nl63oZWVl9GqmJI

@shivomsharma8663 5 жыл бұрын

please make a vedio where we set zero and one value in movie or good movie labels

@martand_05 4 жыл бұрын

At 11 minutes 15 sec,the video is hilarious.

@DanielWeikert 6 жыл бұрын

nice. Can you upload the continuation?

@MachineLearningTV 6 жыл бұрын

We will upload the rest of this course within the next days. So, stay tune! So far, you can take a look at the previous lecture of this course here: kzbin.info/www/bejne/pKnLdKybh7dqa7M Please if you thought this video interesting, please like it because this helps us! Thanks

@DanielWeikert 6 жыл бұрын

Sure thing. Thanks.

@hirutabera1947 4 жыл бұрын

WHAT ARE EASY TOPIC WHICH CAN BE DONE ON LOCAL LANGUAGE USING NLP

@nuraisyah9509 3 жыл бұрын

anyone know how to extract feature word spacing for handwriting image? really needs help here

@aiinabox1260 3 жыл бұрын

Bow is the freq of words , isn't it ?

@rajuvarma8191 4 жыл бұрын

provide code in description, so can practice

@MrSchlechtes 4 жыл бұрын

to be honest... I didn't get the example with the weighting scheme "term frequency"... first I have to count the frequency of my token and than I divide it with? what? with the Sum of all frequencies of all tokens? with the probability distribution 1? 😓😭 Divide tf with what? Please help! Explain it please!

@sharif47 4 жыл бұрын

If I am not mistaken, then this is how it goes: Term frequency for a word (or phrase) is document specific. So, for word B in document C, term frequency is the number of times B appears in C, divided by the number of words in C