NLP - Linear Models for Text Sentiment Analysis

  Рет қаралды 24,420

Machine Learning TV

Machine Learning TV

Күн бұрын

In this video, we will talk about first text classification model on top of features that we have described.
And let's continue with the sentiment classification. We can actually take the IMDB movie reviews dataset, that you can download, it is freely available. It contains 25,000 positive and 25,000 negative reviews. And how did that dataset appear? You can actually look at IMDB website and you can see that people write reviews there, and they actually also provide the number of stars from one star to ten star. They actually rate the movie and write the review. And if you take all those reviews from IMDB website, you can actually use that as a dataset for text classification because you have a text and you have a number of stars, and you can actually think of stars as sentiment. If we have at least seven stars, you can label it as positive sentiment. If it has at most four stars, that means that is a bad movie for a particular person and that is a negative sentiment. And that's how you get the dataset for sentiment classification for free. It contains at most 30 reviews per movie just to make it less biased for any particular movie.
These dataset also provides a 50/50 train test split so that future researchers can use the same split and reproduce their results and enhance the model. For evaluation, you can use accuracy and that actually happens because we have the same number of positive and negative reviews. So our dataset is balanced in terms of the size of the classes so we can evaluate accuracy here.
Okay, so let's start with first model. Let's takes features, let's take bag 1-grams with TF-IDF values. And in the result, we will have a matrix of features, 25,000 rows and 75,000 columns, and that is a pretty huge feature matrix. And what is more, it is extremely sparse. If you look at how many 0s are there, then you will see that 99.8% of all values in that matrix are 0s. So that actually applies some restrictions on the models that we can use on top of these features.
And the model that is usable for these features is logistic regression, which works like the following. It tries to predict the probability of a review being a positive one given the features that we gave that model for that particular review. And the features that we use, let me remind you, is the vector of TF-IDF values. And what you actually can do is you can find the weight for every feature of that bag of force representation. You can multiply each value, each TF-IDF value by that weight, sum all of that things and pass it through a sigmoid activation function and that's how you get logistic regression model.
And it's actually a linear classification model and what's good about that is since it's linear, it can handle sparse data. It's really fast to train and what's more, the weights that we get after the training can be interpreted.
And let's look at that sigmoid graph at the bottom of the slide. If you have a linear combination that is close to 0, that means that sigmoid will output 0.5. So the probability of a review being positive is 0.5. So we really don't know whether it's positive or negative. But if that linear combination in the argument of our sigmoid function starts to become more and more positive, so it goes further away from zero. Then you see that the probability of a review being positive actually grows really fast. And that means that if we get the weight of our features that are positive, then those weights will likely correspond to the words that a positive. And if you take negative weights, they will correspond to the words that are negative like disgusting or awful.

Пікірлер: 7
@saivineeshsuryadevara3006
@saivineeshsuryadevara3006 5 жыл бұрын
How Can we get the weights negative for the Negative words ?
@arjunpukale3310
@arjunpukale3310 4 жыл бұрын
Is there option for stemming in sklearn ?
@PradanaFiqih
@PradanaFiqih 3 жыл бұрын
But where is the code? :(
@ZEANUWOE
@ZEANUWOE 2 жыл бұрын
How to scrape data twitter by user location? Is there any one who can suggest any way or site to find scrapping twitter data by user location?
@nischalpokharel1910
@nischalpokharel1910 4 жыл бұрын
can you provide code for traning moleds
@tonyennis1787
@tonyennis1787 2 жыл бұрын
When I see that the best we can do is 92% accuracy, what I read is that 8% of the population can't write a clear and concise sentence.
Simple Deep Neural Networks for Text Classification
14:47
Machine Learning TV
Рет қаралды 116 М.
NLP - Text Preprocessing and Text Classification (using Python)
14:31
Machine Learning TV
Рет қаралды 84 М.
Osman Kalyoncu Sonu Üzücü Saddest Videos Dream Engine 170 #shorts
00:27
터키아이스크림🇹🇷🍦Turkish ice cream #funny #shorts
00:26
Byungari 병아리언니
Рет қаралды 26 МЛН
Получилось у Вики?😂 #хабибка
00:14
ХАБИБ
Рет қаралды 6 МЛН
Универ. 10 лет спустя - ВСЕ СЕРИИ ПОДРЯД
9:04:59
Комедии 2023
Рет қаралды 2,8 МЛН
Feature Extraction from Text (USING PYTHON)
14:24
Machine Learning TV
Рет қаралды 77 М.
Sentiment Analysis Using Machine Learning and Python
11:50
Computer Science
Рет қаралды 45 М.
All Learning Algorithms Explained in 14 Minutes
14:10
CinemaGuess
Рет қаралды 183 М.
Support Vector Machine (SVM) in 2 minutes
2:19
Visually Explained
Рет қаралды 546 М.
181 - Multivariate time series forecasting using LSTM
22:40
DigitalSreeni
Рет қаралды 268 М.
NLP: Understanding the N-gram language models
10:33
Machine Learning TV
Рет қаралды 110 М.
YOTAPHONE 2 - СПУСТЯ 10 ЛЕТ
15:13
ЗЕ МАККЕРС
Рет қаралды 97 М.
Will the battery emit smoke if it rotates rapidly?
0:11
Meaningful Cartoons 183
Рет қаралды 32 МЛН