Machine Learning Lecture 10 "Naive Bayes continued" -Cornell CS4780 SP17

Рет қаралды 20,803

Kilian Weinberger

Күн бұрын

Пікірлер: 42

@maddai1764 5 жыл бұрын

Woow never thought Naive Bayes as an online classifier. Dear Professor you rock

@hamedalipour1012 5 жыл бұрын

I don't have the words to thank you sir

@ahindrakandarpa7910 3 жыл бұрын

Love your passion for the subject sir. I absolutely enjoy it when you teach.

@rbhambriiit 2 жыл бұрын

great teaching style!

@nilslorentzon863 5 жыл бұрын

I love your lectures, is there anyway to get the matlab code and name files or maybe post them in the description, would really appreciate it!

@doyourealise 3 жыл бұрын

amazing and funny :)

@abhisheksingla2260 4 жыл бұрын

Hahaha. Great classification at 40:06

@StevenSarasin Жыл бұрын

Another awesome video but the superscript for the product around the 16minute mark is written as m when it should be d, since there are d words to take the product over, and m is the count of the number of words in the email.

@prince839 4 жыл бұрын

Can you please tell, What would happen in the case of sentiment analysis when the test set contains a word which has never occurred in the training set, if we use naive Bayes classifier

@kilianweinberger698 4 жыл бұрын

Well, typically you just drop it in that case.

@abunapha 5 жыл бұрын

Starts at 1:20

@itachi4alltime 5 жыл бұрын

11:59 shouldn't it be "d" instead of "m"

@kilianweinberger698 5 жыл бұрын

Yes! You are right, alpha ranges from 1 .. d

@KimAnh-de7xb Жыл бұрын

Thank you professor for your awesome lecture. I have a question regarding the gaussian naive classifier: can we use it when the distribution is not approximated well by Gaussian distribution? Wish you the best!

@travohuy 2 жыл бұрын

Dear Prof. Weinberger, I have a small question. To derive the formula for \hat{\theta}_{\alpha c} at 15:42, it seems that we implicitly assume emails are independent of each other. Hence implicitly, we construct one Bag-of-word for all spammed emails and one Bag-of-word for all non-spammed emails, (not one bag for each email) then estimate theta from there. In other words, there are two assumptions: - the orders of words in each spammed email do not matter. - How they distribute in the spammed emails do not matter (i.e. they can appear a lot in one email, and non in others, or equally appear in each email). Does my thinking make sense? Thanks.

@xiaoweidu4667 3 жыл бұрын

Algorithms are only about assumptions assumptions and damn assumptions. Assumptions are philosophical in nature.

@user-kf9tp2qv9j 3 жыл бұрын

Hi, Kilian,for the cases that NB doesn’t perfectly do classification in the end of the video , what would happen if both x and o are draw from multivariable Gaussian distributions which two dimensions are correlated, that actually break the independent assumption, would it still work? Best wishes and waiting for your opinion.

@kilianweinberger698 3 жыл бұрын

Yes, it would. However, NB works surprisingly well even if this assumption is broken. In practice it might just over-weight those correlated features. As an example, assume I classify news articles by topic. Given the topic/label is "politics" the words "President" and "Biden" are both likely, but they are not independent. NB would assume they are conditionally independent and say that given that both words are present, it is now extremely likely that the label is "politics". However in practice, given that you have "President" in your text, the word "Biden" won't add much information, because they are so correlated, so NB is over-confident.

@user-kf9tp2qv9j 3 жыл бұрын

@@kilianweinberger698 wow, it’s amazing, thanks for your advice

@yizhiwang9632 3 жыл бұрын

Hi professor, I have a question regarding to the Gaussian Naive Bayes. In the lecture you use the pdf of a estimated Gaussian distribution to represent P( X | y =c). Isn't the probability of X takes any value x equals to 0 if X follows a continuous distribution? Are we using pdf here simply because we just want to know what P( X | y =c) is proportional to?

@kilianweinberger698 3 жыл бұрын

Yes, sorry, I was probably a little sloppy there. For discrete distributions, Maximum likelihood maximizes the probability of the data. For continuous distributions, it maximizes the density of the data (the probability is always 0). The math is exactly the same, but the terminology has to change ...

@user-or7ji5hv8y 3 жыл бұрын

I wonder if the notations can be made easier to follow.

@joonho0 4 жыл бұрын

If we use a hash function rather than a pre-defined dictionary, the smoothing 'd' should be infinite. How should we handle this in practice?

@BrunsterCoelho 4 жыл бұрын

I guess you don't need the smoothing if you have the hash function (at least you don't need the smoothing for dealing with unseen words - you might still want it due do data acquisition/counts reason)

@jiviteshsharma1021 4 жыл бұрын

Hi kilian for your projects do the students build their own models and classifiers using the actual math and formulas or are they allowed to use libraries with inbuilt models such as sklearn?

@kilianweinberger698 4 жыл бұрын

They are not allowed to use sklearn, just basic data structures / commands from numpy. My experience is that until you implement an algorithm yourself from ground up, you don’t really understand it... (The assignments do guide them through it, though.)

@jiviteshsharma1021 4 жыл бұрын

@@kilianweinberger698 Thank you professor, I will be implementing them from scratch the same now :))

@jiviteshsharma1021 4 жыл бұрын

@@kilianweinberger698 One last thing professor, just to clarify by the assignments you mean the homeworks right? Thank you

@ayushmalik7093 2 жыл бұрын

hi Professor, after removing stopwords can we use naive bayes probabilities for result interpretation?

@kilianweinberger698 2 жыл бұрын

Yes, the NB probabilities give you a reasonable explanation why a classifier made a certain prediction. It can also reveal how NB over-emphasizes correlated features (because of its class conditional independence assumption that is not met). E.g. if it classifies a document as being about politics, it may reach this conclusion because the three words “President Joe Biden” are all in the document, and all three are predictive of politics - ignoring the fact that they are highly correlated and “Biden” is often surrounded by “President” even in non-political documents.

@esakkiponraj.e5224 3 жыл бұрын

Hallo Kilian, could you let me know, how can I use NB, if my data contains both categorical & continuous features ??

@kilianweinberger698 3 жыл бұрын

Yes, you just need to use different distributions to model these different dimensions.

@esakkiponraj.e5224 3 жыл бұрын

@@kilianweinberger698 Thanks for your reply. After modelling with different distributions, how one can able to combine them as single model which can be used for prediction ? I found some SO answers related to the question and some were suggest to multiply the resulted predicted probabality values between different models. Here, i cannot able to understand the multiplication of the predicted prob ? Could you please explain, is this right & if it is could you explain the real reason ?

@omalve9454 Жыл бұрын

Greetings Professor! What features did you use for the gender classifier?

@jachawkvr 4 жыл бұрын

How do we decide if naive bayes is a good choice given a dataset? The algorithm seems to work well even if the assumption does not hold, so testing the assumption doesn't really help us decide this.

@kilianweinberger698 4 жыл бұрын

Good question! If the naive Bayes assumption doesn't hold, the classifier typically still does well when the following condition holds: features indicative towards one class, stay indicative towards that class, independent of all the other features given in the instance. Imagine a spam vs non-spam email classifier. Naive Bayes assumes that certain words are more likely given that the email is spam vs non-spam. i.e. these words are then indicative towards spam and their log(P(word|spam)) are positive (if class spam=+1). At the end when we do the classification we sum up the log probabilities log(P(word |y)) for each y (spam / not-spam). Because the Naive Bayes doesn't actually hold, these log-probabilities may be a little too high or a little too low, but it is unlikely that a spammy word suddenly is estimated as indicative towards not-spam. The classifier breaks when features interact in a way that makes them change which class they support, i.e. if word W together with word V is highly indicative towards spam, but word W without word V is highly indicative towards non-spam. You can actually test that for pairs of words empirically. For larger sets it quickly becomes infeasible to check. Hope this helps.

@jachawkvr 4 жыл бұрын

Thank you so much for explaining this! This helps me understand the naive bayes algorithm a little better as well.

@anantbansal5901 5 ай бұрын

@@kilianweinberger698 'these words are then indicative towards spam and their log(P(word|spam)) are positive (if class spam=+1)' how can log(P(word|spam)) be positive? assuming it is a natural log and noting that P(.) is always less than equal to 1?

@anantbansal5901 5 ай бұрын

oh i guess I got the catch here, it shall be log(P(word|spam)/P(word|not spam)), right?

@bharasiva96 4 жыл бұрын

At 32:00 shouldn't the summation from 1 to n_c? I don't follow what n here represents.

@kilianweinberger698 4 жыл бұрын

Yes that should be n_c. Thanks for pointing it out!

@coffeenerd4932 Жыл бұрын

@@kilianweinberger698 Isn't your original summation from 1..n correct, because the indicator function filters out all elements from other classes? If we go from 1..n_c, the index would no longer match up with the training examples...