Dirichlet Process Mixture Models and Gibbs Sampling

Рет қаралды 69,948

Jordan Boyd-Graber

Күн бұрын

Пікірлер: 54

@dungthai762 6 жыл бұрын

Thank you for the video! In slide 5 it should be (1-Vj) instead of (1-Vk) right?

@JordanBoydGraber 4 жыл бұрын

Someone asked for a full derivation but deleted their comment, so here's a link: www.ncbi.nlm.nih.gov/pmc/articles/PMC6583910/

@YT-yt-yt-3 3 ай бұрын

Do you update the distribution parameters(alpha and base distribution) too as part of Gibbs sampling and reassignments?

@JordanBoydGraber 3 ай бұрын

There are a couple of ways to do it, but I always liked using slice sampling: people.cs.umass.edu/~cxl/cs691bm/lec08.html The nice thing is that you can implement it so it works generally for (most) hyperparameters for Bayesian models.

7 жыл бұрын

Hello, at minute 16:46 I want to know in what reading can I see the step between equation 6 and equation 7. Thanks

@JordanBoydGraber 7 жыл бұрын

This is replacing the general likelihood distribution with specifically a normal distribution (could be any base distribution).

@yiwendong7000 4 жыл бұрын

@@JordanBoydGraber Thanks for the hint! May I ask for the link to the full derivation? I tried the derivation but failed to integrate the normal distribution formula...

@JordanBoydGraber 4 жыл бұрын

@@yiwendong7000 This should be helpful! www.ncbi.nlm.nih.gov/pmc/articles/PMC6583910/

@Blaze098890 4 жыл бұрын

Not sure if the Gibbs sampler makes sense. To my understanding Gibbs sampling results in samples from the joint distribution, from there we can marginalise over a single variable, divide the two out (the joint and joint-1 marginal) and then arrive at the posterior for the variable we marginalised over. This leads me to believe that what you say in 14:20 is incorrect but I might be wrong. Equation 4 to 5 also does not make sense to me as there is no joint to be applying the chain rule to.

@JordanBoydGraber 4 жыл бұрын

Eq 4 to 5 is breaking apart one conditional to two; there's an intermediate step of explicitly writing out the joint that I omitted. I'm not quite sure what you're referring to at 14:20. The individual Gibbs draws are not from the joint and I didn't give the proof. Radford Neal gives a good treatment: www.cs.toronto.edu/~radford/ftp/review.pdf

@Blaze098890 4 жыл бұрын

@@JordanBoydGraber My misunderstanding may lie in how Gibbs sampling works. Let's say for p(x,y) we sequentially make the draws p(x|y) and p(y|x). But since the samples are correlated (as we are conditioning on what we previously sampled for the other variable) this can't be considered as a sample from the true posterior (which may be what I misunderstand), however, it is a draw from the joint. So after a single iteration of p(x|y) and p(y|x) we have a single sample from the joint p(x,y) rather than a sample for each conditional. Is it maybe implied in Gibbs sampling then that although the samples are from the joint one can also consider them from the posterior as the conditional and joint are proportional to one another?

@amandalevenberg841 8 жыл бұрын

THANK YOU FOR THIS VIDEO

@a.a3265 3 жыл бұрын

Thank you for the video .if I have linear model and I want find the prior distribution.for the parameter by dirichlet process mixture how the DPM prior distribution will be form .

@JordanBoydGraber 3 жыл бұрын

It's a little trickier, as you need to fit your linear model *given* the table assignments of the CRP. Once you've done that you need to compute the probability of a table assignment from the DP prior and the linear model posterior and multiply those two terms together to sample a new table assignment.

@Kaassap Жыл бұрын

This was very helpful tyvm!

@thegreatlazydazz 5 жыл бұрын

I might be mistaken but in eq (3),(4),(5) you are not conditioning on the values \theta_i that the the Gaussian is throwing up, you are just conditioning on the parametres of G. This is why you integrate against \theta in (6) and (7). I mean when you write | \theta , this integration would not make sense.

@ahmedstatistics2838 2 жыл бұрын

hi dear: is the base distribution same the data distrbtoin ???

@JordanBoydGraber 2 жыл бұрын

Almost certainly not, as the idea is to model the data distribution with the DP. So you typically choose a much simpler distribution (e.g., a Gaussian with wide variance) that describes how new clusters form.

@ahmedstatistics2838 2 жыл бұрын

@@JordanBoydGraber thank you so much

@Vb2489 7 жыл бұрын

What are the prerequisites to learn DPMM and Gibbs Sampling ? I have to learn this in 1 week, is it possible? can someone please guide me ?

@JordanBoydGraber 7 жыл бұрын

Look at "Gibbs Sampling for the Uninitiated" by Resnik and Hardisty.

@Jack-lg9mq 7 жыл бұрын

I think you were 5 months too late!

@pablotano352 9 жыл бұрын

Great explanation!!

@ahmedstatistics2838 2 жыл бұрын

How can I use dirichlet process in panel data models? Thank you for advance

@amineounajim9818 2 жыл бұрын

Look up hierarchical dirichlet process.

@ahmedstatistics2838 2 жыл бұрын

@@amineounajim9818 is the dirichlet consider alternative for Bayesian. I mean no need to find the maximum likelihood. but the dirichlet only dirichlet process is enough to consider Bayesian approach for nonparametric models

@ejaz629 7 жыл бұрын

Thank you for the video. What I understand is that in DPMM, you move from parameters to observations (generative model). In in inference: you infer z_i (then clusters parameters) from observations. Now consider my dataset compose of two features (assume normal), i.e. N(5,1), N(10,1). Can you please explain what would be the base distribution (would it be still normal with zero mean and unit variance?), and in this case what would be the goal of inference since our real dataset is difference from data generated by DPMM.

@ahmedstatistics2838 2 жыл бұрын

Many thanks

@jasontappan3565 2 жыл бұрын

I have absolutely no idea how you implemented the chain rule in 15:30.

@jasontappan3565 2 жыл бұрын

It looks like there are a few independence assumptions that I am missing.

@jasontappan3565 2 жыл бұрын

It also looks like the = should rather be a proportional sign. Unless I am missing something completely.

@JordanBoydGraber 2 жыл бұрын

@@jasontappan3565 The full derivation is here: arxiv.org/pdf/1106.2697.pdf And you're right, because I don't have a Dirichlet normalizer, it should be proportional to.

@jasontappan3565 2 жыл бұрын

@@JordanBoydGraber Thank you very much, it makes perfect sense now. Thank you for the video. My wife is busy with her Master's and is doing her dissertation on topic modelling. These videos help alot.

@ahmedalsaleh339 3 жыл бұрын

whats mean the baseline distribution and how can I get it

@JordanBoydGraber 3 жыл бұрын

You typically assume it in the model (e.g., a uniform multinomial distribution), but you could also assume it's the unigram distribution inferred from a corpus (e.g., count all the words and divide by the total number of words).

@tobias3112 9 жыл бұрын

Can you clarify the meaning of the notation N(x, mu, sd). Later in the video when you do the actual calculations you end up with N(x, mu, sd) = (x_i - mu_k)^2 to calculate the P(Zi |Z_-i). This step was a bit confusing for me. Also you say if you find a new cluster you draw mu_0 from the prior, is the prior here just N(0, 1)?

@JordanBoydGraber 9 жыл бұрын

+Tobias Something Yes, there's a simplification step to assume unit variance that causes the variance to go away. Then for new clusters, we assume a standard normal base distribution.

@tobias3112 9 жыл бұрын

+Jordan Boyd-Graber Is the mean of each cluster drawn from a Normal distribution, or do you just initialize the mean as the value for each point?

@JordanBoydGraber 9 жыл бұрын

+Tobias Something The posterior predictive distribution is a normal distribution that includes the effect of the current points assigned to the cluster *plus* the prior distribution for each cluster's mean (in this case, zero).

@jacobmoore8734 3 жыл бұрын

What I got: Major improvement above EM algo because you don't need to supply the number of clusters apriori as a hyperparameter. DP figures it out

@cuenta4384 6 жыл бұрын

Random question!! I wonder why there are not many ppl working with Expectation Propagation?

@JordanBoydGraber 5 жыл бұрын

I dunno. There are certainly some people (e.g., Jason Eisner). I think it's a little less friendly for stochastic gradient descent, which is more popular these days thanks to things like Pytorch and Tensorflow.

@Abood123441 7 жыл бұрын

Thank you for this video, it's really helpful. But I have questions ? Can we use CRP instead to LDA to discover an optimal number of topics ? what is the disadvantages of CRP?

@JordanBoydGraber 7 жыл бұрын

The CRP is still sensitive to the Dirichlet process parameter, so in some ways you're selecting a topic number based on parameter. But it does find a good number of topics with respect to likelihood. Disadvantages: CRP is much slower than LDA.

@Abood123441 7 жыл бұрын

Thank you for your quick response. Another question please: I believe CRP try to assign a customer to a table with many customers than few. but, if use CRP to cluster similar words together, you don't think this will produce clusters with irrelevant words? Thank you in advance .

@JordanBoydGraber 4 жыл бұрын

@@Abood123441 Yes, it usually does! You probably only want to look at the top words of a cluster.

@liclaclec Жыл бұрын

"The Delta is like the Indicator" my heart broke

@JordanBoydGraber Жыл бұрын

This is why it's useful to have someone next to you while recording these things. My notation isn't always clear, and it's good to have a reality check.

@TheSaintsVEVO 4 жыл бұрын

I'm glad you made the video but it is NOT clear or understandable to a beginner. I'm glad it helped everyone else, though.

@JordanBoydGraber 4 жыл бұрын

Did you watch the previous videos, especially the one on mixture models? kzbin.info/www/bejne/p4ukf5iGaLWmqpo It's part of this course, which has even more context: users.umiacs.umd.edu/~jbg/teaching/CMSC_726/