Ali Ghodsi, Lec [3,2]: Deep Learning, Word2vec

Рет қаралды 14,152

Data Science Courses

Күн бұрын

Description

Пікірлер: 17

@jerry11111 8 жыл бұрын

Best lecture on word2vec. It covers everything that the papers are ambiguous on the notations and explanations on what to optimize and why.

@autripat 8 жыл бұрын

The Skip-gram model discussion starts at 17:20 (we transition away from the "intractable" continuous bag of words model). The Skip-gram training objective is to learn word vector representations that are good at predicting nearby words (context). The GloVe (Global Vectors for Word Representation) model starts ay 54:36.

@niteshroyal30 8 жыл бұрын

Thanks Professor for such wonderful lecture on word2vec.

@wanminghuang1722 8 жыл бұрын

Thank you so much. Much easier to understand.

@paolofreuli1686 7 жыл бұрын

Awesome lecture!

@m.farahmand7440 8 жыл бұрын

Thanks for the informative lecture. At time 7:26 shouldn't it be gradient ascent? After all we are trying to maximize the likelihood function.

@yangli7741 6 жыл бұрын

I think 7:26 is just gradient descent, and the guy who reminded that the sigma sign shouldn't exist actually understand it wrong because Prof. Ghodsi may have used confusing notation. In the log-likelihood and the summation over "w", the "w" means every word from the training set (the word as prediction given context "c"); however, when taking the derivative with respect to v_w, the "w" here actually can be any word in the vocabulary and v_w any column of weight matrix W' to be learned. So we should use a different notation, e.g., w*, in the partial derivative in v_w*. Accordingly, the summation over w should exist in the first place, because w* and w are not the same thing. Later removal of the summation in the adjustment rule, w* = w* - r(1-p(w) ) \frac{\partial v_c^T v_w}{\partial w*}, can be seen as changing from GD to SGD. The only reason that the final result didn't go wrong is because the partial derivative with respect to w* when w* eq w is just zero. That is, during the SGD, only v_w is updated.

@cem9927 6 жыл бұрын

If we have 4 words in the dictionary, we will have 4 v_w values and in the gradient descent update we will update each v_w seperately right ?

@tejasduseja 4 жыл бұрын

@@yangli7741 Thanks, I had same confusion in mind.

@imanshojaei7784 4 жыл бұрын

@@yangli7741 Are not also labels missed in formulation (i.e., empirical probabilities)?