Stanford CS224W: ML with Graphs | 2021 | Lecture 3.2-Random Walk Approaches for Node Embeddings

Рет қаралды 78,981

Күн бұрын

Пікірлер: 21

@ananthakrishnank3208 10 ай бұрын

I do not understand why we optimize this particular expression ' Sum over all nodes of log P(Nr(u) | Zu) '. I see that it is encoder which learns f(u)=Zu ; node embedding for node u. What does Probability of neighborhood of u (using some random walk) for the given node embedding of node u even mean? If this is done by the decoder, then the question is, isn't decoder meant to predict the node from node embedding instead of "neighborhood"?

@michal3141 9 ай бұрын

This notation confuses me a lot to be honest. My take on this is that we run random walks from each node u in V and we keep track of each occurrence of an event that node v was reached starting from u. We associate a probability of P(v|u) (node v was reached from u) and we create a total sum over all such events in all random walks. We define the correspondence between P(v|u) and node embeddings for v and u so that the higher P(v|u) the higher similarity between the embeddings of v and u. Intuition is that if we run enough random walks then node v frequently reachable from u will contribute with term k*P(v|u) to the total sum where k is some relatively big number and by optimizing such function we will get relatively high similarity between embeddings of v and u. Long story short: We create an artificial function to optimize called maximum log-likelihood which encodes the frequencies of getting from u to v in random walks and the similarity between embeddings of u and v corresponds to the likelihood of getting from node u to v in a random walk.

@paul_tee 7 ай бұрын

in the first slide, he defines P(v | z_u) as the model-predicted probability of reaching v from u. so i assume if N_R(u)= {v_1,...,v_k}, then P(N_R(u) | z_u) = P( v_1,...,v_k | z_u) = Π_k P(v_i | z_u) this should explain how you get the inner sum in L upon taking log. however, i don't see where the negative comes from in L.

@paul_tee 7 ай бұрын

i guess this maximization is done wrt a fixed RW (to be discussed later) R. given R, the goal is to find an embedding f so that the model can accurately predict that u will reach its all its neighbors N_R(u) why can we post compose with log? log is strict monotone-increasing, and you can check post composing with any strict monotone increasing function preserves critical points and extrema. so post-composition is a safe thing to do. why is it helpful? not sure yet

@shexiaogui Жыл бұрын

I didn't understand the part "drawback of node2vec", why we need to learn each node embedding individually ?

@igorgordiy7709 Жыл бұрын

I guess due to the fact, that all values "v" of an embedding vector "V" are parameters, with respect to which the gradient is calculated.

@isaacgonzalez8476 Жыл бұрын

Why can we use maximum likelihood estimation here? The observed data should be independent for that and it's clearly not since the probability of visit node v depends on the previous nodes visited

@Simon-ed6zc Жыл бұрын

I had a quick question. Do you generate the negative samples every epoch, and then use them for each node, or do you generate new negative samples for every single node you sample in every generation?

@sarangak.mahanta6168 Жыл бұрын

What do you mean by the negative examples in negative sampling?

@vadimshatov9935 Жыл бұрын

this is exactly the same technique as used in word2vec

@bihanbanerjee Жыл бұрын

Handpicking some nodes which are relevant to the context of node of interest on the basis of some factors like degree. In a large network, we can have a large number of neighbouring nodes associated with a particular node. In that case, to make our optimisation more computationally efficient we discard a number of neighbouring nodes and choose few preferably 5 to 20 nodes that are important or relevant.Their degrees can be a measure of their relevance. As degree increases in any kind random walk the probability to pass through it also increases.

@ahamuffin4747 Жыл бұрын

@@bihanbanerjee Can you tell me why "Higher k in sampling corresponds to higher bias on negative events"? at around 12:30

@yoctometer2045 Жыл бұрын

@@bihanbanerjee I don't think that's entirely correct, as neighbouring nodes are positive samples, similarly to word2vec where context words are "positive samples". Negative sampling samples from the negative examples, in this case nodes that are NOT neighbors to the central node.

@yoctometer2045 Жыл бұрын

@@ahamuffin4747 To my understanding: The loss has two terms: 1, (z_u, z_v) corresponding to the positive example (neighbor and central node's similarity) 2, (z_u, z_ni where z_ni are elements of the negative samples i = 1,...,k) corresponding to the selected k negative samples. The loss can be optimized by either increasing term 1 or decreasing term 2. Now, if k is increased, the sum in term2 will have more elements and that part will have more weight, resulting in stronger emphasis on the negative samples, that is, the mapping will be adjusted so that the similarity to the negative examples are smaller. This way there will be a higher bias for negative events, as the loss will punish greater values for similar non-neigbor nodes, that it will reward similar neighbor nodes.

@arthurpenndragon6434 Жыл бұрын

why should theta between embeddings be directly proportional to co-occurrence of the two nodes? shouldn't the embeddings be close together if the nodes co-occur often implying similarity, not further apart? shouldn't it be inversely proportional, i.e. the more times they occur on the same path the closer they are in angle? I must be missing something obvious here.

@LuckySunda Жыл бұрын

Hi, actually cos theta is directly proportional to similarity. Cos 0 is 1 and similarity is max and cos 90 is 0 and similarity is 0.