Пікірлер
@fcw1310
@fcw1310 15 сағат бұрын
Thanks for such amazing illustration for Diffusion. One question is about the equation in slice @ 13:16, how to get t-2 and t-3? x_t=sqrt(a_t)*x_t-1+sqrt(1-a_t)*e x_t-1=sqrt(a_t-1)*x_t-2+sqrt(1-a_t-1)*e x_t=sqrt(a_t)*[sqrt(a_t-1)*x_t-2+sqrt(1-a_t-1)*e]+sqrt(1-a_t)*e=sqrt(a_t*a_t-1)*x_t-2+[sqrt(a_t-a_t*a_t-1)+sqrt(1-a_t)]*e The rightmost term doesn't equal or close to sqrt(1-a_t*a_t-1)*e Dis I misunderstand something? Thanks again. @Outlier
@subashchandrapakhrin3537
@subashchandrapakhrin3537 Күн бұрын
Very Bad Video
@outliier
@outliier Күн бұрын
:(
@user-hm6sh6pl7r
@user-hm6sh6pl7r Күн бұрын
Thanks for the explanation, it's awesome! But I have a question. In cross attention, if we set the text as V, the final attention matrix could be viewed as a weighted sum of each word in V itself (the "weighted" part comes from the Q, K similarity). If I understand correctly, the final attention matrix should contain the values in the text domain, why can we multiply by a W_out projection and get the result in the image domain (add it to the original image)? Will it make more sense if we set the text condition as Q, and the image as K, V?
@outliier
@outliier Күн бұрын
If the text conditioning is q then it would not have the same shape as your image. So q needs to be the image
@mousamustafa1042
@mousamustafa1042 5 күн бұрын
U really liked that you showed the derivation in an understandable way
@raphaelfeigl1209
@raphaelfeigl1209 7 күн бұрын
Amazing explanation thanks a lot! Minor improvement suggestion: add a pop-protection to your microphone :)
@tomasjavurek1030
@tomasjavurek1030 9 күн бұрын
I think it is not exactly true statement that N(mu, sigma) = mu + sigma*N(0, 1). Just try that transformation, mu plays a role of translation in the value axis. However, what is correct, that if you sample from the left side, it acts the same as if you sample from the right side. I am pointing this out because I got stuck with that for a while. But I still also might got it completely wrong.
@tomasjavurek1030
@tomasjavurek1030 9 күн бұрын
Also later, when working with alphas, there's probably just approx. equal operation restrictred just to the first order of taylor expansion.
@EvanSpades
@EvanSpades 11 күн бұрын
Love this - what a fantastic achievement!
@mtolgacangoz
@mtolgacangoz 13 күн бұрын
Brilliant work!
@shojintam4206
@shojintam4206 13 күн бұрын
11:57
@jefersongallo8033
@jefersongallo8033 14 күн бұрын
This is a really great video, thanks for your big effort explaining!
@akkokagari7255
@akkokagari7255 14 күн бұрын
Wonderful explanation! Not sure if this is in the original papers, but I find it very odd that there is no nonlinear function after V and before W_out. It seems like a waste to me since Attention@V is itself a linear function, so w_out wont necessarily change content of the data beyond what Attention@V already would have done through training.
@akkokagari7255
@akkokagari7255 14 күн бұрын
Whoops I mean the similarity matrix not Attention
@JeavanCooper
@JeavanCooper 16 күн бұрын
The strange patten in the reconstructed image and the generated image is likely to be caused by the perceptual loss, I have no idea why but the disappears when I take the perceptual loss away.
@ChristProg
@ChristProg 17 күн бұрын
Thank you so much . But please i prefer that you go to the maths and operations more detailly being training of Würstchen 🎉🎉 thank you
@RyanHelios
@RyanHelios 17 күн бұрын
really nice video, helps me understand a lot❗
@mtolgacangoz
@mtolgacangoz 18 күн бұрын
Great video!! At 13:34, does multiplying with a_0 correct?
@user-kx1nm3vw5s
@user-kx1nm3vw5s 21 күн бұрын
best explanation
@siddharthshah9316
@siddharthshah9316 22 күн бұрын
This is an amazing video 🔥
@gintonic6204
@gintonic6204 24 күн бұрын
12:12 does anyone understand here why when \beta is linear, \sqrt{1-\beta} is linear as well?
@sciencerz7460
@sciencerz7460 27 күн бұрын
the statement at 15:33 isnt right ... is it ? cause i have a counter f(x) = x^2 , g(x) = -x^2 here f(x) >= g(x) but thier derivatives are negatives of each other. Please help i dont really understand the concept of ELBO
@KienLe-md9yv
@KienLe-md9yv 27 күн бұрын
At inference. Input of State A( VQGAN decoder) is discrete latents. Continuous latents needs to be quantize to discrete latents( discrete latents is also choosen from codebook, by which vector in Continuous latents nearest to vector in codebook). But Output of State B is Continuous latents. And Output of State B is directly for Input of State A..... if it right ? How State A( VQGAN decoder) handle Continuous latents . I check VQGAN paper and this Wurchen paper. That is not clear. Please help me that. Thank you
@outliier
@outliier 27 күн бұрын
The VQGAN decoder can also decode continuous latents. It‘s as easy as that.
@KienLe-md9yv
@KienLe-md9yv 27 күн бұрын
So, apparently, it sounds like Wurchen is exactly at Stage C. am i right?
@outliier
@outliier 27 күн бұрын
What do you mean exactly?
@readbyname
@readbyname 28 күн бұрын
Hey great video. Can you tell me why random sampling of codebook vectors doesn't generate a meaningful images. In Vae we random sample from std gaussian, why the same doesn't work for vq auto encoders.
@outliier
@outliier 28 күн бұрын
Because in a VAE you only predict mean and standard deviation. Sampling this is easier. Sampling the codebook vectors happens independently and this is why the output doesn‘t give a meaningful output.
@Bhllllll
@Bhllllll Ай бұрын
How did you manage to get 128 A100 for 3 weeks? I think the cost is about 100k USD for one run. Assuming you did multiple iterations, the overall cost can be easily 200k for this project.
@ashimdahal182
@ashimdahal182 Ай бұрын
Just completed writing a 24 paged handwritten note based on this video and a few other sources
@outliier
@outliier Ай бұрын
Wanna share it? :D
@TheSlepBoi
@TheSlepBoi Ай бұрын
Amazing explanation and thank you for taking the time to properly visualize everything
@Gruell
@Gruell Ай бұрын
Sorry if I am misunderstanding, but at 19:10, shouldn't the code be: "uncond_predicted_noise = model(x, t, None)" instead of "uncond_predicted_noise = model(x, labels, None)" Also, according to the CFG paper's formula, shouldn't the next line be: "predicted_noise = torch.lerp(predicted_noise, uncond_predicted_noise, -cfg_scale)" under the definition of lerp? One last question: have you tried using L1Loss instead of MSELoss? On my implementation, L1 Loss performs much better (although my implementation is different than yours). I know the ELBO term expands to essentially an MSE term wrt predicted noise, so I am confused as to why L1 Loss performs better for my model. Thank you for your time.
@Gruell
@Gruell Ай бұрын
Great videos by the way
@Gruell
@Gruell Ай бұрын
Ah, I see you already fixed the first question in the codebase
@duduwe8071
@duduwe8071 Ай бұрын
Hey @Outlier , on 12:44 looks like you mistakenly use "a" instead of "alpha" notation symbol inside 'product notation (Pi Notation)'. Since you mentioned the example multiplication below using "alpha notation". e.g. t = 8 "alpha_8" = "alpha_1" x "alpha_2" x "alpha_3" x "alpha_4" x "alpha_5" x "alpha_6" x "alpha_7" x "alpha_8" ======================================================= Is it intentional, though ? Please let me know. Thanks
@coy457
@coy457 Ай бұрын
This is dumb, but can anyone explain why when beta increases linearly, the square root of 1 - beta decreases linearly, at 12:13? Shouldn't it have some curve to it, given the square root?
@antongolles8896
@antongolles8896 Ай бұрын
@22:32 ur missing a bar over the alpha on the bottom line. Plz correct me if im wrong
@outliier
@outliier Ай бұрын
You are probably right 🤔
@UnbelievableRam
@UnbelievableRam Ай бұрын
Hi! Can you please explain why the output is getting two stitched images?
@outliier
@outliier Ай бұрын
What do you mean with two stitched images?
@arka-h274
@arka-h274 Ай бұрын
How did the KL divergence expand to log(q/p)? You yourself mentioned it to be the integral of q*log(q/p) for D_kl(q||p) Perhaps too much of a simplification.
@khoakirokun217
@khoakirokun217 Ай бұрын
I really love to understand this Diffusion thing, but I literally says "WTF" ...
@HearinCantMeow
@HearinCantMeow Ай бұрын
what a wonderful and thoughtful way to deliver the whole langscape of the diffusion model! Nice video! 👍
@jasdeepsinghgrover2470
@jasdeepsinghgrover2470 Ай бұрын
Just out of curiosity... Did they derive everything and then get amazing results or the other way around. I am pretty sure someone got amazing results with this and then thought how should I put it in a paper without being rejected 😂😂😂
@shubhamtrehan8753
@shubhamtrehan8753 Ай бұрын
As a PhD student who also struggles with notations, THANK YOU!!
@pallavd3
@pallavd3 Ай бұрын
@outliier in the 12:49 wondering how in the second line beta in beta*I got under-root ? please explain the assumption taken here...
@ankanderia4999
@ankanderia4999 Ай бұрын
` x = torch.randn((n, 3, self.img_size, self.img_size)).to(self.device) predicted_noise = model(x, t) ` in the deffusion class why you create an noise and pass that noise into the model to predict noise ... please explain
@matthewprestifilippo7673
@matthewprestifilippo7673 2 ай бұрын
Great content
@jeffg4686
@jeffg4686 2 ай бұрын
Nice !
@user-rl9px9jg5e
@user-rl9px9jg5e 2 ай бұрын
Thank you so much!!! I was finally able to understand Attention with your video!!!
@fouriertransformationsucks438
@fouriertransformationsucks438 2 ай бұрын
Incredibile explanation. My Bayesian analysis course would be so much better if it is explained this way. *Little suggestion: use \left( blabla ight) for brackets such that it will fit vertically.
@nikhilsaharan
@nikhilsaharan 2 ай бұрын
Thank you so much for this
@paktv858
@paktv858 2 ай бұрын
why you use self-attention in the unit architecture instead of other attention unet? how it works here
@jayhu2296
@jayhu2296 2 ай бұрын
Thanks for the crystal clear explanation!
@autkarsh8830
@autkarsh8830 2 ай бұрын
Thanks, the video was really helpful, it gave me such a great time in understanding diffusion models, kudos and keep on making such quality content!
@sureshmaliofficial
@sureshmaliofficial 2 ай бұрын
This is really great. make it so easy to grasp such a complex stuff
@Soso65929
@Soso65929 2 ай бұрын
So the process of adding noise and removing it happens in a loop
@harshaldharpure9921
@harshaldharpure9921 2 ай бұрын
I have two feature x--> text feature , y--> image_feature and rag --> rag_feature (This is Extra feature) I want to apply the cross _attention between (rag and (x/y)) How should I apply
@AdmMusicc
@AdmMusicc 2 ай бұрын
This was the best ML paper review I have ever seen. You stopped making videos but I would really love to see you go through more of this for more research in the field man! Hatsoff to you.
@uslessfella
@uslessfella 2 ай бұрын
19:57 how can it be KL divergence? there has to be a term outside of log for it to be KL divergence?? can you explain this?