Thanks for such amazing illustration for Diffusion. One question is about the equation in slice @ 13:16, how to get t-2 and t-3? x_t=sqrt(a_t)*x_t-1+sqrt(1-a_t)*e x_t-1=sqrt(a_t-1)*x_t-2+sqrt(1-a_t-1)*e x_t=sqrt(a_t)*[sqrt(a_t-1)*x_t-2+sqrt(1-a_t-1)*e]+sqrt(1-a_t)*e=sqrt(a_t*a_t-1)*x_t-2+[sqrt(a_t-a_t*a_t-1)+sqrt(1-a_t)]*e The rightmost term doesn't equal or close to sqrt(1-a_t*a_t-1)*e Dis I misunderstand something? Thanks again. @Outlier
@subashchandrapakhrin3537Күн бұрын
Very Bad Video
@outliierКүн бұрын
:(
@user-hm6sh6pl7rКүн бұрын
Thanks for the explanation, it's awesome! But I have a question. In cross attention, if we set the text as V, the final attention matrix could be viewed as a weighted sum of each word in V itself (the "weighted" part comes from the Q, K similarity). If I understand correctly, the final attention matrix should contain the values in the text domain, why can we multiply by a W_out projection and get the result in the image domain (add it to the original image)? Will it make more sense if we set the text condition as Q, and the image as K, V?
@outliierКүн бұрын
If the text conditioning is q then it would not have the same shape as your image. So q needs to be the image
@mousamustafa10425 күн бұрын
U really liked that you showed the derivation in an understandable way
@raphaelfeigl12097 күн бұрын
Amazing explanation thanks a lot! Minor improvement suggestion: add a pop-protection to your microphone :)
@tomasjavurek10309 күн бұрын
I think it is not exactly true statement that N(mu, sigma) = mu + sigma*N(0, 1). Just try that transformation, mu plays a role of translation in the value axis. However, what is correct, that if you sample from the left side, it acts the same as if you sample from the right side. I am pointing this out because I got stuck with that for a while. But I still also might got it completely wrong.
@tomasjavurek10309 күн бұрын
Also later, when working with alphas, there's probably just approx. equal operation restrictred just to the first order of taylor expansion.
@EvanSpades11 күн бұрын
Love this - what a fantastic achievement!
@mtolgacangoz13 күн бұрын
Brilliant work!
@shojintam420613 күн бұрын
11:57
@jefersongallo803314 күн бұрын
This is a really great video, thanks for your big effort explaining!
@akkokagari725514 күн бұрын
Wonderful explanation! Not sure if this is in the original papers, but I find it very odd that there is no nonlinear function after V and before W_out. It seems like a waste to me since Attention@V is itself a linear function, so w_out wont necessarily change content of the data beyond what Attention@V already would have done through training.
@akkokagari725514 күн бұрын
Whoops I mean the similarity matrix not Attention
@JeavanCooper16 күн бұрын
The strange patten in the reconstructed image and the generated image is likely to be caused by the perceptual loss, I have no idea why but the disappears when I take the perceptual loss away.
@ChristProg17 күн бұрын
Thank you so much . But please i prefer that you go to the maths and operations more detailly being training of Würstchen 🎉🎉 thank you
@RyanHelios17 күн бұрын
really nice video, helps me understand a lot❗
@mtolgacangoz18 күн бұрын
Great video!! At 13:34, does multiplying with a_0 correct?
@user-kx1nm3vw5s21 күн бұрын
best explanation
@siddharthshah931622 күн бұрын
This is an amazing video 🔥
@gintonic620424 күн бұрын
12:12 does anyone understand here why when \beta is linear, \sqrt{1-\beta} is linear as well?
@sciencerz746027 күн бұрын
the statement at 15:33 isnt right ... is it ? cause i have a counter f(x) = x^2 , g(x) = -x^2 here f(x) >= g(x) but thier derivatives are negatives of each other. Please help i dont really understand the concept of ELBO
@KienLe-md9yv27 күн бұрын
At inference. Input of State A( VQGAN decoder) is discrete latents. Continuous latents needs to be quantize to discrete latents( discrete latents is also choosen from codebook, by which vector in Continuous latents nearest to vector in codebook). But Output of State B is Continuous latents. And Output of State B is directly for Input of State A..... if it right ? How State A( VQGAN decoder) handle Continuous latents . I check VQGAN paper and this Wurchen paper. That is not clear. Please help me that. Thank you
@outliier27 күн бұрын
The VQGAN decoder can also decode continuous latents. It‘s as easy as that.
@KienLe-md9yv27 күн бұрын
So, apparently, it sounds like Wurchen is exactly at Stage C. am i right?
@outliier27 күн бұрын
What do you mean exactly?
@readbyname28 күн бұрын
Hey great video. Can you tell me why random sampling of codebook vectors doesn't generate a meaningful images. In Vae we random sample from std gaussian, why the same doesn't work for vq auto encoders.
@outliier28 күн бұрын
Because in a VAE you only predict mean and standard deviation. Sampling this is easier. Sampling the codebook vectors happens independently and this is why the output doesn‘t give a meaningful output.
@BhllllllАй бұрын
How did you manage to get 128 A100 for 3 weeks? I think the cost is about 100k USD for one run. Assuming you did multiple iterations, the overall cost can be easily 200k for this project.
@ashimdahal182Ай бұрын
Just completed writing a 24 paged handwritten note based on this video and a few other sources
@outliierАй бұрын
Wanna share it? :D
@TheSlepBoiАй бұрын
Amazing explanation and thank you for taking the time to properly visualize everything
@GruellАй бұрын
Sorry if I am misunderstanding, but at 19:10, shouldn't the code be: "uncond_predicted_noise = model(x, t, None)" instead of "uncond_predicted_noise = model(x, labels, None)" Also, according to the CFG paper's formula, shouldn't the next line be: "predicted_noise = torch.lerp(predicted_noise, uncond_predicted_noise, -cfg_scale)" under the definition of lerp? One last question: have you tried using L1Loss instead of MSELoss? On my implementation, L1 Loss performs much better (although my implementation is different than yours). I know the ELBO term expands to essentially an MSE term wrt predicted noise, so I am confused as to why L1 Loss performs better for my model. Thank you for your time.
@GruellАй бұрын
Great videos by the way
@GruellАй бұрын
Ah, I see you already fixed the first question in the codebase
@duduwe8071Ай бұрын
Hey @Outlier , on 12:44 looks like you mistakenly use "a" instead of "alpha" notation symbol inside 'product notation (Pi Notation)'. Since you mentioned the example multiplication below using "alpha notation". e.g. t = 8 "alpha_8" = "alpha_1" x "alpha_2" x "alpha_3" x "alpha_4" x "alpha_5" x "alpha_6" x "alpha_7" x "alpha_8" ======================================================= Is it intentional, though ? Please let me know. Thanks
@coy457Ай бұрын
This is dumb, but can anyone explain why when beta increases linearly, the square root of 1 - beta decreases linearly, at 12:13? Shouldn't it have some curve to it, given the square root?
@antongolles8896Ай бұрын
@22:32 ur missing a bar over the alpha on the bottom line. Plz correct me if im wrong
@outliierАй бұрын
You are probably right 🤔
@UnbelievableRamАй бұрын
Hi! Can you please explain why the output is getting two stitched images?
@outliierАй бұрын
What do you mean with two stitched images?
@arka-h274Ай бұрын
How did the KL divergence expand to log(q/p)? You yourself mentioned it to be the integral of q*log(q/p) for D_kl(q||p) Perhaps too much of a simplification.
@khoakirokun217Ай бұрын
I really love to understand this Diffusion thing, but I literally says "WTF" ...
@HearinCantMeowАй бұрын
what a wonderful and thoughtful way to deliver the whole langscape of the diffusion model! Nice video! 👍
@jasdeepsinghgrover2470Ай бұрын
Just out of curiosity... Did they derive everything and then get amazing results or the other way around. I am pretty sure someone got amazing results with this and then thought how should I put it in a paper without being rejected 😂😂😂
@shubhamtrehan8753Ай бұрын
As a PhD student who also struggles with notations, THANK YOU!!
@pallavd3Ай бұрын
@outliier in the 12:49 wondering how in the second line beta in beta*I got under-root ? please explain the assumption taken here...
@ankanderia4999Ай бұрын
` x = torch.randn((n, 3, self.img_size, self.img_size)).to(self.device) predicted_noise = model(x, t) ` in the deffusion class why you create an noise and pass that noise into the model to predict noise ... please explain
@matthewprestifilippo76732 ай бұрын
Great content
@jeffg46862 ай бұрын
Nice !
@user-rl9px9jg5e2 ай бұрын
Thank you so much!!! I was finally able to understand Attention with your video!!!
@fouriertransformationsucks4382 ай бұрын
Incredibile explanation. My Bayesian analysis course would be so much better if it is explained this way. *Little suggestion: use \left( blabla ight) for brackets such that it will fit vertically.
@nikhilsaharan2 ай бұрын
Thank you so much for this
@paktv8582 ай бұрын
why you use self-attention in the unit architecture instead of other attention unet? how it works here
@jayhu22962 ай бұрын
Thanks for the crystal clear explanation!
@autkarsh88302 ай бұрын
Thanks, the video was really helpful, it gave me such a great time in understanding diffusion models, kudos and keep on making such quality content!
@sureshmaliofficial2 ай бұрын
This is really great. make it so easy to grasp such a complex stuff
@Soso659292 ай бұрын
So the process of adding noise and removing it happens in a loop
@harshaldharpure99212 ай бұрын
I have two feature x--> text feature , y--> image_feature and rag --> rag_feature (This is Extra feature) I want to apply the cross _attention between (rag and (x/y)) How should I apply
@AdmMusicc2 ай бұрын
This was the best ML paper review I have ever seen. You stopped making videos but I would really love to see you go through more of this for more research in the field man! Hatsoff to you.
@uslessfella2 ай бұрын
19:57 how can it be KL divergence? there has to be a term outside of log for it to be KL divergence?? can you explain this?