DDPM - Diffusion Models Beat GANs on Image Synthesis (Machine Learning Research Paper Explained)

Рет қаралды 156,714

Күн бұрын

Пікірлер: 127

@YannicKilcher 3 жыл бұрын

OUTLINE: 0:00 - Intro & Overview 4:10 - Denoising Diffusion Probabilistic Models 11:30 - Formal derivation of the training loss 23:00 - Training in practice 27:55 - Learning the covariance 31:25 - Improving the noise schedule 33:35 - Reducing the loss gradient noise 40:35 - Classifier guidance 52:50 - Experimental Results

@TechyBen 3 жыл бұрын

Will you cover Nvidias or Intels "AI photorealism" examples for game images to photorealism? IIRC a new paper was just released on it. Still early work, but is having better progress, as it no longer fails the temporal or hallucination (artifacts/errors) problems.

@CosmiaNebula 3 жыл бұрын

Summary: self-supervised learning. Given dataset of good images, keep adding Gaussian noise to it to create sequences of increasingly noisy images. Let the network learn to denoise images based on that. Then the network can "denoise" completely Gaussian random pictures into real pictures. To do: learn some latent space (like VAEGAN does) so that it can smoothly interpolate between generated pictures and create nightmare arts.

@ahmedalshenoudy1766 3 жыл бұрын

Thanks a lot for the thorough explanation! It's helping me figure out a topic for my master's degree. Much much appreciated ^^

@SamanthaTries 3 жыл бұрын

My boyfriend wrote these papers. Go Alex Nichol!

@taylan5376 3 жыл бұрын

And i already felt sorry for your bf

@LatinDanceVideos 3 жыл бұрын

You’ll have to compete for his attention with all the coding fanbois. Either way, lucky girl. Hold onto that guy.

@luisfable 3 жыл бұрын

With every great person, there is a great partner

@TheRyulord 2 жыл бұрын

@@LatinDanceVideos KZbin says her name is Samantha Nichol now so I guess she took your advice.

@cedricvillani8502 2 жыл бұрын

Lose 10 pounds by cutting your head off??? 😂😂

@pedrogorilla483 Жыл бұрын

Historic video! Fun to see it now and compare it to the current state of image generation. I’ll check it again in two years to see how far we’ve got.

@ShresthShukla-h9n Жыл бұрын

lol :)

@scottmiller2591 3 жыл бұрын

That notation \mathcal{N}(x_t;sqrt{1-\beta_t}x_{t-1},\beta_t \mathbf{I}) sets my teeth on edge. Doing this with P, a general PDF, is fine, but I would always write x_t ~ \mathcal{N}(sqrt{1-\beta_t}x_{t-1},\beta_t \mathbf{I}), since \mathcal{N} is the Gaussian _distribution_ with a defined parameterization. BTW, the reason for sqrt{1-\beta_t}x_{t-1} is to keep the energy of x_{t-1} approximately the same as the energy for x_t; otherwise, the image would explode to a variance of T*\beta after T iterations. It's probably a good idea to keep the neural network inputs to about the same range every time.

@cedricvillani8502 2 жыл бұрын

Don’t forget to edit your text next time you paste it in😮

@andrewcarr3703 3 жыл бұрын

Love it!! It's called the "number line" in english. Keep up the great work

@impromptu3155 2 жыл бұрын

Just Amazing. I guess I might read this paper for another whole day if I missed your video. Grateful!

@sshatabda 3 жыл бұрын

Great video! I was surprised to see this after the latest paper just a fews days back! Thanks for the great explanations!

@binjianxin7830 2 жыл бұрын

18:46 I guess it’s very likely to be related to Shannon’s Sampling theorem, reconstructing the data distribution by sampling with the well defined normal distribution. The number of time steps and Beta closely related to the band width of the data distribution.

@mariafernandadavila8332 Жыл бұрын

Amazing explanation. Saved me a lot of time!! Thank you!

@videowatching9576 2 жыл бұрын

Fascinating, incredible video! Really appreciate the walkthrough! Such as the cosine vs linear approach to make sure each step in diffusion is useful - very interesting!

@linminhtoo 3 жыл бұрын

yannic, thanks for the video. the audio is a little soft even at max volume (unless I'm wearing my headphones). is it possible to make it a bit louder?

@YannicKilcher 3 жыл бұрын

Thanks a lot! Can't change this one, but I'll pay attention in the future

@abdalazizrashid 3 жыл бұрын

Yup, correct most of your videos has a quite low volume

@JurekOK 2 жыл бұрын

maybe this is just correct -- it's a regular hifi audiophile loudness level. In here, there is no need for hyper compression filters like in commercials and cheap music videos.

@ShawnFumo 2 жыл бұрын

@@JurekOK Maybe, but in practice using my laptop speakers with Windows and KZbin volumes maxed out, it is still pretty low volume. I had to put subtitles on to make sure I didn't miss things here and there, and this was in a fairly quiet room.

@MrBOB-hj8jq 3 жыл бұрын

Can you please make a video about SNN's and latest research on SNN's?

@galinilinn 2 жыл бұрын

Your videos are amazing Yannic, keep it up. Much love

@bg2junge 3 жыл бұрын

Any results(images) from generative models should be accompanied by the nearest neighbor(vgg latent, etc) from the training dataset. I am going to train it on mnist🏋

@alexnichol3138 3 жыл бұрын

There are nearest neighbors in the beginning of the appendix!

@bg2junge 3 жыл бұрын

@@alexnichol3138 i retract my statement.

@48956l 3 жыл бұрын

@@bg2junge I demand seppuku

@kxdy8yg8 3 жыл бұрын

Great materials ! Honestly, I really enjoy your content !! Keep it up 👏👏

@bertobertoberto242 Жыл бұрын

I would say that the sqrt(1-B) is used to converge to a N(0,sigma), mainly in it's "mu", othersize adding gaussian noise would just (in expectation) have X0 as mu, instead of 0

@JamesAwokeKnowing 3 жыл бұрын

This makes me think that instead of super res from lower res image it could be even more effective to store a sparse pixel array (with high res positioning). You could even have another net 'learn' a way of choosing eg which 1000 pivels of a high res image to store (pixels providing most information for reconstruction).

@vidret 3 жыл бұрын

yes... yeeeeeesssssssssssss

@Champignon1000 3 жыл бұрын

wow thats a really great idea actually!

@proinn2593 3 жыл бұрын

There is this step wise generation in GAN's, not based on steps from noise to image, but based on the size of the image, like in Pro-GAN and MSG-GAN. In these models you have discriminators for different sizes of the image, kind of.

@gustavboye6691 3 жыл бұрын

yes that should be the same right?

@cedricvillani8502 2 жыл бұрын

Are you saying it’s not the size of your GAN that matters, but how You use it? 😂

@JTMoustache 3 жыл бұрын

44:14: p(a|b,c) = p(a,c|b) / p(c|b) = p(a|b) * p(c|b,a) / p(c|b) = Z * p(a|b) * p(c|a,b) and if c independant of b given a = Z * p(a|b) * p(c|a) But Z = p(c|b) So given that c independant of b given a, p(a|b,c) = p(a|b) * p(c|a) / p(c|b) Here a = xt, b = xt+1, c=y, Z= 1 / p(y|xt+1) .. Then they probably consider y independent of xt+1 given xt. Problem is, if they consider y indep of xt+1 given xt, they should probably consider y indepedent of xt given xt+1 which would basically say p(xt|xt+1,y) = p(xt|xt+1). But I guess it is the whole point to say that actually no, xt contains more information about y than xt+1 so it y is not independant of xt given a more noisy version of xt (xt+1).

@nahakuma 3 жыл бұрын

I think it is more natural to do your derivation with a=x_t, b=y, c=x_{t+1}. In this way, a fitting probabilistic graph model would be y -> x_{t} - > x_{t+1}. So, the class label y clearly determines the distribution of your image at any step, but given the current image x_{t} you already have a well defined noise process that tell you how x_{t+1} will be obtained from x_{t} and the label then becomes irrelevant.

@TechyBen 3 жыл бұрын

Detecting signal inside the noise. Wow. It's like a super cheat for cheat sheets. And it works! :D

@Kerrosene 3 жыл бұрын

Reminds of normalising flows..the direction of the flow leads to a normal form through multiple invertible transformations...

@PlancksOcean 3 жыл бұрын

It looks like it but the transformation (adding some noise) is stochastic and non invertible

@nisargshah467 3 жыл бұрын

I was waiting for this.. so not read the paper.. thanks yannic

@princeofexcess 3 жыл бұрын

Great video. Could you possibly up the volume level for the next video. I notice this video is much quieter than other videos I watch.

@chaerinkong5303 3 жыл бұрын

Thanks a lot for this awesome video. I really needed it

@stephanebeauregard4083 3 жыл бұрын

I´ve only listened to 11 minutes so far but DDPMs remind me a lot of Compressed (or Compressive) Sensing ...

@thirtysixnanoseconds1086 3 жыл бұрын

same, the Steve brunton videos :D

@easyBob100 2 жыл бұрын

Another question. If the network is predicting the noise added to a noisy image, what do you then do with that prediction? Subtract it from the noisy image? Do you then run it back through the network to again, predict noise? When you train this network, do you train it to only predict the small amount of noise added to the image between the forward process steps? Or does it try to predict all the noise added to the image from that point? Or maybe it's more like the forward process? Starting with latent x_T as input to the network, the network gives you an 'image' that it thinks is on the manifold (x_T-1). At this point, it most likely isn't, but, you can move 1/T towards it like we did moving towards the Gaussian noise to get to x_T. Then, repeat....? More examples and less math always helps...

@furrry6056 2 жыл бұрын

Yes, it's a step by step approach. Thus, when 'destroying' the image, the image at Ti = image at Ti-1 + noise step. You just keep adding / stacking noise, adding a bit more noise (to the previous noise) at each new step. It isn't really 'constant' though. The variance / amount of noise added, depends on the time step and the schedule. A Linear schedule would be constant (adding same amount of noise at each Ti), but if you look at the images (de)generated doing so, you get a quite long tail of images that contain nearly only noise. Therefore a cosine schedule is used, meaning the variance differs per Ti, and also ending up with more information left in the images at the latter time steps. The timestep is actually encoded into the model. Thus, the parameters that are learned to predict the noise 'shift' depending on T. (At least.. In my understanding / words. I'm just a dumb linguist - I don't know any maths either 😅.) Perhaps a better way to explain it, is to imagine that at small Ti, the model can depend on all kinds of visual features (edges, corners, etc.) learned to predict noise. At large T, those features / params get less informative, thus you rely on other features to estimate where the noise is. (Thus its probably not the features that shift depended on T, but their weights.) When generating a new image, you start at Tmax. Thus, pure random noise only. The model first reconstructs to Tmax-1. Removing a little noise.. Then, taking this image, you again remove a bit more noise, etc. It's an iterative process.

@luke.perkin.online 3 жыл бұрын

I wonder if multiscale noise would work better. It'd fit more with convolutions. Instead of 0% to 100% noise, it could disturb from pixels to the whole image.

@arnabdey7019 9 ай бұрын

Please make explanation videos on Yang Song's papers too

@austin99299 Жыл бұрын

16:55 denoising depends on the entire data distribution sizes because adding random noise in one step can be done independent of all previous steps; just add a bit of noise wherever you like. But removing noise (the reverse) has to assume there was noise added in some number of previous steps. Thus, in the example of denoising a small child's drawing, it's not that we're removing ALL the noise. Instead, The dependence problem arises in simply taking a single step towards a denoised picture. Can anyone clarify/confirm?

@G12GilbertProduction 3 жыл бұрын

Diffusing noise with a foward sampling is really more entropian in context accumulation of sharing data by the transformer, but visual autoencoders is thinny for this Gaussian / or / Bayes-Gauss mixture, without a one transformer for a layer. EDIT: I thought is only the prescriptive sense of this upper statement, not evenmore.

@romagluskin5133 Жыл бұрын

50:24 "Distribution shmistribution" 🤩

@CristianGarcia 3 жыл бұрын

This is me being lazy and not looking it up, but if they predict the noise instead of the image, to actually get the image they subtract the predicted noise from the noisy image iteratively until they get a clean image?

@YannicKilcher 3 жыл бұрын

Yes, pretty much, except doing this in a probabilistic way where you try to keep track of the distribution of the less and less noisy images.

@herp_derpingson 3 жыл бұрын

The audio is a bit quiet in this video. . 0:00 I didnt realize any of these were generated. Totally fooled my brain's discriminator. . 29:00 How can the noise be lesser than the accumulated noise upto that point? Are we taking into account that some noise added later might undo the previously added noise? . 50:00 I am not sure how to take the learnings to GANs from diffusion models. The only thing I can think of is pre-training the discriminator with real image and noised real image, but that sounds so obvious I am sure 100s of papers have already done that. . All in all I would love to see more papers which make the neural networks output weird things like probability distributions instead of simple images or word tokens.

@johongo 3 жыл бұрын

This paper is really well-written.

@brandomiranda6703 3 жыл бұрын

What is the main take away?

@YannicKilcher 3 жыл бұрын

make data into noise, learn to revert that process

@herp_derpingson 3 жыл бұрын

Train a denoiser but dont add or remove all the noise in one step.

@zephyrsails5871 3 жыл бұрын

Thank you Yannic for the video. QQ: why would we adding Gaussian noise for image requires multivariate Gaussian instead of just 1d Gaussian? Is the extra dimension used for different color channel?

@PlancksOcean 3 жыл бұрын

1dimension per pixel 🙂

@jakubsvehla9698 2 жыл бұрын

awesome video, thanks!

@JTchen-sq6gs 3 жыл бұрын

It seems you used a tool to concatenate two paper PDFs togather? It is cool, would you mind telling me which tool?

@erniechu3254 3 жыл бұрын

If you're on Mac, there's a native script for that. /System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py. Or you can just use Preview.app lol

@mohamedrashad7845 3 жыл бұрын

What software and hardware you use to make this video (drawing tables, adobe reader, others) ?

@easyBob100 2 жыл бұрын

Can someone explain the noising process with some pseudocode? Is the noise constantly added(based on t) or blended (based on percent of T)? And of course, does it make a difference and why? EDIT: Nevermind. I always figure it out after asking. :) (I generate some noise, and either blend or lerp towards it, as they are the same)

@samernoureddine 3 жыл бұрын

Lightning

@nahakuma 3 жыл бұрын

By the way, these DDPM models seem very related (a practical simplification?) to the Neural Autoregressive Flows, where each layer is invertible and each layer performs a small distrbution perturbation which vanishes with enough layers

@gooblepls3985 3 жыл бұрын

True! I think the important difference (implementational simplification) is that you have no a-priori restrictions on the DNN architecture here, i.e., the layers do not need to be invertible, and the idea is almost agnostic to what exact DNN architecture you use

@mikegro3138 2 жыл бұрын

Hi, i watched the video, but this is not a topic i am familiar with. Could anyone pleas describe in a few sentences how this works. Especially how disco diffusion works. Where does it gets the graphical Elements for the images, how does it connect keywords from the prompt with the artists, the style etc. It seems i can use every Keyword i want, but if there is a database, it should be limited. Is it trained somehow to learn what the different styles look like? What if i pick an uncommon keyword? So much questions to understand this incredible Software. Thanks

@natanielruiz818 2 жыл бұрын

Amazing video.

@soumyanasipuri 2 жыл бұрын

Can anyone tell me what do we mean by x0 ~ q(x0)? In terms of pictures, what is x0 and what is the data distribution? Thank you.

@CristianGarcia 3 жыл бұрын

Can you use this technique to erase adversarial attacks?

@herp_derpingson 3 жыл бұрын

Thats an interesting idea. Although I think we will have to train the network specifically on adversarial noise. Might not though. Not sure, but good idea regardless.

@jg9193 3 жыл бұрын

You'd have to be careful, because this technique relies on neural networks that can potentially be attacked

@daniilchesakov6010 3 жыл бұрын

Hi! Amazing video, thank you a lot! But I'm a bit confused with one detail and have a stupid question. As soon as we train our model to predict epsilon, using x_t and t, and also we have a formula x_t = \sqrt{\bar{\alpha}_t } * x_0 + \sqrt{1 - \bar{\alpha}_t }*eps We can get that x_0 = (x_t - \sqrt{1 - \bar{\alpha}_t }*eps) / \sqrt{\bar{\alpha}_t } And here we know alphas coz they are constants, also we know x_t (just some noise) and we know eps as it is the output of our model -- why can't we calculate the answer in just one step? Would be very grateful for answer!

@idenemmy 2 жыл бұрын

I have the same question. My hypothesis is that such x0 would be very bad. Have you found the answer to this question?

@nahakuma 3 жыл бұрын

I wonder why they state that the undefined norm ||.|| of the covariance tends to 0. Doesn't it tend to whatever is the norm of a uniform covariance matrix?

@herp_derpingson 3 жыл бұрын

Isnt the norm of uniform cov matrix, with mean=0, std=1, zero?

@nahakuma 3 жыл бұрын

@@herp_derpingson As far as I know, the norm of a matrix A is typically defined as the maximum norm of the vector x^TAx, with x^Tx = 1. In the case of a normal distribution you would have x^TAx=1 for any x and so the norm of the covariance would be 1. Am I wrong?

@herp_derpingson 3 жыл бұрын

@@nahakuma Nah, I am a bit out of touch with math. You are probably right.

@이상윤-n7d 3 жыл бұрын

22:31 Can someone explain how eq 12 acquired?

@hudewei7166 3 жыл бұрын

It is approximated by the product of two Gaussian distribution q(x_t|x_{t-1}) and q(x_{t-1}|x_0). If the chain rule is applied on eq.(12), then you can get q(x_{t-1}|x_t,x_0)=q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)/q(x_t|x_0). They also approximate q(x_t|x_{t-1},x_0)=q(x_t|x_{t-1}). Then if the normalization term is ignored, you will get the expression of (10) and (11).

@Farhad6th 3 жыл бұрын

The voice has a problem, it is very low. Please in the next videos fix that. Great video. Thank you.

@cedricvillani8502 2 жыл бұрын

Turn volume up?

@GuanlinLi-l8j 2 жыл бұрын

how about explain the code of this paper

@johnsnow9925 3 жыл бұрын

It wouldn't be OpenAI if they actually released their pretrained models

@PaulanerStudios 3 жыл бұрын

ClosedAI

@herp_derpingson 3 жыл бұрын

@@PaulanerStudios BURN

@ShawnFumo 2 жыл бұрын

Well it's a bit of a moot point now that Stable Diffusion has released theirs. Maybe it isn't matching DALL-E 2 in all areas yet, but is coming pretty close, especially the 1.5 model (already on DreamStudio, though not available for download quite yet).

@lllcinematography 2 жыл бұрын

your audio recording volume is too low. i have to increase my volume like 4x compared to other videos. thanks for the content.

@piotr780 Жыл бұрын

but this random image at the end does not contain any information !

@НикитаДробышев-ж4т 3 жыл бұрын

Hi! Please do something with your mic, because the video is so silent

@ProfessionalTycoons 3 жыл бұрын

super dope

@shynie4986 2 жыл бұрын

what' the purpose of the covariance matrix? or covariance and why is it important to us?

@PeterIsza 3 жыл бұрын

Video starts at 4:28.

@herp_derpingson 3 жыл бұрын

Video ends at 54:33

@bertchristiaens6355 3 жыл бұрын

If you add noise from a standard normal distribution thousands of times, isn't the average noise (expected value) added close to zero, resulting in the same image?

@samernoureddine 3 жыл бұрын

Even if they were using standard Gaussians (they aren't), the sum of just two standard Gaussians X and Y is not a standard Gaussian (the variances add up)

@trevoryap7558 3 жыл бұрын

But the variance will increase so significantly that it will be just noise. (Assuming that the noise are all independent)

@bertchristiaens6355 3 жыл бұрын

@@samernoureddine Thank you! I assumed that it was equivalent of sampling 1000 times (for example) from the same distribution N(0,var). Since these samples approximate the distribution of N(0,var) the mean of these values were 0 I thought. But I should rather see it as a sample from N(0, var+var+..+var), right? (since we add up the samples)

@samernoureddine 3 жыл бұрын

@@bertchristiaens6355 that would be right if they just wanted the noise distribution at some time t (and if the mean were zero: it isn't). But they want the noise distribution to evolve with time, and so the total noise at time t+1 is not independent from the total noise at time t

@cerebralm 3 жыл бұрын

It's like a random walk, the more random choices you make, the further you get from where you started (but unpredictably so)

@awesomealex001 2 жыл бұрын

And now these models are used in DALL·E 2

@donfeto7636 2 жыл бұрын

paper after updates become so complex to read with math

@XX-vu5jo 3 жыл бұрын

The problem with these solutions are their computing cost. I think they should focus more on that instead and they rely too much on data.

@peterthegreat7125 Жыл бұрын

Still confused about the math theory

@akashchadha1 3 жыл бұрын

Schmiduber enters chat.

@amansinghal5908 Жыл бұрын

why even do it when you'd do it in such a hand wavy manner?

@fast_harmonic_psychedelic 3 жыл бұрын

why dont they just use clip as a classifier. does nobody know about this? lol

@twobob 2 жыл бұрын

Not too long

@cedricvillani8502 2 жыл бұрын

The better you are detecting bullshit, the better you are at creating bullshit😂 none of my work would ever be public facing until I was sure I could always identify it and manipulate it and I’m sure that’s true for any company or skilled researcher. ❤😢

@XX-vu5jo 3 жыл бұрын

You can’t explain the equations. 🙄

@DanFrederiksen 3 жыл бұрын

This seemed much too long. For instance you don't need to labor the notion of denoising for minutes. Noise reduction should be in people's vocabulary at this level. I'd suggest going directly to what diffusion models are and try to prepare succinct explanations instead of just going for an hour.

@nahakuma 3 жыл бұрын

Or you could simply skip the parts you already understand ;)

@frankd1156 3 жыл бұрын

This free knowledge...as try to criticize nicely or move on to another resource

@banknote501 3 жыл бұрын

Ok, if it is so easy, just do a video yourself. We need videos about AI topics for viewers of all skill levels.

@mgostIH 3 жыл бұрын

This seems like fair criticism, I don't see why they are being hostile with you

@DanFrederiksen 3 жыл бұрын

@@mgostIH I understand that some feel defensive, but it wasn't meant as an attack but empowering observation. communication is vastly more potent the more concise and clear it is.