How does Stable Diffusion work? - Latent Diffusion Models EXPLAINED

Рет қаралды 91,524

AI Coffee Break with Letitia

Күн бұрын

Пікірлер: 101

@AICoffeeBreak 6 ай бұрын

8:39 What I meant to say, is VQ-VAE, not VQGAN! Thanks to Luis Cunha for spotting this!

@j2325138 Жыл бұрын

Thanks!

@AICoffeeBreak Жыл бұрын

Omg, thanks! 😇

@RemiStardust Жыл бұрын

I'm going to have to watch this twice. So much dense info!

@BjoernBu 2 жыл бұрын

Awesome explanation! Short & to the point, with super useful visualizations throughout all parts. Thank you very much for the video.

@CK-vc3kk 2 жыл бұрын

I have to comment that your Erratum is amazing. Thank you. It clearly shows the greatest challenge in ML research, which is the *smallest* seeming innocent detail can in fact be the *entire reason a model doesn't work* (or at least, not work anywhere near how it should).

@chyldstudios 2 жыл бұрын

Perfect timing, I was just reading about latent diffusion models!

@AICoffeeBreak 2 жыл бұрын

Yeasss! ⏲ 😅

@Micetticat 2 жыл бұрын

Latent space is where all the magic happens!

@AICoffeeBreak 2 жыл бұрын

🪄

@DerPylz 2 жыл бұрын

Welcome back from your deserved holiday break! 😊

@AICoffeeBreak 2 жыл бұрын

Thank you! 😁

@CristianYones 2 жыл бұрын

Best explanation so far! And I've seen a lot of videos...

@TenzinT 2 жыл бұрын

Great, best explanation of Diffusion Models so far!

@obseraft 2 жыл бұрын

You accent is great, thanks for your explanation.

@ManuXit 2 жыл бұрын

super happy I found your awesome channel

@AICoffeeBreak 2 жыл бұрын

Glad you found us!

@niofer7247 2 жыл бұрын

Always love the clarity of your explanations. Keep it up!!!

@ShivamMamgain 2 жыл бұрын

Great explanations. Ty

@paulcurry8383 2 жыл бұрын

I’m curious how inpainting works with these models. Does the removed part of the image get filled with noise and then copied to the output with only the noise prediction from the removed part used to de-noise the removed part?

@blackrack2008 2 жыл бұрын

There are options to keep the removed part of the image and just reinterpret it with a given strength which works like img2img or to just replace it by noise

@mpjstuff 2 жыл бұрын

I still feel like we are missing a bit of understanding. In your last image that you end by using img2img (however, this applies to the entire process of recognition). The baby ice cream robot has a balloon that is floating above it's head in the drawing, with a loose string. In the output image, it's echoing the daddy balloon and pulled by the wind. Now, how does the natural language search with the context of "balloon" now that blob is a balloon and should that since it floats, both would be pulled by the same wind -- is that in a prompt? Even so, the natural language engine -- how does that get the concept of "pulled by the wind" when, some google search would present results that would create a huge mess? Or is that a writing prompt as well? Even still, how does that incorporate into the image conceptually? If I'm extrapolating from the artists drawings, I suppose there is some idea of a bipedal and symmetrical structure, and "beauty" is perhaps going to result in a certain relationship in size and structure once a pattern emerges, but, I don't see how random images in a collection coalesce into being "recognized" from the noise without a lot of mutation? Is it making inferences from each step and jumping forward and then culling the branches that lead to "nonsensical or monstrous" arrangements via a learning engine? Maybe this will be explained if I spend another few hours on videos ;-) -- anyway, thank you Letitia for your wonderful work.

@felipe_ai 2 жыл бұрын

I love your videos, can't believe I didn't find you sooner lol, thank you for all the content :)

@AICoffeeBreak 2 жыл бұрын

Glad we reached you. :)

@satpalsinghrathore2665 2 жыл бұрын

Easily explained! Kudos.

@deeplearner2634 2 жыл бұрын

Man i was also misunderstanding the concept too. Its sounds just insane to even think of trying to predict the *entire* noise from a mid product

@WhatsAI 2 жыл бұрын

Amazing video and explanation as always!

@AICoffeeBreak 2 жыл бұрын

Can only recommend you video on this! :)

@user__214 Жыл бұрын

Thank you for this! Best simple explanation of stable diffusion that I've come across.

@alphacat4927 2 жыл бұрын

I was literally mentally thinking of sitting in a hammock when you said that.

@AICoffeeBreak 2 жыл бұрын

🏖️🏕️

@kajika135bis 2 жыл бұрын

Amazing explanations, this model looks very accessible, I hope it is not filled with undocumented tricks like I experienced with Soft Agent Critic. Thank you again, this video is the last straw for me to go and run this incredible network.

@maryguty1705 11 ай бұрын

just fabulously done! wanderful!

@AICoffeeBreak 11 ай бұрын

@Neptutron 2 жыл бұрын

Yay, you're back! =D Also this is perfect timing for me lol - I was planning on trying out a project with this model

@AICoffeeBreak 2 жыл бұрын

Is this project idea anything you want to share? :) No worries if you would rather not talk about it, especially if you plan to publish a paper about it and do not want to be scooped. 😅

@Neptutron 2 жыл бұрын

@@AICoffeeBreak Hey:) So I'm a PhD student, and I was thinking about making a paper of it if it works, so I won't share the entire idea here because you have a good point. But the gist is that I want to make 3d models from Stable Diffusion's outputs by combining it with a differentiable rendering backend

@AICoffeeBreak 2 жыл бұрын

@@Neptutron 🤞hope that you succeed!

@Neptutron 2 жыл бұрын

@@AICoffeeBreak Thank you Letitia!!

@Neptutron 2 жыл бұрын

Hi Letitia! I sent you an email with a small update:)

@coderaven1107 2 жыл бұрын

LOVE your content!!

@L33TNINJA51 2 жыл бұрын

So are any transformers used in Stable diffusion?

@L33TNINJA51 2 жыл бұрын

Oh it uses transformers (cross attention) at every step. The paper cleared it up for me.

@harumambaru 2 жыл бұрын

If you are really using sponsor to transcribe your videos it is doing amazing job compared to standard youtube autogenerated text. Punctuation signs are so life changing! I saw exclamation mark in subtitles for the first time of 15 years of using youtube!!!

@AICoffeeBreak 2 жыл бұрын

The baked in subtitles in the sponsor spot are transcribed. The rest is not. I copy paste ma script into KZbin for the captions in general.

@HalkerVeil 2 жыл бұрын

They used to say the camera would ruin art too.

@gunnar_langemark 2 жыл бұрын

I really enjoyed your video. Keep up the good work, please. :)

@Harduex Жыл бұрын

Very good and easy to understand explanation! Thank you for your content and keep up with the good work 🙌

@XuhanQian 2 жыл бұрын

love ur work! great explanation!

@HemangJoshi 2 жыл бұрын

Miss coffee bean is so sweet... ❤️❤️❤️❤️

@vikaspoddar001 2 жыл бұрын

Everything is happening in great sync 🙂 After watching Andrej Karpathy's stable diffusion video now, getting an explanation by Miss Coffee bean herself 🎉🎉🎉

@AICoffeeBreak 2 жыл бұрын

Haha, I feel like an imposter just being in the same sentence with Karpathy. 😅

@maryguty1705 11 ай бұрын

Have you ever done videos on all the weird and interesting tricks apply in DL and AI to achieve quite intriguing results?

@AkairoAoihonoSama 9 ай бұрын

Great video! I just found your channel and is fantastic ❤ May you please explain why AI images often have a strange wavy pattern that doesn't represent anything and looks like shapes melting into other shapes? Is it correlated to the noise added?

@AICoffeeBreak 6 ай бұрын

Hi, wonderful question! It can have something to do with noisy training data, but mostly, it is because the process of interpolating within the latent space of generative models can sometimes produce smooth transitions that appear as melting shapes. This is because in the latent space, you cramp many dimensions into few ones, but still have places in the latent space that was unsampled during training. And small changes in the latent vector inherently lead to continuous changes in the output image, but sometimes they are strange, nonsensical melting shapes.

@Handelsbilanzdefizit 2 жыл бұрын

Are skip-connections and residual-connections the same?

@johnclapperton8211 2 жыл бұрын

What constitutes a valid starting image? Is it arbitrary? If I ask for "a frog on a carrot" is the starting image as likely to be a spaceship or hamburger as either a frog or carrot? What do we mean by an "image"? It's an arrangement of pixels, but by what criteria do humans call it an image - just some spatial gradients which human brains have learned to recognise as meaningful in some way?

@pvlr1788 2 жыл бұрын

So the sentence encoder and image encoder-decoder are pre-trained and non-trainable during the training process of the LDM?

@Phenix66 2 жыл бұрын

yup - otherwise, things would be super complex again if you'd take the full image. That's what makes it elegant. And training the text encoder along with the diffusion model is also not a great idea (Imagen form google basically just leaves that out and things get better).

@idealintelligence7009 2 жыл бұрын

Stable Diffusion Kaggle notebook: www.kaggle.com/code/givkashi/stable-diffusion-model-text-to-image-generation

@romekin 2 жыл бұрын

this dog is so cute!!

@ginebro1930 2 жыл бұрын

It's funny how Dalle is made by Open IA and isn't open source.

@robbiero368 2 жыл бұрын

Apparently they've invented this new fangled thing called a "camera" all artists are now out of a job. I don't think you can really call yourself an artist if you can't imagine the creative explosion this sort of thing can promote

@StuninRub 2 жыл бұрын

when the camera became popular, a lot of painters did indeed lose their jobs.

@robbiero368 2 жыл бұрын

@@StuninRub I would argue the benefit was that it eventually allowed the vast majority of the world to create images and not a handful of elites who could afford to pay for a painted portrait. Plus you can still have your portrait painted if you have the money. Plus you can send your photo to an artist and have them paint your portrait more cheaply. So there are probably even more portrait artists now in total than before.

@StuninRub 2 жыл бұрын

@@robbiero368 A lot less portrait artist. The camera put a lot of painters out of business and AI art will do the same to digital artists.

@robbiero368 2 жыл бұрын

@@StuninRub can you supply some number's?

@joedalton77 2 жыл бұрын

Is the time T an input to the network?

@AICoffeeBreak 2 жыл бұрын

The image at diffusion step T, yes. So with a certain amount of noise.

@TimScarfe 2 жыл бұрын

Amazeballs

@AICoffeeBreak 2 жыл бұрын

😅 Thank, Tim!

@HalkerVeil 2 жыл бұрын

Plot twist. This video and the cute girl talking is AI generated...

@matejcigale8840 2 жыл бұрын

Does anybody understand what is the difference between "Disco Diffusion" and "Stable Diffusion" I cant seem to find anything useful. Both seem to steam from the code of the same person, if I understand correctly. Is it just the models?

@maltimoto Жыл бұрын

So how can an AI image generator two objects, for example a man holding an apple in his hand? The AI must (!) have the 3D model in order to render the scene correctly. But how does the AI get the 3D model? This is not explained in any of the videos on KZbin.

@AICoffeeBreak Жыл бұрын

It does not have a 3D model in the sense we have it. It learned from hundreds of millions of 2D images how apple pixels and man pixel "rhyme" with each other, which also includes a sometimes accurate, but sometimes inaccurate fit of an inferred 3D model. It's exactly how ChatGPT talks about physics and reality by just having had read text but not seen any pictures and never touched anything.

@anonymous-vf2pr 2 жыл бұрын

why are we adding noise instead of directly taking noisey image.

@REDINKmysteries 2 жыл бұрын

i like turtles!!!!!

@ellanleicher8657 2 жыл бұрын

So how does a diffusion network generate a new image? From what I understand it tries to reconstruct the noisy image

@Epsellis 2 жыл бұрын

No, Artists aren't complaining because it's easy. Just go watch any respected art youtuber

@AICoffeeBreak 2 жыл бұрын

I'm interested. Could you point me to a specific video?

@maxicornejo9675 2 жыл бұрын

@@AICoffeeBreak This guy explains it very briefly kzbin.info/www/bejne/gqKymX2OhKajfas Basically, it's about exploitation and AI ethics, something that every tech fetishists doesn't care about. Sadly.

@crapadopalese 2 жыл бұрын

Yeah you got it wrong again in explaining what noise is being predicted. But I appreciate you not getting flustered by not getting it right and continuing to speak with complete confidence about things you don't have a grasp on.

@AICoffeeBreak 2 жыл бұрын

Thanks for sharing your opinion! If you would tell us what is wrong, we could point it out (for example by pinning your comment).

@crapadopalese 2 жыл бұрын

@@AICoffeeBreak thanks for your suggestion but I think that it's not an iterative process. Every video you post where you get it wrong disseminates misinformation. Do thorough research and be confident that you understand what you're explaining!

@maxicornejo9675 2 жыл бұрын

@@crapadopalese what a helpful comment, thank you.

@crapadopalese 2 жыл бұрын

@@maxicornejo9675 I think criticizing people that provide misinformation is important.

@maxicornejo9675 2 жыл бұрын

@@crapadopalese Constructive criticism is more valuable than just criticism. >Constructive criticism is a feedback method that offers specific, actionable recommendations.

@81neuron 2 жыл бұрын

great video! just a bit too much cynicism

@chengong388 2 жыл бұрын

I tried it, and it totally sucks, the results are often messy, with different features smeared together, very much like those earlier AI "dream" images. It's just nowhere near as natural as Dall-E, and I can't help but feel that an optimized algorithm shouldn't require so much VRAM for a 512x512 image.

@imapimplykindapimp 2 жыл бұрын

Hard to optimise a neural network, not like a regular human made algorithm

@ДмитроПрищепа-д3я 2 жыл бұрын

Have you added a negative prompt. It's quite important to tell the network what not to do. As for the vram usage, yeah, it has to fit the entire model in memory to run inference on it and the model is massive. There's some work being done to optimize it however. For one, the model you were using is likely 2 times smaller than the original already due to pruning. Stability AI is currently working on distilling the model down further to speed things up even more.

@Schinken_ 2 жыл бұрын

First you make text to image and than do image to image