8:39 What I meant to say, is VQ-VAE, not VQGAN! Thanks to Luis Cunha for spotting this!
@j2325138 Жыл бұрын
Thanks!
@AICoffeeBreak Жыл бұрын
Omg, thanks! 😇
@RemiStardust Жыл бұрын
I'm going to have to watch this twice. So much dense info!
@BjoernBu2 жыл бұрын
Awesome explanation! Short & to the point, with super useful visualizations throughout all parts. Thank you very much for the video.
@CK-vc3kk2 жыл бұрын
I have to comment that your Erratum is amazing. Thank you. It clearly shows the greatest challenge in ML research, which is the *smallest* seeming innocent detail can in fact be the *entire reason a model doesn't work* (or at least, not work anywhere near how it should).
@chyldstudios2 жыл бұрын
Perfect timing, I was just reading about latent diffusion models!
@AICoffeeBreak2 жыл бұрын
Yeasss! ⏲ 😅
@Micetticat2 жыл бұрын
Latent space is where all the magic happens!
@AICoffeeBreak2 жыл бұрын
🪄
@DerPylz2 жыл бұрын
Welcome back from your deserved holiday break! 😊
@AICoffeeBreak2 жыл бұрын
Thank you! 😁
@CristianYones2 жыл бұрын
Best explanation so far! And I've seen a lot of videos...
@TenzinT2 жыл бұрын
Great, best explanation of Diffusion Models so far!
@obseraft2 жыл бұрын
You accent is great, thanks for your explanation.
@ManuXit2 жыл бұрын
super happy I found your awesome channel
@AICoffeeBreak2 жыл бұрын
Glad you found us!
@niofer72472 жыл бұрын
Always love the clarity of your explanations. Keep it up!!!
@ShivamMamgain2 жыл бұрын
Great explanations. Ty
@paulcurry83832 жыл бұрын
I’m curious how inpainting works with these models. Does the removed part of the image get filled with noise and then copied to the output with only the noise prediction from the removed part used to de-noise the removed part?
@blackrack20082 жыл бұрын
There are options to keep the removed part of the image and just reinterpret it with a given strength which works like img2img or to just replace it by noise
@mpjstuff2 жыл бұрын
I still feel like we are missing a bit of understanding. In your last image that you end by using img2img (however, this applies to the entire process of recognition). The baby ice cream robot has a balloon that is floating above it's head in the drawing, with a loose string. In the output image, it's echoing the daddy balloon and pulled by the wind. Now, how does the natural language search with the context of "balloon" now that blob is a balloon and should that since it floats, both would be pulled by the same wind -- is that in a prompt? Even so, the natural language engine -- how does that get the concept of "pulled by the wind" when, some google search would present results that would create a huge mess? Or is that a writing prompt as well? Even still, how does that incorporate into the image conceptually? If I'm extrapolating from the artists drawings, I suppose there is some idea of a bipedal and symmetrical structure, and "beauty" is perhaps going to result in a certain relationship in size and structure once a pattern emerges, but, I don't see how random images in a collection coalesce into being "recognized" from the noise without a lot of mutation? Is it making inferences from each step and jumping forward and then culling the branches that lead to "nonsensical or monstrous" arrangements via a learning engine? Maybe this will be explained if I spend another few hours on videos ;-) -- anyway, thank you Letitia for your wonderful work.
@felipe_ai2 жыл бұрын
I love your videos, can't believe I didn't find you sooner lol, thank you for all the content :)
@AICoffeeBreak2 жыл бұрын
Glad we reached you. :)
@satpalsinghrathore26652 жыл бұрын
Easily explained! Kudos.
@deeplearner26342 жыл бұрын
Man i was also misunderstanding the concept too. Its sounds just insane to even think of trying to predict the *entire* noise from a mid product
@WhatsAI2 жыл бұрын
Amazing video and explanation as always!
@AICoffeeBreak2 жыл бұрын
Can only recommend you video on this! :)
@user__214 Жыл бұрын
Thank you for this! Best simple explanation of stable diffusion that I've come across.
@alphacat49272 жыл бұрын
I was literally mentally thinking of sitting in a hammock when you said that.
@AICoffeeBreak2 жыл бұрын
🏖️🏕️
@kajika135bis2 жыл бұрын
Amazing explanations, this model looks very accessible, I hope it is not filled with undocumented tricks like I experienced with Soft Agent Critic. Thank you again, this video is the last straw for me to go and run this incredible network.
@maryguty170511 ай бұрын
just fabulously done! wanderful!
@AICoffeeBreak11 ай бұрын
@Neptutron2 жыл бұрын
Yay, you're back! =D Also this is perfect timing for me lol - I was planning on trying out a project with this model
@AICoffeeBreak2 жыл бұрын
Is this project idea anything you want to share? :) No worries if you would rather not talk about it, especially if you plan to publish a paper about it and do not want to be scooped. 😅
@Neptutron2 жыл бұрын
@@AICoffeeBreak Hey:) So I'm a PhD student, and I was thinking about making a paper of it if it works, so I won't share the entire idea here because you have a good point. But the gist is that I want to make 3d models from Stable Diffusion's outputs by combining it with a differentiable rendering backend
@AICoffeeBreak2 жыл бұрын
@@Neptutron 🤞hope that you succeed!
@Neptutron2 жыл бұрын
@@AICoffeeBreak Thank you Letitia!!
@Neptutron2 жыл бұрын
Hi Letitia! I sent you an email with a small update:)
@coderaven11072 жыл бұрын
LOVE your content!!
@L33TNINJA512 жыл бұрын
So are any transformers used in Stable diffusion?
@L33TNINJA512 жыл бұрын
Oh it uses transformers (cross attention) at every step. The paper cleared it up for me.
@harumambaru2 жыл бұрын
If you are really using sponsor to transcribe your videos it is doing amazing job compared to standard youtube autogenerated text. Punctuation signs are so life changing! I saw exclamation mark in subtitles for the first time of 15 years of using youtube!!!
@AICoffeeBreak2 жыл бұрын
The baked in subtitles in the sponsor spot are transcribed. The rest is not. I copy paste ma script into KZbin for the captions in general.
@HalkerVeil2 жыл бұрын
They used to say the camera would ruin art too.
@gunnar_langemark2 жыл бұрын
I really enjoyed your video. Keep up the good work, please. :)
@Harduex Жыл бұрын
Very good and easy to understand explanation! Thank you for your content and keep up with the good work 🙌
@XuhanQian2 жыл бұрын
love ur work! great explanation!
@HemangJoshi2 жыл бұрын
Miss coffee bean is so sweet... ❤️❤️❤️❤️
@vikaspoddar0012 жыл бұрын
Everything is happening in great sync 🙂 After watching Andrej Karpathy's stable diffusion video now, getting an explanation by Miss Coffee bean herself 🎉🎉🎉
@AICoffeeBreak2 жыл бұрын
Haha, I feel like an imposter just being in the same sentence with Karpathy. 😅
@maryguty170511 ай бұрын
Have you ever done videos on all the weird and interesting tricks apply in DL and AI to achieve quite intriguing results?
@AkairoAoihonoSama9 ай бұрын
Great video! I just found your channel and is fantastic ❤ May you please explain why AI images often have a strange wavy pattern that doesn't represent anything and looks like shapes melting into other shapes? Is it correlated to the noise added?
@AICoffeeBreak6 ай бұрын
Hi, wonderful question! It can have something to do with noisy training data, but mostly, it is because the process of interpolating within the latent space of generative models can sometimes produce smooth transitions that appear as melting shapes. This is because in the latent space, you cramp many dimensions into few ones, but still have places in the latent space that was unsampled during training. And small changes in the latent vector inherently lead to continuous changes in the output image, but sometimes they are strange, nonsensical melting shapes.
@Handelsbilanzdefizit2 жыл бұрын
Are skip-connections and residual-connections the same?
@johnclapperton82112 жыл бұрын
What constitutes a valid starting image? Is it arbitrary? If I ask for "a frog on a carrot" is the starting image as likely to be a spaceship or hamburger as either a frog or carrot? What do we mean by an "image"? It's an arrangement of pixels, but by what criteria do humans call it an image - just some spatial gradients which human brains have learned to recognise as meaningful in some way?
@pvlr17882 жыл бұрын
So the sentence encoder and image encoder-decoder are pre-trained and non-trainable during the training process of the LDM?
@Phenix662 жыл бұрын
yup - otherwise, things would be super complex again if you'd take the full image. That's what makes it elegant. And training the text encoder along with the diffusion model is also not a great idea (Imagen form google basically just leaves that out and things get better).
It's funny how Dalle is made by Open IA and isn't open source.
@robbiero3682 жыл бұрын
Apparently they've invented this new fangled thing called a "camera" all artists are now out of a job. I don't think you can really call yourself an artist if you can't imagine the creative explosion this sort of thing can promote
@StuninRub2 жыл бұрын
when the camera became popular, a lot of painters did indeed lose their jobs.
@robbiero3682 жыл бұрын
@@StuninRub I would argue the benefit was that it eventually allowed the vast majority of the world to create images and not a handful of elites who could afford to pay for a painted portrait. Plus you can still have your portrait painted if you have the money. Plus you can send your photo to an artist and have them paint your portrait more cheaply. So there are probably even more portrait artists now in total than before.
@StuninRub2 жыл бұрын
@@robbiero368 A lot less portrait artist. The camera put a lot of painters out of business and AI art will do the same to digital artists.
@robbiero3682 жыл бұрын
@@StuninRub can you supply some number's?
@joedalton772 жыл бұрын
Is the time T an input to the network?
@AICoffeeBreak2 жыл бұрын
The image at diffusion step T, yes. So with a certain amount of noise.
@TimScarfe2 жыл бұрын
Amazeballs
@AICoffeeBreak2 жыл бұрын
😅 Thank, Tim!
@HalkerVeil2 жыл бұрын
Plot twist. This video and the cute girl talking is AI generated...
@matejcigale88402 жыл бұрын
Does anybody understand what is the difference between "Disco Diffusion" and "Stable Diffusion" I cant seem to find anything useful. Both seem to steam from the code of the same person, if I understand correctly. Is it just the models?
@maltimoto Жыл бұрын
So how can an AI image generator two objects, for example a man holding an apple in his hand? The AI must (!) have the 3D model in order to render the scene correctly. But how does the AI get the 3D model? This is not explained in any of the videos on KZbin.
@AICoffeeBreak Жыл бұрын
It does not have a 3D model in the sense we have it. It learned from hundreds of millions of 2D images how apple pixels and man pixel "rhyme" with each other, which also includes a sometimes accurate, but sometimes inaccurate fit of an inferred 3D model. It's exactly how ChatGPT talks about physics and reality by just having had read text but not seen any pictures and never touched anything.
@anonymous-vf2pr2 жыл бұрын
why are we adding noise instead of directly taking noisey image.
@REDINKmysteries2 жыл бұрын
i like turtles!!!!!
@ellanleicher86572 жыл бұрын
So how does a diffusion network generate a new image? From what I understand it tries to reconstruct the noisy image
@Epsellis2 жыл бұрын
No, Artists aren't complaining because it's easy. Just go watch any respected art youtuber
@AICoffeeBreak2 жыл бұрын
I'm interested. Could you point me to a specific video?
@maxicornejo96752 жыл бұрын
@@AICoffeeBreak This guy explains it very briefly kzbin.info/www/bejne/gqKymX2OhKajfas Basically, it's about exploitation and AI ethics, something that every tech fetishists doesn't care about. Sadly.
@crapadopalese2 жыл бұрын
Yeah you got it wrong again in explaining what noise is being predicted. But I appreciate you not getting flustered by not getting it right and continuing to speak with complete confidence about things you don't have a grasp on.
@AICoffeeBreak2 жыл бұрын
Thanks for sharing your opinion! If you would tell us what is wrong, we could point it out (for example by pinning your comment).
@crapadopalese2 жыл бұрын
@@AICoffeeBreak thanks for your suggestion but I think that it's not an iterative process. Every video you post where you get it wrong disseminates misinformation. Do thorough research and be confident that you understand what you're explaining!
@maxicornejo96752 жыл бұрын
@@crapadopalese what a helpful comment, thank you.
@crapadopalese2 жыл бұрын
@@maxicornejo9675 I think criticizing people that provide misinformation is important.
@maxicornejo96752 жыл бұрын
@@crapadopalese Constructive criticism is more valuable than just criticism. >Constructive criticism is a feedback method that offers specific, actionable recommendations.
@81neuron2 жыл бұрын
great video! just a bit too much cynicism
@chengong3882 жыл бұрын
I tried it, and it totally sucks, the results are often messy, with different features smeared together, very much like those earlier AI "dream" images. It's just nowhere near as natural as Dall-E, and I can't help but feel that an optimized algorithm shouldn't require so much VRAM for a 512x512 image.
@imapimplykindapimp2 жыл бұрын
Hard to optimise a neural network, not like a regular human made algorithm
@ДмитроПрищепа-д3я2 жыл бұрын
Have you added a negative prompt. It's quite important to tell the network what not to do. As for the vram usage, yeah, it has to fit the entire model in memory to run inference on it and the model is massive. There's some work being done to optimize it however. For one, the model you were using is likely 2 times smaller than the original already due to pruning. Stability AI is currently working on distilling the model down further to speed things up even more.
@Schinken_2 жыл бұрын
First you make text to image and than do image to image