Lesson 9A 2022 - Stable Diffusion deep dive

Рет қаралды 30,490

Күн бұрын

Johno shows us what is happening behind the scenes when we create an image with Stable Diffusion, looking at the different components and processes and how each can be modified for further control over the generation process. The notebook is available in this repository: github.com/fastai/diffusion-nbs
This was made as a companion to lesson 9 of the FastAI 2022 course by Jonathan Whitaker (his channel: kzbin.info/door/P6gT9X2....
00:00 - Introduction
00:40 - Replicating the sampling loop
01:17 - The Auto-Encoder
03:55 - Adding Noise and image-to-image
08:43 - The Text Encoding Process
15:15 - Textual Inversion
18:36 - The UNET and classifier free guidance
24:41 - Sampling explanation
36:30 - Additional guidance
Errata: there should be some scaling done to the model inputs for the unet demo in cell 49 (19 minutes in) - see scheduler.scale_model_input in all the loops for the code that is missing. And in the autoencoder part the 'compression' isn't exactly 64 times since there are 4 channels in the latent representation and only 3 in the input.

Пікірлер: 15

@markhopkins8731 Жыл бұрын

Love your simple explanation of a manifold Jonathan. It's the first time it's made sense to me. Looking forward to the coming lectures.

@al3030 Жыл бұрын

Thank you for this deep dive. The sampling explanation especially was helpful to try to get an intuition for what the model does.

@timandersen8030 Жыл бұрын

Appreciate this supplemental deep dive into code of stable diffusion!

@spider853 Жыл бұрын

I finally understand the schedulers! Thank you!

@saidmoglu Жыл бұрын

pretty good video to further understand SD!

@climez Жыл бұрын

This is useful but I wish you went into more detail here and there. Is some CLIP or similar model included in the stable diffusion implementation? If so, are precomputed weights of the CLIP model used to calculate noise_prediction in each step? I.e. we pass the current noisy image (in a latent space) and the text embedding to CLIP and then calculate the gradient for each voxel of the image so that something (semantic similarity?) is maximized? I wish you would say what happens during training of the mode and what then happens during inference :).

@alexrichmonkey7845 Жыл бұрын

Please explain the ancestral samplers.

@adityagupta-hm2vs 9 күн бұрын

Also, are we using latent space as gradients here, as we are subtracting gradients from the latent, which we typically do from weights in conventional NN ?

@jaivalani4609 Жыл бұрын

How can it perform the custom action. basically how can we fine tune it for our input and target image we want as per our text action

@JohnSmith-he5xg Жыл бұрын

Why do you "sample()" from the latents? Does this mean the latents are not the same between runs?

@adityagupta-hm2vs 9 күн бұрын

How do we decide the scaling factor in VAE part i.e. 0.18215, any hint on how to decide it ? I did try changing and could see the different output, but what's a good way to choose ?

@AM-yk5yd Жыл бұрын

I'm surprised how... complexity(?) raised up. It's second day and I only on 4th minute, spent 30 minutes debugging my coding-along session (I wrote rand_like instead of randn_like and my parrot photo went green instead of grambled)

@offchan Жыл бұрын

rand is uniform whereas rand is normal (gaussian)

@howardjeremyp Жыл бұрын

Feel free to skip over lessons 9A and 9B if you don't feel ready for them just yet - they're optional extras for those looking to dig deeper.