Lesson 18: Deep Learning Foundations to Stable Diffusion

Рет қаралды 7,773

Күн бұрын

(All lesson resources are available at course.fast.ai.) In this lesson, we dive into various stochastic gradient descent (SGD) accelerated approaches, such as momentum, RMSProp, and Adam. We start by experimenting with these techniques in Microsoft Excel, creating a simple linear regression problem and applying the different approaches to solve it. We also introduce learning rate annealing and show how to implement it in Excel. Next, we explore learning rate schedulers in PyTorch, focusing on Cosine Annealing and how to work with PyTorch optimizers. We create a learner with a single batch callback and fit the model to obtain an optimizer. We then explore the attributes of the optimizer and explain the concept of parameter groups.
We continue by implementing the OneCycleLR scheduler from PyTorch, which adjusts the learning rate and momentum during training. We also discuss how to improve the architecture of a neural network by making it deeper and wider, introducing ResNets and the concept of residual connections. Finally, we explore various ResNet architectures from the PyTorch Image Models (timm) library and experiment with data augmentation techniques, such as random erasing and test time augmentation.
0:00:00 - Accelerated SGD done in Excel
0:01:35 - Basic SGD
0:10:56 - Momentum
0:15:37 - RMSProp
0:16:35 - Adam
0:20:11 - Adam with annealing tab
0:23:02 - Learning Rate Annealing in PyTorch
0:26:34 - How PyTorch’s Optimizers work?
0:32:44 - How schedulers work?
0:34:32 - Plotting learning rates from a scheduler
0:36:36 - Creating a scheduler callback
0:40:03 - Training with Cosine Annealing
0:42:18 - 1-Cycle learning rate
0:48:26 - HasLearnCB - passing learn as parameter
0:51:01 - Changes from last week, /compare in GitHub
0:52:40 - fastcore’s patch to the Learner with lr_find
0:55:11 - New fit() parameters
0:56:38 - ResNets
1:17:44 - Training the ResNet
1:21:17 - ResNets from timm
1:23:48 - Going wider
1:26:02 - Pooling
1:31:15 - Reducing the number of parameters and megaFLOPS
1:35:34 - Training for longer
1:38:06 - Data Augmentation
1:45:56 - Test Time Augmentation
1:49:22 - Random Erasing
1:55:55 - Random Copying
1:58:52 - Ensembling
2:00:54 - Wrap-up and homework
Many thanks to Francisco Mussari for timestamps and transcription.

Пікірлер: 12

@mkamp 8 ай бұрын

Bam. This lesson is dynamite. So much depth in just one lesson. ❤

@howardjeremyp 8 ай бұрын

Glad you think so!

@mkamp 8 ай бұрын

Around 1:58:00 (Rand copy). To truly preserve the existing distribution we could also copy the patch from a to b, but also copy what was prior to the copy on b to a.

@Lily-wp5do 5 ай бұрын

This is absolutely amazing!

@seanriley3121 3 ай бұрын

the random replace doesn't need to be slices/patches.. it could "swap" individual pixels. even easier to implement

@mkamp 8 ай бұрын

Around 1:36:00, using batchnorm scales the activations, but the activations are also scaled by the weights and with gamma of batch norm. Regularizing the weights of the linear modules becomes ineffective if the model learns to increase gamma? And it would because there is only one gamma parameter per module, but many weight parameters, therefore the gamma penalty is not having too much of an impact on the loss? Is that what Jeremy explains? Also this would be true for LayerNorm as well?

@coolarun3150 Жыл бұрын

coool!

@JensNyborg Жыл бұрын

Just before you went into copying I was sitting here thinking you could do a random shuffle to maintain the distribution. It may not matter, but the distribution stil changes when you delete pixels. After all, now there are more of the ones you copied. (And I should write this on the forums, but for now I'll write it here lest I forget.)

@carnap355 6 ай бұрын

distribution of individual datapoints changes but average distribution remains the same because the copied patch and the replaced patch on average will have the same distribution