WGAN implementation from scratch (with gradient penalty)

Рет қаралды 56,176

Күн бұрын

Пікірлер: 102

@AladdinPersson 4 жыл бұрын

Hope you guys find this implementation video useful! This video assumes you're familiar with the basics of GANs and if not then follow the previous videos on this playlist. If you have recommendations on GANs you think would make this into an even better resource for people wanting to learn about GANs let me know in the comments below and I'll try to do it! I learned a lot and was inspired to make these GAN videos by the GAN specialization on coursera which I recommend. Below you'll find both affiliate and non-affiliate links, the pricing for you is the same but a small commission goes back to the channel if you buy it through the affiliate link. affiliate: bit.ly/2OECviQ non-affiliate: bit.ly/3bvr9qy Here's the outline for the video: 0:00 - Introduction 0:27 - Understanding WGAN 6:53 - WGAN Implementation details 9:15 - Coding WGAN 15:50 - Understanding WGAN-GP 18:48 - Coding WGAN-GP 25:29 - Ending

@AladdinPersson 4 жыл бұрын

@nerd I don't know, there's too many variables in play to say anything with confidence to that question. I don't spend much time thinking what would be good for an employers point of view either.. I think the focus one should have is just to improve and do what you enjoy, if someone likes that or not is up to them

@foobar1231 3 жыл бұрын

@@AladdinPersson 12:20 -- It is not important and not an error, however, it is better to apply *zero_grad()* method to the optimizer rather than to the model itself. It is not an error, because USUALLY (not always) optimizer applies zero_grad() to model parameters. In some cases, like in Neural Style Transfer task, *optimizer.zero_grad()* is applied to the image, but not to the model. So it is better to write: *optimizer.zero_grad()* rather than model.zero_grad() Also, if you set *nn.Flatten()* to the end of Critic class, you'll not need to *.reshape(-1) during training.

@mkt9191 3 жыл бұрын

Thanks for the great video , thats a big help. I implemented it for my dataset, but not sure if i am on the right track. May i know if you have any inputs for me ? The mean discriminator loss began with 7.4, decreased upto -15 and then increased to -1.38 (approx), where it is kind of stucked. 1.Does the discriminator loss actually becomes zero or any small negative value can be considered as a termination point? 2. Does generator loss gives us any information about convergence. Thank you

@Alex-wx2vd 8 ай бұрын

Awesome video! Thanks so much for posting an easy to follow WGAN guide. A couple of parts in the non-GP WGAN training loop jumped out to me as unusual, so I'll quickly explain what I figured out in case anyone else notices them: 1) Use of `retain_graph=True` in discriminator updates: This line isn't necessary, because we perform a full forward pass through both the generator and discriminator each time, use that compute graph only once to calculate the gradients. We then perform another full forward pass for the next set of inputs. `retain_graph=True` should be used when we need to perform multiple backward passes on the same graph, but since we're only using each graph one time then discarding it, we don't need to retain it. Retaining it will increase memory usage but not cause any issues with training. 2) Using the same real images during the discriminator training: In the original WGAN paper the authors use different target images for each of the five discriminator updates. Doing so increases the diversity of images the discriminator sees, which will increase generalisation power faster, meaning the model should converge quicker. Not doing so won't break anything, but training will take a bit longer. 3) Not detaching the output of the generator: The fake images being inputted to the discriminator during discriminator training should be detached from the compute graph using .detach(). Not doing so will mean when you call loss.backward(), we'll be computing the gradients for all the parameters in the generator network also. Since we then only go on to call the discriminator's optimiser step, we will not update the parameters of the generator when we do this, but we are wasting time calculating gradients which will ultimately slow down things computationally. All the best Aladdin, thanks for the great videos :)

@miladaghajohari2308 3 жыл бұрын

This video is awesome. I knew that WGAN and WGANGP was and I implemented them before and I watched this video for the fun of it and I can say it really showcases the implementation clearly.

@elahehshahmir322 3 жыл бұрын

I really appreciate it if you present a video about the cod implementation of any type of GAN for tabular data and describe the concepts and dimensions of that in detail since it is a bit confusing.

@youtubeadventurer1881 3 жыл бұрын

Not sure this is an issue, but the algorithm from the WGAN paper draws new samples of real images each time in the inner critic update loop, but in this video you use the same batch of real images in each of the 5 inner loop iterations.

@khoivoinh-1099 Жыл бұрын

the same problem i'm confused

@philwhln 3 жыл бұрын

Another great video on GANs! I found the celeb dataset also needed transforms.CenterCrop(64), as the transforms.Resize(64) resulted in 64x78 image. Not sure why there was a difference there.

@AladdinPersson 3 жыл бұрын

Yeah my bad, it should be transforms.Resize((64,64)), I think I had already pre-resized all of them or something and that's why I didn't get an error.

@philwhln 3 жыл бұрын

@@AladdinPersson Cool, thanks for the follow-up! That makes sense. I was looking through the changelog of torchvision to see if the behaviour of Resize had changed :D

@touchyto 2 жыл бұрын

@@philwhln Why did you use Resize((64,64)) and then CenterCrop(64) with the same size values? I thought that if you resize an input image with size 64 it is not necessary to do a centercrop with the same size, it makes no sense. Am i right? Thank you!

@aadarshraj1890 4 жыл бұрын

Awesome Your Teaching Style Is Next Level.

@benyoo2257 2 жыл бұрын

great videos! I have to say I love you youtubers sharing these high-quality tutorials!!!!

@garyzhai9540 2 жыл бұрын

An absolutely phenomenal, informative, and insightful video

@АнварГаниев-ж7н 3 жыл бұрын

Thanks for the video!) I found out that Instance norm and layer norm are not the same, but pretty similar. The difference between them is that instance norm normalizes across each channel in each training example instead of normalizing across input features in a training example.

@leojr8781 2 жыл бұрын

Hi Aladdin , thank u for sharing and this amazing video, it s will be great if u dive in the paper "Improving the improved WGAN-gp " the proposed Ct-gan is more stable in training and effective in both generating samples and semi-supervised settings

@RovisoDominator 4 жыл бұрын

Love this series on GAN... Will you do some videos on audio as well in future?

@AladdinPersson 4 жыл бұрын

Most likely, anything in particular you think would be useful? :)

@FLLCI 3 жыл бұрын

Hi Aladdin, it'd be great if you make a quick video how to implement spectral normalization using Pytorch's spectral_norm in GANs. I don't think it'd take much time. Thank you again!

@Axcellaful 2 жыл бұрын

Would be great to see a Keras implementation!

@aymensekhri2133 3 жыл бұрын

Thank you very much for this amazing explanation

@zitopuxpuxpu6971 3 жыл бұрын

Thank you so much for this insightful vedio!

@pratikkorat790 4 жыл бұрын

Goooooooood you're great...i tried almost every gan but this wassertian architecture never converged...

@clagro7338 3 жыл бұрын

Will you ever make a ProGAN video?

@usama57926 2 жыл бұрын

very nice explanation but i'm confused about loss function. how is it working.....????

@HERiTAGE-ew7pf 4 жыл бұрын

Awesome stuff!

@wolfisraging 4 жыл бұрын

Awesomely explanation 👍

@AladdinPersson 4 жыл бұрын

Appreciate you Wolf!

@starlite5097 2 жыл бұрын

Thanks for the awesome video

@mikhaeldito 4 жыл бұрын

I have officially joined the club :). Would you make some content on autoencoder in the future?

@AladdinPersson 4 жыл бұрын

That's one good looking badge you got 😍 Appreciate you supporting the channel! I have plans on autoencoders, VAEs, but for now the only focus is GANs

@mikhaeldito 4 жыл бұрын

@@AladdinPersson I look forward to it!

@samyalayse8945 Жыл бұрын

I really need some video with that kind of explanation about GANs with 1D data 😭 but i can't find anywhere

@surayuthpintawong8332 3 жыл бұрын

Thank you so much.

@polimetakrylanmetylu2483 3 жыл бұрын

Hey, i did an experiment Apparently changing Adam to SGD with nesterov momentum causes AI to learn cross-hatching

@thiagoviek342 4 жыл бұрын

Great Video!

@UTTAMKUMAR-ic3di 2 жыл бұрын

Hi @Aladdin Persson , why does it takes so much time to train critic and generator. It has taken more than 1 hr and is still running, I am using MNIST dataset.

@srikanthramakrishna1073 3 жыл бұрын

nice video!! Thanks

@putteneersjoris 4 жыл бұрын

Thank you!

@ViduzTube Жыл бұрын

at 11:33 , on line 70, I think you should do critic(fake.detach()), since it is inside the critic update phase and the generator weights must be frozen, right?

@aaryannakhat1004 Жыл бұрын

He uses retain_graph=True in Line 73 which essentially does the same!

@ioanacretu3770 Жыл бұрын

Does this retain the weights of the critic for the entire training?

@Woollzable Жыл бұрын

@@aaryannakhat1004 No. They are not the same. Using retain_graph=True is not the same as .detach() Carlo Aironi is correct, by not detaching the generator in the discriminator/critic update phase (11:33 - line 70), you end up back-propagating through both networks. You should only back-propagate through the critic/discriminator in the n_critic_iterations loop. Furthermore, retain_graph=True is not necessary in this WGAN implementation. In each loop, you are supposed to sample new noise and a new batch of data from the dataset. In conclusion: During each loop, you are updating the weights of the discriminator AND you're sampling both new noise and new data from your dataset. So "retain_graph=True" is not necessary in this case. The typical use case for retain_graph=True is when you want to perform several backward operations on the same graph within a single iteration of the training loop. This is typically not the case in WGAN, as you usually compute the generator and critic losses separately and update their parameters separately, even though the critic is updated multiple times per generator update. In the WGAN-GP however, you need to use retain_graph=True. Because the discriminator/critics's gradient norm is penalized if it deviates from 1.

@Woollzable Жыл бұрын

@@ioanacretu3770 Please see my response to Aaryan_Nakhat :) I haven't checked the version with Gradient Penalty, but just the WGAN (not WGAN-GP). When you use loss.backward(), it will backpropagate through the entire network. Since you used the generator, it will update the generator weights also. which should not be done in this stage. By using fake.detach(), you are telling pytorch to not backpropagate through the generator at this stage. Only backpropagate up until you reach fake = gen(z). The generator is updated in a different loop.

@baohuynh5462 3 жыл бұрын

Many thanks

@hackercop 3 жыл бұрын

Sir at 17:53 why is it that "if norm is 1 for all interpolations then lipschitz is satisfied"? to satisfy lipschitz = 1 doesn't the gradient have to be a value between 1 and -1 not exactly equal to 1.

@MorisonMs 3 жыл бұрын

20:07 The .repeat method is redundant. i.e. the broadcast should do it anyway. right?

@philwhln 3 жыл бұрын

I thought the same. I've tested it and both result in the same tensor, so both should work. Since epsilon is used twice, I'm not sure it there's any optimization in using repeat() once, instead of broadcasting twice, but the difference would be small

@MorisonMs 3 жыл бұрын

21:15 Question regarding autogard.grad In this case the function output is scalar, so why do we need to specify grad_outputs? (Maybe I don't understand autograd.grad correctly) Thanks in advance my brother.

@sumukhbhat1641 3 жыл бұрын

Great tutorial! One question though. What is the capacity of your GPU?? Because mine is 2gb and it ran out of memory, when I add extra critic_iters loop.

@madhuvarun2790 3 жыл бұрын

I have a doubt. At 12:23 the code says critic.zero_grad(). Could you please explain what the line does? I thought we used zero grad on only optimizers.

@augustinestephens471 3 жыл бұрын

I think model.zero_grad() is the same as using optim.zero_grad() for the same model

@augustinestephens471 3 жыл бұрын

model.zero_grad() and optimizer.zero_grad() are the same IF all your model parameters are in that optimizer. I found it is safer to call model.zero_grad() to make sure all grads are zero, e.g. if you have two or more optimizers for one model.

@jianningdeng3645 3 жыл бұрын

Hello, I've noticed that the critic is in negative number. Is this correct? And with the minimization, it will go down to larger negative number.

@dy6697 3 жыл бұрын

Great content

@mohdkashif7295 3 жыл бұрын

at 20:26 why did you expanded dims of epsilon using repeat doesn't multiplying img with scalar value works?

@faridalijani1578 9 ай бұрын

This must be a wrong explanation in the video! one can simply do: epsilon = torch.rand(len(real), 1, 1, 1, device=device, requires_grad=True)

@mostechroom9780 3 жыл бұрын

Why couldn't you specify the critc.zero_gead() outside the critic iterations loop. Wouldn't placing it inside set the parameters to zero within each iteration

@arjunpukale3310 4 жыл бұрын

Please make videos on pix2pix and cycle gans

@AladdinPersson 4 жыл бұрын

I also think the GAN specialization is great, I will also make videos on pix2pix and cyclegan, but these things take time so patience 🙏

@gtalpc59 3 жыл бұрын

@@riis08 i came from that course to these videos. they are not explaining anythng on code implementation and telling only the theory basics. GAN implementation in tricky for beginners. Aladdin's videos are super useful, to the point and very clear.

@riis08 3 жыл бұрын

@@gtalpc59 Its not about good or bad... @Aladdin Persson lectures are amazing... I just said that deeplearning.ai specialization is also available... and that helps a lot in understanding concepts... and that specialization in my point of view is the best.....

@generichuman_ 2 жыл бұрын

Just confirming, for this implementation the losses for the critic and the generator are supposed to hit 0? Mine are hovering around -60 for the critic and 300 for the generator. The image quality is constantly improving over time, but these loss values make me nervous. I know there's not much to go on here, but would lowering the learning rate be my best bet?

@EvanRickettsUk Жыл бұрын

Did you ever sort this? I have a similar problem with strange loss values

@freghidelgrifo Ай бұрын

@@EvanRickettsUk Hi guys, do you have updates?

@suryagaur4363 4 жыл бұрын

Dude please make video on cycle GAN

@Alinawazhusain 11 ай бұрын

you and me have same code and same dataset but why my output is not like you even it is not 1 percent like it

@jup4929 4 жыл бұрын

Thanks so much for the video! Do you know any good setups for non-cloud computing for windows like IDE's for pytorch or are cloud computing platforms the way to go?

@AladdinPersson 4 жыл бұрын

You could use jupyter notebooks, personally I like using PyCharm (and notebooks sometimes) but for IDE I know many that also like visual studio. I don't think cloud computing is the way to go long term, but could work if you seldomly train large models etc

@zulfiqaribrahim8750 3 жыл бұрын

Hi Aladdin , do you plan to code styleGAN2?

@polimetakrylanmetylu2483 3 жыл бұрын

Hey, i don't understand why we put retain grad in backward method. In simple gan it doesn't make sense to do it, we even detach fake images. What's so different in wgans that we don't do it, apart from it breaking autograd.grad calculation?

@alexanderhaas564 2 жыл бұрын

Any update on this?

@shaikansarbasha4169 3 жыл бұрын

Hi bro , For disc features_g = 128, But you took 64

@philosdata 2 жыл бұрын

What VS Code theme is that?

@arhamkhan9489 3 жыл бұрын

Hi, I have an implementation of WGAN using weight clipping. I've copied over the models from a DCGAN implementation that works, just removing the sigmoid from the Discriminator, similar to how your implementation works. I was wondering how you'd interpret the Critic loss gradually decreasing and stabilizing around -6? I was led to believe that the Critic loss should tend towards zero, and while the image quality does improve with training it is quite bad. I'm training on a set of 3x256x256 images. I've tried increasing the critic training iterations and the learning rate but with similar results each time, I've also replaced BatchNorm in the Critic with LayerNorm keeping BatchNorm in the Generator but the same thing seems to happen. Do you have any suggestions for debugging?

@AladdinPersson 3 жыл бұрын

I'd try with smaller image sizes to start with & most important parameter here is learning rate

@arhamkhan9489 3 жыл бұрын

@@AladdinPersson Thanks for the reply! The reason I'm a bit confused is that I've trained the corresponding DCGAN on the same image size as above, how would one adust the architecture to account for different image sizes - I've read online that people have had problems adjusting for different image sizes. Also, how would you go about tuning the learning rate - would you try to make it lower? I've experimented with lr= 5e-5 to 1e-4

@AladdinPersson 3 жыл бұрын

@@arhamkhan9489 Do I understand it correctly that you are using the code from the implementation in the video and you set the image size to 256x256 and it's not working?

@AladdinPersson 3 жыл бұрын

@@arhamkhan9489 Ok I misunderstood. Honestly it's extremely difficult to know what could be (if anything) in your implementation. It could just be hyperparameter search and then my tip is to have quick cycles to be able to test things (which is why I recommended lowering the image size so it runs faster). And then hopefully those hyperparameters also scale when you increase the image size.

@arhamkhan9489 3 жыл бұрын

@@AladdinPersson The implementation is from a DCGAN implementation I wrote myself that did work. Thanks for the help, I'll have to continue to debug by tuning hyperparameters.

@ChrisOffner 3 жыл бұрын

At ~7:15 you say that _"while theta has not converged"_ is equivalent to _"when the _*_critic's loss_*_ is very close to 0"._ Am I misunderstanding something or should it be _"when the _*_generator's loss_*_ is very close to 0",_ since the generator is the one who wants to push the Wasserstein distance to 0? 🤔 Thank you for this playlist, very helpful! :)

@AladdinPersson 3 жыл бұрын

I believe it should be the critics loss because when we obtain a critic loss of 0 it means it can no longer distinguish between fake and reals. Then the generator isn't getting any more useful information and we should stop training

@ChrisOffner 3 жыл бұрын

@@AladdinPersson Really? Then I may be misunderstanding how the critic's loss is calculated. Intuitively, a critic's loss of *0* would suggest to me that the critic does its job _perfectly_, i.e. that it can perfectly distinguish between fake and real samples.

@AladdinPersson 3 жыл бұрын

@@ChrisOffner The critic wants to maximize the distance between the two distributions so if the difference between the expectations is 0 then it would mean it can't distinguish at all. Perhaps I'm missing something

@ChrisOffner 3 жыл бұрын

@@AladdinPersson Right, I guess then we're just hung on terminology here. I just associate "loss" with something to be minimised. So if the critic wants to _maximise_ the Wasserstein metric I wouldn't call that the "critic loss".

@AladdinPersson 3 жыл бұрын

@@ChrisOffner You're right that was a bit confusing

@sahil-7473 3 жыл бұрын

Damn! It wont even work with 4GB of GPU when implementing just exactly as per pseudo-code. Well, I will move on. Great Video! Edit1: As per same code as yours, just difference is swapping loop with batch and n_critic and running with 5 epoch, it was not getting any good image result on MNIST dataset. I went to paper and found that for some dataset, they run for 200k epochs XD. Its no wonder that i was not getting any result. Any guess how many epoch should I run for MNIST dataset in WGAN-GP? Edir2: I miscode the loss_critic line of code. I thought it is of all negative to maximize. But no, it only negative of (real - fake) plus lambda*gp. Damn! even small mistake take to different directions.

@spikewong9740 3 жыл бұрын

Hi, one questions is it critic.zero_grad() or opt_critic.zero_grad()? Or they are the same? Anyone can tell me? Thanks.

@ssssssstssssssss 3 жыл бұрын

They are not the same. One will zero the gradients in the network and the other will zero out the parameters passed into the optimizer, which aren't necessarily the same. If I'm not mistake opt_critic.zero_grad is faster too.

@spikewong9740 3 жыл бұрын

@@ssssssstssssssss Thanks, so any influence to the result?

@TheRealRoySadaka 3 жыл бұрын

Great explanation, thank you! in WGAN-GP, the loss in the paper is: D(x') - D(x) + gp so to minimize this: -(D(x') - D(x) + gp) in your implementation you did: -(D(x') - D(x)) + gp Is this a bug? or the code is correct? Thank you in advance

@AladdinPersson 3 жыл бұрын

Hey Roy, very sorry for the delayed response. In the paper equation 2 on page 2 they specify that the goal of Critic is to maximize the distance for E[D(x)] - E[D(x_tilde)] which essentially means that it's able to distinguish between the two. The notation here is that x is a real image and x_tilde is a fake generated image. You're right that in the code it looks a bit different, which is because we need to convert the maximization into a minimization problem and we do that by simply multiplying with -1. This will therefore swap the order but to remain clear/similar as the paper I did: -(E[D(x)] - E[D(x_tilde)]), obviously there's a simple way to cancel out the negative sign but in this way it's still the same way as it was represented in the paper. For the term with the gradient penalty we want to add it because our goal/the optimal scenario is that it's 0. If we instead would multiply that part with a negative sign it would be contradictory to our goal and would now instead of pushing the norm of the gradient to be 1, rather it would push it to be incredibly large and then make the total loss very negative. This is also clearer in Algorithm 1 and equation 3 on page 4 where I believe they've already turned it into a minimization problem and therefor write is as loss term.

@FranklinDempsey-p8i 2 ай бұрын

Samir Harbor

@NinjTsax Жыл бұрын

Thank for the greate video! I am trying to train on 256x256 images and I have tried using kernel size 14 for the first layer in the generator to achieve 256x256 at the last layer. Seems to work quite well with decent results from epoch 70. Is it better to add 2 extra layers instead of large kernel? like: # Input: N x channels_noise x 1 x 1 self._block(channels_noise, features_g * 16, 4, 1, 0), # img: 4x4 self._block(features_g * 16, features_g * 8, 4, 2, 1), # img: 8x8 self._block(features_g * 8, features_g * 4, 4, 2, 1), # img: 16x16 self._block(features_g * 4, features_g * 2, 4, 2, 1), # img: 32x32 self._block(features_g * 4, features_g * 2, 4, 2, 1), # img: 64x64 self._block(features_g * 4, features_g * 2, 4, 2, 1), # img: 128x128 nn.ConvTranspose2d( features_g * 2, channels_img, kernel_size=4, stride=2, padding=1 ), # Output: N x channels_img x 256 x 256 Should i increase the features_g multiplier? so it goes up to 64 in the first layer?