Hope you guys find this implementation video useful! This video assumes you're familiar with the basics of GANs and if not then follow the previous videos on this playlist. If you have recommendations on GANs you think would make this into an even better resource for people wanting to learn about GANs let me know in the comments below and I'll try to do it! I learned a lot and was inspired to make these GAN videos by the GAN specialization on coursera which I recommend. Below you'll find both affiliate and non-affiliate links, the pricing for you is the same but a small commission goes back to the channel if you buy it through the affiliate link. affiliate: bit.ly/2OECviQ non-affiliate: bit.ly/3bvr9qy Here's the outline for the video: 0:00 - Introduction 0:27 - Understanding WGAN 6:53 - WGAN Implementation details 9:15 - Coding WGAN 15:50 - Understanding WGAN-GP 18:48 - Coding WGAN-GP 25:29 - Ending
@AladdinPersson4 жыл бұрын
@nerd I don't know, there's too many variables in play to say anything with confidence to that question. I don't spend much time thinking what would be good for an employers point of view either.. I think the focus one should have is just to improve and do what you enjoy, if someone likes that or not is up to them
@foobar12313 жыл бұрын
@@AladdinPersson 12:20 -- It is not important and not an error, however, it is better to apply *zero_grad()* method to the optimizer rather than to the model itself. It is not an error, because USUALLY (not always) optimizer applies zero_grad() to model parameters. In some cases, like in Neural Style Transfer task, *optimizer.zero_grad()* is applied to the image, but not to the model. So it is better to write: *optimizer.zero_grad()* rather than model.zero_grad() Also, if you set *nn.Flatten()* to the end of Critic class, you'll not need to *.reshape(-1) during training.
@mkt91913 жыл бұрын
Thanks for the great video , thats a big help. I implemented it for my dataset, but not sure if i am on the right track. May i know if you have any inputs for me ? The mean discriminator loss began with 7.4, decreased upto -15 and then increased to -1.38 (approx), where it is kind of stucked. 1.Does the discriminator loss actually becomes zero or any small negative value can be considered as a termination point? 2. Does generator loss gives us any information about convergence. Thank you
@Alex-wx2vd8 ай бұрын
Awesome video! Thanks so much for posting an easy to follow WGAN guide. A couple of parts in the non-GP WGAN training loop jumped out to me as unusual, so I'll quickly explain what I figured out in case anyone else notices them: 1) Use of `retain_graph=True` in discriminator updates: This line isn't necessary, because we perform a full forward pass through both the generator and discriminator each time, use that compute graph only once to calculate the gradients. We then perform another full forward pass for the next set of inputs. `retain_graph=True` should be used when we need to perform multiple backward passes on the same graph, but since we're only using each graph one time then discarding it, we don't need to retain it. Retaining it will increase memory usage but not cause any issues with training. 2) Using the same real images during the discriminator training: In the original WGAN paper the authors use different target images for each of the five discriminator updates. Doing so increases the diversity of images the discriminator sees, which will increase generalisation power faster, meaning the model should converge quicker. Not doing so won't break anything, but training will take a bit longer. 3) Not detaching the output of the generator: The fake images being inputted to the discriminator during discriminator training should be detached from the compute graph using .detach(). Not doing so will mean when you call loss.backward(), we'll be computing the gradients for all the parameters in the generator network also. Since we then only go on to call the discriminator's optimiser step, we will not update the parameters of the generator when we do this, but we are wasting time calculating gradients which will ultimately slow down things computationally. All the best Aladdin, thanks for the great videos :)
@miladaghajohari23083 жыл бұрын
This video is awesome. I knew that WGAN and WGANGP was and I implemented them before and I watched this video for the fun of it and I can say it really showcases the implementation clearly.
@elahehshahmir3223 жыл бұрын
I really appreciate it if you present a video about the cod implementation of any type of GAN for tabular data and describe the concepts and dimensions of that in detail since it is a bit confusing.
@youtubeadventurer18813 жыл бұрын
Not sure this is an issue, but the algorithm from the WGAN paper draws new samples of real images each time in the inner critic update loop, but in this video you use the same batch of real images in each of the 5 inner loop iterations.
@khoivoinh-1099 Жыл бұрын
the same problem i'm confused
@philwhln3 жыл бұрын
Another great video on GANs! I found the celeb dataset also needed transforms.CenterCrop(64), as the transforms.Resize(64) resulted in 64x78 image. Not sure why there was a difference there.
@AladdinPersson3 жыл бұрын
Yeah my bad, it should be transforms.Resize((64,64)), I think I had already pre-resized all of them or something and that's why I didn't get an error.
@philwhln3 жыл бұрын
@@AladdinPersson Cool, thanks for the follow-up! That makes sense. I was looking through the changelog of torchvision to see if the behaviour of Resize had changed :D
@touchyto2 жыл бұрын
@@philwhln Why did you use Resize((64,64)) and then CenterCrop(64) with the same size values? I thought that if you resize an input image with size 64 it is not necessary to do a centercrop with the same size, it makes no sense. Am i right? Thank you!
@aadarshraj18904 жыл бұрын
Awesome Your Teaching Style Is Next Level.
@benyoo22572 жыл бұрын
great videos! I have to say I love you youtubers sharing these high-quality tutorials!!!!
@garyzhai95402 жыл бұрын
An absolutely phenomenal, informative, and insightful video
@АнварГаниев-ж7н3 жыл бұрын
Thanks for the video!) I found out that Instance norm and layer norm are not the same, but pretty similar. The difference between them is that instance norm normalizes across each channel in each training example instead of normalizing across input features in a training example.
@leojr87812 жыл бұрын
Hi Aladdin , thank u for sharing and this amazing video, it s will be great if u dive in the paper "Improving the improved WGAN-gp " the proposed Ct-gan is more stable in training and effective in both generating samples and semi-supervised settings
@RovisoDominator4 жыл бұрын
Love this series on GAN... Will you do some videos on audio as well in future?
@AladdinPersson4 жыл бұрын
Most likely, anything in particular you think would be useful? :)
@FLLCI3 жыл бұрын
Hi Aladdin, it'd be great if you make a quick video how to implement spectral normalization using Pytorch's spectral_norm in GANs. I don't think it'd take much time. Thank you again!
@Axcellaful2 жыл бұрын
Would be great to see a Keras implementation!
@aymensekhri21333 жыл бұрын
Thank you very much for this amazing explanation
@zitopuxpuxpu69713 жыл бұрын
Thank you so much for this insightful vedio!
@pratikkorat7904 жыл бұрын
Goooooooood you're great...i tried almost every gan but this wassertian architecture never converged...
@clagro73383 жыл бұрын
Will you ever make a ProGAN video?
@usama579262 жыл бұрын
very nice explanation but i'm confused about loss function. how is it working.....????
@HERiTAGE-ew7pf4 жыл бұрын
Awesome stuff!
@wolfisraging4 жыл бұрын
Awesomely explanation 👍
@AladdinPersson4 жыл бұрын
Appreciate you Wolf!
@starlite50972 жыл бұрын
Thanks for the awesome video
@mikhaeldito4 жыл бұрын
I have officially joined the club :). Would you make some content on autoencoder in the future?
@AladdinPersson4 жыл бұрын
That's one good looking badge you got 😍 Appreciate you supporting the channel! I have plans on autoencoders, VAEs, but for now the only focus is GANs
@mikhaeldito4 жыл бұрын
@@AladdinPersson I look forward to it!
@samyalayse8945 Жыл бұрын
I really need some video with that kind of explanation about GANs with 1D data 😭 but i can't find anywhere
@surayuthpintawong83323 жыл бұрын
Thank you so much.
@polimetakrylanmetylu24833 жыл бұрын
Hey, i did an experiment Apparently changing Adam to SGD with nesterov momentum causes AI to learn cross-hatching
@thiagoviek3424 жыл бұрын
Great Video!
@UTTAMKUMAR-ic3di2 жыл бұрын
Hi @Aladdin Persson , why does it takes so much time to train critic and generator. It has taken more than 1 hr and is still running, I am using MNIST dataset.
@srikanthramakrishna10733 жыл бұрын
nice video!! Thanks
@putteneersjoris4 жыл бұрын
Thank you!
@ViduzTube Жыл бұрын
at 11:33 , on line 70, I think you should do critic(fake.detach()), since it is inside the critic update phase and the generator weights must be frozen, right?
@aaryannakhat1004 Жыл бұрын
He uses retain_graph=True in Line 73 which essentially does the same!
@ioanacretu3770 Жыл бұрын
Does this retain the weights of the critic for the entire training?
@Woollzable Жыл бұрын
@@aaryannakhat1004 No. They are not the same. Using retain_graph=True is not the same as .detach() Carlo Aironi is correct, by not detaching the generator in the discriminator/critic update phase (11:33 - line 70), you end up back-propagating through both networks. You should only back-propagate through the critic/discriminator in the n_critic_iterations loop. Furthermore, retain_graph=True is not necessary in this WGAN implementation. In each loop, you are supposed to sample new noise and a new batch of data from the dataset. In conclusion: During each loop, you are updating the weights of the discriminator AND you're sampling both new noise and new data from your dataset. So "retain_graph=True" is not necessary in this case. The typical use case for retain_graph=True is when you want to perform several backward operations on the same graph within a single iteration of the training loop. This is typically not the case in WGAN, as you usually compute the generator and critic losses separately and update their parameters separately, even though the critic is updated multiple times per generator update. In the WGAN-GP however, you need to use retain_graph=True. Because the discriminator/critics's gradient norm is penalized if it deviates from 1.
@Woollzable Жыл бұрын
@@ioanacretu3770 Please see my response to Aaryan_Nakhat :) I haven't checked the version with Gradient Penalty, but just the WGAN (not WGAN-GP). When you use loss.backward(), it will backpropagate through the entire network. Since you used the generator, it will update the generator weights also. which should not be done in this stage. By using fake.detach(), you are telling pytorch to not backpropagate through the generator at this stage. Only backpropagate up until you reach fake = gen(z). The generator is updated in a different loop.
@baohuynh54623 жыл бұрын
Many thanks
@hackercop3 жыл бұрын
Sir at 17:53 why is it that "if norm is 1 for all interpolations then lipschitz is satisfied"? to satisfy lipschitz = 1 doesn't the gradient have to be a value between 1 and -1 not exactly equal to 1.
@MorisonMs3 жыл бұрын
20:07 The .repeat method is redundant. i.e. the broadcast should do it anyway. right?
@philwhln3 жыл бұрын
I thought the same. I've tested it and both result in the same tensor, so both should work. Since epsilon is used twice, I'm not sure it there's any optimization in using repeat() once, instead of broadcasting twice, but the difference would be small
@MorisonMs3 жыл бұрын
21:15 Question regarding autogard.grad In this case the function output is scalar, so why do we need to specify grad_outputs? (Maybe I don't understand autograd.grad correctly) Thanks in advance my brother.
@sumukhbhat16413 жыл бұрын
Great tutorial! One question though. What is the capacity of your GPU?? Because mine is 2gb and it ran out of memory, when I add extra critic_iters loop.
@madhuvarun27903 жыл бұрын
I have a doubt. At 12:23 the code says critic.zero_grad(). Could you please explain what the line does? I thought we used zero grad on only optimizers.
@augustinestephens4713 жыл бұрын
I think model.zero_grad() is the same as using optim.zero_grad() for the same model
@augustinestephens4713 жыл бұрын
model.zero_grad() and optimizer.zero_grad() are the same IF all your model parameters are in that optimizer. I found it is safer to call model.zero_grad() to make sure all grads are zero, e.g. if you have two or more optimizers for one model.
@jianningdeng36453 жыл бұрын
Hello, I've noticed that the critic is in negative number. Is this correct? And with the minimization, it will go down to larger negative number.
@dy66973 жыл бұрын
Great content
@mohdkashif72953 жыл бұрын
at 20:26 why did you expanded dims of epsilon using repeat doesn't multiplying img with scalar value works?
@faridalijani15789 ай бұрын
This must be a wrong explanation in the video! one can simply do: epsilon = torch.rand(len(real), 1, 1, 1, device=device, requires_grad=True)
@mostechroom97803 жыл бұрын
Why couldn't you specify the critc.zero_gead() outside the critic iterations loop. Wouldn't placing it inside set the parameters to zero within each iteration
@arjunpukale33104 жыл бұрын
Please make videos on pix2pix and cycle gans
@AladdinPersson4 жыл бұрын
I also think the GAN specialization is great, I will also make videos on pix2pix and cyclegan, but these things take time so patience 🙏
@gtalpc593 жыл бұрын
@@riis08 i came from that course to these videos. they are not explaining anythng on code implementation and telling only the theory basics. GAN implementation in tricky for beginners. Aladdin's videos are super useful, to the point and very clear.
@riis083 жыл бұрын
@@gtalpc59 Its not about good or bad... @Aladdin Persson lectures are amazing... I just said that deeplearning.ai specialization is also available... and that helps a lot in understanding concepts... and that specialization in my point of view is the best.....
@generichuman_2 жыл бұрын
Just confirming, for this implementation the losses for the critic and the generator are supposed to hit 0? Mine are hovering around -60 for the critic and 300 for the generator. The image quality is constantly improving over time, but these loss values make me nervous. I know there's not much to go on here, but would lowering the learning rate be my best bet?
@EvanRickettsUk Жыл бұрын
Did you ever sort this? I have a similar problem with strange loss values
@freghidelgrifoАй бұрын
@@EvanRickettsUk Hi guys, do you have updates?
@suryagaur43634 жыл бұрын
Dude please make video on cycle GAN
@Alinawazhusain11 ай бұрын
you and me have same code and same dataset but why my output is not like you even it is not 1 percent like it
@jup49294 жыл бұрын
Thanks so much for the video! Do you know any good setups for non-cloud computing for windows like IDE's for pytorch or are cloud computing platforms the way to go?
@AladdinPersson4 жыл бұрын
You could use jupyter notebooks, personally I like using PyCharm (and notebooks sometimes) but for IDE I know many that also like visual studio. I don't think cloud computing is the way to go long term, but could work if you seldomly train large models etc
@zulfiqaribrahim87503 жыл бұрын
Hi Aladdin , do you plan to code styleGAN2?
@polimetakrylanmetylu24833 жыл бұрын
Hey, i don't understand why we put retain grad in backward method. In simple gan it doesn't make sense to do it, we even detach fake images. What's so different in wgans that we don't do it, apart from it breaking autograd.grad calculation?
@alexanderhaas5642 жыл бұрын
Any update on this?
@shaikansarbasha41693 жыл бұрын
Hi bro , For disc features_g = 128, But you took 64
@philosdata2 жыл бұрын
What VS Code theme is that?
@arhamkhan94893 жыл бұрын
Hi, I have an implementation of WGAN using weight clipping. I've copied over the models from a DCGAN implementation that works, just removing the sigmoid from the Discriminator, similar to how your implementation works. I was wondering how you'd interpret the Critic loss gradually decreasing and stabilizing around -6? I was led to believe that the Critic loss should tend towards zero, and while the image quality does improve with training it is quite bad. I'm training on a set of 3x256x256 images. I've tried increasing the critic training iterations and the learning rate but with similar results each time, I've also replaced BatchNorm in the Critic with LayerNorm keeping BatchNorm in the Generator but the same thing seems to happen. Do you have any suggestions for debugging?
@AladdinPersson3 жыл бұрын
I'd try with smaller image sizes to start with & most important parameter here is learning rate
@arhamkhan94893 жыл бұрын
@@AladdinPersson Thanks for the reply! The reason I'm a bit confused is that I've trained the corresponding DCGAN on the same image size as above, how would one adust the architecture to account for different image sizes - I've read online that people have had problems adjusting for different image sizes. Also, how would you go about tuning the learning rate - would you try to make it lower? I've experimented with lr= 5e-5 to 1e-4
@AladdinPersson3 жыл бұрын
@@arhamkhan9489 Do I understand it correctly that you are using the code from the implementation in the video and you set the image size to 256x256 and it's not working?
@AladdinPersson3 жыл бұрын
@@arhamkhan9489 Ok I misunderstood. Honestly it's extremely difficult to know what could be (if anything) in your implementation. It could just be hyperparameter search and then my tip is to have quick cycles to be able to test things (which is why I recommended lowering the image size so it runs faster). And then hopefully those hyperparameters also scale when you increase the image size.
@arhamkhan94893 жыл бұрын
@@AladdinPersson The implementation is from a DCGAN implementation I wrote myself that did work. Thanks for the help, I'll have to continue to debug by tuning hyperparameters.
@ChrisOffner3 жыл бұрын
At ~7:15 you say that _"while theta has not converged"_ is equivalent to _"when the _*_critic's loss_*_ is very close to 0"._ Am I misunderstanding something or should it be _"when the _*_generator's loss_*_ is very close to 0",_ since the generator is the one who wants to push the Wasserstein distance to 0? 🤔 Thank you for this playlist, very helpful! :)
@AladdinPersson3 жыл бұрын
I believe it should be the critics loss because when we obtain a critic loss of 0 it means it can no longer distinguish between fake and reals. Then the generator isn't getting any more useful information and we should stop training
@ChrisOffner3 жыл бұрын
@@AladdinPersson Really? Then I may be misunderstanding how the critic's loss is calculated. Intuitively, a critic's loss of *0* would suggest to me that the critic does its job _perfectly_, i.e. that it can perfectly distinguish between fake and real samples.
@AladdinPersson3 жыл бұрын
@@ChrisOffner The critic wants to maximize the distance between the two distributions so if the difference between the expectations is 0 then it would mean it can't distinguish at all. Perhaps I'm missing something
@ChrisOffner3 жыл бұрын
@@AladdinPersson Right, I guess then we're just hung on terminology here. I just associate "loss" with something to be minimised. So if the critic wants to _maximise_ the Wasserstein metric I wouldn't call that the "critic loss".
@AladdinPersson3 жыл бұрын
@@ChrisOffner You're right that was a bit confusing
@sahil-74733 жыл бұрын
Damn! It wont even work with 4GB of GPU when implementing just exactly as per pseudo-code. Well, I will move on. Great Video! Edit1: As per same code as yours, just difference is swapping loop with batch and n_critic and running with 5 epoch, it was not getting any good image result on MNIST dataset. I went to paper and found that for some dataset, they run for 200k epochs XD. Its no wonder that i was not getting any result. Any guess how many epoch should I run for MNIST dataset in WGAN-GP? Edir2: I miscode the loss_critic line of code. I thought it is of all negative to maximize. But no, it only negative of (real - fake) plus lambda*gp. Damn! even small mistake take to different directions.
@spikewong97403 жыл бұрын
Hi, one questions is it critic.zero_grad() or opt_critic.zero_grad()? Or they are the same? Anyone can tell me? Thanks.
@ssssssstssssssss3 жыл бұрын
They are not the same. One will zero the gradients in the network and the other will zero out the parameters passed into the optimizer, which aren't necessarily the same. If I'm not mistake opt_critic.zero_grad is faster too.
@spikewong97403 жыл бұрын
@@ssssssstssssssss Thanks, so any influence to the result?
@TheRealRoySadaka3 жыл бұрын
Great explanation, thank you! in WGAN-GP, the loss in the paper is: D(x') - D(x) + gp so to minimize this: -(D(x') - D(x) + gp) in your implementation you did: -(D(x') - D(x)) + gp Is this a bug? or the code is correct? Thank you in advance
@AladdinPersson3 жыл бұрын
Hey Roy, very sorry for the delayed response. In the paper equation 2 on page 2 they specify that the goal of Critic is to maximize the distance for E[D(x)] - E[D(x_tilde)] which essentially means that it's able to distinguish between the two. The notation here is that x is a real image and x_tilde is a fake generated image. You're right that in the code it looks a bit different, which is because we need to convert the maximization into a minimization problem and we do that by simply multiplying with -1. This will therefore swap the order but to remain clear/similar as the paper I did: -(E[D(x)] - E[D(x_tilde)]), obviously there's a simple way to cancel out the negative sign but in this way it's still the same way as it was represented in the paper. For the term with the gradient penalty we want to add it because our goal/the optimal scenario is that it's 0. If we instead would multiply that part with a negative sign it would be contradictory to our goal and would now instead of pushing the norm of the gradient to be 1, rather it would push it to be incredibly large and then make the total loss very negative. This is also clearer in Algorithm 1 and equation 3 on page 4 where I believe they've already turned it into a minimization problem and therefor write is as loss term.
@FranklinDempsey-p8i2 ай бұрын
Samir Harbor
@NinjTsax Жыл бұрын
Thank for the greate video! I am trying to train on 256x256 images and I have tried using kernel size 14 for the first layer in the generator to achieve 256x256 at the last layer. Seems to work quite well with decent results from epoch 70. Is it better to add 2 extra layers instead of large kernel? like: # Input: N x channels_noise x 1 x 1 self._block(channels_noise, features_g * 16, 4, 1, 0), # img: 4x4 self._block(features_g * 16, features_g * 8, 4, 2, 1), # img: 8x8 self._block(features_g * 8, features_g * 4, 4, 2, 1), # img: 16x16 self._block(features_g * 4, features_g * 2, 4, 2, 1), # img: 32x32 self._block(features_g * 4, features_g * 2, 4, 2, 1), # img: 64x64 self._block(features_g * 4, features_g * 2, 4, 2, 1), # img: 128x128 nn.ConvTranspose2d( features_g * 2, channels_img, kernel_size=4, stride=2, padding=1 ), # Output: N x channels_img x 256 x 256 Should i increase the features_g multiplier? so it goes up to 64 in the first layer?