*My takeaways:* 1. History of neural network 5:56 2. Activation functions: Sigmoid, tanh, ReLU, Leaky ReLU, Parametric ReLU, ELU, Maxout 'Neuron' 14:04 3. Data preprocessing 33:56 3.1 For images 35:13 4. Weight initialization: Xavier, He 36:57 5. Batch normalization 51:36 6. Babysitting learning process 59:27 7. Hyperparameter optimization 1:05:56 8. Summary 1:17:05
@mojidayo7507 жыл бұрын
I can't be attentive to any lecture more than those of andrej karpathy's ! just perfect ! used to sleep in class or lose attention after 5 minutes. this guy is a genius . thaanks this is gold !!!!
@mostinho74 жыл бұрын
1:40 you don’t train a convolutional neural network from scratch, instead you do transfer learning. A pretrained neural network that you train specifically on your data History (skip) until 14:00 14:30 activation functions and their issues, gradients are killed during backprop due to the sigmoid function having very low gradient when x is large or small 17:00 why we want our data to be 0 cantered (our inputs x) 34:00 data preprocessing 38:00 weight initialization 38:10 multiplying the sampled values of a unit Gaussian by a constant will multiply the standard deviation of the resulting data by that constant 45:00 how to initialize our weights. When using tanh, can use the Xavier initialization, but when using ReLu, use a modified version of the Xavier initialization at 48:30 Todo rewatch batch normalization and what it does 51:30 batch normalization
@xuang88704 жыл бұрын
In case anyone is looking for vanishing gradient and exploding gradient, they are under wight initialization at 37:00
@ThienPham-hv8kx3 жыл бұрын
In summary: Use Relu for activation function for safety. Preprocessing data: we will center image data (center at 0,0) , don't scale it too much because the input will loose some features inside the image. Carefully initiate the weights before training. How to choose the best set of initiation it depends on how big of neural network , how expected of input of hidden layer and how expected of output layer. If we choose a set of tiny weights , the weight will change slowly because of small gradient. If we choose a set of too big weights, the neuron 's saturated and the weights 'll be remain the same and never updated. We can apply batch normalization to reduce dependency on initiating weight. Try to find ideal learning rate in small examples.
@jingkangY9 жыл бұрын
Meaty course, and I feel like I read several core paper during the lecture!
@JingWang-q3t3 ай бұрын
You actually do, for the core part of these core paper!
@kentasuzuki45227 жыл бұрын
1:07:42, why not log-scaling the value end up being bad region (just sampling uniform distribution)? I guess multiplicative effect means parameter values(=learning rate) is multiplied by gradients and subtract weights. But I'm not sure why this effect is the rationale for log-scaling.
@padenzimmermann1892 Жыл бұрын
when this video was recorded I could not even factor a quadrtic equation. Now I can watch this and follow the math w realetive ease. wow
@bashhwu7 жыл бұрын
One of the great lecturers. Thank you very much
@420lomo4 жыл бұрын
54:43 If the batch norm calculates the mean and variances of each batch while training, how can those be learnable hyperparameters? Unlike the network weights, that are fixed, the values of gamma and beta will vary depending on the batch. Unless the means can be expected to be constant (which I'm sure is unreasonable) that doesn't make sense to me.
@sedthh4 жыл бұрын
please upload more of your lectures, these videos should be the go to materials for learning neural networks instead of all the shallow summaries that get a lot of views on youtube
@adosar7261 Жыл бұрын
Can someone explain why the variance is reduced by 1/2 when using ReLU? Take for example a sample of size=100_000 from a normal distribution N(0, a), pass it through the ReLU and then calculate its variance. Would it be a/2? Moreover, on 45:30 why by setting larger weights the distribution of activations changes shape compared to when using Xavier? I am expecting a flatter distribution compared to Xavier, but not that shape with these peaks on the boundaries. Finally, how these distributions of activations are calculated? Passing many samples through the network with fixed weights?
@thedarkknight5792 жыл бұрын
This is Pure Gold!!!!
@wardo58405 жыл бұрын
Can someone clarify me 1:07:20 where he explains that it's best to optimize in log space? I can't see the presentation
@jklasfjkl3 жыл бұрын
Late reply but log space is most often used to combat underflow or zeros issues, i.e. when you multiply things a single 0 will make the whole expression 0. In log space you add everything instead of multiplying, so the problem disappears there.
@hnkulkarni8 жыл бұрын
Fantastic lecture!!
@helenazeng67937 жыл бұрын
Thank you! This is so much better than my professor's lectures
@stevetj9265 жыл бұрын
great lecture! Thanks Andrej
@bingeltube6 жыл бұрын
Very recommendable! Lot's of excellent, practical tips! Great discussion of important subjects like pros & cons of activation functions, weight initialization etc.
@brandomiranda67038 жыл бұрын
At 1:05:02 the speaker explains why the loss remains (roughly) the same but the accuracy increases. Is this the case because the loss doesn't directly measure the 0-1 loss but instead softmax while accuracy is directly a 0-1 loss so the true accuracy can have a bigger change than the loss? Or why is it?
@hnmusac92997 жыл бұрын
I think in this example, loss is more dominated by the (activations of the) wrong labels. Even though some test examples are switching labels from wrong to correct, this is not sufficient to cause an impact on the loss. This loss is more 'cruel' with your mistakes rather than rewarding of your sucesses.
@AlqGo7 жыл бұрын
55:32 "Batch normalisation can act as an identity function..." I think he means inverse function instead of identity function, because identity function cannot undo.
@AlqGo7 жыл бұрын
Ah right...forgot that the sigma and beta are implemented in the same layer as the unit normalization
@7688abinash7 жыл бұрын
Enjoyed it whole-heartedly.... finally an awesome lecture...
@AllanZelener9 жыл бұрын
It would be nice to mention Bayesian optimization as a line of research for automating hyperparameter search. I appreciate all the little tips on current best practices sprinkled throughout the lecture and would like to know your thoughts on this as well.
@Vulcapyro9 жыл бұрын
+Allan Zelener This is mentioned briefly in the course notes :)
@HrishikeshKamath6 жыл бұрын
At 1:01:30 , why does loss increase with increase in value of regularization?
@sokhibtukhtaev96936 жыл бұрын
it's because regularization is added on purpose to make loss increase, henceforth gradients of W's will correspondingly change their values to again make loss values decrease.
@fahmikahlaoui77177 жыл бұрын
Please upload the lectures of this year && Thx soo much for sharing knowledge :) :) Respect
@WahranRai6 жыл бұрын
7:31 i disagree because in the update rule they are updtating the weights with a part of error (difference between desired class and output class): it is like gradient !!!
@SabaRahimi3 жыл бұрын
wow! This is so good!!!
@ProfessionalTycoons6 жыл бұрын
amazing lecture!!!!
@jessehao5907 жыл бұрын
I love this accent and speed, do you think so?
@ShangDaili7 жыл бұрын
slide 35. I am totally lost. Why are the gradients on w always positive or negative?
@BhupinderSinghj7 жыл бұрын
Because wi*xi is a multiply gate, so it acts as a gradient switcher. Gradient at x will be w*(forward gradient). And gradient at w is (x*forward gradient). So if x are all positive, gradient at w is either all positive (if forward gradient is positive) and negative (if it's negative)
@IbrahimSobh8 жыл бұрын
I was thinking why not use the simple tanh activation function with some parameter(s) (to be learned) to stretch horizontally the zero centered "S" shape to be like ReLU? Do you think it will be a good idea? (still zero centered, and not easy to saturate, learnable but one extra parameter per neuron)
@realGBx646 жыл бұрын
if you consider how it is done, the multiplication of inputs by the W matrix effectively streches by a lot on the X axis, while the bias shifts the center left and right.
@fgfanta7 жыл бұрын
What works, what doesn't, and how to make it work; this is gold! Stuff I typically don't find in papers and books.
@perlaz11665 жыл бұрын
25:34 What does "knocked off the data manifold" mean?
@miladaghajohari23084 жыл бұрын
It means that the ReLu may suddenly get not activated on the entire dataset. Remember ReLu receives activations from the previous layer. if all of these activations are negative the gradients will not flow backward to the previous layers. The ReLu will always receive negative weights on all the training examples. That's what he means by the data manifold. The data manifold is actually a region in the input space that the data is disributed on. I think if you see here www.deeplearningbook.org/contents/ml.html , chapter 5, search for manifold and there is a nice explanation. I hope it could have helped you.
@ChundeHuang7 жыл бұрын
Thanks very much, it's very helpful!
@saltrocklamp1996 жыл бұрын
If the output from the sigmoid function being non-centered is a problem, why don't they just subtract 0.5 from it? The gradient would be the same but the output would be centered.
@ta68475 жыл бұрын
tanh is effectively a sigmoid rescaled to -1,1 if mean centered data is so important, why is ReLU the standard (which either has all positive or all 0 weight gradients)? maybe with the sigmoids, gradients accumulate to be increasingly positive for some neurons and increasingly negative for others, making the entire network less stable
@akzsh9 ай бұрын
In 2015, the world didn't know about the problem of batch normalization
@nguyenthanhdat938 жыл бұрын
Why do we need to zero-centered the data during pre-processing step? How would it be beneficial to the network performance? Thanks
@AhmedKachkach7 жыл бұрын
If you're using something like the sigmoid activation, zero-centering data means that you'll be in the zone where the gradient takes the biggest values, reaching convergence faster. If your data is on the edges of the sigmoid's range, the gradients will be very small and might end up being (almost) equal to zero. In practice, normalising data is not easy since you do not know the range of your data at training time (since you train batch by batch), and the mean of your data might actually be valuable (+ some activation functions do not have gradients that max at 0), but these problems are solved by batch norm.
@AhmedKachkach7 жыл бұрын
Just watched the part where he discusses this. I think he gives many arguments for what's desirable (beyond what I said).
@zeyadezzat88935 жыл бұрын
I've hard time with this , that's obvious explanation thank you
@joshuazastrow58667 жыл бұрын
In Reference to Batch Normalization: Is it recommended to use batch normalization on the final layer? (Final fully connected layer -> Batch Normalization -> softmax/svm -> output?)
@adamb.23909 жыл бұрын
lossfunctions will be my favorite tumblr now :)
@wiiiiktor6 жыл бұрын
kzbin.info/www/bejne/nYrToH2DocysjqM The grid on the right can be simply skewed. Additionally, this grid can move +/-1% up & down and left & right (so it would be 5 calculation of the grid in total) to indicate some local gradient, that can be used to find local min/max.
@kyuhyoungchoi8 жыл бұрын
At 47 th slide, it says "TLDR" What does it mean?
@souslicer8 жыл бұрын
+kyuhyoung choi choi..
@nguyenthanhdat938 жыл бұрын
Too Long; Didn't Read
@leavexu7 жыл бұрын
Awesome!
@LouisChiaki7 жыл бұрын
Very very helpful lecture on tunning neural networks!!
@tisblaze4 жыл бұрын
Great lecture by Justin Timberlake. Very well organized too.
@79oose3 жыл бұрын
that's a funny comment hhhhh
@tisblaze3 жыл бұрын
@@79oose 😂
@Age_of_Apocalypse6 жыл бұрын
This guy is talking fast ... LOL. I really love his videos on AI: Thank You!
@aueret3 жыл бұрын
lol, i am listening to it with speed 0.5 :)
@aleksandarabas66182 жыл бұрын
I wasn't blind folded :P
@rezaghoddoosian18 жыл бұрын
guys we learn the Ws (weights) but what about the constant bias (b)? if we dont learn them, wouldnt it cause problem?
@dzianish62238 жыл бұрын
bias is also learned. You can imagine that b is just a simple parameter (like w) but the only difference is that it multiplies by constant 1 instead of variative x.
@bayesianlee64476 жыл бұрын
It was definitely very hard to listen all he saying at first. But now kinda enjoying it lol
@tai94bn9 жыл бұрын
Do students learn skill programming in the class?
@uidsuivn1018 жыл бұрын
Actually working on the assignments and project takes a fair bit of effort and programming work, but the objective of this course isn't to teach a student programming.
@ziweixu15737 жыл бұрын
Confused at kzbin.info/www/bejne/nYrToH2DocysjqM . How is sampling in log space better directly from uniform space because of the fact that learning rate and regularization act multiplicatively?
@mojidayo7507 жыл бұрын
same for me
@priojeetpriyom6 жыл бұрын
what is unit gaussian?
@idealistdev5 жыл бұрын
Gaussian (normal) distribution with mean=1 and std=1