CS231n Winter 2016: Lecture 5: Neural Networks Part 2

  Рет қаралды 185,275

Andrej Karpathy

Andrej Karpathy

Күн бұрын

Пікірлер: 84
@leixun
@leixun 4 жыл бұрын
*My takeaways:* 1. History of neural network 5:56 2. Activation functions: Sigmoid, tanh, ReLU, Leaky ReLU, Parametric ReLU, ELU, Maxout 'Neuron' 14:04 3. Data preprocessing 33:56 3.1 For images 35:13 4. Weight initialization: Xavier, He 36:57 5. Batch normalization 51:36 6. Babysitting learning process 59:27 7. Hyperparameter optimization 1:05:56 8. Summary 1:17:05
@mojidayo750
@mojidayo750 7 жыл бұрын
I can't be attentive to any lecture more than those of andrej karpathy's ! just perfect ! used to sleep in class or lose attention after 5 minutes. this guy is a genius . thaanks this is gold !!!!
@mostinho7
@mostinho7 4 жыл бұрын
1:40 you don’t train a convolutional neural network from scratch, instead you do transfer learning. A pretrained neural network that you train specifically on your data History (skip) until 14:00 14:30 activation functions and their issues, gradients are killed during backprop due to the sigmoid function having very low gradient when x is large or small 17:00 why we want our data to be 0 cantered (our inputs x) 34:00 data preprocessing 38:00 weight initialization 38:10 multiplying the sampled values of a unit Gaussian by a constant will multiply the standard deviation of the resulting data by that constant 45:00 how to initialize our weights. When using tanh, can use the Xavier initialization, but when using ReLu, use a modified version of the Xavier initialization at 48:30 Todo rewatch batch normalization and what it does 51:30 batch normalization
@xuang8870
@xuang8870 4 жыл бұрын
In case anyone is looking for vanishing gradient and exploding gradient, they are under wight initialization at 37:00
@ThienPham-hv8kx
@ThienPham-hv8kx 3 жыл бұрын
In summary: Use Relu for activation function for safety. Preprocessing data: we will center image data (center at 0,0) , don't scale it too much because the input will loose some features inside the image. Carefully initiate the weights before training. How to choose the best set of initiation it depends on how big of neural network , how expected of input of hidden layer and how expected of output layer. If we choose a set of tiny weights , the weight will change slowly because of small gradient. If we choose a set of too big weights, the neuron 's saturated and the weights 'll be remain the same and never updated. We can apply batch normalization to reduce dependency on initiating weight. Try to find ideal learning rate in small examples.
@jingkangY
@jingkangY 9 жыл бұрын
Meaty course, and I feel like I read several core paper during the lecture!
@JingWang-q3t
@JingWang-q3t 3 ай бұрын
You actually do, for the core part of these core paper!
@kentasuzuki4522
@kentasuzuki4522 7 жыл бұрын
1:07:42, why not log-scaling the value end up being bad region (just sampling uniform distribution)? I guess multiplicative effect means parameter values(=learning rate) is multiplied by gradients and subtract weights. But I'm not sure why this effect is the rationale for log-scaling.
@padenzimmermann1892
@padenzimmermann1892 Жыл бұрын
when this video was recorded I could not even factor a quadrtic equation. Now I can watch this and follow the math w realetive ease. wow
@bashhwu
@bashhwu 7 жыл бұрын
One of the great lecturers. Thank you very much
@420lomo
@420lomo 4 жыл бұрын
54:43 If the batch norm calculates the mean and variances of each batch while training, how can those be learnable hyperparameters? Unlike the network weights, that are fixed, the values of gamma and beta will vary depending on the batch. Unless the means can be expected to be constant (which I'm sure is unreasonable) that doesn't make sense to me.
@sedthh
@sedthh 4 жыл бұрын
please upload more of your lectures, these videos should be the go to materials for learning neural networks instead of all the shallow summaries that get a lot of views on youtube
@adosar7261
@adosar7261 Жыл бұрын
Can someone explain why the variance is reduced by 1/2 when using ReLU? Take for example a sample of size=100_000 from a normal distribution N(0, a), pass it through the ReLU and then calculate its variance. Would it be a/2? Moreover, on 45:30 why by setting larger weights the distribution of activations changes shape compared to when using Xavier? I am expecting a flatter distribution compared to Xavier, but not that shape with these peaks on the boundaries. Finally, how these distributions of activations are calculated? Passing many samples through the network with fixed weights?
@thedarkknight579
@thedarkknight579 2 жыл бұрын
This is Pure Gold!!!!
@wardo5840
@wardo5840 5 жыл бұрын
Can someone clarify me 1:07:20 where he explains that it's best to optimize in log space? I can't see the presentation
@jklasfjkl
@jklasfjkl 3 жыл бұрын
Late reply but log space is most often used to combat underflow or zeros issues, i.e. when you multiply things a single 0 will make the whole expression 0. In log space you add everything instead of multiplying, so the problem disappears there.
@hnkulkarni
@hnkulkarni 8 жыл бұрын
Fantastic lecture!!
@helenazeng6793
@helenazeng6793 7 жыл бұрын
Thank you! This is so much better than my professor's lectures
@stevetj926
@stevetj926 5 жыл бұрын
great lecture! Thanks Andrej
@bingeltube
@bingeltube 6 жыл бұрын
Very recommendable! Lot's of excellent, practical tips! Great discussion of important subjects like pros & cons of activation functions, weight initialization etc.
@brandomiranda6703
@brandomiranda6703 8 жыл бұрын
At 1:05:02 the speaker explains why the loss remains (roughly) the same but the accuracy increases. Is this the case because the loss doesn't directly measure the 0-1 loss but instead softmax while accuracy is directly a 0-1 loss so the true accuracy can have a bigger change than the loss? Or why is it?
@hnmusac9299
@hnmusac9299 7 жыл бұрын
I think in this example, loss is more dominated by the (activations of the) wrong labels. Even though some test examples are switching labels from wrong to correct, this is not sufficient to cause an impact on the loss. This loss is more 'cruel' with your mistakes rather than rewarding of your sucesses.
@AlqGo
@AlqGo 7 жыл бұрын
55:32 "Batch normalisation can act as an identity function..." I think he means inverse function instead of identity function, because identity function cannot undo.
@AlqGo
@AlqGo 7 жыл бұрын
Ah right...forgot that the sigma and beta are implemented in the same layer as the unit normalization
@7688abinash
@7688abinash 7 жыл бұрын
Enjoyed it whole-heartedly.... finally an awesome lecture...
@AllanZelener
@AllanZelener 9 жыл бұрын
It would be nice to mention Bayesian optimization as a line of research for automating hyperparameter search. I appreciate all the little tips on current best practices sprinkled throughout the lecture and would like to know your thoughts on this as well.
@Vulcapyro
@Vulcapyro 9 жыл бұрын
+Allan Zelener This is mentioned briefly in the course notes :)
@HrishikeshKamath
@HrishikeshKamath 6 жыл бұрын
At 1:01:30 , why does loss increase with increase in value of regularization?
@sokhibtukhtaev9693
@sokhibtukhtaev9693 6 жыл бұрын
it's because regularization is added on purpose to make loss increase, henceforth gradients of W's will correspondingly change their values to again make loss values decrease.
@fahmikahlaoui7717
@fahmikahlaoui7717 7 жыл бұрын
Please upload the lectures of this year && Thx soo much for sharing knowledge :) :) Respect
@WahranRai
@WahranRai 6 жыл бұрын
7:31 i disagree because in the update rule they are updtating the weights with a part of error (difference between desired class and output class): it is like gradient !!!
@SabaRahimi
@SabaRahimi 3 жыл бұрын
wow! This is so good!!!
@ProfessionalTycoons
@ProfessionalTycoons 6 жыл бұрын
amazing lecture!!!!
@jessehao590
@jessehao590 7 жыл бұрын
I love this accent and speed, do you think so?
@ShangDaili
@ShangDaili 7 жыл бұрын
slide 35. I am totally lost. Why are the gradients on w always positive or negative?
@BhupinderSinghj
@BhupinderSinghj 7 жыл бұрын
Because wi*xi is a multiply gate, so it acts as a gradient switcher. Gradient at x will be w*(forward gradient). And gradient at w is (x*forward gradient). So if x are all positive, gradient at w is either all positive (if forward gradient is positive) and negative (if it's negative)
@IbrahimSobh
@IbrahimSobh 8 жыл бұрын
I was thinking why not use the simple tanh activation function with some parameter(s) (to be learned) to stretch horizontally the zero centered "S" shape to be like ReLU? Do you think it will be a good idea? (still zero centered, and not easy to saturate, learnable but one extra parameter per neuron)
@realGBx64
@realGBx64 6 жыл бұрын
if you consider how it is done, the multiplication of inputs by the W matrix effectively streches by a lot on the X axis, while the bias shifts the center left and right.
@fgfanta
@fgfanta 7 жыл бұрын
What works, what doesn't, and how to make it work; this is gold! Stuff I typically don't find in papers and books.
@perlaz1166
@perlaz1166 5 жыл бұрын
25:34 What does "knocked off the data manifold" mean?
@miladaghajohari2308
@miladaghajohari2308 4 жыл бұрын
It means that the ReLu may suddenly get not activated on the entire dataset. Remember ReLu receives activations from the previous layer. if all of these activations are negative the gradients will not flow backward to the previous layers. The ReLu will always receive negative weights on all the training examples. That's what he means by the data manifold. The data manifold is actually a region in the input space that the data is disributed on. I think if you see here www.deeplearningbook.org/contents/ml.html , chapter 5, search for manifold and there is a nice explanation. I hope it could have helped you.
@ChundeHuang
@ChundeHuang 7 жыл бұрын
Thanks very much, it's very helpful!
@saltrocklamp199
@saltrocklamp199 6 жыл бұрын
If the output from the sigmoid function being non-centered is a problem, why don't they just subtract 0.5 from it? The gradient would be the same but the output would be centered.
@ta6847
@ta6847 5 жыл бұрын
tanh is effectively a sigmoid rescaled to -1,1 if mean centered data is so important, why is ReLU the standard (which either has all positive or all 0 weight gradients)? maybe with the sigmoids, gradients accumulate to be increasingly positive for some neurons and increasingly negative for others, making the entire network less stable
@akzsh
@akzsh 9 ай бұрын
In 2015, the world didn't know about the problem of batch normalization
@nguyenthanhdat93
@nguyenthanhdat93 8 жыл бұрын
Why do we need to zero-centered the data during pre-processing step? How would it be beneficial to the network performance? Thanks
@AhmedKachkach
@AhmedKachkach 7 жыл бұрын
If you're using something like the sigmoid activation, zero-centering data means that you'll be in the zone where the gradient takes the biggest values, reaching convergence faster. If your data is on the edges of the sigmoid's range, the gradients will be very small and might end up being (almost) equal to zero. In practice, normalising data is not easy since you do not know the range of your data at training time (since you train batch by batch), and the mean of your data might actually be valuable (+ some activation functions do not have gradients that max at 0), but these problems are solved by batch norm.
@AhmedKachkach
@AhmedKachkach 7 жыл бұрын
Just watched the part where he discusses this. I think he gives many arguments for what's desirable (beyond what I said).
@zeyadezzat8893
@zeyadezzat8893 5 жыл бұрын
I've hard time with this , that's obvious explanation thank you
@joshuazastrow5866
@joshuazastrow5866 7 жыл бұрын
In Reference to Batch Normalization: Is it recommended to use batch normalization on the final layer? (Final fully connected layer -> Batch Normalization -> softmax/svm -> output?)
@adamb.2390
@adamb.2390 9 жыл бұрын
lossfunctions will be my favorite tumblr now :)
@wiiiiktor
@wiiiiktor 6 жыл бұрын
kzbin.info/www/bejne/nYrToH2DocysjqM The grid on the right can be simply skewed. Additionally, this grid can move +/-1% up & down and left & right (so it would be 5 calculation of the grid in total) to indicate some local gradient, that can be used to find local min/max.
@kyuhyoungchoi
@kyuhyoungchoi 8 жыл бұрын
At 47 th slide, it says "TLDR" What does it mean?
@souslicer
@souslicer 8 жыл бұрын
+kyuhyoung choi choi..
@nguyenthanhdat93
@nguyenthanhdat93 8 жыл бұрын
Too Long; Didn't Read
@leavexu
@leavexu 7 жыл бұрын
Awesome!
@LouisChiaki
@LouisChiaki 7 жыл бұрын
Very very helpful lecture on tunning neural networks!!
@tisblaze
@tisblaze 4 жыл бұрын
Great lecture by Justin Timberlake. Very well organized too.
@79oose
@79oose 3 жыл бұрын
that's a funny comment hhhhh
@tisblaze
@tisblaze 3 жыл бұрын
@@79oose 😂
@Age_of_Apocalypse
@Age_of_Apocalypse 6 жыл бұрын
This guy is talking fast ... LOL. I really love his videos on AI: Thank You!
@aueret
@aueret 3 жыл бұрын
lol, i am listening to it with speed 0.5 :)
@aleksandarabas6618
@aleksandarabas6618 2 жыл бұрын
I wasn't blind folded :P
@rezaghoddoosian1
@rezaghoddoosian1 8 жыл бұрын
guys we learn the Ws (weights) but what about the constant bias (b)? if we dont learn them, wouldnt it cause problem?
@dzianish6223
@dzianish6223 8 жыл бұрын
bias is also learned. You can imagine that b is just a simple parameter (like w) but the only difference is that it multiplies by constant 1 instead of variative x.
@bayesianlee6447
@bayesianlee6447 6 жыл бұрын
It was definitely very hard to listen all he saying at first. But now kinda enjoying it lol
@tai94bn
@tai94bn 9 жыл бұрын
Do students learn skill programming in the class?
@uidsuivn101
@uidsuivn101 8 жыл бұрын
Actually working on the assignments and project takes a fair bit of effort and programming work, but the objective of this course isn't to teach a student programming.
@ziweixu1573
@ziweixu1573 7 жыл бұрын
Confused at kzbin.info/www/bejne/nYrToH2DocysjqM . How is sampling in log space better directly from uniform space because of the fact that learning rate and regularization act multiplicatively?
@mojidayo750
@mojidayo750 7 жыл бұрын
same for me
@priojeetpriyom
@priojeetpriyom 6 жыл бұрын
what is unit gaussian?
@idealistdev
@idealistdev 5 жыл бұрын
Gaussian (normal) distribution with mean=1 and std=1
@techonlyguo9788
@techonlyguo9788 9 жыл бұрын
THX
@ahmettavli4205
@ahmettavli4205 3 жыл бұрын
14:31 tanh(z) = ( e^z - e^(-z) ) / ( e^z + e^(-z) )
@realGBx64
@realGBx64 6 жыл бұрын
I really love this lecture, but somehow I got stuck on tha fact, that this guy has a Russian name, a Hungarian surname and a perfect American accent.
@smyk1975
@smyk1975 5 жыл бұрын
Because he is Slovakian. He also lived in Canada since the age of 15
@dmitry926
@dmitry926 7 жыл бұрын
Complete mess about Batch Normalization...
@taopaille-paille4992
@taopaille-paille4992 7 жыл бұрын
I find Andrey Karpathy a little bit annoying with the way he speaks, but he is a good teacher
CS231n Winter 2016: Lecture 6: Neural Networks Part 3 / Intro to ConvNets
1:09:36
CS231n Winter 2016: Lecture 4: Backpropagation, Neural Networks 1
1:19:39
Andrej Karpathy
Рет қаралды 302 М.
#behindthescenes @CrissaJackson
0:11
Happy Kelli
Рет қаралды 27 МЛН
I Sent a Subscriber to Disneyland
0:27
MrBeast
Рет қаралды 104 МЛН
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Рет қаралды 550 М.
CS231n Winter 2016: Lecture1: Introduction and Historical Context
1:19:08
Andrej Karpathy
Рет қаралды 412 М.
CS231n Winter 2016: Lecture 3: Linear Classification 2, Optimization
1:11:23
MIT 6.S191: Convolutional Neural Networks
1:07:58
Alexander Amini
Рет қаралды 104 М.
CS231n Winter 2016 Lecture 5 Neural Networks Part 2-jhUZ800C650.mp4
1:18:38
Lecture 10 | Recurrent Neural Networks
1:13:09
Stanford University School of Engineering
Рет қаралды 591 М.
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 1,5 МЛН