33. Neural Nets and the Learning Function

Рет қаралды 40,828

5 жыл бұрын

MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018
Instructor: Gilbert Strang
View the complete course: ocw.mit.edu/18-065S18
KZbin Playlist: • MIT 18.065 Matrix Meth...
This lecture focuses on the construction of the learning function F, which is optimized by stochastic gradient descent and applied to the training data to minimize the loss. Professor Strang also begins his review of distance matrices.
License: Creative Commons BY-NC-SA
More information at ocw.mit.edu/terms
More courses at ocw.mit.edu

Пікірлер: 30

@tchappyha4034 5 жыл бұрын

Again Prof. Gilbert Strang! Thank you very much!

@brendawilliams8062 2 жыл бұрын

He is certainly right that even the best would fare well to add anything. To have the pleasure of his lectures is more than gold.

@mihalisgolias1078 4 жыл бұрын

Just once to see one of his lectures and not be amazed. He is simply awesome!

@brendawilliams8062 2 жыл бұрын

I can’t say I will have to review and review from beginning to end many times. You are most clear in explanations.

@bettymontoya784 2 жыл бұрын

I like your intelligent advanced lectures.they are very challenging..strang is the smartest.thank you.

@roshansinghbisht280 3 жыл бұрын

Thank you Prof Strang!

@user-ym3oq3ks6j 4 жыл бұрын

I’m very enjoy your course.

@enkhbatlop 5 жыл бұрын

Thank you Gilbert!

@gtsmeg3474 4 жыл бұрын

Very thanks MIT for sharing such a knowledge

@stefan-ls7yd 4 жыл бұрын

Such a nice prof!

@brendawilliams8062 2 жыл бұрын

He sounds like the film star jimmy Stewart. It may be the accent

@georgesadler7830 2 жыл бұрын

Professor Strang ,thank you for an awesome lecture on Distance Matrices , structure of Neural Nets and the Learning Function. All these mathematical concepts improves my understanding of Machine Learning.

@gonzalopolo2612 4 жыл бұрын

Hats off for professor Gil Strang and MIT for this amazing classes. One question: I may missing something, the last part, where he says that you can obtain the matrix G from D with this formula, I think that is not correct. You don't have the squared norms of the vectors and professor Strang assumes that you have it on the diagonal of D but diagonal of D is all zeros, am I right? Or am I misunderstanding anything? Again thank you very much!

@clearmind9817 4 жыл бұрын

You are right, the diagonal enteries d_{i,i} are all zeros. He explains it a little complicated (also in his book!). I think the d_i and d_j are here not supposed as the diagonal enteries of D, but they produce the diagonal enteries of G in the way of formula at 42:53. I hope I gave your answer :)

@JokerSpeaker Жыл бұрын

Can some one please explain to me @ 34:10 is this some sort of identity ? can't understand what does it mean actually

@dcrespin Жыл бұрын

For BPP, deep learning and for the general structure of neural networks the following comments may be useful. To begin with, note that instead of partial derivatives one can work with derivatives as the linear transformations they really are. It is also possible to look at the networks in a more structured manner. The basic ideas of BPP can then be applied in much more general cases. Several steps are involved. 1.- More general processing units. Any continuously differentiable function of inputs and weights will do; these inputs and weights can belong, beyond Euclidean spaces, to any Hilbert space. Derivatives are linear transformations and the derivative of a neural processing unit is the direct sum of its partial derivatives with respect to the inputs and with respect to the weights; this is a linear transformation expressed as the sum of its restrictions to a pair of complementary subspaces. 2.- More general layers (any number of units). Single unit layers can create a bottleneck that renders the whole network useless. Putting together several units in a unique layer is equivalent to taking their product (as functions, in the sense of set theory). The layers are functions of the of inputs and of the weights of the totality of the units. The derivative of a layer is then the product of the derivatives of the units; this is a product of linear transformations. 3.- Networks with any number of layers. A network is the composition (as functions, and in the set theoretical sense) of its layers. By the chain rule the derivative of the network is the composition of the derivatives of the layers; this is a composition of linear transformations. 4.- Quadratic error of a function. ... --- Since this comment is becoming too long I will stop here. The point is that a very general viewpoint clarifies many aspects of BPP. If you are interested in the full story and have some familiarity with Hilbert spaces please google for papers dealing with backpropagation in Hilbert spaces. A related article with matrix formulas for backpropagation on semilinear networks is also available. For a glimpse into a completely new deep learning algorithm which is orders of magnitude more efficient, controllable and faster than BPP search in this platform for a video about deep learning without backpropagation; in its description there are links to a demo software. The new algorithm is based on the following very general and powerful result (google it): Polyhedrons and perceptrons are functionally equivalent. For the elementary conceptual basis of NNs see the article Neural Network Formalism. Daniel Crespin

@wagsman9999 Жыл бұрын

Love him. ❤

@allyourcode 3 жыл бұрын

@47:29 But where do we get the d vector in the formula for the G = X^T * X matrix?

@ky8920 2 жыл бұрын

vector d consists of xi*xi for 1

@user-tb4sg6lo8f 2 жыл бұрын

You can take any d vector that keeps X.T*X pos. semidef.

@stephenshefsky5201 Жыл бұрын

Good question.

@yangpiao3071 8 ай бұрын

The detail is given in the page 260 of the reference book. To make it breif, placing the first point x1 at the origin, namely x1 = zero vector, makes xi - x1 = xi. Therefore, the first column of the distance matrix D is the x1^2, x2^2 ..., xn^2, which is the diagonal of XTX and also the vector d. I recomend you to look up in the original book to fully get the main idea.

@somerandomguy8361 3 жыл бұрын

6:03 A_1, b_1 -> A_l, b_l

@leiqin5756 3 жыл бұрын

A_1, b_1 -> A_k-1, b_k-1

@yushpi 4 жыл бұрын

was he always old? :P

@AKSHAYAKUMARSEC 3 жыл бұрын

Old is gold

@plekkchand 3 жыл бұрын

Your remark is idiotic.

@freeeagle6074 Жыл бұрын

Young people may be very old. Old people may be very young.

@craighalpin1917 Жыл бұрын

Legends speak of a time which he was but a mere child... But we here do not condone such nonsense

@eliavrad2845 Жыл бұрын

I don't think I understood what is the D matrix in the distance problem: I tried getting a similar term, but I'm not sure it's the same or correct: consider dᵢⱼ = |xᵢ - xⱼ | ² = xᵢ² + xⱼ² + 2xᵢxⱼ, we want to find an expression for X^T X which is xᵢxⱼ We can do rigid translation, so we can limit ourselves to xs which are centered, i.e. the average of the position is 0 i.e. =0. Now, if we average dᵢⱼ on one of the indices, let's pick j, we get ⱼ= xᵢ² + + 2 xᵢ = xᵢ² + + 0 We can denote the second moment =σ², so ⱼ = xᵢ² + σ². We can average dᵢⱼ again, this time over the i index, and we get < ⱼ >ᵢ = < xᵢ² + σ²>ᵢ = 2σ² We can use this to rewrite the xᵢ² terms using averages over dᵢⱼ xᵢ² + xⱼ² = ᵢ + ⱼ - ᵢⱼ = xᵢ² + σ² + xⱼ² + σ² - 2σ² And get 2xᵢxⱼ = ᵢ + ⱼ - ᵢⱼ - dᵢⱼ I think these averages over i and j correspond to some of the weird column tricks [11111][D], but I'm not sure.