ERRATA: - The difference between loss and energy: Energy is for inference, loss is for training. - The R(z) term is a regularizer that restricts the capacity of the latent variable. I think I said both of those things, but never together. - The way I explain why BERT is contrastive is wrong. I haven't figured out why just yet, though :) OUTLINE: 0:00 - Intro & Overview 1:15 - Supervised Learning, Self-Supervised Learning, and Common Sense 7:35 - Predicting Hidden Parts from Observed Parts 17:50 - Self-Supervised Learning for Language vs Vision 26:50 - Energy-Based Models 30:15 - Joint-Embedding Models 35:45 - Contrastive Methods 43:45 - Latent-Variable Predictive Models and GANs 55:00 - Summary & Conclusion
@bzqp23 жыл бұрын
Can you perhaps pin this to the top? Thanks.
@ThichMauXanh3 жыл бұрын
All DNN with loss = some_distance(y, pred) is indeed energy-based model as you said. But not All energy based model has the form loss = some_distance(y, pred) where pred = f(x) is an explicit part of the model. So by Energy-based model, Yann means a generalization of traditional formulation where we can escape the problem of multiple y given a single x. The blogpost needs to make this distinction clearer.
@baskaisimkalmamisti3 жыл бұрын
thanks to you that I can both watch KZbin and keep up with the research at the same time.
@brendawilliams80622 жыл бұрын
I get tired of it.
@falconeagle36553 жыл бұрын
Congrats!! Yann Lecun sent me to your video.
@sheggle3 жыл бұрын
We love us some content that doesn't chase sota, thank you as always Yannic!
@aniruddhadatta9253 жыл бұрын
Because it mostly just stacks up hardware
@QuadraticPerplexity Жыл бұрын
14:05 Regarding whether the third kind of masking could be used for NLP: if the word embedding is good, probably you could mask out a subset of the dimensions.
@sehbanomer81513 жыл бұрын
13:00 Am I the only one who found that question mark really satisfying?
@ProjectsWithRed3 жыл бұрын
I was just about to comment this haha.
@NextFuckingLevel3 жыл бұрын
Another prove that yannic is an android
@rockapedra11303 жыл бұрын
Best question mark ever!
@CalvinJKu Жыл бұрын
Didn’t like the video thumbnail at first sight but the content is king! Subscribed!
@rogerfreitas73233 жыл бұрын
So far best channel in youtube
@WhatsAI3 жыл бұрын
Awesome video as always! And I completely agree, I feel like they are kind of trying to "set their terminology" on already existing concepts, but it was still an interesting read, and even better to hear your point of view on it!
@norabelrose1983 жыл бұрын
I think "energy based model" more precisely is supposed to refer to models that output unnormalized scores as opposed to (log-) probabilities. LeCun has said that he doesn't like approaches that are specifically designed to output valid probabilities or approximations of probabilities (i.e. normalizing flows, traditional VAEs) when arguably some other non-probability based approach would work better. But confusingly he also seems to lump even probability based models into the EBM category when he feels like it.
@ruroruro3 жыл бұрын
Agreed. Extremely hand-wavy and non-specific. Also, I wonder, how do you even determine, if some model is approximating a probability distribution or not. Like I am pretty sure, that for any score function, you can produce a monotonic mapping of that score to [0; 1], that gives you a pretty good approximation of the underlying probability distribution.
@lucathiede92383 жыл бұрын
Normalizing flows output valid probabilities (or more precisely, likelihood) yes, but VAEs dont, they only output a sample without the associated likelihood
@norabelrose1983 жыл бұрын
@@lucathiede9238 The loss function for VAEs is negative ELBO, which is provably a lower bound on the true log probability of the data
@norabelrose1983 жыл бұрын
@@ruroruro Yeah, I think the main problem with mapping arbitrary score functions to probability distributions is computing the normalization constant to ensure the integral of the score over all possible inputs is equal to 1. That’s not tractable in a lot of cases. Some people try hard to figure out ways to compute or approximate the normalizing constant, and LeCun’s approach seems to just be, forget about it, don’t normalize the scores at all.
@lucathiede92383 жыл бұрын
@@norabelrose198 Not quite, the encoder gives you the pdf of a *latent variable*, conditioned on a sample. It will not give you the probability of the sample itself. At no point in the VAE you can actually get p(sample), which is what you are usually trying to approximate in energy-based models afaik.
@emransaleh95353 жыл бұрын
Good topic to tackle in this time. I will enjoy watching the video.
@membershipyuji3 жыл бұрын
Very helpful video. I was able fill in many gaps present in the post.
@brendawilliams80622 жыл бұрын
Thankyou. Informative and nicely explained.
@lucathiede92383 жыл бұрын
There is a difference between energy functions and objective functions: In physics, energy functions are defined as scalar fields with curl = 0 everywhere, so their gradient field is conservative (which is important, because otherwise the path integral for a closed path would be > 0, violating conservation of energy) In ML, there are objectives with gradient fields that are not conservative. The best-known example for this is the GAN objective
@jackdkendall3 жыл бұрын
Yes, and energy-based models are also linked to probability distributions in an explicit way which general loss functions are not. An energy function is an unnormalized probability distribution where you can explicitly get relative probabilities. To say that an energy function is the same thing as a loss function is inaccurate.
@lucathiede92383 жыл бұрын
@@jackdkendall Correct, there is always an explicit link between energy and probability distribution. However, I would not go quite as far as saying energy functions are unnormalized pdfs. For example in thermodynamics, the likelihood of finding a system in a given state is described by the Boltzmann distribution of the energy of the given state, not just the energy normalized And this is also only true for the ideal gas model, for higher-order interactions it becomes non-analytical. But the connection of higher energy -> lower likelihood always remains as far as I know
@jackdkendall3 жыл бұрын
@@lucathiede9238 as far as I'm aware in thermodynamics the probability is defined as the energy of a state divided by the partition function. The Boltzmann distribution is a direct consequence of this. In physics the Arrhenius equation always gives relative probabilities in terms of activation energy, which is just the energy difference between two states. In ML, it's similarly defined. Probability is just energy divided by partition function aka normalization term.
@jackdkendall3 жыл бұрын
Actually scratch that, the Boltzmann distribution is a result of entropy maximization under conservation of energy
@lucathiede92383 жыл бұрын
@@jackdkendall Mh, I am not sure, maybe I have to freshen up my thermodynamics knowledge What I am very sure of though, is that the energy (whether we need to take the boltzman distribution of it or not) only describes the likelihood of a state, not the probability. To get the probability we need to consider the Helmholtz free energy of a macro state, which essentially takes the entropy and the temperature into account I know this is a bit nitpicking, but it is often messed up and very important for example for protein folding, since the protein does not just fold into the state of lowest energy, but lowest free energy
@Arthurein3 жыл бұрын
I am likely wrong so please correct me if it's true. In a standard classification problem, the loss function is the objective for which we optimize the neural network. In an energy-based setting, the loss function *is* the network. Which truly can be interchanged concepts since, after all, the output of a classifier will describe something akin to a probability distribution on the possible categories. But the cool part about the energy function is that it does not require any sort of normalization, and thus it can just be a black box that gives you a small number if things make sense, or a big number if things don't (or viceversa).
@ocifka3 жыл бұрын
52:20 AFAIK the latent variable _z_ and the "embedding" are actually sort of the same thing. (The embedding is just a realization of that random variable I would say.) The confusion probably comes from the fact that there are different distributions over _z_ involved: _p(z)_ and _q(z|x)_ - the latter is what the encoder outputs, including the reparametrization trick.
@PabloHuijseHeise3 жыл бұрын
Totally right. Also the "making the latent variable fuzzy" bit refers to constraining q(z|x) to be close to p(z), the latter typically being a standard gaussian distribution
@frenchmarty74462 жыл бұрын
@@PabloHuijseHeise Actually no. Making the latent variable "fuzzy" refers to adding noise and a KL divergence term to p(z|x) and thereby making p(z) smoother and easier to sample from. This has the side-effect of making p(z) closer to a Gaussian then it otherwise would be but enforcing p(z) to be Gaussian is a separate problem addressed by things like Adversarial Autoencoders, Factor-VAE, etc.
@CosmiaNebula2 жыл бұрын
"energy-based method", as defined in the paper, is *extensionally* the same as machine learning using a loss function, but is *intensionally* different. The problem is that LeCun didn't describe it rigorously using the language of symmetry, though in his subconscious (and in the subconscious of every physicist who reads the paper), the "energy function" is intended to be "energy function that has good symmetries". I will explain. ## Feynmann's unworldly equation Consider, for example, Feynman's "unworldliness equation" U = 0, where U = f^2 + g^2 + ..., and each f, g, ... is a scalar equation of nature. This equation is of course entirely correct, but it is trivial. However, this does not make every equation trivial. Some equations really are more substantial than others. What is the substance? It is *symmetry*, or invariance under transformations. When Maxwell wrote down the Maxwell equations, he used 20 scalar equations. In 4-vector notation, there are just 2 equations. Why such a great simplification? It is not the trivial kind of simplification as in U = 0, but a deep simplification -- all equations written in 4-vector notation are necessarily invariant under Lorentz transforms. Because the proper "home" of Maxwell equations is a universe that is invariant under Lorentz transforms, it's no wonder that they are more elegant when in 4-vector notations. Conversely, when you notice how elegant the equations are in 4-vector form, you realize that the universe should probably be invariant under Lorentz transforms. Modern theoretical physics is basically a game of inventing new transforms, then constructing equations invariant under the transforms, then publish it. > So the “beautifully simple” law in Eq. (25.32) is equivalent to the whole series of equations that you originally wrote down. It is therefore absolutely obvious that a simple notation that just hides the complexity in the definitions of symbols is not real simplicity. It is just a trick. The beauty that appears in Eq. (25.32)-just from the fact that several equations are hidden within it-is no more than a trick. When you unwrap the whole thing, you get back where you were before. > However, there is more to the simplicity of the laws of electromagnetism written in the form of Eq. (25.29). It means more, just as a theory of vector analysis means more. The fact that the electromagnetic equations can be written in a very particular notation which was designed for the four-dimensional geometry of the Lorentz transformations-in other words, as a vector equation in the four-space-means that it is invariant under the Lorentz transformations. It is because the Maxwell equations are invariant under those transformations that they can be written in a beautiful form. (Feynman 2, 25:6) www.feynmanlectures.caltech.edu/II_25.html#Ch25-S6 ## Energy-based methods, from the POV of ### a ML scientist Extensionally, any machine learning problem defined using an energy function is equivalent to one defined using a loss function. And conversely, any ML problem defined by a loss function is equivalent to one defined by an energy function. Intensionally, if you start with any loss function, and find its equivalent energy function, you would almost certainly get an energy function with no good symmetry at all. Energy-based method is a principled way to convert symmetries in the problem into good priors over your neural network. Instead of using arbitrary loss functions constructed ad-hoc, or perhaps meta-learn a loss function, we impose the prior over the space of loss functions that respect the symmetries. Writing down an energy that respects the symmetries is just an efficient, implicit way to impose the prior. ### a physicist Energy-based methods provide a principled way to write down equations that are invariant under physically relevant symmetries, such as translation (R^n), rotation (SO(n)), reflections (E(n)), volume-preserving maps (SL(n)), and so on. It also allows us to use gauge theory for ML. Not only that, it also allows one to enforce only local interactions, by writing the energy as a sum of local interactions (such as E = x1 x2 + x2 x3 + x3 x4 + ...), bringing statistical mechanics and renormalization techniques to the table. Not only does this allow you to import the greatest hits of modern physics and make ML as abstract as string theory, it also imposes good priors. A ML model for physical processes should probably only consider models that are invariant under the symmetries of nature, such as translation, rotation, reflection, etc. ### a mathematician Energy-based methods is the Erlangen program for high-dimensional probability. All hail Felix Klein, the felicitous king of symmetry. ### a linguist Extensional definition and intensional definitions often diverge, and it's more important to discover the intension and making it explicit, than to focus on the extension and quibble. For example, extensionally, an "activation function" is *any* function of type R^n → R, but that's the extensional definition. When you actually say "activation function" you mean any function of this type that has been profitably used in a neural network.
@cem_kaya2 жыл бұрын
thanks for the explanation
@CharlesVanNoland3 жыл бұрын
Regarding the object permanence / gravity thing. They did experiments with cats, raising them in environments that just had vertical stripes all over everything, effectively denying them the opportunity to see horizontal lines and when they matured they would put the cats in more natural conventional environments and they had no concept of the danger of heights or falling because they couldn't perceive the ledges they were approaching.
@ekjotnanda68322 жыл бұрын
Very nice explanation 👍🏻
@shengyaozhuang37483 жыл бұрын
looking forward to "Barlow Twins: Self-Supervised Learning via Redundancy Reduction" review, also from Yann's group. BYOL like method but without momentum updates!
@HaykTarkhanyan2 жыл бұрын
Thanks, that was very helpful
@bsdjns3 жыл бұрын
Great video, please keep them coming! I actually didn't know you're German until you mentioned the eierlegende Wollmilchsau :D
@jeroenput95643 жыл бұрын
Love your critical thinking!
@learnstochastic3 жыл бұрын
Hey Yannic, I don't if you already have it. Is it possible to make these things available as a podcast over Spotify or anywhere? Your approaches are really good, and it would be a lot easier to just plug in ear phones and go for a walk with your explanations on. Thanks.
@PavelChernov3 жыл бұрын
Thank you! Very interesting topic.
@apollozou98092 жыл бұрын
Thank you!
@alvarohenriquez4973 жыл бұрын
Really enjoy your videos. Hope that you do one on the Timesformer soon. Thanks.
@alexandervlasov67463 жыл бұрын
As far as I undersood, energy-based learning is a non-probabilistic counterpart to maximum likelihood (or MAP) estimation. Similar to SVM and Logistic Regression, both are used as classifiers and have the same classifier form, but LR having probabilistic roots, while SVM - not. Basically, one uses the same idea, but avoids probability constraints (non-negative, sums to one). In that context, an energy function is not the same as a loss function. Yann LeCun uses energy function for inference (y_pred = argmin_y F(x,y)). However, a training/validation loss can be constructed in a different way. Often, the loss is (mis-)prediction based, but it's not necessary so, e.g. one can use a structurally similar but different argmin approach to obtain loss/feedback information. A similar difference is between a probability mass/density function and a (log-)likelihood function.
@InquilineKea2 жыл бұрын
If you could only watch 4 Yannic videos this would be one of them
@finlayl25053 жыл бұрын
I wonder if you trained a model on filling in video first, could you then transfer that to object recognition and get better results? Because in the video training it would no doubt get a better understanding of 3d space which may help it in object recognition.
@brendawilliams80622 жыл бұрын
A spark gap unleashed a zoo
@georgelomia47243 жыл бұрын
To the point you brought up on minute 38:15, would an unsupervised clustering method help us group images together according to their similarity, if not why not?
@ProjectsWithRed3 жыл бұрын
Am I not correct in saying that finding the missing patch/crop in an image is not infinite, just a very high number. Because let say the crop (hidden part) is 100x100 and for example there are 3 colour channels, so 100x100x3, we know each colour channel has a range from 0-255, so we can technically make all possible crops that go in that image, which is all combinations of pixel values, which is discrete and not infinite, and one of the combinations of pixel values will be that exact patch in the whole image. It might be infinite in real-life but it is discrete in terms of CV.
@frenchmarty74462 жыл бұрын
A better way of putting it is illposed. There are more possible answers then we have data to learn from, therefore learning the parameters of a probability distribution is impossible. There are infinitely many good *parameters* of our model that fit the data perfectly well.
@EternalKernel3 жыл бұрын
Thank you, Yannic!
@sean_vikoren3 жыл бұрын
To find the possible number of arrangements you simply add the bits in each pixel and that will be a specific number which can essentially be represented by the number of bytes that represent the image. So not infinite.
@waleedawad45203 жыл бұрын
That was very helpful
@_kkaai2 жыл бұрын
Very helpful
@Markste-in3 жыл бұрын
My questions is: what keeps the model from just ignoring the latent variable und just use the input (x). Wouldn't that be more easy for the model than trying to handle a jittery latent variable (z)
@frenchmarty74462 жыл бұрын
1.) The model is actually two components, the encoder and the decoder. The decoder doesn't get to see 'X'. 2.) Modeling an intermediate latent variable forces the model to learn a more generalizable strategy. If you mapped from X* -> X it's impossible to know what the model will do with unseen values. Whereas if the encoder always maps to the same constricted latent space, the information that passes through the constriction is more likely to be meaningful. It's the same logic behind something like LASSO in standard regression. Better the model learns some important information than absorb everything. Also the beauty of using a latent intermediate representation is that you can dramatically change how the model "thinks" about the data by imposing restrictions on what "Z" can look like (see Factor-VAE, Causal-VAE, cluster-VAE).
@DucNguyen-wy1ir3 жыл бұрын
Hi @yannic. Thanks for the fantastic video. Regarding your confusion about limiting the capacity of the latent variable `z` in VAE, i think what the post means is that in VAE, the authors use a unimodel Gaussian distribution, which is of limited capacity because they could have used some other distributions that are of higher capacity, like a Gaussian mixture models or some multimodal distributions (in fact as far as my memory can serve me, there is a paper doing it). What do you think?
@moormanjean56362 жыл бұрын
I think what they mean is that by introducing noise to the latent variable, the network learns to rely less on the latent part of the network and more on the autoencoder.
@frenchmarty74462 жыл бұрын
I'm not sure if the information capacity of the distribution itself is related to how VAEs constrict information. Those are two different kinds of information. The decoder doesn't recieve the entire distribution, it is given one sample from the latent distribution. The information in the distribution itself is lost. A multivariate Cauchy distribution contains more information than a multivariate Gaussian distribution, but a random sample from the former would actually contain less information* (more dimensions of my vector will be extreme outliers that are unrepresentative of the mode). It's more accurate to say that a non-Gaussian distribution *implies* more information (I have to know more as an observer to model something as non-Gaussian, a Cauchy distribution is a less expected guess than a Gaussian a priori); the distribution doesn't pass this information on unless I take multiple samples and relearn the distribution. A vector of n dimensions is still just a vector of n dimensions. Here is a good explanation on why VAEs usually use Gaussian noise instead of anything else: stats.stackexchange.com/questions/517467/is-it-possible-to-use-variational-autoencoders-with-non-gaussian-data This article has a good explanation of why variations autoencoders are variational in the first place with a good visualization of how Gaussian noise "smooths" the latent space (which is useful for generative sampling): www.jeremyjordan.me/variational-autoencoders/ Gaussian noise has the side-effect of restricting information flow, but this wasn't a problem for autoencoders to begin with because we have complete control over the dimensionality of "Z" to begin with. In fact if you wanted to use noise to restrict information flow you would actually generate noise from a more "informative" distribution like Cauchy noise. Remember that the decoder is ultimately trying to reconstruct what was fed into the encoder to start with, if "Z" is more complicated it is that much harder for the decoder to guess what "X" produced it from a single sample of "Z". *this should actually be expected. If a distribution implies more information then by extension any particular sample must be relatively less informative. It's not an accident that highly non-Gaussian distribution are very sample inefficient to parameterize and/or require special less efficient estimators. Also, there is a difference between pushing individual samples to be Gaussian and pushing the entire latent space to be Gaussian. The later is what things like Adversarial Autoencoders (arxiv.org/abs/1511.05644) and beta-VAE (openreview.net/forum?id=Sy2fzU9gl) try to do and is closer to what I think you mean.
@frenchmarty74462 жыл бұрын
@@moormanjean5636 I'm not sure what you mean by "rely less on the latent part and more on the autoencoder". The latent variable is the output of the encoder [p(z|x)] and is the only input the decoder [p(x|z)] receives. Gaussian noise is added to make the decoder more robust and the latent distribution smoother, both of which make it easier to sample new values and to generalize to new inputs.
@aamir122a3 жыл бұрын
At time index 45 minutes, when you say mix x with z, what do you mean ( add, multiply, divide subtract ) etc, as a suggestion it would be best to explain the math operation if functions as you go along.
@sg22r3 жыл бұрын
How is self supervision different than augmentations?
@sebastiangerard95483 жыл бұрын
The blog post says: "An EBM is a trainable system that, given two inputs, x and y, tells us how incompatible they are with each other." I think that models trained with loss functions are not EBMs, since the task that the model is solving is e.g. image classification, not to output the loss between the ground truth and the input image. That's just something you use to find the parameters of the model, but not the task that the model solves. Maybe I misunderstood you, but it seemed like you wanted to argue that using a loss function during training is enough to qualify ML models as EBMs. You could try to argue that e.g. an image classifier is an EBM, since it takes as input an image x and then outputs compatibility scores with each of the classes. In that case you would need to define your second input y as being constant and representing all the classes, since your model outputs needs to be a compatibility measure between x and y. However, I would argue that it is then not an EBM, since it cannot make predictions for varying values of y. Following the definition above, it would need to be able to indicate how incompatible x and y are, for any x and y, to qualify as an EBM. Happy to hear any counterpoints. I don't have any previous experience with EBM or their origin in physics, but this is how the definition would make sense to me. Predicting the compatibility is defined as the central task that the model is solving.
@jean-baptistedelabroise53913 жыл бұрын
hmm, if you scrap image from a search engine is it not possible to get harder negatives for example: 2 images issued from a research for chess pieces are much more likely to be harder negatives than 2 images taken at random.
@frenchmarty74462 жыл бұрын
Yes and that is mostly what people do in practice in unsupervised contrastive learning. The problem is you have no guarantees or bounds on how different the random images are if at all. I might have a picture of a chess board and my random "negatives" are: another chess board, another board game, a picture of a dog and a child's sketch. My negatives all vary in their degree of similarity but I have no way of telling the model that. In practice, this kind of solves itself. The model learns to tolerate some "negatives" being closer to the "positives" because that minimizes the loss overall and generally that will cluster like things together. The point I think is that this is less data efficient than self-supervised methods. For the same amount of data, the model can learn more by reconstructing the data then by clustering it.
@wenxue81553 жыл бұрын
Isn't there already a paper on that: Learning to Predict Without Looking Ahead: World Models Without Forward Prediction. The whole idea very similar to World Models.
@abekang36233 жыл бұрын
I was wondering if VAE's were considered fuzzy in reference to regular autoencoders which do not have the distribution and sampling in the center. Also because many auto encoders scale down their latent representation at the sampling point (center)would that be another reason why VAEs constrict or limit their representation.
@frenchmarty74462 жыл бұрын
I believe the "fuzziness" is in reference to the output, which is caused by the loss function (mean squared error) which encourages blurry outputs. VAEs that use a discriminator like GANs ("VAE-GAN") do not have the same problem. But you are right on both points. Though VAEs actually give better outputs when sampling outside of the data distribution, the injection of noise in the latent space forces the decoder to learn to give acceptable outputs over a wider and smoother range of outputs.
@bdennyw13 жыл бұрын
Yannic did a great job as always, but I'm not sure what the point of the paper was. Seems like a rehash of everything from the last couple of years.
@larsojinnaka3 жыл бұрын
The smiley face at 28:10
@hannesstark50243 жыл бұрын
The way I understand the VAE statements is that we have the gaussian as latent variable and restrict it via the encoder.
@frenchmarty74462 жыл бұрын
It would be closer to say it's the other way around. The encoder generates the latent variable and we constrict the latent space in several different ways (reduced dimensionality, KL divergence loss, Gaussian noise, etc). The decoder is tasked with guessing what went into the encoder to produce the latent variable (p(x | z)) hence a game in which the model learns the most important information to pass through the constriction. The result is we get a latent space that is dense with information, generalizes well and meets whatever other requirements we impose (easy to sample from, etc). We take this and do other useful tasks much easier.
@Blattealkiller3 жыл бұрын
Agree with you about the energy based model terminology ahah
@AlexanderMath3 жыл бұрын
You beat me to it. I was looking for anyone that took up the challenge, I thought exactly the same with energy based model.
@zrmsraggot3 жыл бұрын
At 24:00 don't we all assume the most likely thing hidden is the most 'usual' thing until something else can make us think different ? For ex, if you saw a faceshot of me typing on my chair but couldn't see below that table I'm at, with no info you would have to assume I'm wearing some jeans right ? But if I had a bowl of Cheerios next to me then you could suppose all I wear is some pants.
@zeamon49323 жыл бұрын
Humans tend to stereotype as they grow up. Is that why children imagine more creatively
@chriscanal9993 жыл бұрын
Lmao “my raspberry pi has that capacity”
@G12GilbertProduction3 жыл бұрын
My capacity is better than yours. ×P
@RohitKumarSingh253 жыл бұрын
Recently many papers using similar approach as BYOL i.e. instead of exact network keeping a moving average of network with lots of augmented images while training, has solved the negative sampling problem in contrastive loss approach.
@terguunzoregtiin87913 жыл бұрын
Hi, do you think "masked auto encoders are scalable vision learners" follows this blog's idea?
@thegistofcalculus3 жыл бұрын
I understand the motivation behind Siamese networks but why not a giant one hot vector target where each input picture corresponds to a node in the giant one hot network (it was done before and you can just search the title "Unsupervised Feature Learning via Non-Parametric Instance Discrimination")
@dwhdai3 жыл бұрын
are there any good examples of SSLs on timeseries data?
@zoltanczesznak9763 жыл бұрын
dont take it as a paper but as tutorial, with a catchy name, so it is sort of high end popular science, big crowds wont start to read statistical learning theory, in addition you are also a good educator i believe, so it served its purpose
@marc-andrepiche18093 жыл бұрын
An energy model is only a model where the lowest lost is 0. (some models maximise and/or can be negative)
@lamiaalsalloom1881 Жыл бұрын
thank you, you are a saint
@XX-vu5jo3 жыл бұрын
I am making an implementation of this with a slight variations will share it on my github soon.
@adityakane56693 жыл бұрын
Great video! Just some food for thought. Assume you have many random images of the world, from mountains to seas to cities and whatnot. Then you clip such a part (say 100x100) and ask the model to predict that. You use a sliding window approach to this, by which I mean you ask the model to predict many clips of the same image. Now you shuffle all these (image, clip) pairs and train the model. What will the model learn? Will it even learn anything? Or will it have a good understanding of the world as the authors suggest?
@frenchmarty74462 жыл бұрын
What exactly do you mean by shuffle the image/clip pairs? Do you mean pair a clipped image to a non-matching clip and learn to discriminate good matches from poor matches? If you do this correctly then yes, the model will learn quite a bit about the world. It will learn a latent representation that is semantically meaningful (like clustered with like, etc). I don't know where the line for "good understanding" is drawn, but we do know that these latent representations are useful for many other unrelated tasks downstream with fine tuning. I can use the same model to add labels to images for example with a small number of labeled training examples. It is faster to learn the relationship between latent labels then raw data labels, so the latent space is capturing some important information in a compressed form.
@adityakane56692 жыл бұрын
@@frenchmarty7446 By clips I mean cut-outs of the image. The self-supervised task is to predict the contents of the cutout. I'm sorry for the confusion earlier. I now know that such efforts have been made in papers like MAE.
@LuisAldamiz2 жыл бұрын
Imagine a baby that does not have five senses but only a connection to the Internet which it reads in ones and zeroes and has to learn everything from that. It would not work, so the AI is quite amazing: it understands binary and extracts a lot of info from it, something we can't easily do.
@GuillermoValleCosmos3 жыл бұрын
I don't get why we need to limit the capacity of z. Isn't adding conditioning on the discriminator for a GAN enough to force it to attent to the conditioning?
@frenchmarty74462 жыл бұрын
We create an information bottleneck to force the model to learn meaningful representations without labels. We could very easily make "Z" have the same dimensions as "X" but it won't learn anything interesting. A conditional GAN is something else entirely where we already have labels and want to generate new samples that match our labels (as opposed to just being completely random). GANs don't learn representations on their own they only generate new samples.
@oraz.3 жыл бұрын
I'm sure Lecun will put this to good use for his employers Facebook and Instagram.
@mdfeatherwx3 жыл бұрын
This video deserves to get my 1hour of time
@mar-a-lagofbibug88333 жыл бұрын
ThAnk you.
@IqweoR3 жыл бұрын
29:58 So basically energy based model predicts this F(x,y), this function becomes another learnable parameter, which in this case would be what you define 'loss function'. How do you train it exactly? I don't know, but, in theory, if you train it on a some real videos there's a chance it will overfit to these videos and will probably predict high value to anything your main model outputs, for example your model predicts that a cat that wears a hat, that's rediculous at a first glance, compared to real videos of a cats, but we can actually think of this. We've seen bloggers doing this for the lols :) And if you somehow pair them (your self supervised model and this F model) and train them together in pair, this F should represent not just 'how real output is', but actually 'how it fits the data', it wouldn't matter that much if a cat wears a hat, if that's appropriate to the context. But how to do this training properly - still an open question.
@junhanouyang65933 жыл бұрын
I may be really stupid. But for latent variable predictive models if we want to limit the capacity of z and make sure our model focused on the Pred(x) decoding and not care about z. Then what is the purpose of z here?
@frenchmarty74462 жыл бұрын
We do care about "Z", it just isn't (always) part of our loss term. A constricted latent space forces the model to retain only important and (hopefully) meaningful information and thereby generalize much better. If Z was unconstrained, the model would just learn the identity function (x ≈ z) and learn nothing interesting.
@dimitriognibene89453 жыл бұрын
Is this any different from predictive coding? I find offensive and unfair to rename concepts without giving credit to related people like Mumford, Ballard, rao, friston ...
@MIbra962 жыл бұрын
23:54 With only a 32x32 8-bit greyscale image you have a total of 256^(32*32) possible images. That is ridiculously huge. xD
@owlmaster15283 жыл бұрын
In your video about Multimodal Neurons - that is my comment there: I didn't know that Picasso was connected to the AI. Now we know on what trip he went. We need an answer of just what exactly AI he was connected to (in trance or whatever) so he would made all this images. Do we have more proof for the Matrix now?
@YEASTY_COMMIE3 жыл бұрын
bro idk what you on but I want some of that
@owlmaster15283 жыл бұрын
@@YEASTY_COMMIE Keep calm, breath in and compare them :) You earned one like from me :)
@citizizen3 жыл бұрын
I want to build a brain... First a thanks for this channel! I think that, all the datasets can be put together in chunks (kinds of reservoirs as it where), represented in interesting ways. Future gene pools. In essence my idea is that when we get multiple datasets, and each of those are represent as special kinds of information, we build something grand in the end. I guess that, intelligence might emerge more effectively like this.. Intelligence is about a lot of repetitions.. Regards, Justin
@moormanjean56362 жыл бұрын
what you are talking about sounds a lot like transfer learning, you should look into this maybe
@Gauloi0073 жыл бұрын
I think energy model not just a equivalent reformulation, it's a reformulation that aloud you to make unsupervised learning , by having a lost fonction ( energy function) than you learn like a adversarial model in some way. I think don't read all the precedent ( you was talking in the other video) paper it's why hou still don't understand.
@reginaphalange25633 жыл бұрын
23:55 a nice glass of wine indeed
@silberlinie3 жыл бұрын
Is Yannic constantly getting closer and closer to Agent Smith from the Matix in appearance? If so, what can we expect from him? If it is not so, why do I ask for it?
@tinyentropy3 жыл бұрын
Somehow disappointing take aways from a big title like this one. Good explanation, though. Thanks!!!
@ahmedtrabelsi29363 жыл бұрын
Hello , great work . Can you have a look for neuromorphic articles, human brain inspired articles (SNN Algo ...)
@wiktormigaszewski86843 жыл бұрын
why can't you generate negative examples from the same photo, by just taking a more distant fragment of it?..
@andrewcutler45993 жыл бұрын
Forgetting the paper but there is one that breaks an image into chunks then uses an auxiliary loss to encourage the model to produce similar embeddings from each chunk. So that's an example of positive examples from the same photo. Not sure why they would be negative. You're saying neighboring portions of the image should be more similar than distant portions?
@wiktormigaszewski86843 жыл бұрын
@@andrewcutler4599 sure, this is how it works in reality! :)
@zeamon49323 жыл бұрын
@@wiktormigaszewski8684 i dont think so. Think about copying an apple to get N*N apples to generate a single image
@jonseltzer3213 жыл бұрын
not sure I follow the translation between the example of '...a cow lying on a beach' and kids being able to identify the cow because they have a model of the world. There's nothing common about a cow lying on a beach. I think it's more likely that humans are modeling and are able to exclude the beach and still see the cloud where machine learning is cheating and considering context when doing the analysis so making the cow, not just the cow, but also grass, clouds, cows eating grass, etc. And when confronted with a cow in the wrong context, they fail. The machine learning system has flattened the data where as the child has a hierarchical picture of the data.
@frenchmarty74462 жыл бұрын
That is what is meant by a world model. You're world model tells you what can and cannot be disentangled. A cow lying on a beach being a rare sight is a fact of the data distribution, not the world model per ce. We would say that a good world model is robust to this kind of confounding whereas a poor world model is naive to confounding (it "cheats" with spurious relationships).
@duncanmays683 жыл бұрын
4:45 Insulting cows is very Swiss
@jvboid3 жыл бұрын
Thank you. I wanted to add that Chollet also defines intelligence as "skill acquisition efficiency with respect to … information"
@خالد_الشيباني3 жыл бұрын
I think this is the standard definition of intelligence in human-centric fields such as psychology as well.
@willd1mindmind6393 жыл бұрын
Most of the tasks for deep leaning are associated with making sense of the buckets of binary information sitting on a a computer server. The rise of machine learning is associated with the rise of big data as they have a natural synergy, aka Google. But it really isn't intelligence. Intelligence is being able to derive understanding about the world from the data provided. In terms of visual intelligence, that means learning what light is, what shadows are, what surfaces are, what textures are, what perspective is, what near and far is, left and right, up vs down and so forth at a base level (most creatures with vision can do that). Then on top of that basic core of fvisual comprehension, there is the higher order reasoning, learning and understanding that humans have evolved, but neural networks can't do. Even with that limitation though, the reason it works in so many applications of modern industry and computing is because a lot of business functioning is based around statistical models anyway. And using statistic based models to help model behaviors and generate predictions fits in well to a large number of business tasks. And of course data stored in silicon as a result of human activity online is growing exponentially. So these neural network models work reasonably well for a large set of business related use cases and scenarios on general purpose computer hardware.....
@jamiekawabata71013 жыл бұрын
I have a new method I call "simulated annealing". Oh crap that's already taken.
@harrywoods97842 жыл бұрын
Just a thought, in my mind embodiment is the key to self learning. That’s why Teslas humanoid Optimus robot is exciting. 🤔IMO
@samernoureddine3 жыл бұрын
28:10 Amazon logo
@eelcohoogendoorn80443 жыл бұрын
Better than openAI? They did release some pretrained CLIP models, so not so fast! But yeah, got to agree on the energy thing. Indeed it seems to mean 'loss function for people with physics envy'. Cmon Yann, you are too well paid for that.
@davidk9913 жыл бұрын
loved the german :D
@hoaxuan70743 жыл бұрын
There is an argument to go the other way and extensively use human labeled data. To make up for the 'fact' that neural network training algorithms are only able to search for statistical solutions, not explore the full solution space.
@ZakkeryDiaz3 жыл бұрын
The space would be confined to the labels provided
@hoaxuan70743 жыл бұрын
@@ZakkeryDiaz The net output could be augmented with extensive labels or short descriptive sentences. You might train the net to produce an image out and a descriptive sentence. To train a net to do that you are forcing in human concepts via the training sentences. Anyway it would have to make the net better at its basic task of producing images. Proof needed.
@ZakkeryDiaz3 жыл бұрын
@@hoaxuan7074 I think theres 2 things here. One is the desire to reduce the cost of data acquisition. Requiring human intervention is very expensive. The second point is I think there will be more sophisticated structures in the future (Sparse networks and inhibitions to subsystems etc) that will give rise to emergent properties to explore some of those spaces we can't find purely with neural networks. We lose the opportunity for other paths in the network space because the labels prematurely close them before they prove useful in a situation not covered by your test/training
@hoaxuan70743 жыл бұрын
@@ZakkeryDiaz Sufficently sparse or small neural networks are no longer chained to statistics. There are many examples of small nets trained by evolution, to play games, on YT, that are not statistical. I just hsd an idea of jobs for most of the population helping neural networks improve. That people have employment is important. The less valuble you are the greater the chance you will be misused.
@hoaxuan70743 жыл бұрын
@@ZakkeryDiaz You know there are Fast Transform fixed-fixed-filter bank neural networks? They are really fast and really work nice for many problems. Yet by construction they are 100% statistical in behavior. And that is quite in contrast to Numenta's spare neural network where ReLU is replaced by top-k magnitude selection. They are certainly both unconventional neural networks. Which is better?
@rnoro3 жыл бұрын
I agree with most of the comments. It looks like "self-supervised learning" = "unsupervised learning". They just renamed the loss function to energy function.
@Hypotemused3 жыл бұрын
Speaking of ‘energy functions’ - any idea how much these models cost to train and the energy they consume ? GPT3, SEER , etc -- the performance metrics are always published but they never say what it cost or how much energy they consume. Is it so negligible that it’s not even worth mentioning?
@dr.mikeybee3 жыл бұрын
You need an hadoop cluster of raspberry pis for that.
@beans28743 жыл бұрын
54:13 - 69 vs 96
@GuillermoValleCosmos3 жыл бұрын
nice vs cine
@hoaxuan70743 жыл бұрын
A paper to look at is Numenta's sparse neural net. A random mask is used to get rid of say 95% of the weights. Top-k magnitude selection is used on each layer of dot products and only those selected connect forward. I pointed out that maximum magitude is correlated with minimum angle between input vector and weight vector. And minimum angle is correlated with reduced noise sensitivity. You can see discourse numenta for the argument. In fact a dot product has a critical zone within which it displays error correction. The two factors are angle and distribution. A single non-zero input is not distributed, all inputs equal and non-zero is fully distributed. No talk of this in the neural network books suggests poor scientific methodology and castles being built on foundations of sand.
@citizizen3 жыл бұрын
If it is hard to do visual stuff. A computer do analysis in different forms.. Like verbal methods for visual patterns.. like a language. Can't it be that some visual stuff has verbal side to it? Not sure here. Perhaps energy based models and language based techniques can be combined. So energy might be like 'emotion' of the objects used, and this energy with verbal material... Perhaps a joke...=> p2p, internet, sharing resources by any counting device..
@NeoShameMan3 жыл бұрын
human are probably few shoot learner has proven by the subliminal effect (ie one input (1 frame) isn't enough), and human brain is full of feedback loop, that work as working memory, few shoot with memory is probably what is happening , with continuous learning to bake the data. Also human training is quite long, baby don't wake up talking and walking day one, AND there is a bunch of autonomous function baked in that is override by the training through time.
@XX-vu5jo3 жыл бұрын
54:17 he Wants it!!!
@seanreynoldscs2 жыл бұрын
no object permanence is learned. that is why peakaboo is so fun for kids.
@Zantorc3 жыл бұрын
This is _the_ cat; This is _our_ cat; This is _his_ cat; This is _her_ cat; This is _their_ cat; This is _somebodies_ cat; This is _ones_ cat; This is _zir/hir/eir/vis/LGBTQ+ pronouns_ cat; This is _Yannic's_ cat;...
@charlesquarra50503 жыл бұрын
Liking at 27:30 how Yannic cuts through all the BS terminology in behalf of us. Yes, "energy" in ML context is just another notch of upward trending terminology entropy for the usual concept of loss. Nothing special or mystical about it. Thanks Yannic for fighting the good fight of terminology entropy reduction
@jackdkendall3 жыл бұрын
Energy based models have been around far longer than deep learning. Not to mention that it's not even the same thing as a loss function. These distinctions exist for a reason.
@lucathiede92383 жыл бұрын
No it is not, energies have explicitly a rotation free gradient field, as opposed to for example the GAN objective. Also as another commentator pointed out, another way to look at it is, that energies are explicitly linked to probabilities, the lower the energy the higher the probability This concept does simply not make sense for GAN objectives (this comment obviously refers to the OG comment, not to the previous answer)
@Cl0udn1n3 Жыл бұрын
A F Of XY energy based model. Foxy energy is what I always look for when on a train. #NyuWorldOrder #SoylentGarchBurgerQlists #yannie-chan