Hopfield Networks is All You Need (Paper Explained)

Рет қаралды 98,133

Yannic Kilcher

Күн бұрын

Пікірлер: 86

@thomasmuller7001 4 жыл бұрын

Yannic Kilcher is All You Need

@quebono100 4 жыл бұрын

Thats a good one :)

@tarmiziizzuddin337 4 жыл бұрын

haha, yeah man,

@kimchi_taco 4 жыл бұрын

Such information transformer

@rockapedra1130 3 жыл бұрын

Yannic is awesome at “bottom-lining” things. Cuts through the abstruse mathematical fog and says “this is all it’s REALLY doing”. This channel is HUGELY valuable to me. There are too many papers that veer off into the implementation maths, IMHO. Yannic helps you filter out all the irrelevancies.

@good_user35 4 жыл бұрын

It's Great to learn why transformer works so well (Theorem 4) and how the three vectors (K, Q and V) can be translated in Hopfiled networks. The analysis of layers for patterns reminds me many studies of BERTology in NLP. In one of the papers, I remember it reported that the most syntactic processing seems to occur by the middle of the 12 layers. It's interesting and seems there are still many things to be known in the future. Thanks!

@samanthaqiu3416 3 жыл бұрын

I'm still confused by this paper: the original Krotov energy for binary pattern retrieval keeps weights as *sums* over all stored patterns, which means constant storage... this lse and update rule seem to be keeping the entire list of stored patterns around.. that looks like cheating to me.. I am probably missing something

@maxkho00 Жыл бұрын

@@samanthaqiu3416 I kept asking myself this throughout the entire video. Surely I must be missing something? The version of Hopfield "network" described by Yannic in this video just seems like regular CPU storage with a slightly more intelligent retrieval system.

@centar1595 4 жыл бұрын

THANK YOU! I actually asked specifically for this one - and man that was fast :)

@lucasl1047 28 күн бұрын

This video is soon gonna boom lol

@chochona019 4 жыл бұрын

Damn man, the amount of great papers you review is amazing. Great work.

@samanthaqiu3416 4 жыл бұрын

Regarding theorem #3: c has a lower bound that is exponential on d^-1, hence the guarantee that N will grow exponential seems optimistic. If you include the lower bound on c, seems that the lower bound on N has no exponential dependence on d at all

@Deathflyer 2 жыл бұрын

If I understand the proof in the appendix correctly, this is just phrased weirdly. By looking at the actual formula for c, its actual asymptotic behaviour as d \to \infty is just a constant.

@jesseshakarji9241 4 жыл бұрын

I loved how he drew a pentagram

@mgostIH 4 жыл бұрын

Wow I found about your channel a few days ago, today I saw this paper and got interested in it and now I see you just uploaded! Your channel has been very informative and detailed, quite rare compared to many others which just gloss over details

@dylanmenzies3973 20 күн бұрын

I read about hopfield nets, thought "why can't they be continuous?", and bang straight into the cutting edge.

@emuccino 4 жыл бұрын

Linear algebra is all you need

@alvarofrancescbudriafernan2005 2 жыл бұрын

Can you train Hopfield networks via gradient descent? Can you integrate a Hopfield module inside a typical backprop-trained network?

@revimfadli4666 Жыл бұрын

I guess fast Weights can do those

@sthk1998 4 жыл бұрын

If there can be so much exponential information embedding within these hopfield networks, does that mean that this is a good architecture type to use in a reinforcement learning task?

@YannicKilcher 4 жыл бұрын

possibly yes

@sthk1998 4 жыл бұрын

@@YannicKilcher how would one transfer the model representation of eg Bert or some other transformer model to a RL framework

@jaakjpn 4 жыл бұрын

@@sthk1998 You can use hopfield networks (and transformers) for the episodic memory of the agent. DeepMind has used similar transformer like attention mechanisms in their latest RL methods, e.g., Agent57.

@revimfadli4666 Жыл бұрын

Also how resistant would it be against catastrophic forgetting?

@revimfadli4666 Жыл бұрын

@@jaakjpn I wonder if the ontogenic equivalent of Baldwin effect played part

@woooka 3 жыл бұрын

Cool work, great to get more insights about Transformer attention!

@rock_sheep4241 4 жыл бұрын

You are indeed the most amazing neural network ever :))

@rock_sheep4241 4 жыл бұрын

A quick Sunday night film :))

@0MVR_0 4 жыл бұрын

Personhood goes well beyond stimulated predictions with evaluatory mechanics.

@Irbdmakrtb 4 жыл бұрын

Great video Yannic!

@Imboredas 2 жыл бұрын

I think this paper is pretty solid, just wondering why it was not accepted in any of the major conferences.

@dylanmenzies3973 20 күн бұрын

Hang on a sec, n nodes, therefor n^2 weights. (ish) The weights contain the information for stored patterns, thats not exponential on size n, more like n pattern storage of n bits at best. Continuous is different.. each real number can contain infinite information, depends on the accuracy of output required.

@AbgezocktXD 4 жыл бұрын

These spheres (32:00) are just as in coding theory. Very cool

@konghong3885 4 жыл бұрын

Not gonna lie, I have been waiting for this video so I don't have to read it myself :D

@umutcoskun4247 4 жыл бұрын

Lol I was looking for a youtube video about this paper just 30 min ago and was sad to see that you have not had uploaded a video about it yet...I was 15 min to early I guess :D

@Xiineet 4 жыл бұрын

"its not also higher, it is also wider" LMAO

@bzqp2 26 күн бұрын

Hopfield Networks is All You Need To Get A Nobel Prize in Physics.

@cptechno 4 жыл бұрын

Love your work! I'm interested in the research paper magazines that your regularly scan into. Can you give a list of these research magazines? Maybe you can classify them has 1) very often quoted magazine 2) less often quoted ....

@YannicKilcher 4 жыл бұрын

There's not really a system to this

@revimfadli4666 Жыл бұрын

@@YannicKilcher so just the PhD "one paper per day" stuff?

@rockapedra1130 4 жыл бұрын

Very clear! Great job!

@luke.perkin.inventor 4 жыл бұрын

It looks great, but equivalently expressive networks aren't always equally trainable? Can anyone recommend a paper that tackles measuring learnability of data, trainability of networks, maybe linking p=np and computational complexity? I understand ill posed problems, but for example, cracking encryption, no size of network or quantity of training data will help... because the patterns are too recursive, too deeply burried, and so unlearnable? How is this measured?

@davidhsv2 4 жыл бұрын

So, the Albert architecture, with the sharing parameters can be described as a hoper network with 12 iterations? Albert is an unique transformer encoder iterated 12 times.

@YannicKilcher 4 жыл бұрын

It's probably more complicated, because transformer layers contain more than just the attention mechanism

@nitsanbh Жыл бұрын

Would love for some pseudo code! Both for training, and for retrieval

@siuuuuuuuuuuuu12 24 күн бұрын

Doesn't this network take the form of a hub?

@jeanphilippe9141 2 жыл бұрын

Hey! Amazing video, love your work. I'm a beginner in all of this but I have this question : can bringing up the number of dimensions of the problem lower the "perplexity" of the problem? Higher dimensions meaning more information meaning tighter or more specific "spheres" around a pattern. My guess is "yes" but that sometimes the dimensions are fixed in a problem so this solution to lower perplexity is impossible. Does the paper say anything about that, or do you have an educated guess on what could be an answer? :) If my question is stupid just say so I really don't mind! Thanks for any answer and thank you for your videos. I'm hoping on making this an activity for high school students to promote science, so thanks a lot!

@burntroses1 2 жыл бұрын

It is breakthrough in understanding immunity and cancer

@0MVR_0 4 жыл бұрын

It is time to stop giving academia the 'all you need' ultimate.

@seraphim9723 4 жыл бұрын

Modesty is all you need!

@TheGroundskeeper 4 жыл бұрын

Hey man. I literally sit and argue AI for a job and I often find myself relying on info or ideas either fully explained or at least lightly touched by you very often. This is a great example. It’d be a sin to ever stop. It’s obvious to me that training was in no way done and the constant activity in the middle does not indicate the same items are going back and forth about the same things

@gamefaq 4 жыл бұрын

Great overview! Definition 1 for stored and retrieved patterns was a little confusing to me. I'm not sure if they meant that the patterns are "on" the surface of the sphere or if they were "inside" the actual sphere. Usually in mathematics, when we say "sphere" we mean just the surface of the sphere and when we say "ball" we mean all points inside the volume that the sphere surrounds. Since they said "sphere" and they used the "element of" symbol, I assume they meant that the patterns should exist on the surface of the sphere itself and not in the volume inside the sphere. They also use the wording "on the sphere" in the text following the definition and in Theorem 3. Assuming that's the intended interpretation, I think the pictures drawn at 33:42 are a bit misleading.

@YannicKilcher 4 жыл бұрын

I think I even mention that my pictures are not exactly correct when I draw them :)

@sacramentofwilderness6656 4 жыл бұрын

Concerning these spheres : do they span all the parameter space? Or there are some regions, not belonging itself to a particular pattern? There were theorems, claiming that the algorithm has to converge, in that case, does the getting caught by a particular cluster depend on the initialisation of weights?

@YannicKilcher 4 жыл бұрын

Yes, they are only around the patterns. Each pattern has a sphere.

@valthorhalldorsson9300 4 жыл бұрын

Fascinating paper, fantastic video.

@DamianReloaded 4 жыл бұрын

It'd be cool to see the code running on some data set.

@martinrenaudin7415 4 жыл бұрын

If queries, keys and values are of the same embedding size, how do you retrieve a pattern of a bigger size in your introduction?

@YannicKilcher 4 жыл бұрын

good point. you'd need to change the architecture in that case.

@mathmagic9333 4 жыл бұрын

At this point in the video kzbin.info/www/bejne/pKeZoHl6pZulhLM you state that if you increase the dimension by 1, the storage capacity increases by 3. However it increases by c^{1/4} so by about 1.316 and not 3, correct?

@YannicKilcher 4 жыл бұрын

True.

@nicolasPi_ 3 жыл бұрын

@@YannicKilcher It seems that c is not a constant and depends on d. Given their examples with d=20 and d=75, we get respectively N>7 and N>10 which looks like a quite slow capacity increase, or did I miss something?

@tripzero0 4 жыл бұрын

Want to see attentionGAN (or op-GAN). Does attention work the same way in GANs.

@pastrop2003 4 жыл бұрын

Isn't it fair to say that if we have one sentence in the attention mechanism meaning that each word in the sentence is attending to the words from the same sentence, the strongest signal will always be from any word attending to itself bcs in this case the query is identical to the key? Am I missing something here?

@charlesfoster6326 4 жыл бұрын

Not necessarily, in the case of the Transformer: for example, if the K matrix is -Q matrix, then the attention will be lowest for a position onto itself.

@pastrop2003 4 жыл бұрын

@@charlesfoster6326 True, although based on what I read on transformers in cases of a single sentence, K==Q. If so we are multiplying a vector by itself. This is not the case when there are 2 sentences (translation task is a good example of that). I haven't seen the case when K == -Q

@charlesfoster6326 4 жыл бұрын

@@pastrop2003 I don't know why that would be. To clarify, what I'm calling Q and K are the linear transforms you multiply the token embeddings with prior to performing attention. So q_i = tok_i * Q and k_i = tok_i * K. Then q_i and k_i will only be equal if Q and K are equal. But these are two different matrices, which will get different gradient updates during training.

@sergiomanuel2206 4 жыл бұрын

You are a genius man!

@ChocolateMilkCultLeader 4 жыл бұрын

Thanks for sharing. This is very interesting

@zerotwo7319 Жыл бұрын

Why dont they say attractor? Much more easier than 'circles'.

@ArrakisMusicOfficial 4 жыл бұрын

I am wondering, so how many patterns does each transformer head actually store?

@YannicKilcher 4 жыл бұрын

good point, it seems that depends on what exactly you mean by pattern and store

@shabamee9809 4 жыл бұрын

Maximum width achieved

@rameshravula8340 4 жыл бұрын

Lot's of math in the paper. Got lost in the mathematics portion. Got a gist of it, however

@josephchan9414 4 жыл бұрын

thx!!

@004307ec 4 жыл бұрын

as an ex-phd student on neural science, I am quite interested in such research.

@conduit242 3 жыл бұрын

So kNN is all you need

@FOLLOWJESUSJohn316-n8x 4 жыл бұрын

😀👍👊🏻🎉

@quebono100 4 жыл бұрын

Subscibers to the moon!

@ruffianeo3418 Жыл бұрын

What really bugs me about all "modern AI" "explanations" is, that they do not enable you to actually code it. If you refer to one source, e.g. this paper, you are none the wiser. If you refer to multiple sources, you end up confused because they do not appear to describe the same thing. So, it is not rocket science but people seem to be fond of making it sound like rocket science, maybe to stop people from just implementing it? Here a few points, not clear (at least to me) at all: 1. Can a modern hopfield network (the one with the exp) be trained step by step, without (externally) retaining the original patterns it learned? 2. Some sources say, there are 2 (or more) layers (feature layer and memory layer). This paper says nothing about that. 3. What are the methods to artificially "enlarge" a network if a problem has more states to store, than the natural encoding of a pattern requires (2 ^ number of nodes < (number of features to store))? 4. What is the actual algorithm to compute the weights if you want to teach a network a new feature vector? Both the paper and the video seem to fall short in all those points.

@tanaykale1571 4 жыл бұрын

Hey, Can you explain this Research Paper - CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning (arxiv.org/abs/1903.02351) It is related to Image Segmentation. I am having a problem understanding this paper.

@444haluk 3 жыл бұрын

This is dumb. Floating point numbers are already represented with 32 bits. THEY ARE BITS! The beauty in Hopfield Networks is that I can change every bit independent of other bits to store a novel representation. If you multiply a floating number with 2, they will all shift to left, you just killed many type of operation/degree of freedom due to linearity. With 10K bits I can represent many patterns, FAR more than number of the atoms in the universe. I can represent far more with 96 bits instead of 3 floats. This paper's network is a very narrow minded update to the original network.