Yannic is awesome at “bottom-lining” things. Cuts through the abstruse mathematical fog and says “this is all it’s REALLY doing”. This channel is HUGELY valuable to me. There are too many papers that veer off into the implementation maths, IMHO. Yannic helps you filter out all the irrelevancies.
@good_user354 жыл бұрын
It's Great to learn why transformer works so well (Theorem 4) and how the three vectors (K, Q and V) can be translated in Hopfiled networks. The analysis of layers for patterns reminds me many studies of BERTology in NLP. In one of the papers, I remember it reported that the most syntactic processing seems to occur by the middle of the 12 layers. It's interesting and seems there are still many things to be known in the future. Thanks!
@samanthaqiu34163 жыл бұрын
I'm still confused by this paper: the original Krotov energy for binary pattern retrieval keeps weights as *sums* over all stored patterns, which means constant storage... this lse and update rule seem to be keeping the entire list of stored patterns around.. that looks like cheating to me.. I am probably missing something
@maxkho00 Жыл бұрын
@@samanthaqiu3416 I kept asking myself this throughout the entire video. Surely I must be missing something? The version of Hopfield "network" described by Yannic in this video just seems like regular CPU storage with a slightly more intelligent retrieval system.
@centar15954 жыл бұрын
THANK YOU! I actually asked specifically for this one - and man that was fast :)
@lucasl104728 күн бұрын
This video is soon gonna boom lol
@chochona0194 жыл бұрын
Damn man, the amount of great papers you review is amazing. Great work.
@samanthaqiu34164 жыл бұрын
Regarding theorem #3: c has a lower bound that is exponential on d^-1, hence the guarantee that N will grow exponential seems optimistic. If you include the lower bound on c, seems that the lower bound on N has no exponential dependence on d at all
@Deathflyer2 жыл бұрын
If I understand the proof in the appendix correctly, this is just phrased weirdly. By looking at the actual formula for c, its actual asymptotic behaviour as d \to \infty is just a constant.
@jesseshakarji92414 жыл бұрын
I loved how he drew a pentagram
@mgostIH4 жыл бұрын
Wow I found about your channel a few days ago, today I saw this paper and got interested in it and now I see you just uploaded! Your channel has been very informative and detailed, quite rare compared to many others which just gloss over details
@dylanmenzies397320 күн бұрын
I read about hopfield nets, thought "why can't they be continuous?", and bang straight into the cutting edge.
@emuccino4 жыл бұрын
Linear algebra is all you need
@alvarofrancescbudriafernan20052 жыл бұрын
Can you train Hopfield networks via gradient descent? Can you integrate a Hopfield module inside a typical backprop-trained network?
@revimfadli4666 Жыл бұрын
I guess fast Weights can do those
@sthk19984 жыл бұрын
If there can be so much exponential information embedding within these hopfield networks, does that mean that this is a good architecture type to use in a reinforcement learning task?
@YannicKilcher4 жыл бұрын
possibly yes
@sthk19984 жыл бұрын
@@YannicKilcher how would one transfer the model representation of eg Bert or some other transformer model to a RL framework
@jaakjpn4 жыл бұрын
@@sthk1998 You can use hopfield networks (and transformers) for the episodic memory of the agent. DeepMind has used similar transformer like attention mechanisms in their latest RL methods, e.g., Agent57.
@revimfadli4666 Жыл бұрын
Also how resistant would it be against catastrophic forgetting?
@revimfadli4666 Жыл бұрын
@@jaakjpn I wonder if the ontogenic equivalent of Baldwin effect played part
@woooka3 жыл бұрын
Cool work, great to get more insights about Transformer attention!
@rock_sheep42414 жыл бұрын
You are indeed the most amazing neural network ever :))
@rock_sheep42414 жыл бұрын
A quick Sunday night film :))
@0MVR_04 жыл бұрын
Personhood goes well beyond stimulated predictions with evaluatory mechanics.
@Irbdmakrtb4 жыл бұрын
Great video Yannic!
@Imboredas2 жыл бұрын
I think this paper is pretty solid, just wondering why it was not accepted in any of the major conferences.
@dylanmenzies397320 күн бұрын
Hang on a sec, n nodes, therefor n^2 weights. (ish) The weights contain the information for stored patterns, thats not exponential on size n, more like n pattern storage of n bits at best. Continuous is different.. each real number can contain infinite information, depends on the accuracy of output required.
@AbgezocktXD4 жыл бұрын
These spheres (32:00) are just as in coding theory. Very cool
@konghong38854 жыл бұрын
Not gonna lie, I have been waiting for this video so I don't have to read it myself :D
@umutcoskun42474 жыл бұрын
Lol I was looking for a youtube video about this paper just 30 min ago and was sad to see that you have not had uploaded a video about it yet...I was 15 min to early I guess :D
@Xiineet4 жыл бұрын
"its not also higher, it is also wider" LMAO
@bzqp226 күн бұрын
Hopfield Networks is All You Need To Get A Nobel Prize in Physics.
@cptechno4 жыл бұрын
Love your work! I'm interested in the research paper magazines that your regularly scan into. Can you give a list of these research magazines? Maybe you can classify them has 1) very often quoted magazine 2) less often quoted ....
@YannicKilcher4 жыл бұрын
There's not really a system to this
@revimfadli4666 Жыл бұрын
@@YannicKilcher so just the PhD "one paper per day" stuff?
@rockapedra11304 жыл бұрын
Very clear! Great job!
@luke.perkin.inventor4 жыл бұрын
It looks great, but equivalently expressive networks aren't always equally trainable? Can anyone recommend a paper that tackles measuring learnability of data, trainability of networks, maybe linking p=np and computational complexity? I understand ill posed problems, but for example, cracking encryption, no size of network or quantity of training data will help... because the patterns are too recursive, too deeply burried, and so unlearnable? How is this measured?
@davidhsv24 жыл бұрын
So, the Albert architecture, with the sharing parameters can be described as a hoper network with 12 iterations? Albert is an unique transformer encoder iterated 12 times.
@YannicKilcher4 жыл бұрын
It's probably more complicated, because transformer layers contain more than just the attention mechanism
@nitsanbh Жыл бұрын
Would love for some pseudo code! Both for training, and for retrieval
@siuuuuuuuuuuuu1224 күн бұрын
Doesn't this network take the form of a hub?
@jeanphilippe91412 жыл бұрын
Hey! Amazing video, love your work. I'm a beginner in all of this but I have this question : can bringing up the number of dimensions of the problem lower the "perplexity" of the problem? Higher dimensions meaning more information meaning tighter or more specific "spheres" around a pattern. My guess is "yes" but that sometimes the dimensions are fixed in a problem so this solution to lower perplexity is impossible. Does the paper say anything about that, or do you have an educated guess on what could be an answer? :) If my question is stupid just say so I really don't mind! Thanks for any answer and thank you for your videos. I'm hoping on making this an activity for high school students to promote science, so thanks a lot!
@burntroses12 жыл бұрын
It is breakthrough in understanding immunity and cancer
@0MVR_04 жыл бұрын
It is time to stop giving academia the 'all you need' ultimate.
@seraphim97234 жыл бұрын
Modesty is all you need!
@TheGroundskeeper4 жыл бұрын
Hey man. I literally sit and argue AI for a job and I often find myself relying on info or ideas either fully explained or at least lightly touched by you very often. This is a great example. It’d be a sin to ever stop. It’s obvious to me that training was in no way done and the constant activity in the middle does not indicate the same items are going back and forth about the same things
@gamefaq4 жыл бұрын
Great overview! Definition 1 for stored and retrieved patterns was a little confusing to me. I'm not sure if they meant that the patterns are "on" the surface of the sphere or if they were "inside" the actual sphere. Usually in mathematics, when we say "sphere" we mean just the surface of the sphere and when we say "ball" we mean all points inside the volume that the sphere surrounds. Since they said "sphere" and they used the "element of" symbol, I assume they meant that the patterns should exist on the surface of the sphere itself and not in the volume inside the sphere. They also use the wording "on the sphere" in the text following the definition and in Theorem 3. Assuming that's the intended interpretation, I think the pictures drawn at 33:42 are a bit misleading.
@YannicKilcher4 жыл бұрын
I think I even mention that my pictures are not exactly correct when I draw them :)
@sacramentofwilderness66564 жыл бұрын
Concerning these spheres : do they span all the parameter space? Or there are some regions, not belonging itself to a particular pattern? There were theorems, claiming that the algorithm has to converge, in that case, does the getting caught by a particular cluster depend on the initialisation of weights?
@YannicKilcher4 жыл бұрын
Yes, they are only around the patterns. Each pattern has a sphere.
@valthorhalldorsson93004 жыл бұрын
Fascinating paper, fantastic video.
@DamianReloaded4 жыл бұрын
It'd be cool to see the code running on some data set.
@martinrenaudin74154 жыл бұрын
If queries, keys and values are of the same embedding size, how do you retrieve a pattern of a bigger size in your introduction?
@YannicKilcher4 жыл бұрын
good point. you'd need to change the architecture in that case.
@mathmagic93334 жыл бұрын
At this point in the video kzbin.info/www/bejne/pKeZoHl6pZulhLM you state that if you increase the dimension by 1, the storage capacity increases by 3. However it increases by c^{1/4} so by about 1.316 and not 3, correct?
@YannicKilcher4 жыл бұрын
True.
@nicolasPi_3 жыл бұрын
@@YannicKilcher It seems that c is not a constant and depends on d. Given their examples with d=20 and d=75, we get respectively N>7 and N>10 which looks like a quite slow capacity increase, or did I miss something?
@tripzero04 жыл бұрын
Want to see attentionGAN (or op-GAN). Does attention work the same way in GANs.
@pastrop20034 жыл бұрын
Isn't it fair to say that if we have one sentence in the attention mechanism meaning that each word in the sentence is attending to the words from the same sentence, the strongest signal will always be from any word attending to itself bcs in this case the query is identical to the key? Am I missing something here?
@charlesfoster63264 жыл бұрын
Not necessarily, in the case of the Transformer: for example, if the K matrix is -Q matrix, then the attention will be lowest for a position onto itself.
@pastrop20034 жыл бұрын
@@charlesfoster6326 True, although based on what I read on transformers in cases of a single sentence, K==Q. If so we are multiplying a vector by itself. This is not the case when there are 2 sentences (translation task is a good example of that). I haven't seen the case when K == -Q
@charlesfoster63264 жыл бұрын
@@pastrop2003 I don't know why that would be. To clarify, what I'm calling Q and K are the linear transforms you multiply the token embeddings with prior to performing attention. So q_i = tok_i * Q and k_i = tok_i * K. Then q_i and k_i will only be equal if Q and K are equal. But these are two different matrices, which will get different gradient updates during training.
@sergiomanuel22064 жыл бұрын
You are a genius man!
@ChocolateMilkCultLeader4 жыл бұрын
Thanks for sharing. This is very interesting
@zerotwo7319 Жыл бұрын
Why dont they say attractor? Much more easier than 'circles'.
@ArrakisMusicOfficial4 жыл бұрын
I am wondering, so how many patterns does each transformer head actually store?
@YannicKilcher4 жыл бұрын
good point, it seems that depends on what exactly you mean by pattern and store
@shabamee98094 жыл бұрын
Maximum width achieved
@rameshravula83404 жыл бұрын
Lot's of math in the paper. Got lost in the mathematics portion. Got a gist of it, however
@josephchan94144 жыл бұрын
thx!!
@004307ec4 жыл бұрын
as an ex-phd student on neural science, I am quite interested in such research.
@conduit2423 жыл бұрын
So kNN is all you need
@FOLLOWJESUSJohn316-n8x4 жыл бұрын
😀👍👊🏻🎉
@quebono1004 жыл бұрын
Subscibers to the moon!
@ruffianeo3418 Жыл бұрын
What really bugs me about all "modern AI" "explanations" is, that they do not enable you to actually code it. If you refer to one source, e.g. this paper, you are none the wiser. If you refer to multiple sources, you end up confused because they do not appear to describe the same thing. So, it is not rocket science but people seem to be fond of making it sound like rocket science, maybe to stop people from just implementing it? Here a few points, not clear (at least to me) at all: 1. Can a modern hopfield network (the one with the exp) be trained step by step, without (externally) retaining the original patterns it learned? 2. Some sources say, there are 2 (or more) layers (feature layer and memory layer). This paper says nothing about that. 3. What are the methods to artificially "enlarge" a network if a problem has more states to store, than the natural encoding of a pattern requires (2 ^ number of nodes < (number of features to store))? 4. What is the actual algorithm to compute the weights if you want to teach a network a new feature vector? Both the paper and the video seem to fall short in all those points.
@tanaykale15714 жыл бұрын
Hey, Can you explain this Research Paper - CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning (arxiv.org/abs/1903.02351) It is related to Image Segmentation. I am having a problem understanding this paper.
@444haluk3 жыл бұрын
This is dumb. Floating point numbers are already represented with 32 bits. THEY ARE BITS! The beauty in Hopfield Networks is that I can change every bit independent of other bits to store a novel representation. If you multiply a floating number with 2, they will all shift to left, you just killed many type of operation/degree of freedom due to linearity. With 10K bits I can represent many patterns, FAR more than number of the atoms in the universe. I can represent far more with 96 bits instead of 3 floats. This paper's network is a very narrow minded update to the original network.