Important correction: There's an error in the scrappy code I was demoing around 19:50, such that in fact not all pairs of vectors end up in that (89°, 91°) range. A few pairs get shot out to have dot products near ±1, hiding in the wings of the plot. I was using a bad cost function that didn't appreciably punish those cases. On closer inspection, I think it's not actually possible to get 100k vectors in 100d to be as "nearly orthogonal" as this. 100 dimensions seems to be too low, at least for the (89°, 91°), for the Johnson-Lindenstrauss lemma to really kick in. The broader point made at that point in the video _is_ true, though, which is that the number of nearly orthogonal vectors you can cram in will scale exponentially. But when that makes for an appreciable effect is an interesting question. The viewer Nick Yoder wrote in with more explorations on this, with these as a few highlights: 1. 89 degrees is a rather tight constraint. At this angle, things don't get interesting until dimensions > 250,000 2. 85 degrees (dot product of only 0.087) scales VERY quickly 3. 85 degrees and 12,288 dimensions (GPT3 embedding space) holds well over 40 Billion vectors (likely more than 10^17) 4. 85 degrees and 116,000 dimensions (within reason for GPT4, GPT-o1, -o3 or similar newer models) holds well over a Googol (10^100) vectors and likely over 10^160
@pics229917 күн бұрын
A googol vectors? Are there even this many nuanced meanings within the limits of a human language? I guess that takes a lot of context into account...
@iaPika15 күн бұрын
It’s so great that you would clarifify something like this even months after the original video is posted it’s yet another thing that shows you truly care about the content you make! It’s fascinating to see the interplay between dimensionality and orthogonality in such depth. The exponential scaling insight is still mind-blowing-makes me want to dig deeper into the Johnson-Lindenstrauss lemma and its practical limits.
@pankajb6413 күн бұрын
10^17 is 7 orders of magnitude more than 40 Billion. Seems a little confusing when you wrote it like that. But thank you for pointing these corrections out and for the whole series indeed.
@thatguyalex283510 күн бұрын
Thank you, 3Blue1Brown. Now I understand somewhat of what happens inside of a 12 billion parameter Mistral AI model on my laptop. :)
@PiercingSight5 ай бұрын
Okay, the superposition bit blew my mind! The idea that you can fit so many perpendicular vectors in higher dimensional spaces is wild to even conceptualize, but so INSANELY useful, not just for embedding spaces, but also possibly for things like compression algorithms! Thank you so much for this truly brilliant series!
@00010110125 ай бұрын
Superposition reminds me of bloom filters
@markwebb71795 ай бұрын
When I first read that Toward Monosemanticity paper, it was like a revelation. I think it's highly likely encoding information through superposition explains biological signaling models that have remained unexplained for decades. These concepts aren't just making waves in tech.
@galactic_dust425 ай бұрын
Right ? it's like if we could "stretch" the space to fit more informations, it's crazy !
@user-zz6fk8bc8u5 ай бұрын
But is it really that surprising? I'd actually be amazed if the internal representation would be highly structured. Asking where a LLM stores the fact that Michael Jordan is a basketball player is a bit like asking where your brain stores the value of your current age. That's all over the place and not a single "variable" that sits in a group of a few neurons.
@IvanToshkov5 ай бұрын
@@user-zz6fk8bc8u The surprising part was that the number of almost perpendicular directions grows exponentially with the number of dimensions. Without this 100-dimensional space would be able to handle 100 independent concepts.
@srijagangadkar28602 күн бұрын
This is the best video in the world to understand these complex models.Thanks! Please keep making such videos
@mchammer50265 ай бұрын
During the whole video I was thinking "ok but it can only encode as many 'ideas' as there are rows in the embedding vector, so 'basketball' or 'Michael' seem oddly specific when we're limited to such a low number". When you went over the superposition idea everything clicked, it makes so much more sense now! Thank you so much for making these videos, Grant!
@AlphaPhoenixChannel5 ай бұрын
I was doing exactly the same thing in my head - i knew there had to be a catch that I couldn't see and then there it was, hiding right behind the 12288-dimensional teapot.
@giacomosimongini54525 ай бұрын
The number seems indeed low, but it is comparable to the number of "word" tokens in GPT3 vocabulary (50k), and 'basketball' or 'Michael' are ideas simple enough to possibly be represented by individual tokens. But of course the superposition trick allows to represent much more nuisanced and niche ideas.
@beaucoupdecafeine5 ай бұрын
I still don't get it 😭
@mariusfacktor35974 ай бұрын
@@beaucoupdecafeine Let me take you through a simple example. Let's say you have 8 outputs you want to encode for. {dog, cat, table, chair, bucket, mouse, eagle, sock}. A first idea is to use eight numbers (a vector of size 8) and tie each element in the vector to a certain output. For example 10000000 = dog, 01000000 = cat, 00100000 = table, ... This is called one-hot encoding. But let's say you are limited to an output vector of size 3, so like xxx where each x can be a number. Can you still encode all 8 outputs? Yes you can. You can use every combination of 0 and 1. For example 000 = dog, 001 = cat, 010 = table, 011 = chair, ... But now instead of 8 outputs, you have 500. Can you still encode 500 outputs using a vector of size 3? Yes. Just use real numbers instead of binary. For example (0.034, 0.883, -0.403) = dog, (0.913, -0.311, 0.015) = cat, ... (0.664, -0.323, -0.844) = baseball. As far as I know, Large Language Models (as well as most other Natural Language Processing models) use a fixed vocabulary. So an LLM may have a vocabulary size of 32,000 words which each map to a unique vector.
@jackbyrnes32234 ай бұрын
Yes, and the dimensions he used were direclty aligned with 'Michael' and 'Jordan' - this wouldn't really be the case, as it would be an inefficient use of weights. Michael would instead be a combination of the ~12,300 feature dimensions.
@Gabriel-tp3vc5 ай бұрын
This video is pure gold! This complex topic is just so clearly and correctly explained in this video! I will show this to all my students in AI-related classes. Very highly recommended for everyone wanting to understand AI!
@thatonedynamitecuber5 ай бұрын
How’d you get here so early?
@manuelsuarez75215 ай бұрын
how did you comment to the past?
@tommy_asd5 ай бұрын
@@thatonedynamitecuber Right? Video released less than a minute ago, this was commented an hour ago
@one_in_many5 ай бұрын
That too it says he commented an hour ago 😮
@laylaknopfler59895 ай бұрын
Can you share with us some materials you teach ? 🙏
@sigmund51053 күн бұрын
Thank you! I am completely stuck on how the K, Q, V matrices are built during training --nobody talks about that, so looking forward to your next video 🙂!
@rigelr53455 ай бұрын
Broo, just watched chapter 4 and re-watched the previous three chapters. Your first three videos had just dropped when I was learning about neural networks in grad school, like perfect timing. Took a couple of years off drifting around. Now, I'm going back into machine learning, hopefully gonna do a PhD, so I was re-watching the video series, then realized this one was rather new and got curious, noticed the last chapter 7 is missing, then check your home page and lo and BEHOLD you've released chapter 7 like 53 minutes ago. Talk about impeccable timing. I feel like you just dropped these just for me to go into machine learning haha... kinda like "welcome back, let's continue". Anyway thank you so much for taking me on this wonderful journey.
@fluffsquirrel5 ай бұрын
Perfect timing! I'm studying it as well and am pleasantly surprised at the incredibly convenient uploads. Good luck on your PhD Rigel!
@srisaisubramanyamdavanam99125 ай бұрын
That's how universe attracts curious people. There was a task given for me to give a lecture on some ML topic around 4 months ago. Out of instant gratification i have chosen to speak about GPT architecture. Literally i was scared intially and was doing some research and guess what just after 2 to 3 days, this man out of nowhere suddenly dropped a bomb of starting a series on transformers. I was so happy at that time and it helped me doing good amount of research amd seminar also went well.....
@AlphaPhoenixChannel5 ай бұрын
The script you ran with randomly distributed vectors was mind-opening, let alone once tuned - that's incredible. It's such an awesome quirk of high dimensions. I spent a good chunk of yesterday (should have spent a good chunk of today but oh well) working on animations to try to communicate traversing a high dimensional configuration space and why gradient ascent really sucks for one particular problem, so the whole topic couldn't be more top-of-mind. (my script already contains a plug for your previous video with the "directions have meanings" foundation. this series is so good!)
@oncedidactic5 ай бұрын
You know when alpha phoenix upvotes it’s good stuff
@mr_rede_de_stone9165 ай бұрын
Still can't help but being blown away by the quality of pedagogy in these videos...
@ValidatingUsername5 ай бұрын
@@iloveblender8999Pedagogy is the structure or chronology of the education which tends to try to make sure all prerequisite knowledge is covered to a competent degree before the next step.
@fluffsquirrel5 ай бұрын
@@iloveblender8999I think the word was used correctly, but I love Blender too!!
@christophkogler62205 ай бұрын
Oh, oh, I can be more correcter than both of you! 'pedagogy' doesn't actually seem to be an extremely precisely defined word. But, based on a few seconds of search, both of you are QUITE wrong, and OP used the word correctly. Merriam-Webster on pedagogy: 'the art, science, or profession of teaching'. Wikipedia on pedagogy: 'most commonly understood as the approach to teaching, is the theory and practice of learning'. The word pedagogy itself indicates nothing about who is involved, what is taught, or how the teaching occurs.
@ValidatingUsername5 ай бұрын
@@christophkogler6220 “The approach to teaching … theory and practice of learning” If you can’t rationalize someone discussing what is taught and when for best learning and teaching then don’t even google it.
@fluffsquirrel5 ай бұрын
@@christophkogler6220 "More corrector" tho?
@ishtaraletheia98045 ай бұрын
The near-perpendicular embedding is wild! Reminds me of the ball in 10 dimensions.
@prdoyle5 ай бұрын
Incredible that it's exponential!
@Charles-Darwin5 ай бұрын
It is astonishing
@silviavalentine38125 ай бұрын
I know this is such a late response but I wanted to comment on how amazingly human this process is. Whenever we hear or read words, our brain immediately starts to guess the meaning by using the very same process mentioned in this video series. For example, I'm sure many of you have seen the pangram "The quick brown fox jumps over the lazy dog". Now if you come across a sentence that starts with "The quick" your brain might come up with many different ideas but the pangram would be included. As soon as you interpret the word after "brown", then your chances for your brain to guess the pangram goes up. I believe the same is true for thinking of "solutions" of words to output also.
@JohnBerry-q1h5 ай бұрын
The pangram that you cite was a typical repetitive TYPING DRILL, back when my high school (in the 1980s) taught typing on IBM Selectric typewriters.
@JivanPal3 ай бұрын
@@JohnBerry-q1h It has been a handwriting drill for many, many decades, too, and remains so even in the present day. Specific typing drills are not particularly useful, since they result in people exhibiting muscle memory rather than focusing on typing. For example, I can type my five-word computer password extremely quickly just from memory, because it's a simple repetitive task that I do many times a day - sometimes I even briefly forget the actual words that make it up, since I'm subconsciously recalling finger movements, not thinking about the words themselves - but my typing in general is less rapid than that.
@jordanledoux1973 ай бұрын
What's even more crazy is when you look at the studies that have been done on people with a severed corpus collosum. The experiments that were performed suggested that the human brain consists of many "modules" that do not have any kind of consciousness themselves, but are highly specialized at interpretation and representation. It seemed like the way "consciousness" works in humans is that the conclusions of these different "modules" within the brain all arrive at the decision-making or language parts of the brain, and those parts work VERY similarly to these LLMs: the generate a thought process or set of words that explains the input from the modules, and that's it. For example, one module might identify something as dangerous, and another module might identify something as being emotionally significant. Those conclusions, along with many, many other "layers", arrive at the language parts of the brain, and that part of the brain creates a "story" that explains the sum of the "layer" conclusions. Is this the only way to interpret the results of the experiment? Of course not, and it's not they way they were interpreted originally when first performed. But we also hadn't really invented any AI at that point either. The way that these models represent information and process data seems to me to be MUCH closer to how human brains work than I think most people realize. The human brain probably has at least one type of layer (such as attention or MLP) that is not currently represented in modern AIs, and is also even more parallel, allowing for the possibility of "later" MLP layers or attention layers to cross-activate other layers while they are happening.
@danielwoods7325Ай бұрын
The thing I love most about this video is how closely these concepts align what I was studying in MA English. I wrote my master's dissertation borrowing some of Ricoeur's ideas on semantic pertinence - specifically, I said that time and "meaning" (pertinence) are the axes of narrative, and that "understanding" is the movement between points in that co-ordinate system. So to hear that LLMs encode "meaning" as a direction in a higher dimensional co-ordinate system... it's shockingly similar. It's not the only example (there are some interesting parallels between how an LLM processes an input, and classical philosophies on our experience of time) - I really love seeing how concepts that were being talked about thousands of years ago are being mirrored in the bleeding-edge tech of today.
@peterrichards51072 ай бұрын
Thanks for all of these videos, great work!
@owcsc3 күн бұрын
Thank you for donating
@rigelr53455 ай бұрын
I watched the end of this series now, and I'm just blown away by the maths of it all. Like how simply each step is, yet how it all adds up to something so complex and WEIRD to wrap your head around. It's just so fucking fascinating man. Really makes you think about what WE even ARE.
@lesmcfarland24745 ай бұрын
This is simply the most comprehensible explanation of transformers anywhere. Both the script and the visuals are fantastic. I learned a lot. Thank you so much.
@redex688 күн бұрын
This channel keeps blowing my mind with how well and how intuitively some topics can be explained after going to lectures that completely confuse me.
@Vadim-rh3lr5 ай бұрын
There are no words for how good this is and all of the 3Blue1Brown videos are. Thank you, Grant Sanderson, you are providing a uniquely amazing service to humanity.
@dkosolobov14 сағат бұрын
Thank you for these wonderful videos! The clarity, the excelently thought out visuals, the constant attention to the viewer, everything, the overall level of presentation is astounding. And all this with almost no compromises for the mathematical rigour!
@juliuszkocinski74785 ай бұрын
Grant does so good of a job explaining these in interesting manner - I bet 3b1b has measurable impact on a whole humanity's grasp of Math at this point.
@MahsaBadami3 күн бұрын
I’m a visual learner, and after struggling through various courses trying to grasp the math behind LLMs, your videos have been a game-changer for me. I can’t express how easy, intuitive, and calming your teaching style is! Sometimes, I watch an episode, take notes, and then rewatch it while cooking or painting-it really sticks that way. I’m so excited for the rest of the series and can’t wait to recommend your videos to others. Thank you for making learning enjoyable!
@SunnyKimDev5 ай бұрын
The combination of Linear + RELU + Linear function, and adding the result to the original, is known as Residual Networks. As 3b1b demonstrated in this video, the advantage Residual Networks have against a simple perceptron network is that the layers perturb (nudge) the input vector, rather than replace it completely.
@fluffsquirrel5 ай бұрын
Thank you so much! This makes sense
@tymczasowy5 ай бұрын
Thanks for the pointer! I remember MLP from 10 years back and I couldn't recall the "adding part". Btw. I'm also puzzled by a different thing -- we call it MLP, i.e., *multilayer* perceptron. But it seems that in a single MLP block there is only one layer. It's called multilayer, because it's replicated. I.e., it's N layers, each consisting of "(multi-head) attention" + "1 layer of residual perceptrons". Is my understanding correct? Do you know whether there are deep nets that actually use multiple layers in a single block? Why or why not would they do it?
@fluffsquirrel5 ай бұрын
@@tymczasowy Sorry I may be completely off on this, but would those be dense nets?
@tymczasowy5 ай бұрын
@@fluffsquirrel I'm not sure I understand your question. AFAIU all these are dense networks. My question was that it seems from the video that in each layer there is a single "sub-layer" of attention followed by a single sub-layer of residual perceptrons. So, the perceptrons are never stacked directly after each other (there's always attention between them). I am wondering whether it would be beneficial to have more consecutive sub-layers of perceptrons (after each attention). Are there deep nets with such architecture? If not, is it because it's *impractical* (too hard to train, too many parameters), or maybe it doesn't really make sense *conceptually* (there is nothing to gain)?
@fluffsquirrel5 ай бұрын
@@tymczasowy I apologize as I am not well-versed in this subject, but I too am interested in this idea. Hopefully somebody will come along with an answer. Good luck!
@Pandando4 ай бұрын
Every time I rewatch this series, I feel an irresistible urge to give it another thumbs-up. It's a shame that's not possible!
@hbhavsiАй бұрын
Cannot thank you enough for the value you have provided to the world through this video series. Could not have found a better or a more engaging starting point to study Neural Networks. I was telling my wife about this video series, and described it as a level of pedagogy that has no match.
@EzequielPalumbo5 ай бұрын
You have explained this topic so well that it almost looks trivial. Amazing.
@simonthehedgehog9285 ай бұрын
Thank you very much for yet another wonderful explanation! While focussing on Training feels reasonable, I would also love to learn more on positional encodings. The sinus used in the original paper and the more recent versions would surely make for some interesting visualizations - by simply reading through the papers I am lacking the intuition for this.
@3blue1brown5 ай бұрын
I'll consider it for sure. There are many other explainers of positional encoding out there, and right this moment I have an itch for a number of non ML videos that have been on my list.
@simonthehedgehog9285 ай бұрын
@@3blue1brown Thanks for considering :) I totally understand that there are non ML videos to be made - do whatever feels right and I'll enjoy it nonetheless
@a.gholiha68845 ай бұрын
Animation. 5/5 Didactic 5/5 Pedagogic 5/5 Voice. 5/5 Knowlege 5/5 Uniqness 4/5 Just beatiful work ❤❤❤ keep it up.. I will send this to eveyone that appreciate the work.
@UnderscoreZeroLP2 ай бұрын
damn lost a point on uniqness :,(
@a.gholiha68842 ай бұрын
@@UnderscoreZeroLP hard to be totally unique and somewhat dangerous (during your lifetime) also.
@reocam8918Ай бұрын
@@a.gholiha6884 do you mind my asking for other similar ones since uniqueness lost one cent
@UnderscoreZeroLPАй бұрын
@@a.gholiha6884 i think 3b1b is dangerously uniq though 😱
@naninano88135 ай бұрын
"locating and editing factual associations in GPT" is a fun read and an important prior work for this. thanks for posting!
@Veptis5 ай бұрын
I got two relevant references that help answer this question with great examples: 1. Locating and Editing Factual Associations in GPT (2022) Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov 2. Analyzing Transformers in Embedding Space (2022) Guy Dar, Mor Geva, Ankit Gupta, Jonathan Berant
@naninano88135 ай бұрын
Yeah i remember the first paper was even on YT with Yannic Kilcher interviewing original authors. btw have you guys seen that write up "What’s Really Going On in Machine Learning?" which draws parallels between cellular automata and highly quantized MLPs. very nice read.
@Veptis5 ай бұрын
@@naninano8813 the Yannick Interview was great. I have also been fortune enough to have Yonathan Belinkov visit my university and give a guest lecture on 3 adjacent papers which I had quite a few questions on to discuss.
@kwew13 ай бұрын
I may have learnt something new here. Grant is saying that the skip connection in the MLP is actually enabling the transformation of the original vector into another vector with enriched contextual meaning. Specifically, in 22:42, he is saying that via the summation of the skip connection, the MLP has somehow learnt the directional vector to be added onto the original vector "Michael Jordan", to produced a new output vector that adds "basketball" information. I was originally of the impression that skip connections are only to combat vanishing gradient, and expedite learning. But now Grant is emphasizing it is doing much more!
@rishivij19955 ай бұрын
The way you have presented the mathematics on youtube, no one has done before. Everyone just tells fancy words and you give the essence how they work. Great video Grant, waiting for more videos from you😊
@dougunderwood5695 ай бұрын
Thanks!
@pavithran85075 ай бұрын
hey Grant im sure i cant understand everything from this series so im skipping this video, but the purpose of this comment is thank you for creating manim python library because you started(i would say) "animation for education" to entirely different level and encourage many people to do that in your style of animation using manim, because of you indirectly im learning many things in youtube, thanks again and i wish you have more success in your carrier with your loved ones
@danfg72155 ай бұрын
I didn`t watch the other videos (yet) but I could totally pretend to understand a lot of the things in this video, which was mind blowing. Try it out, you can always come back later to try and understand more.
@Dom-zy1qy5 ай бұрын
Nah, I am certain you can understand this. If you know Python (optional) some calculus, and linear algebra (kind of optional), you're good. It's just overwhelming at first. The hard part for me really is trying to understand wtf researchers are talking about in papers. Can't tell if researchers are pedantic, or I'm just too dumb, or both.
@djmetrokid5 ай бұрын
@@Dom-zy1qymath is not optional here dude I agree that if you just want to learn the theory python is optional but saying calculus and linear algebra are kinda optional that’s bull shit ai is a mathematical field As for papers yes they can be overwhelming but reading a academic paper isn’t a quick process you need to read paragraphs repeatedly to understand them properly
@fgvcosmic6752Ай бұрын
@@djmetrokid Linear Algebra is optional here as long as you know how Vectors and Matrices work, without needing details like fields and spaces.
@thatonedynamitecuber5 ай бұрын
Always happy to see 3B1B uploading
@karanshah16985 ай бұрын
I have been re-watching the Superposition hypothesis for the last 15 minutes. It still blows my mind. Grant, your work is so beautiful, to me you are the Da Vinci of our time.
@laurenceporter49822 ай бұрын
Thanks
@grotmx5 ай бұрын
Fantastic explanation. Also, am I the only one who appreciates his correct use of "me" and "I"? So rare these days.
@kinderbeno565 ай бұрын
Decided on a whim last night to get a refresher on the previous 2 videos in the series. Beautiful timing; great work as usual
@artnibv2 ай бұрын
Thanks!
@johnchessant30125 ай бұрын
18:28 this part is really cool!
@dilyaraarynova66573 ай бұрын
Just watched the whole series, took me two days but it was so insightful and easy to understand compared to other sources. Thank you! Those animations and the approach you take to explain is so helpful!
@threeMetreJimАй бұрын
It's strange that having an MLP store facts used to be called 'over-fitting'. I did some experiments with deliberate over-fitting and storage of separate images in a limited size MLP, and it turned out the amount of useful information able to be stored was close to the number of weights (parameters) multiplied by their size (8 bytes for a 64 bit float). It was close to the same for jpeg stored images, but using a very oversized network gave the best quality images. 'Useful information' being related to the frequency content of the images (high frequency content - sharp edges - taking more storage space). The MLP performing a kind of compression in an attempt to fit all data. For images, I was tempted to think in how many planes of information could fit (overlapping planes causing the data compression effect). Not sure if this would translate to an LLM in some way.
@JonBarker15 ай бұрын
Thanks!
@DarthMakroth5 ай бұрын
I've been reloading your page everyday since the first in the LLM series, seeing this notification was like Christmas come early
@br27603 күн бұрын
Danke!
@utsav47515 ай бұрын
I was waiting for this video for so long. Thanks for this!
@hatemsshow1458 күн бұрын
Thanks a lot. I struggled a lot with understanding transformers but you made it super simple. Truly a savior.
@PewrityLab5 ай бұрын
22:10 Holograms coming! :D
@jonasgajdosikas11255 ай бұрын
maybe even a crossover with thought emporium?
@cardosov5 ай бұрын
Thanks
@carlosmorais75225 ай бұрын
That is an amazing video. And watching it makes me think a lot on how the "Akinator" most probably works by "asking questions" about the characters and how they live on that superposition space that can be only activated when the right sets of answers are given.
@aayushjariwala62565 ай бұрын
Umm I don't think that akinator even work on NLP. It's just large database which you can think of tree or decision tree( yes, no, neutral, probably yes, probably no). Where each branch has question and divides databse into smaller sub tree. But yes as compared to vector which encodes particular question it is doing same job!
@pafnutiytheartist5 ай бұрын
Akinator works on much more simple principle, but it can also be expressed in terms of vectors. Each question is an orthogonal direction, 1 is Yes, 0 is no, .5 is unknown. As you answer the questions you populate the vector components, and the system picks the component that'll be the most valuable for cutting the space of possibilities. The data is simply stored in the database, no fancy neural networks needed.
@carlosmorais75225 ай бұрын
@@pafnutiytheartist Exactly. Each answer to those questions can be correlated to the dot product of the NLP. But for me, the mistery is how they populate their database.
@MrDivarАй бұрын
Many thanks for making such fruitful videos. I'm excited for the upcoming chapters for training, fine-tuning and alignment topics!
@vitaliiznak5365 ай бұрын
This is an excellent video! It offers the best intuitions on transformer architecture that I've seen. However, I'm curious about one aspect that wasn't covered: positional encoding. Specifically, I'm trying to understand why adding positional encoding doesn't disrupt the semantic clustering of initial words embeddings. How does the model maintain the integrity of these clusters while incorporating positional information? I'd love to see this explored in a future video
@julian19715 ай бұрын
Thanks😊
@codybattery83705 ай бұрын
AI researcher here, really like the explanation of ReLU! My recent experiments show that dropping all absolute gate values below a threshold leads to universal performance gains across different model architectures!
@rbk000Ай бұрын
Amazing.
@JonnyInChina5 ай бұрын
FINALLYY the video we all waited for!
@Words-.5 ай бұрын
fr!!!
@fluffsquirrel5 ай бұрын
This AND training's gonna be next! Who's excited for that too?!
@frank-lr4 ай бұрын
I bow my head to your ability to explain complex topics in a didactically excellent way without leaving out crucial details.
@prateekgupta24085 ай бұрын
I have been waiting for this video for a long time man!. Good Job and I hope to see the video explaining Training soon! Thank you so much for these :)!
@Freak80MC5 ай бұрын
"How facts are stored remains unsolved" I'm only a minute in but it's kinda wild to think we have invented machines so complex that we can't just know exactly how they work anymore, we basically have to do studies on them the same way you would do a study on a human brain which won't lead to a full understanding, just a partial understanding. Maybe some problems are so complex we can never truly know the complete inner workings of everything. Maybe we are hitting into the limit of human cognition itself. A human being simply can't keep all the facts straight in their head all at once to be able to come to a complete understanding of something this complex. I don't know, just rambling... It's just wild that we invented something that works, but if you ask HOW it works, that's literally an unsolved problem. We flat out don't know exactly how it works.
@iankrasnow53835 ай бұрын
How little do we really understand? Is it naive to assume that we can start to develop more algorithms whose purpose is to analyze the behavior of neural networks? Surely there's some way to map a concept's embedding in a perceptron in some sort of rigorous way.
@academyfortechnologyandcom84735 ай бұрын
But of course, neural nets are a loose imitation of our own brain. Since we don't know where facts are stored in those, either, it's not unintuitive that that we wouldn't know where they are stored in things that operate similarly. You said that we've invented something, and that's only true to a point. It seems to me that we're doing is as much "discovering" as we are "inventing".
@lemurpotatoes79884 ай бұрын
They do that in medicine all the time.
@sender14964 ай бұрын
The "mystery" surrounding neural networks is VERY different from that of the brain. Sure, having a high-level understanding of how it structures the different weights is very headache-inducing, but it is not a mystery. We still know exactly everything going on conceptually, and grasping how information is stored, etc. is not a mystery in my opinion, even though it's complicated. Understanding the human brain on the other hand is much more complicated, because for every person, things are "experienced". For instance, there is no mystery in following the process going from photons hitting your retina, to the signal reaching the visual center of your brain. However, how that visual center somehow makes you "see" something is where the mystery is. Questions you can ask yourself to understand this are: Why do you see your red as red and not yellow? Why does cold feel like cold and not warm? And more importantly, why is there any sensation at all? You can also google "p-zombies". In our modelling of neural networks and machines, everything works without "sensation" (or at least there is no proof of any sensation). We do not know what sensation/experience/whatever you like to call it is, and we have no way of modelling it. Surely there is something interesting in trying to understand how machines with similar neural structure store their information and comparing that to the human brain, but simply thinking of humans as "inputs (for instance vision) mapping to outputs (for instance movement)" does not at all deal with the problem mentioned above.
@sender14964 ай бұрын
I think the hint lies in how different reality is when we diverge from what we are used to. Everything makes sense looking at our own scale, but when trying to understand smaller and bigger things (quantum physics, relativity, etc.) it suddenly seems like reality is very different from what we can make sense of. Reality is probably very different from what we think, and as humans, we are very limited, for example by our specific senses. Neural networks are built in a way that we understand each process, at least at a detailed level. We can technically track every decision it makes and understand it. This is clearly less complicated than the human brain which includes sensation, and probably requires a more advanced understanding of what reality actually is.
@Neomadra5 ай бұрын
So glad you touched on interpretability. Anthropics towards monosemanticity paper is one of the most intriguing this year. Using sparse autoencoders to extract monosemantic neurons is just genius.
@thamiordragonheart86825 ай бұрын
the video was really cool and the mathematical explanation is really good. I do have a semantic quibble though. The model doesn't have any clue that Michael Jordan plays basketball, it knows that sentences that include Michael Jordan, sport, and play all often include the word basketball, so a sentence that includes all 3 is very likely to include basketball. It's a subtle distinction, but I think it's important because it explains how a Large Language Model can "know" really common facts like that Michael Jordan plays basketball, and even what team he's on, but often mess up quantitative answers and spout well-structured nonsense when asked about something even moderately technical.
@3blue1brown5 ай бұрын
I think that’s a valid quibble. It’s very hard to talk about these models without over-philosophizing what we mean by words like “know”, and also without over-anthropomorphizing. Still, the impressive part is how flexible they can seem to be with associations learned, beyond for example what n-gram statistics could ever give, which raises the question of how exactly the associations get stored. The fact that they do well on the Winograd schema challenge, for example, raises a question of how language is represented internally in a manner that seems (at least to me) to much more deserve a description like “understanding” than previous models.
@bornach5 ай бұрын
How well do LLMs perform with Winograd schema submitted after their training cutoff date? A problem of evaluating anything trained on ginormous datasets is ensuring the answers weren't included in the training. Many initially impressive AI results look less impressive when it is revealed that the researchers took less than adequate precautions to avoid cross contamination between training set and evaluation set. When a LLM cannot answer "If Tom Cruise's mother is Mary Lee Pfeiffer, who is Mary Lee Pfeiffer's son?" or only gets the correct answer when it is someone famous but fails with random made-up people, one does question what aspects of a fact has it really learned? Prof Subbarao Kambhampati points to several studies that show LLMs memorise without understanding or reasoning.
@thamiordragonheart86825 ай бұрын
@@3blue1brown The fact that being a little flexible with orthogonality gives you exponentially more dimensions is really interesting and impressive. I had no idea that was possible, though it does make sense and it does raise a lot of interesting questions. I think the information storage is probably less like an encyclopedia and more like the world's most detailed dictionary because it stores correlations and relationships between individual tokens. After reading through some winograd schemas, I do think that they prove the model knows something and has some reasoning ability, but with the level of detail LLMs record, I think they can be answered by reasoning about how language is structured without requiring you to know the physical meaning or underlying logic of the sentence. Given how little of the human brain is devoted to language among other things, I don't think that has very much to do with how most humans store information or would solve Winograd Schemas internally, but it's definitely some kind of knowledge and reasoning, and how that fits into what makes something intelligent is more of a philosophical debate. At the level LLMs work at, all human languages seem to have an extremely similar structure in the embedding space, so I think the most exciting realistic application for LLMs once we understand them a little better is matching or exceeding the best human translators, and eventually decoding lost languages once we figure out how to make a pretrained LLM.
@RedStainedInk5 ай бұрын
@@bornach The GPT-3 Paper goes into great detail how they try to avoid training data contamination like that, you can be sure they thought about that problem.
@mikeblake97615 ай бұрын
So glad to have another vid from 3B1B on this topic, the more I understand about AI the more awestruck I am by it.
@divandrey-u3q5 ай бұрын
Huge thanks for the work you are doing! You are one of the few channels that explain LLMs that well. I also like Andrej Karpathy's videos but yours are more for building intuition, which is also great and super helpful! I'm very curious though what was that thing with the green glass holograms
@pierretetreau74972 ай бұрын
This is gold as it is the only thing I have seen that enables our brains to get a proper overview and at the same time detailed information about GPTS...
@thedook-h4g2 ай бұрын
these videos make me feel both dumb and smart.
@awelshphoto4 ай бұрын
When I was giving a talk last year on how transformers worked... I envisioned something like this video in my mind. But of course it was amateur hour in comparison, both visually and in explanation. You are a true professional at this, Grant.
@daantromp51955 ай бұрын
I don't know if this question makes any sense, but after watching all these videos it suddenly popped into my mind: If we consider this higher-dimensional embedding space, in which each direction encodes a specific meaning, each vector in this space represents a certain distinct concept, right? (a vector 'meaning' man, woman, uncle, or aunt, as per the example at 3:44). If so, what would a vector consisting of only 0s represent? In other words, what central concept would be at the origin point of this higher-dimensional embedding space?
@avikmalladi21274 ай бұрын
Great question
@zw22495 ай бұрын
Great video! In case anybody is wondering how to count parameters of the Llama models, use the same math as in 16:47 but keep in mind that Llama has a third projection in its MLP, the 'Gate-projection', of the same size as the Up- or Down-projections.
@salmonsushi475 ай бұрын
Babe wake up 3b1b uploaded!
@ScilentE5 ай бұрын
Love your videos! I've been familiar with Transformers and LLMs, but the notion of superposition in high-dimensional space was new to me. Thanks for the knowledge! Cheers!
@timeflex5 ай бұрын
4:10 I wonder, what word sits in the center? What is [0, 0, 0, ..., 0] ?
@graysonking165 ай бұрын
Philosophy
@reocam8918Ай бұрын
he mentioned a python lib called gensim, not hurt to try
@angeldude101Ай бұрын
Probably complete uncertainty, where everything is equally impossible.
@philippeannet4 ай бұрын
Your video’s are simply mind blowing… the effectiveness with which you succeed to make us ‘visualise’ the mechanics of AI is truly unique ! Keep on the good work 👍
@paullin1785 ай бұрын
This is why I pay for internet
@mohammadrabie72585 ай бұрын
I am not gonna talk about the video, it’s obvious how it is !! I wish there is an Oscar for the Video makers on KZbin, this Chanel would be definitely among the very top nominees !
@user-pb1ko8il3w5 ай бұрын
wake up babe, new 3b1b video dropped that will completely change your view on math and science and will provide unparalleled intuition for the same
@pro-indicators5 ай бұрын
Invaluable resource shared in this video. Therefore requiring an infinite number of thank you !
@BHBima5 ай бұрын
The fact that the number of nearly perpendicular vectors increases exponentially as dimension space increases is really interesting and also terriflying.
@hammerth14215 ай бұрын
Yep. Combining this with the scaling laws of computational power makes it scary to think about the capabilities of machine learning in the future.
@rainaldkoch90935 ай бұрын
Danke!
@rich10514145 ай бұрын
This is precisely why hallucinations are such an issue with LLMs. They don't actually store facts, so hallucinations and facts aren't distinguishable.
@bornach5 ай бұрын
It also might explain the problem of "glitch tokens" in LLMs. A prompt could accidentally send a LLM into a direction for which the linear combination of superposition vectors it was trained on makes absolutely no sense.
@rich10514145 ай бұрын
@@bornach I imagine sometimes there are spurious tokens in pools of otherwise related tokens. If you give someone directions to the store, but you misinterpret one of the turns as right instead of left, you are going to end up in the wrong part of town. Humans, I imagine, would usually realize pretty quick something went wrong and would know they are lost. LLMs keep trucking along and vomit out whatever garbage happens to be near where they end up in the end.
@DarkStar6663 ай бұрын
Do you think your brain “stores facts” somewhere? Seems like more a matter of scale and architecture to me - LLMs aren’t the end game, they are the first glimpse of a new era.L of understanding for us humans
@kitgary5 ай бұрын
You know what, after watching your video allow me to show off with my engineer collegues, they are amazed by my knowledge! Thanks!
@DelandaBaudLacanian5 ай бұрын
Neil Nanda's mechanistic interpretability gives me hope that these models aren't "black boxes" as many engineers pretend they are
@jordinne22015 ай бұрын
Neel Nanda is one of the people who thinks neural nets are black boxes (by default). Mechanistic interpretability is not solved, and I don't think any engineers are pretending this for any reason.
@neelnanda24695 ай бұрын
I think they're black boxes by default, we have not YET solved this problem (and may never do), but we're making real progress and I'm optimistic and would love to see more people pushing on this
@simon12425 ай бұрын
corrected by the man himself
@thefacethatstares5 ай бұрын
@@neelnanda2469 hi neel, i remember your super clear complex analysis revision lecture from uni. really happy to see you're making waves out there
@Virtuous_Rogue4 ай бұрын
Another high dimensional fact: If you pick a random point in a unit square, the probability that point is within 0.001 units of the border is 0.4%. In a 10,000 dimension hypercube, that probability becomes 99.999999%! Source: Hands on Machine Learning by Aurelien Geron Also, I've been teaching myself deep learning this week and your series cleared up some confusion I had about attention layers/units/whatever so thanks!
@kylewood40015 ай бұрын
15:11 ITS REAL
@bijeshshrestha24505 ай бұрын
Can we get much higher
@IAcroniXI5 ай бұрын
THE ONE PIECE IS REAL
@joelhaggis505427 күн бұрын
YOU WANT MY TREASURE? YOU CAN HAVE IT! I LEFT EVERYTHING IN ONE PIECE
@achmadzidan59884 ай бұрын
Unbelievable this very clear of explanation for something that "very hard" is exist. very excited to wait the next chapter!!
@martinstu84005 ай бұрын
"reasons" of neural networks will never be solved. Just as Stephen Wolfram said: "Asking 'why' a neural network does something is the same is asking 'why houses are made stones?' It's just because it's something that was available at the moment, lying around to be exploited in the computational universe"
@_Blazing_Inferno_5 ай бұрын
Yes, we won’t be able to figure out the fine details for why a neural network does exactly what it does. However, we can get a big picture or a group of medium-sized pictures. It’s similar to studying the actual human brain and why we do what we do: it’s really difficult since we can’t really start from the smallest details and learn the biggest ones, but we can work our way down to a certain extent, making hypotheses and testing them to figure out how correct our understanding is. Just because the task of understanding models or brains as a whole all the way from neurons to behavior is for all intents and purposes impossible, that doesn’t mean we can’t understand certain aspects of it.
@kyrylosovailo16903 ай бұрын
I love how this series explains the topic without riding the wave of hype. Stay that way!
@makebreakrepeat5 ай бұрын
Weird... my gpt is hallucinating that MJ played golf.
@AaronFleming-nj6vy5 ай бұрын
The walkthrough of code was so helpful in understanding the dimensionality point. Putting your code GPT ironically helped solidify my understanding of what you were saying. Thank you!
@chunlingjohnnyliu28895 ай бұрын
0:45 Wait we don't actually know how it works fully?
@Dom-zy1qy5 ай бұрын
We know how to create these networks, but we don't specifically know what combination of parameters of the network corresponds to a specific concept/output. We just feed it a bunch of data, then basically tell it to predict what might come after a specific textual sequence. This prediction is made based on many matrix multiplications and transformations. We then have a loss function, which grades the accuracy of the predictions. During training, basically compute the gradient of this loss function with respect to all parameters of the model. (the values in the matrices which we used to transform our input data) Because so many parameters (billions to trillions) are needed to make these predictions well, it's difficult to really know for sure which parameter(s) "corresponds" to an idea or concept. You could in theory, do all of this by hand without a computer, but it would take you an eternity. So we use computers, the consequence of that being we end up not knowing what our network is "thinking".
@chunlingjohnnyliu28895 ай бұрын
@@Dom-zy1qy thanks for the clarification 🙏🏻🙏🏻
@ernststravoblofeld4 ай бұрын
It's not just that we don't know what the weights end up representing. We don't know which of the dozens of ways a matrix or a vector can represent data the model is using at any part of the process.
@passivehouseaustralia44064 ай бұрын
This the reason for the often used "black box" ... You can train the network and understand it's performance... But what the inside of the weights do is the "black" unknown bit.
@jellovendigar14 күн бұрын
Thank you so much. The number of embedding dimensions being somewhat limited as compared to the gigantic number of possible concepts was bugging me a lot. I intuitively thought the embedding space could not store more concepts that are perpendicular to each other than the embedding space dimension. The exponential increase that you get by just relaxing the exact 90 degrees contraint a little is a big surprise for me and really counter-intuitive. I love your content man, many many thanks
@anispinner5 ай бұрын
Facts.
@josephkclee3 ай бұрын
謝謝!
@samuelgunter4 ай бұрын
15:10 the one piece is real
@user-rm2qj2jh4l5 ай бұрын
This is such a good series, thank you so much!!! Have been waiting at the edge of my seat since April and this video was definitely worth the wait! Thank you for such high-quality, rigorous yet still intuitive content!
@rianantony5 ай бұрын
Why'd they have to name it MLP
@flakmoppen5 ай бұрын
This could be me just living in a bubble, but why wouldn't they name it MLP?
@noaht25 ай бұрын
@@flakmoppenMy Little Pony
@flakmoppen5 ай бұрын
@@noaht2 ah. Gotcha.
@Seerinx5 ай бұрын
they were invented ~20 years before my little pony existed (1960s vs 1980s)
@shameekm21465 ай бұрын
Thank you so much for this wonderful video. I am going to watch this multiple times to keep refreshing these topics as long as transformers are the epitome architecture in NLP.
@N0Xa880iUL5 ай бұрын
The quality of 3b1b has declined. The videos have the same pace (maybe more) despite an increase in the difficulty. Effectively decreasing the target viewerbase who'd enjoy the process. Basically I'm admitting that the content has progressed beyond me, even though earlier the videos were something I could understand. This stuff must be quite challenging.
@Gabriel-tp3vc5 ай бұрын
This is state of the art, bleeding edge, and pretty recent advances, that even the brightest AI researchers didn't understand less than a decade ago. The fact that he can even explain it at all to the public this well is IMHO quite impressive. But yes, it is sometimes a good idea to watch such videos more than once and take a good night's sleep in between 🙂 I mean, no amount of billions of dollars and professors could get you this knowledge a decade ago. Just for perspective.
@N0Xa880iUL5 ай бұрын
@@Gabriel-tp3vc I appreciate it, no doubt.
@jerryanyu8467Ай бұрын
Thank you for the amazing video! We are looking forward to the next one, please!
@makinamiura47083 ай бұрын
It is one of the most striking moments in my life knowing that the seemingly pure mathematical idea of loosening the notion of "perpendicular" a bit, results in the reason of the efficiency of information in how the universe is made and perceived. Life is awe.
@sedthh5 ай бұрын
wow, awesome as always! you can interpret bias + relu as the denoising part, which discards all information below a certain threshold and prevents the neuron from firing and passing information to the preceeding layers
@GustavoMunoz28 күн бұрын
You are a genius. Thanks a lot. Very generous, thanks. Hopefully you can create the following chapters soon 🙂