How ChatGPT is Trained
13:43
Жыл бұрын
What are Diffusion Models?
15:28
2 жыл бұрын
What are Transformer Neural Networks?
16:44
Why Do Random Walks Get Lost in 3D?
14:57
What is Automatic Differentiation?
14:25
What are Normalizing Flows?
12:31
4 жыл бұрын
Пікірлер
@CristianGutierrez-th1jx
@CristianGutierrez-th1jx 10 күн бұрын
Hands down the best intro to gen models one could ever had.
@tom-sz
@tom-sz 11 күн бұрын
Great video! Where can I learn more about the rounding and truncation errors plot at 2:06? I need to make an analysis of these errors for a project. Thanks :)
@stormzrift4575
@stormzrift4575 11 күн бұрын
Amazing
@TyrionLannister-zz7qb
@TyrionLannister-zz7qb 27 күн бұрын
Are the animations and sound track inspired from a channel named 3Blue1Brown ?
@Terrial-tf7us
@Terrial-tf7us Ай бұрын
you are amazing at explaining this concept in such a simple and understandable manner mate
@DamianReloaded
@DamianReloaded Ай бұрын
I've been toying with this idea for while. It would be great to have a language model to which you could give an answer as prompt and it would output the deduction to that answer going in reverse. My wild guess is that the domain of things that can lead to something is greater than the domain of things that are the outcome of a group of things. All roads lead to Rome. Causality?
@GordonWade-kw2gj
@GordonWade-kw2gj Ай бұрын
Wonderful video. The detailed example helps tremendously. And I think there's an error: At t=6.24, sInce $v_6 = v_5\times v_4$, in $\dot{v}_6$ shouldn't there be a plus sign where you've got a minus sign?
@Billionaire-Odyssey
@Billionaire-Odyssey Ай бұрын
Very much valuable content explained with clarity I wonder why you channel haven't still exploded you earned a new sub and continue making videos on such topics
@fugufish247
@fugufish247 Ай бұрын
Fantastic explanation!
@WilsonSolofoniaina-dn6ky
@WilsonSolofoniaina-dn6ky Ай бұрын
Shingle Magic better than RoofMaxx
@aaronyu2660
@aaronyu2660 Ай бұрын
I'm not fully sure what you mean by learning backwards in the context of this video, but if I'm understanding this correctly, my guess is that some languages that contain conjugations at the end of the word (or suffixes) makes language learning asymmetric. Because of that, when programmers try to tokenize the text, it's easier to say "ing" comes after "learn" (predicting forward) instead of saying "learn" comes before "ing" (predicting backwards). That being said, it's obvious that there are many root words that come before "ing." On the contrary, there are only so many possibilities that come after "learn". These are one of the asymetries of the semantic sequencing behind latin languages, and since Transformers learns and predicts from one end to another, this seems like a real possibility of why GPTs act like this. I also took a quick glance into the paper and saw that they did address tokenization, but I'm not sure if the method that they used addresses what I'm saying correctly.
@mesdamessoleils9869
@mesdamessoleils9869 Ай бұрын
This explanation is not correct. It is indeed more likely that 'ing' comes after 'learn' than to have 'learn' come before 'ing'; however, if you take into account the probability of seeing 'ing' at the end of a word, it is higher than that of seeing 'learn' at the beginning of the word; if you compute the probabilities one way or another, you will get exactly the same. So it's not about the probabilities in a language being higher one way or another (they are *exactly the same*, when multiplying the conditional probabilities), it's about them being harder to learn (or to compute). Also, this appears to not be so much related to the syntactic structure of the language, as the effect is universal across languages and is mostly apparent when the context window size gets large (at which scale the orders of the words in a sentence become less relevant, since it is at the scale of quite a few sentences)
@aaronyu2660
@aaronyu2660 Ай бұрын
​@@mesdamessoleils9869 I'm not quite sure how you computed the probabilities or why you even used that example to get the point across as it might not even be relevant to how a transformer works. Just curious though, how much experience do you have with knowing how transformers work inside out? I myself just learned recently and not 100% clear, nor am not sure of the model or the exact way they did it exactly, but if I have access to the source code, would probably be able to figure this out (rather just theories). However, judging by what I know, the model is trains and predicts a token based on the previous tokens passed into it. So even if I know that at the end of the word, "ing" happens more at the end of the word, I still would have more trouble predicting the root word ("learn" ) behind it rather than predicting "ing" comes after "learn". Just imagine it in terms of possible branches. Obviously, the more choices of likely words to choose from, the less probability for you to predict the next word correctly. Lets say make an example sentence, "I like learning math from my teacher" and the standard tokens are probably "I", "like", "learn", "ing", "math", "from", "my", "teacher". Moreover, assume that the model is trained with a bit more grammatically correct examples. Once my model generates up to the point, "I like learn", the next possible token that makes grammatically correct is easily going to be "ing" or something of high affinity if I use a different example. However, training it backwards, if I predicted "math from my teacher," I'm split between whether I should use a suffix ("ing") or ("learn") because each of them sounds grammatically plausible. Predicting backwards on a larger trained model, I can reasonably get the sentence "I like to learn math from my teacher" or "I like learning math from my teacher". As you can see, there are two plausible options predicting backwards which basically splits the probability of a confident prediction in two while going forwards, it most definitely is easier to complete the puzzle simply following the context clues (which GPTs are basically made for). I might not have exhausted all the cases, but you can probably see what I mean now. From that stand point, it's generally easier to predict forwards than backwards as you can see with english, and you can see this even more when you look at the syntactic structure of English. Generally speaking, we can predict that a good deal of english sentences start with a noun (aka subject or pronouns such as "I") while english sentences can end in a plethora of ways (nouns, verbs, words with suffixes which acts as a token of its own), making the amount of plausible choices that much more when trying to predict backwards. This mass of choices lowers the probability (aka certainty) that we have a confident right answer, hence unequal probabilities forward and back. Moreover, I think the video pretty much is mentioning what I am mentioning through various abstract examples, such as timestamp 6:36 for example. He pretty much says that it's much harder to predict the factors of a product rather than predict the product from the two factors, which completely makes sense. In essence, the models are struggling to learn backwards probably IS in fact since it's harder to learn/solve backwards, and the metric that shows that is the unequal joint probabilities of the sentence forward and back. I looked into the paper and video and found that the loss difference IS infact calculated by the joint probabilities instead, yet they are not the same. Moreover, the conditional probabilities are definitely most likely different forward and back given they are completely different conditional probabilities (NOT EQUAL) and likely behaved in the way I mentioned it. As for the context size affecting the learning difference, I believe the smaller context size (as mentioned in the video) spits out garbage predictions forward and back equally terribly, making the loss equally bad in either direction. However, increasing the context size allows the model to understand the context a lot more and hit more roadbumps along the way, possibly adding up the bigger disparity between the losses/probabilities. I wished they tested out more languages before claiming it's universal, as we might find a language that is easier to learn backwards than forward, but I don't remember a language with a sentence structure unique enough to do so.
@aaronyu2660
@aaronyu2660 Ай бұрын
@@mesdamessoleils9869 I'm not quite sure how you computed the probabilities or why you even used that example to get the point across as it might not even be relevant to how a transformer works. Just curious though, how much experience do you have with knowing how transformers work inside out? I myself just learned recently and not 100% clear, nor am not sure of the model or the exact way they did it exactly, but if I have access to the source code, would probably be able to figure this out (rather just theories). However, judging by what I know, the model is trains and predicts a token based on the previous tokens passed into it. So even if I know that at the end of the word, "ing" happens more at the end of the word, I still would have more trouble predicting the root word ("learn" ) behind it rather than predicting "ing" comes after "learn". Just imagine it in terms of possible branches. Obviously, the more choices of likely words to choose from, the less probability for you to predict the next word correctly. Lets say make an example sentence, "I like learning math from my teacher" and the standard tokens are probably "I", "like", "learn", "ing", "math", "from", "my", "teacher". Moreover, assume that the model is trained with a bit more grammatically correct examples. Once my model generates up to the point, "I like learn", the next possible token that makes grammatically correct is easily going to be "ing" or something of high affinity if I use a different example. However, training it backwards, if I predicted "math from my teacher," I'm split between whether I should use a suffix ("ing") or ("learn") because each of them sounds grammatically plausible. Predicting backwards on a larger trained model, I can reasonably get the sentence "I like to learn math from my teacher" or "I like learning math from my teacher". As you can see, there are two plausible options predicting backwards which basically splits the probability of a confident prediction in two while going forwards, it most definitely is easier to complete the puzzle simply following the context clues (which GPTs are basically made for). I might not have exhausted all the cases, but you can probably see what I mean now. From that stand point, it's generally easier to predict forwards than backwards as you can see with english, and you can see this even more when you look at the syntactic structure of English. Generally speaking, we can predict that a good deal of english sentences start with a noun (aka subject or pronouns such as "I") while english sentences can end in a plethora of ways (nouns, verbs, words with suffixes which acts as a token of its own), making the amount of plausible choices that much more when trying to predict backwards. This mass of choices lowers the probability (aka certainty) that we have a confident right answer, hence unequal probabilities forward and back. Moreover, I think the video pretty much is mentioning what I am mentioning through various abstract examples, such as timestamp 6:36 for example. He pretty much says that it's much harder to predict the factors of a product rather than predict the product from the two factors, which completely makes sense. In essence, the models are struggling to learn backwards probably IS in fact since it's harder to learn/solve backwards, and the metric that shows that is the unequal joint probabilities of the sentence forward and back. I looked into the paper and video and found that the loss difference IS infact calculated by the joint probabilities instead, yet they are not the same. Moreover, the conditional probabilities are definitely most likely different forward and back given they are completely different conditional probabilities (NOT EQUAL) and likely behaved in the way I mentioned it. As for the context size affecting the learning difference, I believe the smaller context size (as mentioned in the video) spits out garbage predictions forward and back equally terribly, making the loss equally bad in either direction. However, increasing the context size allows the model to understand the context a lot more and hit more roadbumps along the way, possibly adding up the bigger disparity between the losses/probabilities. I wished they tested out more languages before claiming it's universal, as we might find a language that is easier to learn backwards than forward, but I don't remember a language with a sentence structure unique enough to do so.
@aaronyu2660
@aaronyu2660 Ай бұрын
@@mesdamessoleils9869 I'm not quite sure how you computed the probabilities or why you even used that example to get the point across as it might not even be relevant to how a transformer works. Just curious though, how much experience do you have with knowing how transformers work inside out? I myself just learned recently and not 100% clear, nor am not sure of the model or the exact way they did it exactly, but if I have access to the source code, would probably be able to figure this out (rather just theories). However, judging by what I know, the model is trains and predicts a token based on the previous tokens passed into it. So even if I know that at the end of the word, "ing" happens more at the end of the word, I still would have more trouble predicting the root word ("learn" ) behind it rather than predicting "ing" comes after "learn". Just imagine it in terms of possible branches. Obviously, the more choices of likely words to choose from, the less probability for you to predict the next word correctly. Lets say make an example sentence, "I like learning math from my teacher" and the standard tokens are probably "I", "like", "learn", "ing", "math", "from", "my", "teacher". Moreover, assume that the model is trained with a bit more grammatically correct examples. Once my model generates up to the point, "I like learn", the next possible token that makes grammatically correct is easily going to be "ing" or something of high affinity if I use a different example. However, training it backwards, if I predicted "math from my teacher," I'm split between whether I should use a suffix ("ing") or ("learn") because each of them sounds grammatically plausible. Predicting backwards on a larger trained model, I can reasonably get the sentence "I like to learn math from my teacher" or "I like learning math from my teacher". As you can see, there are two plausible options predicting backwards which basically splits the probability of a confident prediction in two while going forwards, it most definitely is easier to complete the puzzle simply following the context clues (which GPTs are basically made for). I might not have exhausted all the cases, but you can probably see what I mean now. From that stand point, it's generally easier to predict forwards than backwards as you can see with english, and you can see this even more when you look at the syntactic structure of English. Generally speaking, we can predict that a good deal of english sentences start with a noun (aka subject or pronouns such as "I") while english sentences can end in a plethora of ways (nouns, verbs, words with suffixes which acts as a token of its own), making the amount of plausible choices that much more when trying to predict backwards. This mass of choices lowers the probability (aka certainty) that we have a confident right answer, hence unequal probabilities forward and back. Moreover, I think the video pretty much is mentioning what I am mentioning through various abstract examples, such as timestamp 6:36 for example. He pretty much says that it's much harder to predict the factors of a product rather than predict the product from the two factors, which completely makes sense. In essence, the models are struggling to learn backwards probably IS in fact since it's harder to learn/solve backwards, and the metric that shows that is the unequal joint probabilities of the sentence forward and back. I looked into the paper and video and found that the loss difference IS infact calculated by the joint probabilities instead, yet they are not the same. Moreover, the conditional probabilities are definitely most likely different forward and back given they are completely different conditional probabilities (NOT EQUAL) and likely behaved in the way I mentioned it. As for the context size affecting the learning difference, I believe the smaller context size (as mentioned in the video) spits out garbage predictions forward and back equally terribly, making the loss equally bad in either direction. However, increasing the context size allows the model to understand the context a lot more and hit more roadbumps along the way, possibly adding up the bigger disparity between the losses/probabilities. I wished they tested out more languages before claiming it's universal, as we might find a language that is easier to learn backwards than forward, but I don't remember a language with a sentence structure unique enough to do so.
@aaronyu2660
@aaronyu2660 Ай бұрын
@@mesdamessoleils9869 I'm not quite sure how you computed the probabilities or why you even used that example to get the point across as it might not even be relevant to how a transformer works. Just curious though, how much experience do you have with knowing how transformers work inside out? I myself just learned recently and not 100% clear, nor am not sure of the model or the exact way they did it exactly, but if I have access to the source code, would probably be able to figure this out (rather just theories). However, judging by what I know, the model is trains and predicts a token based on the previous tokens passed into it. So even if I know that at the end of the word, "ing" happens more at the end of the word, I still would have more trouble predicting the root word ("learn" ) behind it rather than predicting "ing" comes after "learn". Just imagine it in terms of possible branches. Obviously, the more choices of likely words to choose from, the less probability for you to predict the next word correctly. Lets say make an example sentence, "I like learning math from my teacher" and the standard tokens are probably "I", "like", "learn", "ing", "math", "from", "my", "teacher". Moreover, assume that the model is trained with a bit more grammatically correct examples. Once my model generates up to the point, "I like learn", the next possible token that makes grammatically correct is easily going to be "ing" or something of high affinity if I use a different example. However, training it backwards, if I predicted "math from my teacher," I'm split between whether I should use a suffix ("ing") or ("learn") because each of them sounds grammatically plausible. Predicting backwards on a larger trained model, I can reasonably get the sentence "I like to learn math from my teacher" or "I like learning math from my teacher". As you can see, there are two plausible options predicting backwards which basically splits the probability of a confident prediction in two while going forwards, it most definitely is easier to complete the puzzle simply following the context clues (which GPTs are basically made for). I might not have exhausted all the cases, but you can probably see what I mean now. From that stand point, it's generally easier to predict forwards than backwards as you can see with english, and you can see this even more when you look at the syntactic structure of English. Generally speaking, we can predict that a good deal of english sentences start with a noun (aka subject or pronouns such as "I") while english sentences can end in a plethora of ways (nouns, verbs, words with suffixes which acts as a token of its own), making the amount of plausible choices that much more when trying to predict backwards. This mass of choices lowers the probability (aka certainty) that we have a confident right answer, hence unequal probabilities forward and back. Moreover, I think the video pretty much is mentioning what I am mentioning through various abstract examples, such as timestamp 6:36 for example. He pretty much says that it's much harder to predict the factors of a product rather than predict the product from the two factors, which completely makes sense. In essence, the models are struggling to learn backwards probably IS in fact since it's harder to learn/solve backwards, and the metric that shows that is the unequal joint probabilities of the sentence forward and back. I looked into the paper and video and found that the loss difference IS infact calculated by the joint probabilities instead, yet they are not the same. Moreover, the conditional probabilities are definitely most likely different forward and back given they are completely different conditional probabilities (NOT EQUAL) and likely behaved in the way I mentioned it. As for the context size affecting the learning difference, I believe the smaller context size (as mentioned in the video) spits out garbage predictions forward and back equally terribly, making the loss equally bad in either direction. However, increasing the context size allows the model to understand the context a lot more and hit more roadbumps along the way, possibly adding up the bigger disparity between the losses/probabilities. I wished they tested out more languages before claiming it's universal, as we might find a language that is easier to learn backwards than forward, but I don't remember a language with a sentence structure unique enough to do so.
@sethjchandler
@sethjchandler Ай бұрын
Great job. Going to show this to my class (Large Language Models for Lawyers, University of Houston Law Center)
@CEBMANURBHAVARYA
@CEBMANURBHAVARYA Ай бұрын
Nicely explained, thanks
@GarethDavidson
@GarethDavidson Ай бұрын
This is really interesting. I also assumed you've just got a map and it doesn't matter what order you learn it in. But I guess the sequence is important, a model that learns in a forward order is aligned with both 1) sentences making sense to a listener due to previous info, and 2) learning how to generate sequences that exists in a system of thinking in words, forwards in time. Both of those are things to do with human brains likely have optimizations that facilitate better sentence constuction and understanding, so a model learning that way would just have a natural alignment. Video frames generated backwards are likely a mathematically identical problem but perceptually different for similar reasons. Going from a broken glass all over the floor to an intact glass in an implausible location is noticeable, an intact glass going to broken glass in odd locations is not. They both might be the same hard problem, but the forwards condition is much easier to fake; if you start with something constrained then things can fan out naturally. I guess rather than 20/20 and blind, hindsight is something like `sight/(hind^2)` while foresight is something like `sight/fore!`, and if you flip that round you have the problem in the other direction as the distance and number of insights increase.
@tinkerbrains
@tinkerbrains Ай бұрын
Nice explanation!
@stazizov
@stazizov Ай бұрын
Hello everyone from 2024, it seems the flow-matching hype has begun
@anchyzas
@anchyzas 2 ай бұрын
I also feel the residual connection is definitely RNN inspired.
@alfcnz
@alfcnz 2 ай бұрын
@Ari, this is really great! 🤩🤩🤩
@ariseffai
@ariseffai 2 ай бұрын
Thanks Alfredo!
@ejkmovies594
@ejkmovies594 2 ай бұрын
giving my 3blue1brown vibes. Amazing video.
@samllanwarne6512
@samllanwarne6512 2 ай бұрын
great video imagery... good job. Love seeing the flow of information, and little opinions about why things are happening. got a bit confusing when it went away from images to just pure equations, could do images and boxes with the equations?
@shashanks.k855
@shashanks.k855 2 ай бұрын
Glad your back!
@ErikWessel
@ErikWessel 2 ай бұрын
There’s an argument that all spoken human languages convey information (in the, like, Shannon sense) at the same rate. If this is roughly true, you can take the words per minute average for a spoken language, and use it as a proxy for information per word. According to my googling, spoken French averages more words per minute than English, so this implies that English may convey more information per word. Then, more salient pieces of information in a block of text would be spread over more tokens in French than in English. Therefore, we would expect sequence reversal to impede French more dramatically than English, since English words contain more intrinsic information which is then shielded from reversal. Have the authors (or anyone else) explored this angle, and does it hold up for other languages?
@ariseffai
@ariseffai 2 ай бұрын
Very nice connection. I wonder how tokenization would affect this analysis. According to Sec. 2.1.1, the authors train a BPE tokenizer from scratch for each language individually. So even if French has a higher rate of spoken words per minute than English, perhaps the relevant statistic here would be "tokens per minute", since the reversal happens at the token level. For example, if it were the case that tokens per minute for the two languages ends up being similar, this would weaken this theory. These details are not examined in the paper but would definitely be interesting to see!
@ErikWessel
@ErikWessel 2 ай бұрын
@@ariseffai Good point! I guess the average number of tokens per word is going to come down to how long French words tend to be compared to English words, and how repetitive the character combinations are, since they used the same total number of tokens when encoding both languages. Could definitely cancel out this effect, or even be the dominant cause of the difference itself!
@MS-cs7gt
@MS-cs7gt 2 ай бұрын
Try Sanskrit
@lorea4749
@lorea4749 2 ай бұрын
Please hear me out : Language can reduce entropy by an infinite amount, when we say "table" we convey the idea of a legged object with a planar surface that can hold things, with an infinite amount of variations of any of its components yet we all recognize the object when we see it. This way of communication reduces uncertainty "forward" by the infinite amount of variations that it can hold. But which one of the variations that the concept contains is "backward"?
@chalkchalkson5639
@chalkchalkson5639 2 ай бұрын
I think it's interesting that they limited themselves to reversible mappings in the formal languages section. In classical thermodynamics the arrow of time comes from the observables carrying less information about the exact micro state and only the micro states follow reversible predictions. If we want to create an analogy to thermodynamics, the words and sentences of a language would the observables of the complex act of communication, with only the entire system (including my thought, goals etc) following a reversible time evolution. But that still leaves the question: why would average information of a specific microstate given a macrostate increase in natural language? There are some example where that seems obvious, a joke setup probably implies the punchline, but not vice versa, same for the proof of a theorem. But it's just as easy to find examples in the opposite direction, the last chapter of a detective novel probably predicts the entirety of it better than vice verse.
@WhyZedIsUnavailable
@WhyZedIsUnavailable 2 ай бұрын
Thank you for a very clear expansion of the paper! But I'm not convinced by the paper itself. After recent watching of Andrej's Karpathy's tutorial on Tokenizer IMO its expected to see that when you take a highly tuned system, and significantly change one part, you expect the performance to drop because the other pieces are no longer as tuned for that task. IMO here the paper would be more convincing if it was trained on chars with no tokenizer. Than on "forward" tokens in reverse.
@frotaur
@frotaur 2 ай бұрын
We tried that as well ! For one dataset, we trained the BPE tokenizer backwards, then made the experiment again with this tokenizer; If the asymmetry came from the tokenizer, you would expect the backward model to now do better; but it doesn't ! In fact, the loss curves for the 'reverse BPE' and 'forward BPE' tokenizers look almost identical. This means the arrow of time effect is not a result of the way you tokenize, but really a feature of the dataset. (BTW, we thought about char tokenizers, problem is with those it's very costly to train models with attention spanning more than a few sentences; since in that case a sentence is already 100s of tokens, and compute scales quadratically with attention)
@anisingh5437
@anisingh5437 2 ай бұрын
Ari , you have an incredible ability to explain stuff.
@ariseffai
@ariseffai 2 ай бұрын
Thanks! Appreciate the kind words.
@NoNameAtAll2
@NoNameAtAll2 2 ай бұрын
why reverse the tokens instead of reversing words? or even letters to make new tokens
@Dart_ilder
@Dart_ilder 2 ай бұрын
I work with proteins - and they are very interesting because they have a linear sequential structure and a very complex 3d relations after folding. It is getting obvious that not all structures of information can be learned via GPT. But I suppose we will just see more adaptors like in "let my graph do the talking" or omnifusion for new data modalities. Although on a second thought even though we have a mostly linear dependency in text - we think more sparatically. May be we will se other architectures as "thought hubs". Reminds me alot about "Arrival"
@SirajRaval
@SirajRaval 2 ай бұрын
best AI tutorial of the year🔥
@necbranduc
@necbranduc 2 ай бұрын
r u 4 r3aL?
@ariseffai
@ariseffai 2 ай бұрын
Not quite sure about that haha
@user-zz6fk8bc8u
@user-zz6fk8bc8u 2 ай бұрын
My guess is that in video prediction it would depend a lot on the video for example the video of a ball being thrown around or a pendulum is easily predictable even when time is reversed but think about a video of bob ross and you watch it backwards as he unmixes color. A good guess is probably that the mixed some color with white but unmixing two colors is extremely hard to predict.
@ariseffai
@ariseffai 2 ай бұрын
Neat. Although one might also argue it should be easier in reverse because you can assume you will eventually reach a blank white canvas at the beginning of the video. But going forward, when you start with the blank white canvas, it is difficult to predict the exact scene that will be painted :)
@user-zz6fk8bc8u
@user-zz6fk8bc8u 2 ай бұрын
It feels obvious. Language has a lot of structure that back-reference like pronouns and stuff which are straight forward in one direction but practically impossible to reverse.
@ariseffai
@ariseffai 2 ай бұрын
Interesting! For me it was not a priori obvious. I like the sparsity argument from the paper as a potential mechanism, but I'd love to see concretely how one can prove language is "sparser" in the forward direction. The pronoun example has potential, but as @grapesurgeon mentions, there can often be forward references as well.
@user-zz6fk8bc8u
@user-zz6fk8bc8u 2 ай бұрын
​@@grapesurgeon (Almost) all natural languages developed as a spoken language. It's much harder for the brain to build sentences that reference something that you will say three sentences later, because it would require you to already "build" the next three sentences in your head while you say the current sentence. Let's assume you currently "build" and say sentence A and next are B, C, D, etc. for A to reference B and C you'd also have them ready in your brain and thus B and C can't reference D, E, F or they would also influence how A is structured. There are some "high level" thoughts that do that, like "I'll tell you the details about that later..." but to say that you don't have to know the exact sentences you'll say later. But language itself (at least it feels obvious to me - I can't back any of that up) look like it has to have that structure and I assume even for advanced super human intelligence but probably with a larger time-frame/window. I think it boils down to that predicting the future is harder than storing/remembering the past. So it's simpler to reference something from the "exact" past than to reference something from the "exact" future because it's so hard to predict the "exact" future.
@DavodAta
@DavodAta 2 ай бұрын
We turn off the music when we talk
@MDNQ-ud1ty
@MDNQ-ud1ty 2 ай бұрын
I think the way you explained the probability relationships is a bit poor. For example p_t(x) = p_t(f_t^(-1)(x)) would imply the obvious desire for f_t to be the identity map. If x is a different r.v. then there is no reason one would make such a claim. The entire point is that the rv's may have different probabilities due to the map(and it may not even be injective) and so one has to scale the rv's probabilities which is where the jacobian comes in(as would a sum over the different branches). It would have been better to start with two different rv's and show how one could transform one in to another and the issues that might creep. E.g., This is how one would normally try to solve the problem from first principles. The way you set it up leaves a lot to be desired. E.g., while two rv's can easily take the same value they can have totally different probabilities which is the entire point of comparing them in this way. I don't know who would start off thinking two arbitrary rv's would have the same probabilities and sorta implying that then saying "oh wait, sike!" isn't really a good way to teach it.
@valentinakaramazova1007
@valentinakaramazova1007 3 ай бұрын
Extremely good video, not shying away from the math. More like this is needed.
@seahammer303
@seahammer303 3 ай бұрын
as a beginner i understood nothing. change the title because these are not the basics
@SuperDonalByrne
@SuperDonalByrne 4 ай бұрын
Great video!
@BR-hi6yt
@BR-hi6yt 4 ай бұрын
Sounds like a legal document "which are implemented as linear transformations of the embeddings" - yeah, thanks a lot. Now, back to your autists special needs class.
@lord_of_mysteries
@lord_of_mysteries 4 ай бұрын
It's uses diffusion model..... search it up, you would then realize why it can't inspect images
@user-xh9pu2wj6b
@user-xh9pu2wj6b Ай бұрын
what are you even talking about?
@enes_duran
@enes_duran 4 ай бұрын
Great work really, there is a mistake in the equation on the rightmost side at 12:46.
@dullyvampir83
@dullyvampir83 4 ай бұрын
Great video, thank you! Just a question, you said a main problem with symbolic differentiation is that no control flow operations can be part of the function. Is that in any way different for Automatic differentiation?
@1PercentPure
@1PercentPure 5 ай бұрын
king
@karigucio
@karigucio 5 ай бұрын
Why do we sum the positional and input embeddings? Wouldnt concatenating make more sense? How would that play with dimensions?
@pmemoli9299
@pmemoli9299 5 ай бұрын
Badass
@HH-mf8qz
@HH-mf8qz 5 ай бұрын
This is an amazing channel, very instructive, structured and easy to understand. Instant sub and hobe you make more videos over the coming months and years
@HH-mf8qz
@HH-mf8qz 5 ай бұрын
Very good video Can you maybe make an updated version now that chatgpt 4 is released and the new googel gemeni is about to come out for mixel input AIs
@alexistorstenson
@alexistorstenson 5 ай бұрын
super cool video! i'm watching this in the middle of the night after my math major friend told me about his research project involving random walks. been absolutely fascinated by the concept since i heard of it and this does a great job of breaking it down!
@oualidzari2176
@oualidzari2176 5 ай бұрын
High-quality video! Thank you!
@khaledsakkaamini4743
@khaledsakkaamini4743 5 ай бұрын
great video thank you Ari
@miguelcampos867
@miguelcampos867 5 ай бұрын
Great vídeo!
@ollllj
@ollllj 5 ай бұрын
on expression-swell: one of my proudest computations (and hard to debug code) is the automated differentiation 3rd derivative of the general quotient rule within [shadertoy ... /WdGfRw ReTrAdUi39] , with identical parts already pre-multiplied out by how much it is constantly repeated. webgl code: Struct d000{float a;float b;float c;float d;};//1 domains t,dt,dt²,dt³ , sure, this could just be a vec4, but i REALLY needed my custom labels for debugging. d000 di(d000 a,d000 b){return d000( //autodiff up to 3 derivatives for division , up to 3 iterations of; quotient rule within chain rule) a.a/b.a //0th derivative, simple division ,(a.b*b.a-a.a*b.b)/(b.a*b.a) //dx first derivative ,((a.c*b.a+a.b*b.b-a.b*b.b-a.a*b.c)*(b.a*b.a)-2.*(a.b*b.a-a.a*b.b)*(b.a*b.b))/(b.a*b.a*b.a*b.a) //dxdx second derivative ,((((a.d*b.a+a.c*b.b+a.c*b.b+a.b*b.c-a.c*b.b-a.b*b.c-a.b*b.c-a.a*b.d)*(b.a*b.a) +(a.c*b.a+a.b*b.b-a.b*b.b-a.a*b.c)*(b.b*b.a*b.a*b.b)) +(-2.*(a.c*b.a+a.b*b.b-a.b*b.b-a.a*b.c)*(b.a*b.b) +(a.b*b.a-a.a*b.b)*(b.b*b.b+b.a*b.c)))*(b.a*b.a*b.a*b.a) -((a.c*b.a+a.b*b.b-a.b*b.b-a.a*b.c)*(b.a*b.a) -2.*(a.b*b.a-a.a*b.b)*(b.a*b.b)) *4.*(b.b*b.a*b.a*b.a))/(b.a*b.a*b.a*b.a*b.a*b.a*b.a*b.a)) //dxdxdx //3rd derivative quotient rule sure is something ;}