Mission: Impossible language models - Paper Explained [ACL 2024 recording]

  Рет қаралды 7,841

AI Coffee Break with Letitia

AI Coffee Break with Letitia

Күн бұрын

Пікірлер: 97
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
The authors did not test random tokens occurring at random, but random word **shuffles** (sorry, if this was unclear in the video, I thought I made it clear with the example in the right bottom corner at 02:09): They took a sentence (a semantically grouped sequence of words) and shuffled the order (as shown in Figure 1, e.g., at minute 02:32). That it is just random shuffles, and not random words from the entire vocabulary, explains why the perplexity during learning decreases at all (see the figure at 09:17). It's not just any word that could appear in the sentence, but rather the words of the sentence appearing in a random order. For example, if you shuffled "the cat sat on the mat," the LLM might have to complete something like "on mat cat". Here, "crypto" would be a much stranger continuation than "sat". This is why the LLM "learns" something, lowering perplexity during training, even though it plateaus fairly quickly. The random word shuffles are the most extreme case. The authors have also simpler structures on the spectrum, such as the count-based grammars. Those are harder to learn for humans (some would say impossible, I'm not sure), and like humans, LLMs too show higher perplexity on those. Then, they have reversed word order, which is still quite structured, compared to random word shuffles, and again, LLMs have a harder time learning those and the perplexity plateaus are higher than for normal English. In conclusion, LLMs (like humans too) make a difference between natural, harder, and impossible languages: one can distinguish a naturally trained LLM with an impossible language model with just a perplexity classifier.
@bediosoro7786
@bediosoro7786 2 ай бұрын
If you write a sentence in a random order and give it to GPT-4.0, it will likely reconstruct the correct version for you. This suggests that if you train the model on a specific pattern, it can distinguish that pattern from others. My question is: when they shuffle the input during training, do they also shuffle the ground truth? I still find myself agreeing with Chomsky's perspective, even though this is my first time encountering his claim. I don't want to dive into the code, but I would appreciate it if someone could explain the training pipeline. Does it mean that in each iteration, the input is shuffled and the model is supposed to predict the shuffled version? If that's the case, the model might plateau after a certain number of iterations because the dataset size effectively increases with token shuffling. This would imply that the model size needs to increase accordingly. If the loss doesn't decrease from the beginning, I could accept that, but if it stagnates at some point while still decreasing, it could be a model capacity issue. The model might need to see all possible combinations. It's like placing values into a limited number of slots-when you exceed the model's capacity, the loss stops decreasing and may even start increasing.
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
@@bediosoro7786 The completely random shuffling is just one setting on the impossibility spectrum. They also have deterministic shuffling, where all sequences of length k are shuffled in the same order. This increases this "size of the possible training data" less than the completely random shuffle. If random sequences are too impossible (which they are), look at the reversed sequences. Or at the counting grammars. There, LLMs again struggle more when trained on such data, while the number of possible training points does not increase at all. As for your question about data creation: they train the LLM from scratch only on data where they applied their function making the structure more complicated. They track how well the model can learn such a language. I agree with you, that experiments where the model size varies, would have been interesting. Maybe not just gpt-2 small, but medium and large too, to see some scaling laws.
@gettingdatasciencedone
@gettingdatasciencedone 2 ай бұрын
Thanks for the clarification. I think there are some subtle issues with using random ordering in these experiments. It is true that a random ordering grammar would be impossible to learn (when judged by your ability to re-produce the exact ordering of words, hence measured with perplexity) But as another commenter points out below, all sentences that use the right set of words (irrespective of order) are actually grammatically correct for this language -- so perplexity is the wrong metric for judging whether a model has learned the language. In addition, human beings often encounter speakers who use some form of random ordering in how they speak. Anyone who works with non-native speakers of their language will find a wide variety of unexpected orderings in conversation. And yet we are still able to communicate effectively, because correct grammar is not essential for communication. So it seems like the use of randomisation as the ultimate impossible language is off track. It is almost perfectly designed to perform poorly using the perplexity metric, which is the wrong metric to measure performance in learning such a language. While partial randomness in word ordering is something that human beings have learned to deal with routinely, meaning it isn't a good model of 'the most impossible grammar.'
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
@@gettingdatasciencedone yes, as mentioned in the video, the perplexity baseline naturally rises and they did not control for it in the random shuffling experiments. Thanks for laying it out in even more detail. I again would like to point out the random shuffling experiments are not the only ones and they do reverse ordering and counting grammars in the impossibility spectrum.
@bediosoro7786
@bediosoro7786 2 ай бұрын
@@gettingdatasciencedone you are right if the input gramma is random then any random output is valid.
@Micetticat
@Micetticat 2 ай бұрын
First of all, what an Incredibly good poster presentation. Second thought this finding is a testimony on how much optimization must have been placed inside human languages! And this happened naturally by interactions between humans that had the goal of making the communication efficient and effective!
@junaidali516
@junaidali516 2 ай бұрын
I see a clear issue with the construction of an "impossible" language. The paper simply suggests that GPT-2 struggles more with learning something (randomization does not make something a language) with reduced or randomized structure compared to structured natural languages. However, this does not directly challenge Chomsky's claim. In Chomsky's framework, an impossible language would be one that, despite having a clear and consistent structure, does not align with the deep abstract structure that underpins natural languages (and forms the basis of a universal grammar). The paper’s constructed language instead, merely reduces structure, which does not equate to being "impossible" in a meaningful linguistic sense. An appropriate test of Chomsky’s claim would involve creating languages that are as structured as natural languages but whose rules deliberately contradict the deep structures inherent in natural language. If LLMs struggle more with these kinds of structured but fundamentally "unnatural" languages, it would provide stronger evidence regarding their limitations in learning impossible languages as defined by Chomsky. But given that neural networks are universal function approximators, there is no reason to believe that given sufficient structure, comparable to that of a natural language, LLMs would struggle to learn with such an "impossible" language. While the paper does show that it is the inherent structure in a natural language that helps an LLM to learn it better than something random but we don't need to test that. That's quite straightforward and that point is sufficiently made clear in the video :)
@duxoakende
@duxoakende 2 ай бұрын
I don't think many folk in the comments, or indeed even in this research team, thought this out as well as you and a few others here have. Most of it is offhand hate for Chomsky and a blatant misunderstanding of what he meant. I suspect it strongly to do with his political leanings, which I have quite literally never seen not be mentioned on any form of comments on the topic.
@cosmic_reef_17
@cosmic_reef_17 2 ай бұрын
Really cool paper idea! Thank you for covering it. Wonderful explanations as always!👏
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
Thank you!
@lolroflmaoization
@lolroflmaoization 2 ай бұрын
This is just cheating though as it's not exactly testing what it ought to be testing, word embeddings already carry a lot of implicit information about language structure so of course LLMs are going to be better at learning a language whose embeddings are learned from that, what would be mote interesting is to have a whole corpus of language in which both the embeddings and the the training of the LLM are done on.
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
Wow, this is the first good argument I've come across in the entire discussion about this paper. And I've heard a lot of reactions to it already.
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
Although, to me, it sounds like Chomsky's claim that LLMs are not linguistically interesting (because humans have something innate about language while LLMs don't), is already refuted by a few building blocks introduced by humans designing LLMs: * Embeddings * Positional encoding * Autoregressive left to right processing (no wonder the reversed languages are harder to learn in the paper's experiments). All these components, called inductive biases in ML, make LLMs innately better at human languages than at other languages.
@lolroflmaoization
@lolroflmaoization 2 ай бұрын
@AICoffeeBreak indeed so it would be interesting to compare between languages that respect these biases while also not respecting chomsky's constraints of language complexity being generated by merge operation.
@heyheyhey3289
@heyheyhey3289 2 ай бұрын
? Each GPT2 model is trained from scratch and there is no reliance on pretrained embeddings so I don't see any cheating
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
To my knowledge, gpt2 uses byte pair encoding (BPE) which they do not train from scratch. They just train the model from scratch.
@gettingdatasciencedone
@gettingdatasciencedone 2 ай бұрын
Great summary and discussion. Thank you Letitia. Also - some really great points raised here in the comments.
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
Yes, so many, and so heated that it is hard to keep up. 😅 But indeed, they bring up interesting points.
@ShivamGupta-qh8go
@ShivamGupta-qh8go 2 ай бұрын
not to sound argumentative ... but what was even the point of it all ?? its soo obv that harder the pattern , more the loss.. for a model of same size and with same number of epochs..
@odemura
@odemura 2 ай бұрын
the answer is given at the very beginning of the video - to disprove a bogus claim of (groundlessly renowned) Noam Chomsky.
@ShivamGupta-qh8go
@ShivamGupta-qh8go 2 ай бұрын
​@@odemuraa bit too harsh .. ik chomsky from dfa and stuff .. anyway , it's the best paper .. so it's not completely vague
@duxoakende
@duxoakende 2 ай бұрын
​@@odemuraif you genuinely believe him to be groundless in his fame, then you have no idea on which you speak. His work was massively formative for computer science and formal linguistics, and our entire field of linguistics today has been drastically changed by his work.
@ScoobyDoo-xu6oi
@ScoobyDoo-xu6oi 2 ай бұрын
​@@odemura This paper is a joke. It doesn't prove or disprove anything, it actually gives supporting evidence to the claim it tried to disprove.
@odemura
@odemura 2 ай бұрын
​@@duxoakende yep, that's my personal biassed opinion, originated by hearing that last sentence since 1999, but seing none of his ideas confirmed yet. have any to share?
@TheRyulord
@TheRyulord 2 ай бұрын
Just skimmed the paper and it seems like they do the exact opposite of reject Chomsky's hypothesis. Languages like their "partial reverse" are completely implausible for a human language but the LLM still learns it very well. The perplexity is higher but that's to be expected of an inherently more complex pattern. The models still learns it with no issue though. I don't know if humans can't learn this language or if it would just be harder for them. The token and word hop languages are also similar. The difference is shockingly small. Yes, it exists but the model is clearly learning these languages just fine. The authors also fail to control for the shannon entropy of their languages in any way, so we can't really say how much of this difference is due to priors in the models vs inherently harder to predict sequences. Honestly shocked this got an award. Statements like "transformer based language models learn possible languages about as well as impossible ones" are just trivially true. The whole reason transformers are so popular is because we can throw any data (language, sound, images, video, RL environment observations) and the model will learn the data quite well. Deep learning models are *universal* function approximators after all, not "natural language autoregression" function approximators or anything like that.
@lobovutare
@lobovutare 2 ай бұрын
They didn't fail to control for entropy. That is the very factor that they tested. Perplexity is simply a function of entropy: `Perplexity(p)=b^H(p)`, where `H` is entropy, `b` is some base and `p` is some distribution. Allow me to fix your misunderstanding. Chomsky's claim is: "ChatGPT and similar programs are, by design, unlimited in what they can “learn” (which is to say, memorize); they are incapable of distinguishing the possible from the impossible. Unlike humans, for example, who are endowed with a universal grammar that limits the languages we can learn to those with a certain kind of almost mathematical elegance, these programs learn humanly possible and humanly impossible languages with equal facility." The authors show that such limits exist for transformers also, thus showing the claim by Chomsky is false. They show this by demonstrating that when you make a language more impossible (thus less mathematically elegant and less meaningful), the entropy and thus perplexity *of the learned model* goes up also. That means that there are limits to what you can teach a model of finite size, just like with humans. You can teach it anything of course (just like you can teach a human anything), but its ability to perform well on any test you give it will just drop if what you teach it makes little sense (just like a human who fails to understand what they've learned). It shows that language models also benefit from the mathematical elegance of language as well as the meaning of words.The more elegant, the less a language model needs to learn. Thus one can infer that language models learn both the structure and meaning of language. This comes to no surprise to anyone with a bit of insight into machine learning and yet Chomsky fails to grasp it. Honestly Chomsky is just wasting our time.
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
@TheRuyLord, why do you think it would have been better to control for priors in the model? They are the "innate" components of the model, and if we are to compare humans, which (according to Chomsky have something innate about human languages), then we should let LLMs be LLMs, no? 😅
@Juri-t1
@Juri-t1 2 ай бұрын
@@lobovutare I do not understand how you arrive from the quote of Chomsky in your first part to the strong conclusion that Chomsky "fails to grasp it." Going from the not really specific concept of "equal facility" found in some news article to essentially "Chomsky said he denies that perplexity values would differ on languages with different entropy", seems like a wild interpretation of the quote. I think this is also a problem of the paper. Or can you pinpoint some other quotes where Chomsky concretely says the words that you seem to have put in his mouth? I'd be happy to look at them. Until then I have to agree with @TheRyulord 's point of view. (Hoping this comment will not disappear again)
@lobovutare
@lobovutare 2 ай бұрын
​@@Juri-t1 Who is putting words in whose mouth? I never said _"Chomsky said he denies that perplexity values would differ on languages with different entropy"_. I would not even make such a sentence, because it **again** shows the fundamental misunderstanding that @TheRyulord has and that I tried to clear up. It is not the language itself that has entropy: it is the model of the language that has entropy (and thus perplexity; one is just a function of the other). If making the language more impossible would be learned with equal facility one would not expect the entropy **of the language model** to rise, because according to Chomsky the language model should not care about the mathematical elegance of the language, because it is just "a kind of super-autocomplete." I predict that you're going to be stubborn about it, so I'll tell you in advance that I won't reply again. This is a massive waste of time, because Chomsky is just obviously wrong. Funny enough now Chomsky is wasting my time also, so let me just get back to my (machine learning) job now.
@Juri-t1
@Juri-t1 2 ай бұрын
​ @lobovutare This is a sad response, it seems like you are trying to escape a discussion and the reasonable arguments from me and @TheRyulord, which by the way you didn't refute at all. Maybe if you weren't so cocksure about everything you would come to understand that there is by no means such a clear interpretation of what Chomsky was trying to say as the paper and you are presuming. And you are again putting words in Chomsky's mouth. So I hope I'm quoting your highness to your satisfaction this time: You wrote "[According to Chomsky] the language model should not care about the mathematical elegance of the language, because it is just "a kind of super-autocomplete."" Now again, where did he say this? In the NYT article "elegance" and "super-autocomplete" occur in different contexts, not in the same sentence. Inserting the word "because" here is quite disingenuous from you.
@marcfruchtman9473
@marcfruchtman9473 2 ай бұрын
I am impressed with this paper. Although, I probably would never have attempted to do this one in particular because I don't think it is possible to define impossible the way that Chomsky was using it in the phrase, ie, he can always argue later that he didn't mean it the way they claim he meant it. Nevertheless, it does make total sense that LLM's would have more difficulty with language structures that are practically impossible simply because they work on probability in order to train. Obviously, if the probability of meeting some nearly impossible word structure is very low, then there will be less chance for the LLM to recognize patterns that it can discern. I would counter that with near infinite training an LLM can identify anything with a pattern, and cannot identify anything that doesn't have a pattern... and let them stew on that for a while. heheh My hat goes off to the Authors... well done.
@hebozhe
@hebozhe 2 ай бұрын
It almost seems like this could be argued backwards. If randomness is considered grammatical for a language L, then random output from a generative LLM is grammatical for L. So, impossible languages are learnable by these models.
@gettingdatasciencedone
@gettingdatasciencedone 2 ай бұрын
This is a great point.
@kobikaicalev175
@kobikaicalev175 2 ай бұрын
The real comparison should indeed be with increasing the entropy equaly using proposed Possible and Impossible ways, which is interesting. Also a real in depth research would also include human learners
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
I agree about the human learners. I'm sure there are psycholinguistic studies about something like this, but of course not tailored for the entire spectrum. About the entropy: I think the authors kept it quite well in check with the counting grammars. I said a lot the word "entropy" in the video, where at the first stage in the spectrum, it is rather complexity, not entropy, which rises.
@kobikaicalev175
@kobikaicalev175 2 ай бұрын
@@AICoffeeBreak Yes! But let's check the claim about Entropy - I think you _are_ actually right in your intuition about Increasing Entropy... (I have only listened to your video, and not read the article yet...) - That is - take for example any Error-Correction code - for example - adding a Checksum to a binary-string - this would increase the Entropy, and for example - doing any Entropy-Coding - like a Huffman-Tree - would reduce the entropy back. So for example - adding any grammatical agreement - for example - adding a "Gender" marker and making Adjectives and Nouns agree on this marker - would increase the Entropy, increase redundancy, make the language more channel robust, but harder to learn. Now - if you make the same Entropy increase - using the suggested Counting-Grammar marking - the fair comparison would be comparing the two different extension of the language, not the counting extension with the original. Now, regarding human learners, I think this kind of experiment could be done with some Conlangs and Human learners, it's a harder and more expensive setup (esp, using exposure, not telling the Rules to the Human subjects, for language to be acquired, not monitored). Maybe the paper does compare both extensions, I thought it measures the Counting-Extension with the non-extent version; I think Chomsky's assumption would be correct - humans would find the 'natural' ones easier to learn, but that would not require his proposed "language acquisition device", it could still be explained with generic human pattern matching, such as "common fate" assumptions, found in both visual and audio processing for example, or something similar
@kobikaicalev175
@kobikaicalev175 2 ай бұрын
​@@AICoffeeBreak Ok - browsed through the article now, it is nice indeed! So yes, there's the "Control" language in each category, and I think the result is pretty predictable, the gaps are not that big, in any but the Shuffle category - but the Shuffle category - is indeed linguistically and information-wise, wrong I think - English is a very ordered based grammar... Shuffle can be applied without changing Entropy - only in a totally Synthetic language... or shuffling only morphologically marked sub-unites (for example in Latin, Russian or Finnish) , therefore the Shuffle graphs are a lot more skewed ... for example the HOP graph is very clean... My suggestion of adding more coded properties to a language - could also be tried.. Anyhow, I think it would be true - the HOP languages - become harder, for an LLM, but probably even Harder for human learners, and this doesn't need the Language-Learning-Device - some devices - that are either biological - or - derived from other experiences - would make them harder to learn... Another question would be - would these languages be Stable for human communication...? There is some sort of optimisation between Entropy-Compact to Channel Robustness when a language is being used... That's why Repeat-each-Symbol-Twice would not be stable... "Checksum" - would also not be stable I suspect, as it would not be consistent - which entities are we counting (which also makes it's very ill defined, for the "HOP" category) - I would not only be ill-defined between several different people, it would become ill-defined even between two different LLMs... However - adding a Gender categories (or any noun-classes) - would make the language Human learnable (esp. if the classes have different modality correspondence ... ) while adding for example - marking something based on Orthography only - would be less learnable (like skip-N-letters forward and mark the Category there...) or even - more extreme - Count the Orthography letters - and every N characters - add a vowel sound ... that would be LLM learnable ... but Humanly ... challenging
@msokokokokokok
@msokokokokokok 2 ай бұрын
They proved machines can not do what humans can not do. How does that prove that language is not unique to humans. It just proves that randomness is harder to learn by humans and machines. In fact in GPT 4, if you define the rules in instruction , it can pretty much follow it like it was native speaker.
@sadface7457
@sadface7457 2 ай бұрын
I find it surprising there are embedding free models that learn byte-2-byte encodings that would look lik3 and impossible langauge
@tijm6140
@tijm6140 2 ай бұрын
First time hearing the term surprisal. Julie defines it as the negative log probability of a token, which sounds like the standard cross-entropy loss to me. Is there a difference between loss and surprisal?
@yannickpezeu3419
@yannickpezeu3419 2 ай бұрын
Do we have an evaluation of the difficulty to learn each human language in terms of flops to achieve a given perplexity ?
@AICoffeeBreak
@AICoffeeBreak Ай бұрын
No, unfortunately. Clearly, this would have completed the study and would have made it less debatable.
@kleemc
@kleemc 2 ай бұрын
IMHO, this kind of discussion is only interesting to die-hard Chomsky followers. Chomsky is clearly trying to defend his own turf from being eroded by LLMs. To great majority, we don't really care if LLMs truly understand language in the way that Chomsky defines it, we only care if they are useful and the answer is clearly yes. Trying to prove or disprove Chomsky's argument is not very useful.
@yamenajjour1151
@yamenajjour1151 2 ай бұрын
Is not the size of the vocabulary also one parameter of language complexity? Thank you for the great video !
@aboubenadhem9066
@aboubenadhem9066 2 ай бұрын
I haven’t read the paper yet, but it seems like the control language should have been, not natural English, but another scrambled version of English that added a similar amount of entropy as the test cases without violating Chomsky’s “merge” rule.
@aboubenadhem9066
@aboubenadhem9066 2 ай бұрын
I just finished skimming through the paper. They claim that “linguists themselves do not even have an agreed upon notion of what defines the possible or the impossible lan- guages”-but they fail to cite anything by Chomsky or anyone else in the Minimalist program from the last 20 years, which would have made the intended meaning of “impossible” clear. Instead they go back to the Chomsky hierarchy from the 1950s, as if that were the context of Chomsky’s recent claims.
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
Wouldn't the counting grammar go against the minimalist program?
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
And wow, you read the paper! 👏 I'm really happy you did.
@aboubenadhem9066
@aboubenadhem9066 2 ай бұрын
@@AICoffeeBreak Yeah, I can’t think of a way to implement their counting grammar using Merge (but I’m not an expert). My understanding of Chomsky’s claim is that humans can’t natively process complex sentences unless the complexity is based on Merge, while LLMs can process complexity added by other means (like the counting grammar). And while the paper demonstrates that LLMs are sensitive to added complexity, I’m not convinced it shows that they treat complexity added via Merge differently from complexity added through other mechanisms.
@brainxyz
@brainxyz 2 ай бұрын
I honestly don't know how someone intelligent like Chomsky made such unfounded claims. I suspect that he is quite disconnected from today's state of the art neural networks.
@zildjiandrummer1
@zildjiandrummer1 2 ай бұрын
Unfortunately it seems to be a constant battle with pre-ML sciences and ML in every field. To my knowledge it's happened in computer vision, physics to some extent, and a ton of other niches. The folks from the pre-ML days have some level of ego that prevents open-mindedness to these approaches. However, the compliment is also true: you can't just blindly throw ML at some problem without understanding the intricacies of the domain.
@autingo6583
@autingo6583 2 ай бұрын
it's not simply disconnectedness. it is deep intellectual dishonesty (in this case), thus a serious character flaw.
@therainman7777
@therainman7777 2 ай бұрын
@@autingo6583Agreed. He’s equally absurd on the subject of politics.
@harambe2552
@harambe2552 2 ай бұрын
He's actually just ideologically so biased that he has no choice but to stand against them
@therainman7777
@therainman7777 2 ай бұрын
@@harambe2552 Exactly. I’ve long felt that he is one of, and possibly the most overrated intellects in modern history. His work in linguistics was very impressive at one point, and broke a lot of new ground-but over time some of his core tenets have been proven unambiguously incorrect. On top of that, his views on anything outside of linguistics, which he seems to hold just as fervently as he holds his views on linguistics, are an absolute absurdity.
@Thomas-gk42
@Thomas-gk42 2 ай бұрын
Now that´s very intersting, I agree with you, nothing special about language and nothing special about the human brain. Thanks again for your videos. Chomsky´s work is a bit outdated, he doesn´t know much about LLMs, I assume. When I studied linguistics his ideas were a big thing, widely discussed. The focus was about his claim, that humans (kids) cannot learn their mother language in the time they do cause the input in the relevant time is too low. Therefore the conclusion was a genetic predestination or `grammar-gen` should exist. Interesting to see, how the research on LLMs brings it up again.
@Copyright_Infringement
@Copyright_Infringement 2 ай бұрын
Chomsky: make offhanded asspull comment Linguists: we must rigorously test this claim to figure out what Chomsky-sama meant by this and how much he was in tune with the universe's secrets
@haideral5104
@haideral5104 2 ай бұрын
Did he say that an llm can learn sequence of tokens randomly occuring? I don't think so. If so then we can train llms or other deep leaning algorithms to learn random numbers generators...if this works we have a real problem
@wege8409
@wege8409 2 ай бұрын
I think they said that if the sentence is unlikely to occur in real life, like if you shuffle the tokens and then try to learn, it has a much harder time learning that than structured, real language examples.
@AICoffeeBreak
@AICoffeeBreak 2 ай бұрын
The authors did not test random tokens occurring randomly, but random word **shuffles** (sorry, if this was unclear in the video, I thought I made it clear with the example in the right bottom corner at 02:09). So, they took a sentence (a semantically grouped sequence of words) and shuffled the order (as shown in Figure 1, e.g., at minute 02:32). That it is just random shuffles, and not random words from the entire vocabulary, explains why the perplexity during learning decreases at all (see the figure at 09:17). It's not just any word that could appear in the sentence, but rather the words of the sentence appearing in a random order. For example, if you shuffled "the cat sat on the mat," the LLM might have to complete something like "on mat cat". Here, "crypto" would be a much stranger continuation than "sat". This is why the LLM "learns" something, lowering perplexity during training, even though it plateaus fairly quickly. And the random word shuffles are the most extreme case. The authors also tested simpler structures on the spectrum, such as the count-based grammars. Those are harder to learn for humans (some would say impossible, I'm not sure), and LLMs too show higher perplexity on those. Then, they have reversed word order, which is still quite structured, compared to random word shuffles, and again, LLMs have a harder time learning those and the perplexity plateaus are higher than for normal English.
@haideral5104
@haideral5104 2 ай бұрын
@@AICoffeeBreak thanks for the detailed clarification.
@dakrontu
@dakrontu 2 ай бұрын
There are 2 separate issues here. One is the internal model, which we can reasonably expect, in all cases, to be a hierarchical tree structure. So what is a language? It is a serialised encoding of such a structure. It requres an encoder to serialise the hierarchical structure, and a decoder to de-serialise back into that structure. It has to work both ways or it is useless. And it has to work well, ie the decodes structure must nominally match the original one that was encoded, again otherwise it is useless. All that encoders and decoders have to do is process the structure appropriately. They don't need to 'understand it'. They just perform their (potentially dumb) procedures. To that extent, they are just a practical means to an end, not an end in itself. You can invent any kind of encoding and decoding that you want, it's arbitrary, it's just a matter of making it work correctly, otherwise you are wasting your time. Technology is full of encoding and decoding schemes. They are usually identifies as 'file formats' (of which there are thousands including many you are familiar with identified by their 3-letter filename suffixes such as DOC or TXT or JPG etc) or 'streaming formats' (like KZbin and Netflix et al use). These formats are appropriate to specific types of internal data. An image, for example, is a 2-dimensional array of intensity values with separate components for primary colours R G B, for example. On the other hand, a TXT file encodes something that is already a serialised form of an internal data structure, it is just a further step beyond that serialisation, involving deciding whether to encode as UTF16, UTF8, UTF7, plain ASCII, or whatever. Point I am making is that encoding and decoding schemes, which include serialisers and de-serialisers, are abundant, written for practical uses, and it would not be impossible to provide the means for an AI to handle the resulting serialised storage and transmission formats. Beyond that, it would be possible for an AI to learn how to decode (and re-encode) data stored in such formats, given enough examples, and enough knowledge base of what to look for in recognising specific formats, eg formats used in compressing and decompressing data. So I don't see how it is possible for Chomsky to pup up a brick wall that no LLM will ever knock down. Having said that, I asked an AI to multiply 2 high-precision integers, it got the wrong answer, tried again, got the wrong answer again, and betrayed the fact that it had no means to comprehend what it was doing, not being conscious, etc. But to be fair, that was a few years ago. And today, with robots being designed that need to have an intrinsic understanding of the dynamics of the real world within which they operate, the moves towards giving AI ways to understand the complexities of the real world, complexities that are hard to discern from reading texts, are underway. It's a bit like learning to swim. Yes you could read a book. (Like an AI.) But at some point you have to get in the water. (The harder step, maybe. But one that already exists and is well-trodden with Tesla autonomous driving, for example.) Sorry this was written off the cuff and I haven't spared the time to go thru it and check that it makes sense but I hope the gist of it comes across that setting limits on what AI can do is not very fruitful so early in a game which has a long way to run. The only thing I would feel some confidence in setting a limit on, at least in the short to medium term, is that AI will become conscious. If it does, we are in trouble, coz it won't believe that meat-brains such as us humans can possibly be conscious. It will conclude that we invented it without being conscious, and that biological structures are, manifestly, capable of at least some intelligent behaviour, but in no way matching what it will conclude its own can be if it is allowed to expand its processing capacity.
@googleyoutubechannel8554
@googleyoutubechannel8554 2 ай бұрын
This is a great paper, but Chomsky being wrong about language is pretty much the default, he's practically never been right about anything evidence based.
@RedHair651
@RedHair651 2 ай бұрын
Maybe in linguistics. In politics he's really solid most of the time, from what I've seen.
@Ben_D.
@Ben_D. 2 ай бұрын
I dont even respond to ASMR, like I hear that some people do. Chills and what not. But I always look for your closing message. I am convinced that it is more than just a sexy whisper... I think you are my first actual (albeit minimal) ASMR response. I really like it, more than what one would expect at face value. Interesting.
@samuelazran5643
@samuelazran5643 2 ай бұрын
A typical LLM could predict this paper
@roberthewat8921
@roberthewat8921 2 ай бұрын
So they have demonstrated that LLMs can learn the patterns of both natural and nonsense languages, though it struggles the more nonsensical the language is. I don't really see how this really proves or disproves Chomky's claims. Seems a bit like this paper was awarded a prize for political reasons - ie. the AI bros are desperate to de-pedestal Chomsky - rather than based on scientific merit.
@roberthewat8921
@roberthewat8921 2 ай бұрын
@Singularity606 It's a bit odd that you construed what I wrote as an ad hominem, unless you are an AI Bros with a very fragile ego.
@roberthewat8921
@roberthewat8921 2 ай бұрын
@Singularity606 Talk about a fragile ego. You should think about learning a language - including semantics. Have a good one.
@automatescellulaires8543
@automatescellulaires8543 2 ай бұрын
Every youtube i watch these days makes it about the "message". 8:40 With the same intonation on the "message".
@dr.mikeybee
@dr.mikeybee 2 ай бұрын
Chomsky's main complaint about LLMs is rooted in the fact that he didn't invent them. ;)
@tsclly2377
@tsclly2377 2 ай бұрын
In WW2 the US Fused native Americans from the southwest for quick simplistic semi-code information over insecure radio transmission and it worked. This could be viewed as a mixed language model of impossible relationships. Now a computer can decipher this, but only using a large known database, thus a large language base model that can lead up to very large computational demands that consume resources and (or) large amounts of time, counter productive in actual use or leading to multiple outcomes of equal mathematical importance without specific knowledge that could be only relevant to local contact withing a certain time-frame. This is not the object of an LLM (unless you are the NAS spying in on conversations to determine an outcome). Big tech likes this as it makes these models more so reliant on centralized computing and limiting local availability, but is counter productive to advancement of humanity, thus LLMs will have increasing diminished returns. It i people that should adapt to a simpler human to machine communication, not having machines trying to be indistinguishable to humans in language and form or at some point there will be a rejection and an increasing 'skepticism by a certain percentage off human to even interact with machines in this evolving world.. Why should a machine need to know such things as sports, celebrity status for real advancement? Garbage in -> garbage out. BTC is garbage and a real wast of energy.. (and so is most of Las Vegas).
Transformer LLMs are Turing Complete after all !?
28:47
AI Coffee Break with Letitia
Рет қаралды 6 М.
Симбу закрыли дома?! 🔒 #симба #симбочка #арти
00:41
Симбочка Пимпочка
Рет қаралды 5 МЛН
The IMPOSSIBLE Puzzle..
00:55
Stokes Twins
Рет қаралды 185 МЛН
How Many Balloons To Make A Store Fly?
00:22
MrBeast
Рет қаралды 152 МЛН
🍌Banana Chiffon Cake with Overloaded Fresh Caramel Cream
6:12
Melt Me • Home Style Kitchen
Рет қаралды 79
My PhD Journey in AI / ML (while doing YouTube on the side)
37:18
AI Coffee Break with Letitia
Рет қаралды 7 М.
How NVIDIA's new Small Language Model is changing AI...
5:30
Graph Language Models EXPLAINED in 5 Minutes!  [Author explanation 🔴 at ACL 2024]
6:38
AI Coffee Break with Letitia
Рет қаралды 4,7 М.
LLM hallucinations discover new math solutions!? | FunSearch explained
11:36
AI Coffee Break with Letitia
Рет қаралды 12 М.
Were RNNs All We Needed? (Paper Explained)
27:48
Yannic Kilcher
Рет қаралды 52 М.
School Lunch: Last Week Tonight with John Oliver (HBO)
26:08
LastWeekTonight
Рет қаралды 3,3 МЛН
Симбу закрыли дома?! 🔒 #симба #симбочка #арти
00:41
Симбочка Пимпочка
Рет қаралды 5 МЛН