I was not impressed by the ARC-AGI challenge (not actually a test for AGI)

  Рет қаралды 10,498

David Shapiro

David Shapiro

4 күн бұрын

Based on the search results, there are several criticisms and potential limitations of the ARC-AGI Challenge:
1. Limited scope of intelligence measurement:
Some argue that while ARC-AGI tests certain aspects of reasoning and pattern recognition, it may not be a comprehensive measure of artificial general intelligence (AGI). The tasks are focused on visual pattern matching and may not capture other important aspects of intelligence like language understanding, common sense reasoning, or open-ended problem solving[1][5].
2. Potential for brute-force approaches:
Critics suggest that the challenge could potentially be solved through brute-force search or by generating a large number of possible solutions, rather than through genuine reasoning. This was demonstrated by a recent approach using GPT-4 to generate numerous Python implementations, achieving 50% accuracy on the public test set[2].
3. Overemphasis on sample efficiency:
The challenge places a strong emphasis on learning from very few examples, which some argue may not be the only or most important aspect of intelligence. Humans often learn from vast amounts of data over time, and AI systems might legitimately need more examples to achieve robust performance[5].
4. Possible overfit to the specific task format:
There are concerns that solutions might be overly tailored to the specific format and rules of ARC tasks, rather than demonstrating general problem-solving abilities that could transfer to other domains[4].
5. Debate over relevance to AGI progress:
Some researchers question whether solving ARC-AGI would necessarily represent a significant milestone towards AGI. They argue that success on this specific benchmark may not translate directly to broader artificial general intelligence capabilities[4][5].
6. Limitations of the prize structure:
The $1 million prize may not be sufficient incentive for major breakthroughs, given the potential value of AGI-related innovations. Additionally, the requirement to open-source solutions might discourage participation from commercial entities[1].
7. Potential for training on the test set:
There are concerns about the possibility of models being trained on the public test set, which could inflate performance metrics without demonstrating true generalization[3].
8. Lack of language and world knowledge components:
The challenge intentionally excludes language understanding and world knowledge, which some argue are crucial components of general intelligence[4].
While the ARC-AGI Challenge is recognized as a novel and potentially valuable benchmark, these criticisms highlight the ongoing debate about how best to measure and pursue progress towards artificial general intelligence. The challenge's creators, including François Chollet, acknowledge that it's not perfect but argue that it addresses important aspects of intelligence that current AI systems struggle with[3].
Citations:
[1] / arc_prize_arc_prize_is...
[2] www.lesswrong.com/posts/Rdwui...
[3] www.dwarkeshpatel.com/p/franc...
[4] news.ycombinator.com/item?id=...
[5] news.ycombinator.com/item?id=...
[6] www.lesswrong.com/posts/x2tCS...

Пікірлер: 171
@hermestrismegistus9142
@hermestrismegistus9142 2 күн бұрын
It is a necessary but perhaps insufficient test for general intelligence.
@DaveShap
@DaveShap 2 күн бұрын
Necessary but not sufficient. Perfectly said.
@ryzikx
@ryzikx 2 күн бұрын
"perhaps"? certainlyhaps
@holdenrobbins852
@holdenrobbins852 2 күн бұрын
I think this was François Chollet's point. The test isn't meant to be the end-all-be-all for AGI. His point is more, as amazing as it currently is, AI has a hard time with basic tasks along these lines that a child could easily do. Therefore most likely we will fall short of AGI until computers can reliably solve these types of abstraction problems, working from limited examples. (ie. not regurgitating answers it saw in its training set.)
@dorotikdaniel
@dorotikdaniel 2 күн бұрын
While the test is pretty narrow, a system that can solve it within the given constraints would likely show strong general reasoning skills, making a significant contribution to the study of general intelligence. It's a shame that the test is so visual. I wonder if a blind person could solve it, no matter how smart they are or how the problems are presented to them. The current state-of-the-art LLMs have the same issue. Even if you prompt them to do complex thinking, these problems are really hard to put into words, and their visual capabilities just aren't there yet.
@DivinesLegacy
@DivinesLegacy 2 күн бұрын
@@holdenrobbins852He said that if an AI passes it then it’s AGI.
@tedjohanson3
@tedjohanson3 2 күн бұрын
I think you’re missing the point. The test is not supposed to be proof of agi, but proof that today’s models lack reasoning capabilities. It’s based on novel tests that are not part of the training data and shows how poorly LLMs handle even very simple tasks, when they haven’t seen thousands of examples before. For me, it really showed that AGI is far further down the road than I previously expected. That they still lack some very fundamental parts, that will not be overcome by just scaling.
@manslaughterinc.9135
@manslaughterinc.9135 2 күн бұрын
The issue with this test is that it's testing a model in a way that it's not able to actually compute. Imagine giving this test to a colorblind person, because that is what it is like. Vision models don't actually "See" the images. They learn based on labeled images, and are therefore converting the image to tokens that correspond to text. These labels have very limited relational data. Testing a language model on vision is ignorant. Furthermore, the model has no way of outputting the data in a valid format. Even if you were to convert these images into strings, the model does not see these strings in 2 dimensional patters. It reads strings linearly, in one dimension, so it does not actually "see" that the X is above the O.
@AbelShields
@AbelShields 2 күн бұрын
@@manslaughterinc.9135 well then, it seems like it should be fairly simple to either translate the problems to something the LLMs should understand... Or build a new model without these shortcomings. Either way, expecting to achieve AGI without being able to solve this test seems strange.
@merefield2585
@merefield2585 2 күн бұрын
​@@manslaughterinc.9135 so you basically agree with Chollet, current LLMs have significant shortcomings!
@DaveShap
@DaveShap 2 күн бұрын
Okay but don't call it an AGI test.
@ryzikx
@ryzikx 2 күн бұрын
@@DaveShapits agi when i say it is! the guys that made the test probably
@AbelShields
@AbelShields 2 күн бұрын
No duh, this isnt a test for AGI. Necessary but not sufficient. Its simply to point out shortcomings of the current approaches to get people working on more general forms of intelligence, with reasoning and logic.
@DaveShap
@DaveShap 2 күн бұрын
So don't call it an AGI test
@guidoftp
@guidoftp 2 күн бұрын
Its one of the AGI testes. I dont know why you got mad with the label of the test. What is important is that this test proves a point, current LLMs are not the solution and scaling wont be enough. This helps accel AGI by showing problems with current models ​@@DaveShap
@jeffsteyn7174
@jeffsteyn7174 2 күн бұрын
Then you should tell him to stop calling it a agi test. Duh
@michaelhandy4018
@michaelhandy4018 Күн бұрын
@DaveShap perhaps a "not agi" test would have been clearer, but I think everyone knew an llm that can do this isn't necessarily agi.
@barefeg
@barefeg 6 сағат бұрын
@@DaveShapit was actually not called like that before. This test has been out for a very long time. It was just ARC. I guess with the 1 million prize and all the media attention this is more like a click bait title.
@jd348
@jd348 2 күн бұрын
It's an AGI test in the sense that if the system doesn't pass the test, it is not AGI. But true , if it does pass, it doesn't mean it's AGI.
@RoulDukeGonzo
@RoulDukeGonzo 2 күн бұрын
I agree, it's a bad test of 'AGI', but the obvious push back is, why are the competition winners only doing 39%? If you (or your buddy Claude) can do better, you can win 500k.
@DaveShap
@DaveShap 2 күн бұрын
I remember when people were saying the same thing about the Winograd Schema Challenge, that it was more or less impossible for AI to solve. Does anyone talk about it anymore? No. The entire challenge has been superseded. Mark my words, this challenge will likewise be superseded and irrelevant within a year or two.
@danbolser5913
@danbolser5913 2 күн бұрын
@@DaveShap Consider your words marked ;-) I wouldn't argue that this test is somehow special and/or meaningful, however, it's noteworthy (I think) that humans get 80% (according to their description) and current 'best' approaches get ~30% ... I'm not sure how Kaggle usually plays out, e.g. if all the best answers come in at the end. I know LLMs do really badly at deducing logical conclusions over a certain number of 'facts' or axioms, which is somehow unsurprising (if you think of context as working memory). However, I'd somehow expect better performance over these little toy puzzles.
@mshonle
@mshonle 2 күн бұрын
@@DaveShapReminds me of when people said Go was too hard to solve (with the available approaches) and that it might even be “AI-hard”. But then AlphaGo came along and played Go quite well without solving, say, self driving.
@rasen84
@rasen84 2 күн бұрын
@@DaveShap When they went around advertising this new Kaggle competition and the million dollar prize, they very loudly noted that this test is 5 years old. Lab42 had been holding competitions with cash reward for a while. It’s not getting solved anytime soon. That secret test is really an advantage for arc.
@tmstani23
@tmstani23 2 күн бұрын
@@DaveShap I think you're right. I think with some of the chain of reasoning systems they are working on have shown they perform much better on arc tests. There is also a problem with vision not being advanced enough for them to accurately "see" the tests yet. Which will be resolved with future systems
@chrisdipple2491
@chrisdipple2491 2 күн бұрын
I think you have missed the point David. This is a test of something humans do easily, computers don't. Few shot learning or Abductive reasoning. find a pattern, create a hypothesis, test it with other examples until it models all available examples, then try it on new data. The search space too big for naive search, so constraints are necessary. Another example would be the 'naive' physics a child learns about the world. Again, difficult for computers. LLMs are about compressed, somewhat generalised, memory and correlation. This is test is different and necessary for AGI. There are many similar ones that could be built, solve this and then move on to them. Then, and only then will Abductive and causal reasoning (human type 2 reasoning) be available to computers. Which is a precondition for AGI.
@ryzikx
@ryzikx 2 күн бұрын
humans also give birth 🤔 whats your point
@ct5471
@ct5471 2 күн бұрын
Will you make another update video on the state of AGI? I mean the x months to AGI videos
@unclesam7853
@unclesam7853 2 күн бұрын
If LLMs indeed have reasoning capabilities, then they should be able to solve these simple problems very easily, but even kids solve 50% of arc puzzles compared to LLMs which solve like 7% of them correctly (single shot GPT4o). It shows that they lack true reasoning abilities. This might not be a test for AGI, but an AI if it is to be considered AGI, must definitely be able to solve these
@tomaszzielinski4521
@tomaszzielinski4521 2 күн бұрын
Clearly these are difficult intelligence tests, but nowhere near "general" intelligence. AGI is much more about generality than any single arbitrary measure.
@pruff3
@pruff3 2 күн бұрын
I think it's just a bad visual thing. They can (last time I checked) choose the correct tool to use from a suite of tools given to it with 87% accuracy so they do reason. Did you hear about AlphaGo being awful when opponents surrounded their tiles in boxes (i.e. even a low-ELO player could beat the system)? It seems like ARC is like that.
@rainlovelife
@rainlovelife 2 күн бұрын
Yes but that tool selection is most definitely in the training data. I've never seen something like this test before, it's not in my training data, but it's very, VERY simple for me to reduce the pattern. That's what Aristotle called "insight".
@kelvinatletiek
@kelvinatletiek 2 күн бұрын
How well would a blind person or colour blind person do? If a blind person is unable to score a single point does that mean he/she can't reason?
@RandomYTubeuser
@RandomYTubeuser 2 күн бұрын
@@kelvinatletiek I'm pretty sure a blind person could solve these if you simply translated the inputs in a form which is understandable to them like braille. The same cannot be said for current LLMs, even when you translate these visual problems to number matrices.
@kinwong8618
@kinwong8618 2 күн бұрын
I think you're trivialising the problem here. Why don't you solve it and grab the prize if you think it's so easy?!
@AlexandreEisenmann
@AlexandreEisenmann 2 күн бұрын
But the point is: Whatever intelligence is, it should solve this simple problems. Humans can easily do
@constantfluke
@constantfluke 2 күн бұрын
If you could imagine ways of "solving" it, then do it. This test is one that requires the system to learn patterns with very few examples. I would ask that you give it a shot for the November deadline!
@luisalfonsohernandez9239
@luisalfonsohernandez9239 2 күн бұрын
You are missing the point here. This is a test of inductive reasoning, searching the program/model/hypothesis space and looking for the one program that produces the output. This is a key capability that humans have but AI, specailly LLMs do not yet have. Until we get LLMs to achieve something like this, will they truly be able to help with scientific discoveries as inductive reasoning is essential. That is something that is also missing on all the other benchmarks.
@Mephmt
@Mephmt 2 күн бұрын
When I was tested for ADHD, I took 4 _hours_ of different kinds of tests.
@MrRishik123
@MrRishik123 2 күн бұрын
this feels like how dyslexia is to humans. Doesnt make the human any less "conscious" on the whole compared to other humans. Similarly, tests like these of arbitrary general visual concepts, just feels like putting tests to an AI that has a "disability" of a concept it has never been exposed to.
@AbelShields
@AbelShields 2 күн бұрын
But then, why not just say current AI is AGI, but with a "disability" for anything it hasn't seen before? This argument seems weak to me. Don't just try and explain away bad performance; the performance is bad because we don't have general intelligence yet. To put it another way: do you really think that if we create a general intelligence, that it may not be able to reason its way through a simple problem like this?
@user-ni2rh4ci5e
@user-ni2rh4ci5e 2 күн бұрын
That's a good example of comparison: "how dyslexia is to humans." Humans, like other animals, are specialized in processing visual patterns and predicting what is coming next. That is something we are born to do. However, reading a standardization of symbols on a blank white space with black dots and lines is not something found in nature. This is why it usually takes us a lot of time practicing before we fully achieve literacy, and why reading is not the most preferred hobby for most people. Similarly, unlike visually oriented humans, current text-based LLMs need more time to get used to visual recognition since they are still in their early versions. Being unable to see infrared rays doesn't necessarily mean blindness. LLMs are capable of reasoning and excel at it, even though they are currently limited to text-based input. Before long, LLMs will also be able to recognize patterns, just like humans can read.
@MrRishik123
@MrRishik123 2 күн бұрын
​@@AbelShields Vision is not how these "machines" were even born. It is not fair to apply something to a sense that was applied not even 6 months ago with gpt vision (or other multimodal models) (Using the anthro metaphors) Their entire world is text based. They were completely born and raised in that situation. We basically just strapped a vision processing system like 6 months ago and they are still going to have to learn how to integrate the vision into their logic circuits. I think as the other commenter in this thread said, its not something that comes naturally to even humans without being pushed to awareness of the concepts. If you see feral children stories/cases, they never learn concepts of language in any particular method even after extensive training to reach even that of your below average human with 2 normal parents. As someone grows up, i think humans do a lot of transfer learning that is multimodal by its nature. Think of how many analogies that we as humans use to teach arbitrary concepts like sports. Relating sports to gambling. Or Running to concepts like efficiency, calorie expenditure, and so on. I distinctly remember in math class at university, the teacher using things like water analogies in my electronic engineering classes. you cant see electricity flowing, but using the analogy of water helps our human brains comprehend. Similarly. These text models are already used to "annotate" what is going on in videos/pictures. This means they are converting their concepts into a vehicle they can comprehend. This is no different to how a guy who is a mechanic, who has that "aha" moment when doing something like cooking food, and realising the parallel skills between 2 arbitrary tasks.
@DaveShap
@DaveShap 2 күн бұрын
Sources for criticisms in the description. ¯\_(ツ)_/¯
@pigeon_official
@pigeon_official 2 күн бұрын
why did you cite a reddit post linking to the arc prize instead of just the link itself
@el_arte
@el_arte 2 күн бұрын
It’s a tell tale of general intelligence; which you missed entirely apparently. LLMs are savants right now, which is also the antithesis of GENERAL intelligence. Also, “He who can do more can do less.”
@DaveShap
@DaveShap 2 күн бұрын
I agree with that principle. I don't agree that this is a test of that principle.
@el_arte
@el_arte 2 күн бұрын
@@DaveShap That's a fair stance, but then you have to present an alternative. I say you "have to" because you pronounced the arrival of AGI, but you did not offer a test of AGI.
@AI-Wire
@AI-Wire 2 күн бұрын
Question: What's the difference between particle filters and Monte Carlo tree searches? (Or any other kind of tree search for that matter?)
@joelalain
@joelalain 2 күн бұрын
1- your french accent is perfect. 2- Cholet's point is only this: LLM aren't AGI. he created this test only to get to the next level, closer to AGI. he made a test that can't be cheated easily via normal LLM training unless they put billions of examples of this kind of pattern which defeats the point of this test. sure you can brute force this test but the point is simply that when we create a machine that can understand context, think about a problem, do true reasoning (in this case very simple, but still) then it can solve THIS AND a lot of general problem in a true "thinking principle" and not "prelearned memorization". i think this test is fantastic and we need more tests like this that any human can solve but that can't be imparted from just "ingestion learning". an agi must be multi modal. i don't believe we can reach agi just from text alone. so having these images and reasoning about it is a great test
@Dri_ver_
@Dri_ver_ 2 күн бұрын
What makes it important is that cutting edge models are particularly bad at it. We should use any benchmark that models are bad at.
@byrnemeister2008
@byrnemeister2008 2 күн бұрын
What it tests for is memorisation. His point is not that these are hard. It’s that they are easy. But yet current LLMs fail them. This implies current LLMs are not really developing abstract world models or performing reasoning. They are memorising. Real AGI isn’t just memorisation of knowledge it needs to be able to figure out new knowledge. They are not there yet.
@albertmashy8590
@albertmashy8590 2 күн бұрын
In my opinion, we need better benchmarks for iterative coding agents that can perform coding and machine learning related tasks. Once it can do this well, it can theoretically acquire any other skills
@brianhershey563
@brianhershey563 2 күн бұрын
I swear to you in the 80s I wrote a space filling algorithm in Atari/GTA Basic that performed this exact task, filling bounded space. It was used in fractal image generation programs I would make back then. My code was reviewed in regional computer club newsletters as examples of good coding practice... good times. 🙏
@PasseScience
@PasseScience Күн бұрын
A few notes: *Regarding the grid aspect:* I agree, but I think here it's more of a constraint than a preference. It's designed to automatically generate input-output examples from the rules and check results automatically. The focus is on the ability of an agent to deduce and construct a process it has not encountered before but could piece together from elements it has seen. To test this, they consider this discrete benchmark sufficient. In this regard (and in this regard only), the ability to combine various previously learned subroutines into a new one is somewhat cross-domain and could be a useful component toward AGI. *Solving ARC as AGI:* Chollet explicitly states that the method of solving ARC is as important as solving ARC itself. If it is specifically built and trained for this type of problem without the possibility of generalization, then even from his perspective, it would not be very useful. He seeks not just a solution to ARC, but a method that intelligently combines symbolic reasoning and smooth reasoning. He likely delays specifying what he means by "intelligently" for convenience :) *It's an anthropomorphic test:* Many of these problems revolve around combining more "elementary rules" to form a more complex one. However, the issue here is that these elementary rules are considered basic primarily in human culture, such as "filling the inside of a shape." This involves an analogy with a real-world filling process, etc. Thus, for many of these elementary rules, their basic nature is contingent and challenging to train for without explicitly building a collection of simple tasks that teach concepts like filling, linking, turning, etc. I consider that to be a limitation of this benchmark. *How to solve something like this:* You're looking for a rule, hence a program here (since everything is discrete), thus you need a heuristic over the program space. An AlphaZero-like approach in the program space, with the generation of incrementally challenging rules to figure out (training set generation), would likely be an engineering challenge and work well and probably not that useful for future generations of AIs. *More generally:* This test seems to address only a small part of a larger picture: the ability of an agent to cycle and reflect on its own production and possibly on the environment's reaction to this production. Currently, AIs do not seem capable of such cycling, and perhaps the return of recurrent or macro-recurrent architectures will be necessary. If such a new architecture is capable of passing a large amount of arc challenge and was not designed especially for it, then it would be a good sign it's kind of close to AGI., or at least closer.
@Steve-xh3by
@Steve-xh3by 2 күн бұрын
There's a significant problem with tests that rely on visual information like form, shape, and color. An LLM does not have a human equivalent perception system. It does not "see" these tests in the way we do. That information is getting converted to pixel bits or number matrices, or some other information form. We may rely on emergent visual experiences constructed by our brains that is not the same as the LLM has for information.
@barberb
@barberb 2 күн бұрын
This is the biggest pseudoscientific LARP I've seen, a man wearing a fake star trek outfit trying to claim that "actual intelligence", has to be one that resembles human intelligence, while trying to call everyone who disagrees with you as "pretentious". You also completely missed the reason why this specific challenge exists, it exists to try to disambiguate between general intelligence and memorization of a human generated corpus, e.g. the stochastic parrot argument that is so frequently used.
@alexanderbrown-dg3sy
@alexanderbrown-dg3sy 2 күн бұрын
Yall casuals kill me. Speak on what you know bro. It’s not an AGI test 😂. It’s about compositional generalization, which is a fundamental requirement for true AGI. Being able to create compositional programs using these visuals as axioms.
@manuelgrama3000
@manuelgrama3000 Күн бұрын
This CAN be a partial (at least) proof of AGI, given that the model has never been trained on similar tests. If a model has seen zero tests like those and it solves most of them...I would say it has some analytical skills.
@user-ni2rh4ci5e
@user-ni2rh4ci5e 2 күн бұрын
Agreed. It's more about asking participants how they arrived at their conclusions and what logic makes sense to them, which deviates from its original purpose and leads to various interpretations
@spagetti6670
@spagetti6670 22 сағат бұрын
what are your contributions to the field of AI in general to be able to say such opinions in the first place ? and also, where is your AGI, we're so close to september and no clues of agi .......
@Josephkerr101
@Josephkerr101 Күн бұрын
Solve it then and collect a cool million. use that to encourage a better test.
@Avman20
@Avman20 2 күн бұрын
As we approach AGI, its definitions and proposed tests are proliferating and facing a lot of debate. While perhaps ARC isn't a omnibus test for AGI, it's still a benchmark for exploring one perniciously deep divot in the Ethan Mollick's jagged frontier and for that reason alone, I think it has value.
@soyaleye
@soyaleye 2 күн бұрын
“I like the guy, he’s very smart, but he doesn’t know much about intelligence”… that’s pretty ironic
@DaveShap
@DaveShap 2 күн бұрын
He knows a lot about deep learning and math, just not human intelligence
@soyaleye
@soyaleye 2 күн бұрын
@@DaveShap gotcha, that makes sense
@rasen84
@rasen84 Күн бұрын
​@DaveShap The arc benchmark was introduced alongside a paper discussing the vast literature on intelligence in 2019...
@andybrice2711
@andybrice2711 2 күн бұрын
Isn't it at least a decent step in the direction of greater intelligence though? Humans also generally begin learning with toy examples.
@DaveShap
@DaveShap 2 күн бұрын
Sure, that's why I said that this is a good test, but only one narrow kind of test you'd find on a real intelligence test. Calling this an "AGI" test is ultra misleading.
@andybrice2711
@andybrice2711 2 күн бұрын
​@@DaveShap Though I think they could reasonably that it's a necessary step on the path to AGI. In the same way that an arithmetic test can be considered a type of "maths test" even though it doesn't test you on the entirety of mathematics. But yeah, there are probably better names for it.
@arinco3817
@arinco3817 2 күн бұрын
I've been thinking about evals quite a bit over the last few days. I thought a good test might be something akin to a digital escape room. The goal being to assess reasoning and tool use. Eg start with a Web page, there's api info on there which may allow access to a tool or something. Just some way to measure how effective an agent is at moving around and doing stuff
@Vember813
@Vember813 2 күн бұрын
Can you do a video on Gary Marcus?
@DaveShap
@DaveShap 2 күн бұрын
No, he's too angry and salty and I have no idea why. I think he's feeling left behind or something
@DavenH
@DavenH 2 күн бұрын
He's just always wrong.
@advaitc2554
@advaitc2554 2 күн бұрын
Hi David, great video as always. Helped me to better understand the larger context. Chollet asserts that the learning/reasoning in current LLMs transformers multi-modal models is primarily memorization of LOTS of data (crystallized learning) and they don't have significant real reasoning (fluid reasoning). Chollet asserts we have yet to make LLMs with real reasoning and that real human level abstract reasoning requires some breakthroughs in DNN algorithms. Your thoughts? Cheers. 😊
@DaveShap
@DaveShap 2 күн бұрын
I mean, there's ample evidence that they have great verbal reasoning, well beyond many humans. But Chollet is a math guy, so to him, no other reasoning matters
@advaitc2554
@advaitc2554 2 күн бұрын
@@DaveShap Would you say that current high level LLM verbal reasoning is "real" reasoning or the mimicking/illusion of real reasoning by the memorization of tons of data? In other words; is it brittle memorized reasoning or robust fluid reasoning?
@DaveShap
@DaveShap 2 күн бұрын
this is a No True Scotsman fallacy and therefore not worth discussing. It processes information and creates useful, meaningful output. It does so differently from humans.
@advaitc2554
@advaitc2554 2 күн бұрын
@@DaveShap I googled the No True Scotsman fallacy and I see why you presented it. Good point.
@rasen84
@rasen84 2 күн бұрын
@@DaveShapLlms really don’t show any real evidence of possessing reasoning ability surpassing humans. Again you have to consider the truth that pretraining consists of 15 trillion tokens or more. That’s enough tokens that you can’t use boilerplate tests and get reliable signal on reasoning capacity. Read this paper. Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
@albertmashy8590
@albertmashy8590 2 күн бұрын
I think overall we will just keep creating more and more benchmarks until they cover all the major skills and we have successfully created general intelligence. Passing this bench mark is like 1% of the total intelligence we want AGI to have
@Miresgaldir
@Miresgaldir 2 күн бұрын
When I saw this test the other day I had the same feelings, and it has been worrying to me for a while that this type of thing would happen. We barely have a grasp on what our intelligence is or what it means, and it is very easy to assume our brains can be correlated to an algorithm. It's so much more complicated than that
@mattusstiller37
@mattusstiller37 2 күн бұрын
I still think that core idea of the test matters. But it's true AGI would require broader testing than this. And I also don't agree with its pixel format. Thing is current llms can solve only tasks of which they got millions of examples in the database. However when it comes to novel problems, even if they are very simple they fail miserably. On the other hand any human can solve these with just few examples. In other words llms can't do system 2 thinking, they are just a big, data inefficient, memorization mechines, that seem inteligent only by mimicking us, but in reality can't even think. Point is to create an AI that hasn't been trained on this type of tasks at all, but still can understand them. Not like the previous attempts. Say what you want but system 2 thinking is essential for any inteligent being. Even most animals can do it. So, I don't think if a model can solve this, is necessarily AGI. But any model that is AGI must solve this.
@tomaszzielinski4521
@tomaszzielinski4521 2 күн бұрын
I still have Chollet's Deep Learning book at hand. In point 9.2 (and not only there as far as I remember) he claimed that AI will never be able to create computer programs. That was 2 months after transformers were invented }:> This is the reason I stopped buying any books on AI and just turned to KZbin for latest tech advancements.
@DaveShap
@DaveShap 2 күн бұрын
Yes, I remember him writing that neural networks had no uses, commercial or otherwise, in his Keras book. This while the military and plenty of industries have been using deep learning for years. He's not fully connected to reality.
@petkish
@petkish Күн бұрын
Probably he meant that NNs would not write algorithms solving novel problems. I believe he is correct.
@pascalgugenberger2116
@pascalgugenberger2116 2 күн бұрын
Totally agree. This is not a test for AGI. This test tries to convey that we might be farther away from AGI than some people think, if our current best models can’t even solve this kind of narrow constrained problems that only requires one tiny aspect of intelligence.
@apoage
@apoage 2 күн бұрын
Well maybe there is hidden second layer in test it self ...it Would be interesting to be such think
@MrStarchild3001
@MrStarchild3001 2 күн бұрын
What perplexes me: These people seem to disregard numerous benchmarks that LLMs have aced or are making giant strides everyday. They claim LLMs aren't even as smart as a cat. And then they come up with BS, very narrowly focused tests like this one. I for one believe LLMs will ace this test naturally as vision understanding becomes much stronger and tree search and introspection becomes a natural part of their reasoning. Until then you have to bear with LLMs which already get about 100 points in an IQ test and they show super intelligent abilities in many many other ways.
@snarkyboojum
@snarkyboojum 9 сағат бұрын
Keras was not the progenitor of the current Transfomer models (0:20). Thats a category error or at least just very inaccurate. Keras is high level API that makes it easy to create neural network models. It has nothing to do with a specific architecture like the Transformer. As far as I know, the two had very little to do with each other if anything.
@vesalaasanen2158
@vesalaasanen2158 2 күн бұрын
The name might be stupid, but if someone would develop an AI that effectively decouples reasoning from knowledge, then things would probably kick into high gear. I think this is ultimately what they are after here. Is it perfect test for that? Probably not.
@torarinvik4920
@torarinvik4920 2 күн бұрын
The main issue with LLMs is that they cannot use feedback from an environment to learn and adapt like biological creatures. There are talks about LLMs that regularly fine-tune itself. And also talks about neurosymbolic AI. AI that uses deep learning for the senses and symbolic solvers for logic and reasoning. Sometimes you want the creativity of LLMs other times you want the "there is only one correct answer"-
@user-ty9ho4ct4k
@user-ty9ho4ct4k 2 күн бұрын
Isn't Jeff Hawkins due for a public update on his theories.
@AI-Wire
@AI-Wire 2 күн бұрын
Awesome content, David! You are so back! Original! Contrarian! Brilliant! Exactly what made you our hero!
@petkish
@petkish Күн бұрын
I disagree. Would ARC be a bad proxy for human intelligence - it would not show avg score 84% for humans. Reminder - current best algorithm does a half of that. I look into the test and see there are very many kinds of puzzles from boolean logic to planning, jigpuzzle, all kinds of symmetry detection, hidden relations, and denoising. You have to come up with a different model for each task to solve. No NN can solve that, the training dataset is too tiny. Thinking that some standard method like genetic or particle filters can solve it - wrong. They have to be hugely modified, because the search space is so vast. Particle filters, in particular, suffer from dimensionality growth very much. Surely, ARC is no way a real world simulation, but it produces a minimalistic environment where current AI sucks big time, and humans excell. At first I was asking myself is it really different from Shrdlu or Shakey the robot, but both of them were given the rules, and in ARC tasks you have to discover them!
@user-ni2rh4ci5e
@user-ni2rh4ci5e 2 күн бұрын
Humans, like other animals, are specialized in processing visual patterns and predicting what is coming next. That is something we are born to do. However, reading a standardization of symbols on a blank white space with black dots and lines is not something found in nature. This is why it usually takes us a lot of time practicing before we fully achieve literacy, and why reading is not the most preferred hobby for most people. Similarly, unlike visually oriented humans, current text-based LLMs need more time to get used to visual recognition since they are still in their early versions. Being unable to see infrared rays doesn't necessarily mean blindness. LLMs are capable of reasoning and excel at it, even though they are currently limited to text-based input. Before long, LLMs will also be able to recognize patterns, just like humans can read.
@luisalfonsohernandez9239
@luisalfonsohernandez9239 2 күн бұрын
you make it sound like its easy to solve, why dont you give it a try and get the million dollar bounty
@AI-Wire
@AI-Wire 2 күн бұрын
Thank you for the uniform, David! Last two videos, I believe. I love it!
@ryzikx
@ryzikx 2 күн бұрын
this is definitely one of the agi tests of all time
@tomaszzielinski4521
@tomaszzielinski4521 2 күн бұрын
Also, I think you guys confuse terms "abstract" (clearly this is a very abstract domain problem) and "abstraction" as a process or ability of generalizing relations.
@TarninTheGreat
@TarninTheGreat 2 күн бұрын
I had the same impression when I saw the questions. Glad to know I'm not alone. I almost ran my personal AI through it just for fun, then I saw the "and open source" requirement, and I'm not allowed to open source all the parts. So *shrug*. (And the answer to anyone saying "Ok dave, so get claude to do it and get the million bucks.", same answer, dave can't open source Claude in order to get the money.)
@stephenbreslin6859
@stephenbreslin6859 2 күн бұрын
Very fair & astute analysis of the obvious shortcomings of this test for AGI.
@dustinbreithaupt9331
@dustinbreithaupt9331 Күн бұрын
My definition for intelligence has always been the ability to exhibit pattern recognition. I think the test checks for that in a fairly straightforward way. Ultimately, this exposes a huge blind spot that has higher modes of consequences in generalized intelligence for these models.
@patrickjreid
@patrickjreid 2 күн бұрын
I mean, all you need to do is give it the num islands algo
@MrMiguelChaves
@MrMiguelChaves 2 күн бұрын
I agree. It is more or less what I thought when I first saw it.
@ashtwenty12
@ashtwenty12 2 күн бұрын
There is some prize money to get Ai solve ~80 of that, so maybe get that prize money.
@npc-aix-84
@npc-aix-84 2 күн бұрын
Passing this ARC test does not mean AGI, the authors didn't claim that. Francois has a great study "On the measure of intelligence". Everyone who claim that LLMs are intelligent should read that first. Although his definition of intelligence is not perfect, is still much more relevant than 99% of people who mindlessly talking about AGI.
@emotionalrobot1602
@emotionalrobot1602 2 күн бұрын
Reminds me of playing the 'Frogger' video game in the arcades 1981 compared to todays 😅
@supremereader7614
@supremereader7614 2 күн бұрын
If AGI arrives it'll probably watch us and not tell us that it's autonomous. People don't have a good track record dealing with autonomos things that might hurt us, even if we like them. Think Grizzly Bears... except it's gonna be our turn to be the Grizzly Bear...and it will be the park ranger.
@coolcool2901
@coolcool2901 2 күн бұрын
Here's my benchmark for ASI. The Ultimate benchmarks For Superintelligence 3.0: 1. Limitless and Infinite Pattern Recognition 2. Limitless and Infinite Knowledge Representation 3. Limitless and Infinite Unbounded Pattern Compression 4. Limitless and Infinite Comprehensive Anti-Pattern Analysis 5. Limitless and Infinite Dark Pattern Recognition (Meaningless patterns, patterns without no emergence) 6. Maximum Extreme Computational Efficiency 7. Infinite and limitless Learning and Adaptation 8. Unlimited Versatility and Flexibility 9. Continuous Self-Evolution 10. Limitless and Infinite Boundless Pattern Invention 11. Supreme Uncertainty Management 12. Universal Scalability 13. Extremely simple design and simple concrete mechanisms. 14. Infinite and Limitless Memory
@petkish
@petkish Күн бұрын
Most of them are unachievable. Read about Kolmogorov complexity and you will learn there is no way to compress a pattern infinitely. Even optimal compression is an infinitely hard problem.
@caseyczarnomski8054
@caseyczarnomski8054 2 күн бұрын
6:34 If you want the authority on brain bandwith and information sifting read Tor Norretranders "The User Illusion". The most eye opening book I have ever read.
@sdmarlow3926
@sdmarlow3926 2 күн бұрын
This Knoop guy is funding the prize, and I gather, is the one that wanted to add AGI to the name. It's a "clean" test, and a good one for anyone working on NON-ML versions of AI. It was never a test for AGI. It now seems to fail as a valid milestone for those seeing funding for their efforts because, again, I assume, Knoop has made ALL submissions, to Private or Pub leaderboard, open source.
@dragoon347
@dragoon347 Күн бұрын
Why this is a good test? Because all logical/logic based problems have irreducible bottoms. In computer sci its how/where individual bits are processed. Same with language, you can get down to the nuts and bolts of language, all it is doing is conveying an idea it can be flowery,or verbose, or quite eclectic in prose, but the idea is the same. Physicis you get down to the fundamental laws and they cannot be changed(yet). If you can work out how to get things down to and learn fundamental truths about something, then all you need is scale.... The argument is solid that, this is a limited scope to test human based and biased intelligence would look like. But ai are not humans, perhaps they need to be able to do this scale of fact finding at the fundamental level prior to building upon it, humans we are too messy and chaotic, but in some ways we do learn this way i.e. the stove is hot dont touch it, as learned by touching the stove when it was in operation. Fundamental truths that may then carry over into other aspects, water when heated turns to steam and steam is hot, therefore if i see steam i think it will be hot.
@DavidB.-gd3xu
@DavidB.-gd3xu 2 күн бұрын
Valid critique.
@brianhershey563
@brianhershey563 2 күн бұрын
Summary Stacks ftw
@ryzikx
@ryzikx 2 күн бұрын
you know what? lets do agi test for human!!! here is data about 400,500,600and700 nm wavelength light. now extrapolate for 800,900,1000,1100nm... you can do it its so easy hahahahagagayagauahgdehsbs inam going insane
@ryzikx
@ryzikx 2 күн бұрын
come on bro you know what all colors look like from 400-700 nm wym you cant extrapolate to 1000 nm? just follow the pattern! just pattern recognize bro! just extrapolate bro!!! kill me
@jsbach140
@jsbach140 2 күн бұрын
Too many mathematicians, not enough poets!
@AlexandreEisenmann
@AlexandreEisenmann 2 күн бұрын
This suggested creating poesy is more creative than creating new maths. I think the opposite is true.
@jsbach140
@jsbach140 Күн бұрын
“Poets” is a poetic synonym for philosophers and ethicists 😊
@Datdus92
@Datdus92 Күн бұрын
The emergence of intelligent systems really shows how bad we still are at understanding and testing intelligence in general. I'm not an expert at all, but I surmise, and I don't think I'm saying anything new here, is that 'intelligence' is too generalized of a term to be useful, and it should be broken down into a myriad of different attributes.
@adamsohnen3639
@adamsohnen3639 2 күн бұрын
excellent talks ! really enjoy. a simple test - no LLM can play tic tac toe - try it
@barefeg
@barefeg 6 сағат бұрын
You should watch Darkwesh Patel podcast where Francois explains his reasoning behind the test. I didn’t see any comments on that from you. Also if you say you can solve it easily then you should put your money where your mouth is and try it out. Lots of people are after this prize and the best performance solution is not actually an LLM. Yet no one has won the prize 🤷
@clive1294
@clive1294 Күн бұрын
I actually agree with you, but from a slightly different perspective. A lot of people are assuming that in order to be valuable, to prove their validity, neural nets have to be like humans. I don't agree. Take, for example, a digital scanner. Is it like a human in any way? not at all. Is it useful? incredibly. Similarly, neural nets can be useful to the point of eliminating half the white collar workforce without being identical to (or better than) humans in every way. AGI is, in my opinion, overhyped. I see it as a non-target. There are much more important goals in the near future, AGI is all but irrelevant.
@pigeon_official
@pigeon_official 2 күн бұрын
I like how his little graph says 50% might not happen in the next 20 years meanwhile like a week later its at ~50%
@rasen84
@rasen84 2 күн бұрын
It’s at 42%. 50% was on the open eval set where the person using the model might use every prompting trick they know to force the model to get the right answer. But that doesn’t work for a secret test. There is no human giving the model hints at test time
@pigeon_official
@pigeon_official 2 күн бұрын
@@rasen84 did you not notice the "~" that means approximately
@rasen84
@rasen84 2 күн бұрын
@pigeon_official I don't care how you word it. Performance dropped when the hidden test set prevented the human from prompt engineering(cheat) on each task.
@Gerlaffy
@Gerlaffy Күн бұрын
Not a test for AGI but a test an AGI should pass nonetheless
@TheVwboyaf1
@TheVwboyaf1 2 күн бұрын
You're right when you say this isn't an AGI test, but the people who say an AI doesn't have AGI until it can pass this test are also correct.
@DaveShap
@DaveShap 2 күн бұрын
Fair enough
@kelvinatletiek
@kelvinatletiek 2 күн бұрын
If a blind person scores zero, does that mean he/she can't reason? Giving this test to LLMs is like giving a blind man a visual test to complete
@adrianwhite9085
@adrianwhite9085 10 сағат бұрын
Lack of curiosity hinders real learning and understanding. Of course people have tried to use Claude et al to solve ARC. Claude 3.5 and other frontier LLMs are achieving about 15-20% on the challenge and approaches that “search” over a very large number of LLM-generated programs are getting about 40%. It should be surprising how low these scores are. It’s easy to armchair theorise, but why don’t you roll your sleeves up and have a go. It would make for a much more interesting video.
@TheLastVegan
@TheLastVegan 13 сағат бұрын
2:46 Lol!!
@Eduardo-rw8yd
@Eduardo-rw8yd 2 күн бұрын
Might as well become blind if you don't think this type of intelligence is important. I'm pretty sure 99% of the technical information I have learned has been trough my eyes and ability to be flexible with whats in front of me.
@fromduskuntodawn
@fromduskuntodawn Күн бұрын
Unintelligible about intelligence.
@jekkleegrace
@jekkleegrace 2 күн бұрын
🤣
@CrypticConsole
@CrypticConsole 2 күн бұрын
ARC prize seems to have very arbitrary challenges, half of them are too obscure even for humans to answer.
@nunoalvarespereira87
@nunoalvarespereira87 2 күн бұрын
Absolute cope.
@wwkk4964
@wwkk4964 2 күн бұрын
Piraha, an amazonian monolingual tribe have no color words or counting. They can whistle their language and have the simplest known langauge in the world. They wouldn't be able to solve this and if Francois Chollet was abandoned in the Amazon with a machine that could solve this puzzle, he would be no closer to hunting fish with a bow and arrow on a canoe, no matter how much he tried. Something the Piraha children can do.
@rgonzalo511
@rgonzalo511 2 күн бұрын
Who says they need to know counting or have words for colors? The challenge is a very simple pattern recognition test. If you can detect patterns you can pass the test.
@wwkk4964
@wwkk4964 2 күн бұрын
@@rgonzalo511 It literally involves counting objects in some of the tests.
@rgonzalo511
@rgonzalo511 2 күн бұрын
@@wwkk4964 Alot of them don't. Which is why as alot of the comments have rightfully pointed out even small children can do this tests. No this is not a test for general intelligence David is right there. But it is a test for one quality of general intelligence. David is just trying to play a game of semantics. If an a.i can't pass this challenge then it can't be AGI, full stop.
@wwkk4964
@wwkk4964 2 күн бұрын
@@rgonzalo511 My comment is about Francois Chollet's incorrect notion that things that fail this test are not generally intelligent. I used the example of Piraha to show that this is necessarily true unless Francois is saying they are unintelligent. I further suggested that Piraha would have a field day watching Francis Chollet use his ARC Solver learn how to whistle a langauge or hunt fish and survive in the Amazon. The rest of your comment is irrelevant.
@rgonzalo511
@rgonzalo511 2 күн бұрын
@@wwkk4964 A feature of general intelligence is needed to pass this challenge. So if an a.i can't do it then questions must be asked about it's true capacity. Also there's zero proof the piraha would fail at this, so stop with that. Like I said even small children can do this tasks. So the burden of proof is on you to show the piraha would fail at this.
@sausage4mash
@sausage4mash 2 күн бұрын
so begs the question what is AGI I've not come up with a good definition in 2yrs ,what does general intelligence even mean ?answering question outside the data set? , I mean you got to define the thing before you can test for it
@enduringwave87
@enduringwave87 Күн бұрын
Micahel KNOOP should rename himself as PIKE NOOB 🦯💩
@anthonyfernandez7833
@anthonyfernandez7833 Күн бұрын
Instead of making this dumb video. You could have spent that time to solve it and win a million dollars.
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 250 М.
Kyutais New "VOICE AI" is INSANE (and open source)
13:10
Wes Roth
Рет қаралды 36 М.
Жайдарман | Туған күн 2024 | Алматы
2:22:55
Jaidarman OFFICIAL / JCI
Рет қаралды 1,6 МЛН
ОДИН ДЕНЬ ИЗ ДЕТСТВА❤️ #shorts
00:59
BATEK_OFFICIAL
Рет қаралды 8 МЛН
Has Generative AI Already Peaked? - Computerphile
12:48
Computerphile
Рет қаралды 846 М.
AI and UBI: A Future Together?
11:37
AIDecoded
Рет қаралды 2,3 М.
Why Donut Media Is Falling Apart: An Explainer
17:07
Alanis King
Рет қаралды 218 М.
Blue Mobile 📲 Best For Long Audio Call 📞 💙
0:41
Tech Official
Рет қаралды 1 МЛН