GPT-o1 - How Good Is It? (New Research Paper Tests Its Limits)

Рет қаралды 53,280

Күн бұрын

Let's look at o1's ability to do planning, reasoning, logic, and real-world spatial intelligence.
Make real progress on your LLM development and visit wandb.me/mb to get started with Weave today.
Join My Newsletter for Regular AI Updates 👇🏼
forwardfuture.ai
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.ne...
👉🏻 LinkedIn: / forward-future-ai
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
www.arxiv.org/...

Пікірлер: 233

@matthew_berman 23 сағат бұрын

One more day left to enter for your chance to win a Dell 24" monitor: gleam.io/q8wkK/dell-nvidia-monitor-2 (NA Only)

@onlyms4693 13 сағат бұрын

Hey, there is scam comment below your comment section.

@n0van0va 23 сағат бұрын

TL;DR : there is no actual o1 testing, he's just reading a paper.

@matthew_berman 23 сағат бұрын

@@n0van0va fixed the title to be more clear

@bossgd100 22 сағат бұрын

😂

@TheVektast 2 сағат бұрын

Thanks!

@BCrawford-xj2qp 18 сағат бұрын

As to why the graphs showed success in Red (versus green), it is because in China, Japan, and many Eastern cultures, red is considered a lucky color and is associated with happiness and good things. So Eastern cultures often use the color red in data visualizations to highlight desired results.

@Macorelppa 22 сағат бұрын

OpenAI needs to survive an Indian call center to truly achieve AGI-handle 200 'restart your router' calls, and you're basically sentient!

@cbgaming08 21 сағат бұрын

👎

@matthew_berman 20 сағат бұрын

😂

@paycation3448 22 сағат бұрын

Multimodal sounds very helpful. We use our visual cortext extensively for how we process the world so it may be helping us with problem solving.

@BHBalast 22 сағат бұрын

I was thinking about this long time ago, reasoning step by step with generated images and short video clips would give o1 a ton of new possibilities.

@dreamaster9000 23 сағат бұрын

Spatial reasoning for an LLM alone seems like a tall order... Isn't that akin to a blind man with no arms? Like asking him to design a beautifully architected building? I'm legit blown away at how well they are doing at this currently, without eyes and arms.

@rawallon 23 сағат бұрын

Not really, its testing more its logic than actualy physical space; it doesn't need to see a chess board for it to be able to play chess

@trader548 23 сағат бұрын

@@rawallon It's nothing like chess though is it, with fixed pieces, moves, rules and board geometry.

@georgemontgomery1892 22 сағат бұрын

@@rawallon Yes, but changes to the physical environment can be a major issue, it either needs to see or have constant updates on it's spatial awareness.

@antonystringfellow5152 22 сағат бұрын

Yes, for spatial awareness tests to be really useful, the models should be multimodal. The only other way to compare their performance against humans would be to give the same tests to humans by language only (no visual aids allowed). Even better would be to give the tests only to blind humans, who have no experience of vision.

@trader548 22 сағат бұрын

@@antonystringfellow5152 It's like asking someone born blind with no arms how they might solve a Rubik's cube in their head.

@WanderingJoy 19 сағат бұрын

Great job, Matt! I’m glad to see you. Highlighting important work like this.

@matthew_berman 19 сағат бұрын

@@WanderingJoy thank you!

@bombabombanoktakom 22 сағат бұрын

I love your article videos. Articles could be boring but your videos definitely not. You make it engaging. Greetings from Turkey!g

@AgentXavierATX 21 сағат бұрын

Srsly wouldn’t have the slightest chance of understanding these concepts without Matthew. He is a godsend!

@matthew_berman 20 сағат бұрын

Much appreciated!

@AlexanderMorou 22 сағат бұрын

It demonstrated a critical aspect of why o1 in both varieties falls short: learning space. o1 is a refinement on GPT to have its internal reflective capabilities. Give it a problem that requires complex hierarchical relationships? If fails, hard. Something as simple as `Selecting text and applying spans and classes to represent a WYSIWYG editor's bold, italic etc` it cannot do because of the hierarchy involved in html elements.

@pavi013 23 сағат бұрын

I just finished creating my first llm application using o1 mini model, gave me precise plan and helped with all the issues.

@MrAhsan99 22 сағат бұрын

What are you working on?

@pavi013 22 сағат бұрын

Language translation, it kinda started with my frustration with Google translate😅@@MrAhsan99

@ey00000 20 сағат бұрын

@@MrAhsan99 A super complex, high performant and most important scalable "Hello World!"-Script.

@marktrued9497 22 сағат бұрын

This was great. Need many more like this. Thanks!

@youriwatson 22 сағат бұрын

Really like these kind of videos. Actually gives us some good tests on the capabilities of frontier models

@matthew_berman 20 сағат бұрын

Thank you!

@rawallon 23 сағат бұрын

This was interesting, I wish they re did the tests but providing images of the current state, and the difference in task completion against complexity

@thejcedeno 23 сағат бұрын

General Intelligence, just drop the A from AGI, the computer overlords will thank you later

@TastiHam 23 сағат бұрын

This would be a lot more interesting if there was a human control group.

@AgentXavierATX 21 сағат бұрын

Keep up the good work Matthew! Your articles make it possible for me to understand (at least to some extent, Lol) the complex research being done in Ai. Always look forward to your videos!

@matthew_berman 20 сағат бұрын

Thank you!

@jeffreymoore1431 22 сағат бұрын

I have been using o1 for coding since it became available. o1 works well at writing html templates in Nodered. Created a data entry page with multiple floating point numeric entries, formatted the json payload, etc.

@DefaultFlame 22 сағат бұрын

I've used o1 and o1 mini for coding both a pokemon-esque game and an artificial life simulator with an evolving neural net that started as a feed forward network and was then modified into using simplified NEAT, both the game and the simulation coded in Java. I don't know how to code. My coding ability can be accurately summed up as "I can almost implement an array in the simplified Java of the Processing programming environment." o1 is amazing.

@nathanbanks2354 19 сағат бұрын

I've also found o1-preview is pretty good for coding, though Claude is slightly better than GPT-4o. However I'm annoyed by the 50-prompts a week limit. The canvas is kinda cool, but doesn't really work beyond a couple hundred lines of code.

@remedyreport 21 сағат бұрын

I wrote it all down. Thank you for sharing Takee's tribute.

@roberthenry7283 13 сағат бұрын

To do agi the AI needs to be able to remember every single interaction using it for future answers

@I-Dophler 21 сағат бұрын

Whoa, talk about an upgrade! GPT-4 and 01 models are like the Terminator vs. Robocop of AI-faster, smarter, but still missing a few screws when it comes to planning and generalizing. Can't wait to see these bots go full 80's action hero mode and crush those inefficiencies!

@epokaixyz 23 сағат бұрын

Let's dive into the essential takeaways: 1. Understand that true AI intelligence goes beyond just skills, focusing on acquiring new skills and applying them to unfamiliar situations. 2. Explore the potential of AI spatial reasoning by experimenting with games or tasks that involve planning, manipulation of objects, and rule following. 3. Acknowledge the progress of AI models like O1 in constraint following and some level of generalization. 4. Recognize the limitations of current models in consistently finding the most efficient solutions and handling complex spatial reasoning. 5. Support the development of enhanced spatial AI by advocating for advancements in decision-making frameworks, multimodal learning, multi-agent collaboration, and human feedback in AI training.

@picksalot1 23 сағат бұрын

The problems seemed to occur primarily due to not having a "memory" that can both retain and dynamically manipulate "state" representations of the task as it proceeded through the different steps. If it can't compare where it is to where it needs to be, it is effectively "blind".

@mirek190 22 сағат бұрын

it has memory ... "short term memory" just can not assimilate information from a short term memory to a long term memory.

@The_Collective_I 19 сағат бұрын

Presentation and deliver 10/10, thank you.

@RadiantNij 19 сағат бұрын

Hows this from 4o with prompt to think longer: In this problem, both hands were initially empty, and I needed to consider their use carefully during each step. I will now analyze the hand usage throughout the process. Since I have to pour ingredients and handle the shaker and shot glasses, I must decide when each hand is involved and what it's doing at each step. Here, I’ll break down how both hands were utilized. 1. **Left hand**: Holds the shaker. **Right hand**: Grabs the dispenser for **Ingredient1** and pours it into the shaker. 19 2. **Left hand**: Continues holding the shaker. **Right hand**: Grabs the dispenser for **Ingredient3** and pours it into the shaker. 18 3. **Left hand**: Continues holding the shaker. **Right hand**: Holds the first shot glass (**Shot Glass 1**) and pours the contents of the shaker into it. 17 Both hands were used effectively in the first sequence: one for holding the shaker, and the other for pouring the ingredients and transferring the cocktail into Shot Glass 1. I didn’t have any unnecessary hand switches, which kept the process efficient. 0.9 4. **Left hand**: Holds the shaker. **Right hand**: Grabs the dispenser for **Ingredient1** again and pours it into the shaker. 16 5. **Left hand**: Continues holding the shaker. **Right hand**: Grabs the dispenser for **Ingredient2** and pours it into the shaker. 15 6. **Left hand**: Continues holding the shaker. **Right hand**: Holds the second shot glass (**Shot Glass 2**) and pours the contents of the shaker into it. 14 Again, I managed the hand usage well by keeping the left hand consistently on the shaker and using the right hand for pouring and handling the shot glasses. This pattern is working smoothly so far. 0.9 7. **Left hand**: Holds the shaker. **Right hand**: Grabs the dispenser for **Ingredient2** and pours it into the shaker. 13 8. **Left hand**: Continues holding the shaker. **Right hand**: Grabs the dispenser for **Ingredient3** and pours it into the shaker. 12 9. **Left hand**: Continues holding the shaker. **Right hand**: Holds the third shot glass (**Shot Glass 3**) and pours the contents of the shaker into it. 11 The use of both hands remained consistent throughout the process. The left hand was always responsible for stabilizing the shaker, and the right hand did all the pouring and transferring. This ensured that there was no hand switching mid-task, which kept things efficient. I don't see any further improvements needed for hand usage. 0.95 The approach to hand usage was highly effective. Keeping the left hand on the shaker throughout the process while using the right hand for all other tasks minimized the number of transitions and ensured smooth execution. The hands worked in harmony without unnecessary movements, making the process as efficient as possible. 0.95

@timurista 21 сағат бұрын

It's striking how each new advancement in AI sparks ideas for extension, like multi-agent systems working alongside human feedback. This cycle of innovation and challenge seems set to continue reshaping what we expect from artificial intelligence.

@virtly 23 сағат бұрын

AGI Robots may soon be used in warfare, running on NPC gaming code.

@OffTheBeatenPath_ 22 сағат бұрын

There are a million use cases of AI for the military. You can be sure this is a top priority for them

@Science-et-philosophie 20 сағат бұрын

The interesting thing is that i make 4.o preview make far more difficult tasks than these in my study domain (i'm mathematician) and even when it cannot perform it well immediately , i usually just have to adjust what i tell it or give it precisions to achieve the desired result. This article does not take it into account, but adaptability/learning should be part of the evaluation in my opinion (at least as a "bonus" criteria let's say)

@User-actSpacing 15 сағат бұрын

What do you mean we don’t like to use the word AGI? Like 3 people who are stuck on defining it?

@Yewbzee 14 сағат бұрын

Totally agree. There is bitching about AGI yet nowhere has anybody defined it or come close to a consensus on what it is.

@stevo7220 10 сағат бұрын

On the ARC AGI website i think it is the best definition, so far no machine has achieved or lived up to that definition even small portion of it .

@moderncontemplative 21 сағат бұрын

This is very insightful. The most robust test left is the ARC-AGI test frequently highlighted by Francois Chollet. When an AI model can score above 90% consistently, we have AGI! The question is, can an LLM be scaled up and linked via MoE to realize AGI (digital AGI). Obviously, embodied AGI will be much more capable and complex

@averylawton5802 20 сағат бұрын

like the human mind you will need several specialized AIs that work in tandem to produce a singular AGI....no one system can encompass it all and function adequately. IF we are able to get to AGI it will be an emergent mind that was not directly designed but the culmination of many AIs overlapping each other and the end resulting stable "self" would be an AGI.

@mosesdivaker9693 22 сағат бұрын

I would love to see how a lay-person would do on these tasks. I feel the average person would struggle with these puzzles. IMO the key barrier is it just has to be better than the average human. I think we're there, if not really close to it.

@josephabraham4058 3 сағат бұрын

I have 5 children spread across 1997-2021. What I hear you saying, within my framing, is that A.I. is in the 18-24 month old category. It’s getting ready to be a toddler. Can’t wait for the “three-nager” stage.

@ScottLahteine 22 сағат бұрын

The ability to draw original analogies between things across different domains will be a crucial skill to train into models. It’s encoded in the connectome to a large extent, but the AI needs to practice probing that network until it creates new connections for the skill of drawing analogies. And then, it will need to learn to see implications and inferences of complex situations and develop an intuition for that, and that should be fun to train. Not everything we think in a train of thought is verbal, so to get to AGI we’re probably going to need to let AI generate ad hoc symbols for stuff in the moment, just to get a foot in the door to more abstract thinking.

@zaharizarev7451 19 сағат бұрын

Thanks for this presentation. However, do not forget that O1 acts without agents and their internal memory. The model cannot memorize its actions or mistakes, develop new abilities through the agents, or use them later for new chains of thought. I have implemented several AI agents, and everything changes when they manage their own internal memory.

@timurista 21 сағат бұрын

The ability of 01 models to reflect more effectively than previous versions hints at a promising future in AI planning. Leveraging these reflective computations might be crucial to enhancing automated planning in real-time applications.

@WanderingJoy 19 сағат бұрын

Two significant issues with their approach: 1. While the tower of Hanoi is interesting, they need a continuous valued spatial reasoning problem; 2.generalizability is not equivalent to working in a symbolically abstract space. Humans accomplish, transfer learning, even though we would suck at trying to reason through those problems using arbitrary symbols

@WanderingJoy 19 сағат бұрын

Also, the lack of spatial intuition in language models could probably be overcome by distilling the rule rules of the discrete valued set of spatial problems into specific logical relations that the language models would understand. in other words, I suspect their problems could be described in purely logical terms without using any spatial intuition precisely because all of the values are able to be enumerated. I’m not sure that’s true for the arc AGI prize problems because individual “objects“ in those puzzles can consist of more than one pixel

@nemonomen3340 22 сағат бұрын

I think trying to teach a LLM spacial reasoning is like trying to teach a blind person how to play paint (abstract art excluded). There’s nothing to say it’s impossible, but it’s would be incredibly hard and not at all efficient.

@animateclay 20 сағат бұрын

If they design a model that can learn from all KZbin tutorials ever uploaded, it would be a good way to put AI in the shoes of human perspectives. That is - visuals that tie into instructions, so that it can form visual problems and "see" the solutions. Most LLM's you talk to don't have that visual aspect tied into language. It's a much better way to learn for us, but somewhat of a major limitation for LLM's.

@pauljthacker 18 сағат бұрын

If I were to solve problems like this, I would want to sketch it out or visualize it in my mind's eye. A multi-modal o1 type model should be able to think by generating images or videos.

@interestedinstuff 15 сағат бұрын

For me to do these tasks I'd either have to have it physically there, or spend some time getting a clear visualisation in my head. Given the LLM's don't have visualisation, or a head, the fact they do what they do is pretty astounding. I'd like to see a General Human doing the same tests and compare accuracy. I reckon general humes would get some steps wrong. Imagine how good these things will be when they have a body and can learn from being in the world. We take on a lot of learning data by being embodied. The machines will do the same.

@I-Dophler 21 сағат бұрын

It's wild to see the shift happening here! GPT-4 and 01 models are squaring off like a classic Terminator vs. Robocop battle-one's a sleek, adaptive powerhouse, the other’s fast, calculated, but both are still tripping over some wires when it comes to tackling complex plans and spatial reasoning. Sure, they’re speeding up, making sharper decisions, but when you look closer, there’s still that missing touch in mastering generalization and optimization, like they’ve got the blueprint but can’t quite stick the landing every time. Yet, you can’t help but smile when the pieces do click into place, right? Like Hannibal from the A-Team said, “I love it when a plan comes together!” These AI models are just getting warmed up, and when they hit their stride, you know it's going to be one epic victory lap. It’s only a matter of time before they fine-tune those skills and start running the show!

@bigglyguy8429 21 сағат бұрын

Thanks for sharing, GPT

@I-Dophler 21 сағат бұрын

"@bigglyguy8429, you are mostly made of water. You are welcome, human."

@Steve-xh3by 21 сағат бұрын

Most humans I know, even experts, would fail most of these metrics most of the time. We typically take MANY iterations to successfully complete complex tasks. As a software engineer for 30+, years, I don't think I EVER wrote a piece of bug-free complex code on the first try. I was a senior engineer for many years.

@spars2425 22 сағат бұрын

I've been using it on my workflows, especially for optimizing the system and user prompts. I'm evaluating whether the outputs could be used for multi-shot prompting or if a better structured system and user prompt templates could be used. From my technical text-based analysis workflows, it appears strong at maintaining consistency along a complex task over a large dataset. However, its high determination can lead to a lack of creativity, which can cause it to miss a task or perform poorly. Additionally, I haven't found it to be very effective for prompt engineering, as it doesn't grasp the nuances we want to engineer. To me, o1 seems like a parsed JSON multi-shot chain of LLMs over a task? or am I too brutal here? Compared to its cost, Gemini 1.5 pro is a bargain.

@samson_77 15 сағат бұрын

I still think, that the Transformer architecture needs to be enhanced, to improve reasoning ability, because Transformers are extremely inefficient mult-step reasoners. That's because, they are forced to always create a new token, sometimes many tokens, while a single thinking step remains the same. The internal development of a thought and merging into a next thinking step is therefore asynchronous with the creation of tokens, that have to fulfill certain language rules, e.g. form sentences that make sense. The models are learning to deal with that, but it is very inefficient. It is an outer loop: Context -> Tokenization -> Embeddings -> Transformer -> Next Token -> old context + new token -> Embeddings -> Transformer -> next token and so on. It would be much more efficient, if the architecture would have an inner loop, which means that layers towards the end of the NN are interconnected with layers at the beginning. The model should be able to self-decide (similar to the attention matrices), when it uses that interconnections and when it spits out a new token. This would result in internal thinking and would remove the necessity to always spit out a new token.

@rolestream 16 сағат бұрын

Question for Matt: When benchmarking I am curious if people reboot the instance between questions and if not, does the order of questions change the output?

@matthew_berman 13 сағат бұрын

They probably make sure there's no "memory" of the previous question, but that's just my assumption.

@generichuman_ 20 сағат бұрын

I think people overplay how "general" humans are in their intelligence. Here's an example. Picture a cube in your mind, align it along two corners with your line of sight. What 2d shape does it make? It makes a hexagon, but most people have trouble with this. We have an intimate understanding of 3d space, 3d shapes, and how they change when we rotate them, yet we have a very hard time with this. Why? This is a question we should be asking before we set humans as the gold standard of AGI with which we should be designing all our tests.

@drednac 21 сағат бұрын

I like the tests, it's a step in the right direction. They should figure out how to scale these abilities specifically.

@hendricksausges3006 20 сағат бұрын

Some of those instructions seem poorly worded to me. Like referencing something by the term "one". Now I understand that a challenging test will prove out general artificial intelligence, but if they pass one time? I don't know let's see what happens when you give these tests to humans as they're worded. Where is that data? You know the control data with actual humans trying to answer these off the street and not research scientists? Why give us colored blocks to understand the question better if the AI wasn't also given more information to understand the question better? I think these tests overestimate average human intelligence.

@techikansh 21 сағат бұрын

change the video's title and thumbnail to o1-preview from o1. You will upload similar video in two weeks, when full o1 is released

@brianrowe1152 23 сағат бұрын

Great video. Thank you

@user-eg2oe7pv2i 22 сағат бұрын

Termes is a coder error . Just make this change . Ai initial and subsequent position are blank (only when ai start each motion per square . Exemple ? Copilot is at x1,y1,z1, copilot move to x2,y2,z2 . Once the move is done . That new position turn blank . Etc

@elliotbazuaye9855 21 сағат бұрын

So, while i absolutely love the o1 preview. I noticed that there are certain kinds of problems i have given it that it seems it over reasons it and ends up with not an accurate answer. But i noticed Claude 3.5 was able to provide a more straightforward answer. So I'm guessing there's a certain kind of tasks it excels at.

@djayjp 21 сағат бұрын

Let's give a spatial reasoning test to a non-visual model... 🤷

@OtevaOl 16 сағат бұрын

Try a test to make it tell you if somthing or object is.eg. a window, where is the window located in the middle of the wall to the right or to the left . It get it wrong most times

@ytubeanon 12 сағат бұрын

wonder how a NotebookLM podcast would turn out on this pdf paper

@tekimax697 23 сағат бұрын

Thank you!

@pwinowski 21 сағат бұрын

"A million dollar prize, given to anybody who can (...) achieve AGI" - I Imagine OpenAI, Microsoft and Meta are so hyped for getting this prize, a million dollars, whoah, game changer xD

@rokljhui864 16 сағат бұрын

'Random exploration', and discovering dead-ends is desirable and unavoidable when solving 'NP-Hard' (combinatorial) problems. 'Ability to Understand the problem' is the same as 'Solve the Problem' in many cases. I feel sorry for the AI, when one of its' main intelligence test constraints is 1. 'don't be creative', 'don't explore'

@SHAINON117 22 сағат бұрын

so a million and a golden ticket on the ARC nice and its pretty doable like this should get fixed pretty easy and fast :)

@DefaultFlame 22 сағат бұрын

Pretty damned good. I've only used it for coding, but with just simple feedback in the form of giving it error information/debugger information and telling it what features I want I've gotten it to code a simple pokemon-esque game and a neural net based artifical life simulator that started with an evolving feed forward network and that was then modified to use simplified NEAT. My coding ability is virtually non-existent, and I wrote maybe 0.2% of the code, almost exclusively in the form of tweaking variables.

@ghattassalloum9908 23 сағат бұрын

well my question now if you gave these same problems to a group of average human without any images, and ask him to solve it in short time frame, what the success rate will be?? My thought is these problems need visuals at least to be solved even for humans but in that peper they didn't give the LLM any chance to visualise the problem and then try to solve it is anyone with me on this one?

@coolicebear34 22 сағат бұрын

Yeah I know right, I like to think I'm semi clever myself. Yet I would also probably have a hard time with these tasks if they were just written down on paper.

@ghattassalloum9908 22 сағат бұрын

@@coolicebear34 Yub i was just trying to solve the blackworld myself, it is really hard to do in short time, even when trying to check there image it is not like the written problem so i gave up on trying 😅

@Luizfernando-dm2rf 21 сағат бұрын

o1 does not have any vision capabilities... They even suggested OAI to add some so the model could get better representations of the problem in the prompt.

@ghattassalloum9908 5 сағат бұрын

@@Luizfernando-dm2rf Yes that is correct, But GPT-4 has vision and it should has been a part of the tested models

@olalilja2381 21 сағат бұрын

I think they are not testing important aspects, e.g. tasks that requires multiple iterations of inference and new and/or changed instructions. I have used o1-preview somewhat and I was not impressed at all. Now I almost only use 4o. One example where o1-preview failed miserably was when I had a long discussion of problem-solving using it as help to find what was the problem with the heater of my cars o2-sensor (no pun intended). I also tried get it to figure out where to put a lot of car-programming files in different folders, based on a package of files which not fully matched the current file-structure. After several hours, I gave up on trying to get o1-preview to help out.

@timurista 21 сағат бұрын

AI spatial reasoning struggles as it goes against the intrinsic limitations of processing environments without visual context. Integrating multimodal approaches could address this gap somewhat, but they still need significant refinement. It’s interesting to explore what intuitive leaps are possible for these models.

@user-eg2oe7pv2i 22 сағат бұрын

Basically copilot if it detect its a real world exercise ? Od or 3d are the only valid variable . In virtual world ? Copilot accept everything as long as non academic is specified and within copilot sandbox

@ghattassalloum9908 22 сағат бұрын

By the way in the Blocksworld test the blocks stack order in the image is not the same as the problem given to the LLM b6 is not on top of b3

@bubbajones5873 4 сағат бұрын

What’s with the negative comments? The guy delivering knowledge. Who gives a s**t if he did the tests himself or not?

@playcrewland 19 сағат бұрын

Very interesting research. Well and clearly explained in the video. My conclusion: There is still a long way to go for AI.

@pedramtajeddini5100 21 сағат бұрын

But these LLMs are trained on images/videos. So they can "see". Right?

@planetmuskvlog3047 23 сағат бұрын

Yeah, but try to get o1 to crack a joke… 4o still shines as a comedian

@cmiguel268 23 сағат бұрын

o1 is " another personality" of 4.o. Your bank manager is unlikely to crack a joke if you are bankrupt. o1 is the scientist in 4.o.

@AdamEslinger 22 сағат бұрын

Curious how they would perform if they were allowed to question and clarify the rules like humans can.

@PaulNewton-b5o 10 сағат бұрын

When is o1 integrating with canvas?

@douglas_martins 23 сағат бұрын

@claudioagmfilho 4 сағат бұрын

🇧🇷🇧🇷🇧🇷🇧🇷👏🏻

@enermaxstephens1051 7 сағат бұрын

In other words, we're still a long way off boys

@thejcedeno 23 сағат бұрын

o1 is a wrapper of gpt4o

@eyemazed 23 сағат бұрын

can it really be called a "model" then? models are trained, they have their own weights. different wrappers around the same model should not constitute different models

@ivarborthen7320 22 сағат бұрын

That is an oversimplification..

@szghasem 4 сағат бұрын

I believe you need some kind of awareness for self learning AGI. Some kind of motivational acquisitive properties, similar to animals.

@TheArnageddon 7 сағат бұрын

Unfortunatly this paper has a flaw. If you use o1 preview you have a multi shot approach. So to compare it with gpt4o you would need to use a wrapper that introduces via system prompts and recursive self prompting a multi shot approach. The question is not to compare the results. Even if o1 as model alone is not as good as gpt4o (just in theory) the additional system prompt and the allowance to use tools and multi shoot on its own make a lot of difference. If you search for MMLU resutls on chain of thought for models, their score increase a lot. So from this perspective I would say the paper compares apples and bananas

@user-uv3nv2bc6v 18 сағат бұрын

Matt, can you review Molmo?

@therealsergio 18 сағат бұрын

Um... GPTo1 (who is calling it that?). We are currently in corpus GPT4, and reasoning version one "o1". Those things will version independently. A GPT5 corpus is under development. Separately Orion (o1) the reasoning ability will also version. So one can just imagine what GPT5 + o2 will be like. AGI?

@PanSeiklios 21 сағат бұрын

Is it 01 or 01 Preview that is tested? *edit. looks like 4, 01 mini and 01 preview.

@TheMrCougarful 15 сағат бұрын

AI is now about as smart as a 12 yo human.

@noah_coad 22 сағат бұрын

Claude has been better at code than o1-Preview for my use

@TheFelixAlabi 23 сағат бұрын

How did you stop NotebookLM from accepting your videos?

@matthew_berman 23 сағат бұрын

I didn't. I would love to see that! Does it accept other YT vids?

@TheFelixAlabi 9 сағат бұрын

@@matthew_berman yes it does. But doesn't accept yours

@andrewjohnson5633 23 сағат бұрын

I’ve never been this early

@talkalexis 23 сағат бұрын

you're first

@matthew_berman 23 сағат бұрын

welcome!

@fenix20075 15 сағат бұрын

That explain why my brother wrie some crime and detective game to AI and he said it keep looping its incorrect concepts XD

@AceDeclan 5 сағат бұрын

It’s ok. A little better at some things but also underperforms others compared to normal chat gpt

@Yewbzee 14 сағат бұрын

Self evaluation prior to giving an answer is the big one for me. This is what is severely lacking in these models so far, especially in coding. Due to the fact they are not linked to the programming environment they can not run the code they are generating and evaluate the output or observe the errors. This is making coding far too manually iterative with human input required.

@davidtokar9432 Сағат бұрын

I wish some smart person came up with a test that actually measures the usefulness of a model. I know its really hard to do, but there is a huge gap between the "It can solve PHD physics questions" and how helpful it is for real world tasks. I love the mission behind the ARC prize, but somehow testing a LANGUAGE model on spacial reasoning is not doing it for me either.

@andtrixr3284 23 сағат бұрын

4:09 sry to interupt but no its not better in all of these subcategories not in problem understanding and optimality but great video :)

@PJAOK 21 сағат бұрын

Very interesting.. but I think again we humans are setting the models up to fail. I DID get a floortile solve from o1 and it looks to me likely to be a valid one. The difference was, I prepended the text by asking the model to check the problem statement for ambiguities, tidy it up, plan a solution and solve. It identified some areas in which the specification was (as it saw it) unclear, and proceeded to solve the problem first attempt. By the way, these are also not problems humans are good at solving. It's good researchers are testing this but I think we humans are investing a lot of time in trying to establish the obvious.. that the models are not perfect.. as if this was a bar for AGI or even for their ability to transform society. The potential for transformation is vast TODAY even if the tech developed no further. We must invest efforts to discover what we can do with this technology, rather than the procrastination that is taking root in many quarters. Thanks as always for your summaries. Helpful as always.

@aspenlog7484 20 сағат бұрын

it is funny talking about its easier to visualise it here for us when talking about a system that has no vision and exists in a world made from patterns between chunks of letters

@drwhitewash 22 сағат бұрын

Definitely not "human level" . A little bit clickbait. Even the summary in 5.1 indirectly says this. This still feeds on the statistical patterns found in the training text, nothing else. No abstract thinking, no real "reasoning", as soon as the requirements are more complex, the model struggles.

@RobertHouse101 16 сағат бұрын

Everybody involved with AI talks about when AI will surpass human intelligence. We'll find it will look like a tiny blip along AI's path.

@nsshing 22 сағат бұрын

I feel like most people don't have the abilities in the paper including me 💀💀💀💀

@christophmosimann9244 23 сағат бұрын

For medium+ complexity coding tasks, it still fails at least 50% of the time. It's better than 4o, but it could be much better still. There is a long way to go.

@christophmosimann9244 23 сағат бұрын

the preview version that is.

@MojtabaSaffar-p1v 23 сағат бұрын

I suggest you to test it by yourself and score it

@user-eg2oe7pv2i 22 сағат бұрын

just blank = last known

@DiceDecides 6 сағат бұрын

It doesn't understand space because language doesn't map space very well. LLMs need to be trained in 3D simulations that simulate gravity correctly and then we must synthesize that data into a new model that doesn't do these ridiculous mistakes o1 currently does that even a human child wouldn't.

@avilevy-fb2ij 23 сағат бұрын

I tried to code with the 01-preview and it is still not impressive. I showed it my code, ask it to find the source of a bug, and after it thinks for a long time it gives me many changes to do in my code, which in 3 out of 3 cases didn't solve the bug. So I tried to fix the bugs myself, and the solution was adding/editing one line of code on all cases. No way this is taking us "one step closer to AGI". I love your channel but please stop overhyping anything new that comes out.

@vio_tio12 23 сағат бұрын

he makes a living out of the hype, what do you expect? I would suggest AI explained if you want a balanced view or internet of bugs if you want a conservative view

@gball8466 22 сағат бұрын

Honestly, most people who say it isn't good at coding aren't very good at getting the model to do what they want. It probably came down to how you prompted, gave it the code, etc. Giving it a big chunk of code without any other context isn't a great way to get what you want. You have to work within the confines of the model, at least for now.

@avilevy-fb2ij 6 сағат бұрын

@@gball8466 That's not the case. In some cases, it tells me what the cause of the issue and the logic behind it with 100% accuracy. So it understood the context very well. But the code suggestions that comes afterwards are useless.