Can LLMs Reason? Why LLMs Struggle to Think Critically

  Рет қаралды 5,436

Prompt Engineering

Prompt Engineering

Күн бұрын

Пікірлер: 38
@macemoneta
@macemoneta Ай бұрын
My favorite question to monitor the reasoning ability of AIs is: "What's the smallest integer whose square is between 15 and 30?" Most will reply with '4', when the correct answer is '-5'. Asking follow-up questions confirms that: - the AI understands that 'integer' includes negative numbers, - that squaring a negative value results in a positive value, - that negative values are smaller than positive values, and - that negative values are smaller the further they get from zero. It just can't put the pieces together.
@elawchess
@elawchess Ай бұрын
I did the tests and I don't think the problem is as bad as you say. Once you put a warning in the question to say that by integer you mean to include negative numbers as well, they typically get it. Which leads me to believe that it's because the training data does include cases where people do colloquially take "integer" to mean the positive values only, and now it can't tell which of these you mean, because now it's ambiguous.
@LucaCrisciOfficial
@LucaCrisciOfficial Ай бұрын
I think most people also would answer like this
@anuroop345
@anuroop345 Ай бұрын
"Find the smallest integer (including negative integers) whose square is between 15 and 30. Please show all calculations step by step" even gpt-4o-mini can solve this with just little change
@macemoneta
@macemoneta Ай бұрын
@@anuroop345 That's the point. If you ask the AI to define 'integers', it knows the definition is 'negative whole numbers, zero, positive whole numbers'. It can't use reason to apply the knowledge it has.
@elawchess
@elawchess Ай бұрын
@@anuroop345 Don't mind them. They want LLMs to become mind readers. 😃
@ernestuz
@ernestuz Ай бұрын
It looks like overfitting to me. Thanks for your videos. EDIT: For anyone interested, this behaviour is due to the way transformers models cook an answer. Basically, they break the question in a sequence of tokens (a token may be a word or a part of one) and then predict statistically what is the most likely token to follow. Think of them as glorified Markov chains, in which the next state depends on many previous tokens instead of the last one. They don't reason, they just give you the most likely sequence of tokens as answer.
@TheReferrer72
@TheReferrer72 Ай бұрын
I have seen kids do this on reading comprehension tests where they ignore what they have read and use prior knowledge.
@PrinceCyborg
@PrinceCyborg Ай бұрын
Yeah, it's overfitting, I wrote a research paper about this a few months back and how to address it.
@chrislowe6926
@chrislowe6926 Ай бұрын
The concerning thing about this is that the models need to be prompted to review their answer before they correct it - and they are trained to then find a different answer. Even asking the model “Are you sure?” is prompting it to review the answer and find a different answer. If a human doesn’t notice that the answer is wrong, then the wrong answer will remain - the models do not continue thinking about the problem and come back 5 minutes later with “I’m sorry, I got that wrong, let me try again” as a human could. While for these simple examples the humans can see the mistakes clearly, in real-life use with much more complex and opaque data, there is significantly less chance of a human picking up on errors. I don’t see these models reaching a point where we can rely on them to make complex decisions, as there is no accountability, as you have with a human making the decision.
@MeinDeutschkurs
@MeinDeutschkurs Ай бұрын
Great! Similar to humans: the difference between what you know and what you think/believe you know is huge. I don’t think that this is an attention problem, I think this is a problem of pattern similarities (compare RAG: similarity search) and this gets worse with quantization. Maybe we can force the model to “think different” by forcing it to output json with predefined key value pairs. Btw, the tested model told you the recognized pattern.
@konstantinlozev2272
@konstantinlozev2272 Ай бұрын
LLMs do lack critical thinking. It is a combination of the way they are trained and how inference is done. It implicitly "trusts" it's training (and fine tuning) data. And it produces the most likely reply based on the representation of the text in the prompt. The representation of a slightly altered riddle is very very close to the unaltered riddle. So architecturally it is very easy for it to make that mistake. I have tested it on legal reasoning and it flops on moderately complex stuff, simply because it lacks the context that would allow it to deduce the correct reply.
@Kaoru8168
@Kaoru8168 Ай бұрын
o1 preview is a model created specifically for reasoning tasks because it fails at others things that general models typically do in my opinion its not a surprise o1 preview does well since it was specifically trained for these tasks
@elawchess
@elawchess Ай бұрын
But it's amazing how human-like it is. I feel humans would do that a lot too and would often miss the slight adjustement that has been made to the problem. I think it's more of a problem if when it's pointed out, it still can't do it. I don't think this means it's not reasoning anymore than a human who initially confuses them for the famous problems could be said to not be reasoning. Maybe they can do a finetuning to say, when you think you matched the user prompt to a famous problem, crosscheck that there are no significant differences.
@testales
@testales Ай бұрын
Claude did pretty well but all my local favorite 70b+ LLMS failed miserably. :-( Some of these are really funny modifications. I'll definitely include them in my evaluation tests.
@marcus-b4x3h
@marcus-b4x3h Ай бұрын
Maybe I missed something, but I did not understand why the temperature is set to 0 7:28
@s.patrickmarino7289
@s.patrickmarino7289 Ай бұрын
So the results can be replicated.
@marcus-b4x3h
@marcus-b4x3h Ай бұрын
@s.patrickmarino7289 thanks for the clarification 🙂
@s.patrickmarino7289
@s.patrickmarino7289 Ай бұрын
The real question is, if you ask the first question and it figures out the trick, will it have a better result with the second and third questions? Can it generalize and say, "The first question was a trap. Perhaps I should take a closer look at the second question."
@Kaoru8168
@Kaoru8168 Ай бұрын
o1 is an specialist model but Gemini Experimental 1114 does a lot of things better than o1 struggles. Is very complete model in every aspect
@kumargaurav2170
@kumargaurav2170 Ай бұрын
Auto regressive spitting of next probable word will never brings in that pause and think and output the next word, I mean structurally there's no arrangement for a transformer based models to think and output results. Things have to be tweaked at structure level of model to account for model reasoning.
@unclecode
@unclecode Ай бұрын
My friend, you seems had flue, get better soon. I am curious to know if we fine-tune these models with such a kind of question where the first answer is wrong and the second is reminding the model to think deeply, and then prepare, like, just maybe 100 examples. Does that give us a model whose weight is a little bit adjusted in that direction to be more sensitive to the problem statements and nuances? I think it will happen. Even maybe you can try to put them into a prompt to try with in-context learning and then in the prompt say that we're going to give you similar questions, so be careful and don't fall into a trap. I think if we do that, then the model will do it. It's a good experiment.
@LucaCrisciOfficial
@LucaCrisciOfficial Ай бұрын
I think o1-preview roughly matches average humans solving these puzzles (of course they must know the original version but also must not be oriented to logic-scientific). It would be interesting to test these questions on and assorted bunch of people with average IQ (100) to compare
@JynxSp0ck
@JynxSp0ck Ай бұрын
So basically the "bias" weight in whichever part of the perceptron is asking "is this ?" is too high and giving a positive result even when there are crucial pieces missing. It seems to be able to reason just fine when it doesn't have its thought process overwritten by cultural noise, kind of like most people really.
@nekoill
@nekoill Ай бұрын
What do you mean "struggle to think critically"? They don't reason, like at all. The only reasoning LLMs do is "what is the probability of the next word being X?"
@elawchess
@elawchess Ай бұрын
That's very reductionist. It's like saying your brain JUST fires neurons.
@nekoill
@nekoill Ай бұрын
@elawchess not really. Just because the way neural networks are organized is somewhat inspired by the human brain, it doesn't mean that they behave like one. Don't forget that neural networks are also a very reductionist model of a brain. What even gives you the idea that LLMs possess any sort of reasoning capabilities in the first place?
@elawchess
@elawchess Ай бұрын
@@nekoill I didn't make any claim that Artificial NN behave like the brain. What I meant by reductionist there is that because you know it operates by predicting the next word, you are tried to say that essentially there's nothing more to what it's doing. It's in that same sense that an advanced alien could visit us and they could be so wise they even know how to make a copy of our brains because they know how to make it fire neurons like our brain does. They can similarly conclude that we are not reasoning since our brain "JUST" fires neurons. Just because the LLMs work by predicting the next word doesn't mean they aren't doing that in a sufficiently complex way that "reasoning" has somehow emerged.
@elawchess
@elawchess Ай бұрын
@@nekoill "What even gives you the idea that LLMs possess any sort of reasoning capabilities in the first place?" you are the one that was trying to argue that they can't reason, by saying "they just predict the next word". That's the only point I'm addressing.
@s.patrickmarino7289
@s.patrickmarino7289 Ай бұрын
One group of researchers created a very simple model where it was fed all the possible moves of a game, "A1, to C3 "kind of stuff. The model did not know that the letter A is next to the letter B or that the number 5 is next to the number 6. When they looked at the model after training and how it was laid out, the model had a representation of the board in it's layers. The model had A1 next to A2 and B4 next to B5. That is a form of reasoning on a model so simple, it could not use or understand words and had no concept of math or numbers. I call that reasoning.
@jackgaleras
@jackgaleras Ай бұрын
Llegue aqui buscando la IA nivel Doctorado anunciada hace unos meses , y el algoritmo de Google me mando por aca
@chrislowe6926
@chrislowe6926 Ай бұрын
The concerning thing about this is that the models need to be prompted to review their answer before they correct it - and they are trained to then find a different answer. Even asking the model “Are you sure?” is prompting it to review the answer and find a different answer. If a human doesn’t notice that the answer is wrong, then the wrong answer will remain - the models do not continue thinking about the problem and come back 5 minutes later with “I’m sorry, I got that wrong, let me try again” as a human could. While for these simple examples the humans can see the mistakes clearly, in real-life use with much more complex and opaque data, there is significantly less chance of a human picking up on errors. I don’t see these models reaching a point where we can rely on them to make complex decisions, as there is no accountability, as you have with a human making the decision.
The New Gemini Experimental: Can it Pass the Reasoning Tests?
16:49
Prompt Engineering
Рет қаралды 3 М.
Stop Losing Context! How Late Chunking Can Enhance Your Retrieval Systems
16:49
Une nouvelle voiture pour Noël 🥹
00:28
Nicocapone
Рет қаралды 9 МЛН
Арыстанның айқасы, Тәуіржанның шайқасы!
25:51
QosLike / ҚосЛайк / Косылайық
Рет қаралды 700 М.
Why is ARP used in networks? (FREE CCNA 200-301 Course 2025)
12:33
David Bombal Tech
Рет қаралды 4,4 М.
I Redesigned the ENTIRE YouTube UI from Scratch
19:10
Juxtopposed
Рет қаралды 885 М.
The 8 AI Skills That Will Separate Winners From Losers in 2025
19:32
History of AI Reasoning (AlphaGo, MuZero, LLMs)
17:24
Art of the Problem
Рет қаралды 71 М.
How might LLMs store facts | DL7
22:43
3Blue1Brown
Рет қаралды 875 М.
DeepSeek v3: The First Open Model to Rival OpenAI and Anthropic!
27:45
Prompt Engineering
Рет қаралды 10 М.
Open Reasoning vs OpenAI
26:59
Sam Witteveen
Рет қаралды 31 М.
LightRAG: A More Efficient Solution than GraphRAG for RAG Systems?
19:49
Prompt Engineering
Рет қаралды 42 М.