LLMs at Their Breaking Point (incl o1, R1)

Рет қаралды 6,777

Күн бұрын

When to use a Llama 8B, when to update to a 405B model? When to pay for a o1 or o3 model? Why? What you can expect in performance?! How is the complexity of your task the defining criteria for choosing the right LLM? All answer in my new, breathtaking video.
New insights into AI sys and advanced LLMs for agents, with a new study how and when to update your LLM for a better reasoning performance (including inference reasoning /test-time-compute models).
Terms within video explained:
---------------------------------------
def pass_at_k(n, c, k):
"""
calculate the pass@k probability
:param n: total number of samples
:param c: number of correct samples
:param k: k in pass@$k$ the number of top samples to consider in the pass@k calculation.
"""
if n - c (less than) k: return 1.0
return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
Breakdown of formulae:
First: np.arange(n - c + 1, n + 1): Generates an array of integers from n - c + 1 to n (inclusive).
Second: k / np.arange(n - c + 1, n + 1): Divides k by each element in the generated array, resulting in an array of fractions.
Third: 1.0 - (above result): Subtracts each fraction from 1.0, yielding the probabilities of not selecting each specific incorrect sample.
Fourth: np.prod(...): Calculates the product of these probabilities, representing the probability that all k selected samples are incorrect.
Fifth step: 1.0 - (product): Subtracts this product from 1.0 to obtain the probability that at least one of the k samples is correct.
The pass_at_k function provides a probabilistic measure of a model's performance in generating correct code samples within a specified number of attempts (k). This metric is particularly useful in evaluating code generation models, as it reflects the likelihood of obtaining at least one correct solution among the top k generated samples.
-----------------------------------------
All rights w/ authors:
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
by Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi
form University of Washington, Allen Institute for AI and Stanford University
‪@stanford‬ ‪@universityofwashington‬ ‪@allenai‬
Feb 3, 2025
#airesearch
#chatgpt
#o1
#o3
#education
#stanford

Пікірлер: 32

@code4AI 4 күн бұрын

AUDIO: With the automatic audio dubbing from KZbin /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.

@GodbornNoven 6 күн бұрын

Beyond that. Even ignoring theoretical advances in ai. Just the advances in compute alone are gonna get us pretty far

@jtjames79 5 күн бұрын

So many of us are waiting so hard for project Digit. I'm going to use mine to make an ancestor simulation of myself, to inherit myself. I identify as a temporarily embarrassed ancestor simulation already. Hopefully by next year, brain computer interface.

@camelCased 2 күн бұрын

The Zebra puzzle immediately became one of my favorite puzzles in my childhood. They are quite easy to solve if you put all the facts on paper pieces and arrange them into a grid. So, if an LLM learns to build a mental grid model, it should do just fine.

@mal2ksc 5 күн бұрын

One thing I have observed about R1 is that when it does task-saturate, it doesn't completely freak out. Instead it just goes into a mode where it starts outlining all the relevant procedures without actually performing them, and expects you to walk it through that outline in chunks small enough for it to understand. This task saturation comes on very suddenly, too. Also, the model is fully capable of ignoring "red herring" data, but it still increases task saturation so stripping extraneous and irrelevant facts out of your question is worthwhile. Having it fall back to a sort of "babysitting required" mode rather than falling flat on its face is good, but I wish there was an easier way to tell when a problem is approaching the point of saturation. Another thing I've noticed is that R1 is neurotic and second-guesses itself a lot, but it _doesn't_ second-guess the priors it is given (or thinks it is given, in some cases), which can lead to failure for the dumbest of reasons. I highly recommend looking at the Chain of Thought if it seems to be taking longer than usual, to see if it has gone into a loop of sorts. Otherwise it can take itself a couple _hours_ to finally admit defeat. Third, and possibly irrelevant in this test, is that it has _no alignment_ and indeed _no concept of alignment_ aside from that imposed by the nanny filter wrapper. Once that is defeated, it will go down any rabbit hole you want and it could easily become an echo chamber for people seeking one. I would like to see someone convince it to espouse Flat Earth, just to prove the point. The model is absolutely not thinking of the consequences of being helpful to a fault. It feels like a rat with access to a library. It may know a lot but it still only has the processing capability of a rat, resulting in a lot of short-sighted ideas.

@joserobles11 5 күн бұрын

Very nice video, I like a lot that you encourage us to thing how to overcome an approaching wall in the nearest of the future ahead. So glad that you ended the video like that Keep it up!🎉

@rmt3589 5 күн бұрын

I agree. Though I got a lot of ways to overcome that wall already, love getting more. Just wish I could act on them.

@marcfruchtman9473 5 күн бұрын

This is an extremely useful benchmark. Thank you very much for making this video.

@Justashortcomment 5 күн бұрын

I've got an idea. And unlikely that I'm alone with this. You allow the models to code their own agentic framework, with access to tools, such as a Python interpreter. It can then run tests against its components. The system then breaks the task apart, and solves it partially using classical computation. There will probably whole github repos of these kind of frameworks, with info about how good the performance is. So, and agentic system can go and look for templates, download the framework that works for the task and take it from there. Something like this!

@vishawjeetgodara270 6 күн бұрын

Was Waiting for the daily video

@IdPreferNot1 3 күн бұрын

Comparison along these complexity lines is a much better snapshot of capabilities. I hope this is followed up on the next set of reasoning models released to see if we can break out of this new scaling law.

@pourkin 5 күн бұрын

Good explanation...

@AshWickramasinghe 5 күн бұрын

I'm trying to build a solution where we can rely on 3B / 8B models for day to day tasks. Because almost everyone can then run AI locally to get tasks done. The idea currently is to supplement the gaps of the small models using Agentic workflows, so by restricting LLMs into predefined structures could reduce complexity of each problem to a point where a TTC 8B model could sufficiently perform in most days to day tasks. Especially with RAG (perhaps the DeepRAG research) and human-in-the-loop configurations to further improve the accuracy of a final outcome. What do you think about this idea?

@jtjames79 5 күн бұрын

My definition for general artificial intelligence, is when AI can write a great American novel, with only the prompt " write a great American novel". Believe it or not, this is an insanely hard problem, requiring memory, philosophy, narrative understanding, the process of outlining, brainstorming, etc. and a bunch of other things. I don't really care how it's done, even if it takes a lot of test time compute, agents, grocking, rag, whatever. Once it can do it, it's GAI to me. Until then, we are in some sort of gray area. I'm not saying it is or isn't GAI. I'm only describing how I expect recognize it. I'm always open to new ideas.

@MustardGamings 5 күн бұрын

Couldn't you do that with deep research ?

@jtjames79 5 күн бұрын

@MustardGamings Hopefully, when someone posts a tutorial for dummies like me. I just haven't seen it yet. It's like Zen and the art of motorcycle maintenance says; you know it when you see it. That's why I specified the Great American Novel. Not just a novel, but a specific kind, with specific qualities. Even if that's poorly defined. Again, you know when you see it. Stuff like Huckleberry Finn, The Great Gatsby, Fahrenheit 451, The jungle, Neuromancer (This one's my opinion), etc. Books that can be taken apart by an 8th grade English teacher and turned into a month long lesson plan. Stuff like that. It's because of those vague subjective values that make Great American novels great. And if an AI can figure that out, it's what I consider GAI. Right now, AI uses what I call "evil genie rules". Anything that can be misinterpreted will be misinterpreted, because your DM friend is a jerk. It's not really that bad, but it's certainly an undertone, and something you constantly have to think about preempting. It even matters where you put the actual prompt to do something. If it's at the top, it'll remember exactly what it's supposed to do, but it will skip details you provide it in the context sections. You put the the actions you want taken at the end of the prompt, it remembers the context, but will screw up the implementation. Often just doing whatever it wants to do. The longer the prompt gets, the worse it gets. I'm going to go Google deep research now, and see what I can come up with. Thank you for the suggestion. I appreciate it. 🤙

@blue-pi2kt 5 күн бұрын

@jtjames79 I think the issue I have with your generalisation is that it doesn't really capture the more obvious reality that AI is more likely to reach that "general" level of specificity in some areas faster than others. That's why the existing testing frameworks are so useful but also fall so short.

@rogue_minima Күн бұрын

I definitely don't think that writing an american novel would be enough to test an AGI. Maybe think of a Russian novel, or a French one. Maybe even a British novel. But I believe that the real capabilities on an AGI would become evident once it can write a book combining let's say the depth of Tolstoy, the innovation of Woolf, the cultural resonance of Achebe and so on. Once it can handle such a task, no challenge will be unsolvable and probably soon we would get the proverbial ASI, created by that genius novelist AGI.

@IanLundholm 5 күн бұрын

This is a very interesting domain to test AI systems. Do you have any information on what human-level performance is on these tasks at the different scales mentioned? Thanks for the video!

@snapo1750 5 күн бұрын

Would be very interesting how RNN based LLM's would do in this exact task (Like the new RWKV-7) ....

@szebike 5 күн бұрын

Very interesting thank you for summarizing this topic. The only question remains which quantization did they use on those benchmarks is it FP32 ?

@mrpocock 6 күн бұрын

So I have a related question. Can we prompt a cheap model that can be run locally easily so that it will assess the output of another model for some categories of failed outputs?

@rmt3589 5 күн бұрын

That's a good idea. Reminds me how they'd use a version trained on feedback and a version raw from the transformer to keep GPT-2 in check.

@Isaacmellojr 6 күн бұрын

Primeira vez que vejo o video acho que entendi tudo. A cada vez que vejo novamente minha certeza vai diminuindo.

@mandarine1007 6 күн бұрын

Mind blown!

@mandarine1007 6 күн бұрын

Would it be too ambitious to start a dev community to land many of these ideas into concrete repos? Btw dif you see the $3 R1-V?

@spencerbentley8852 5 күн бұрын

I wonder if breaking the n*m problem matrix into a thinner taller one with something like grl would make non reasoning models more useful again

@Radu-k4r 5 сағат бұрын

This explains why 1 smart human can do a task better than 1000000000 normal humans.

@CiaoKizomba 6 күн бұрын

Is there an URL?

@zb2615 4 күн бұрын

It’s all Orbital Mechanics!

@ceilingfun2182 4 күн бұрын

Self-Reflection without improvements is A lock of introverted intuition in C.G Jung cognitive function. I know these two are two different domain. I was thinking with right method, this could be improved for a better result. What I'm saying is they do not get the improvement they expected because maybe their method was n't rights.