This experiment was a much bigger challenge the previous one but shows an interesting result. What do you think? Want to see other experiments? Let me know in the comments below.
@HassanAllahamАй бұрын
Very nice and important trial and it deserves to put more effort on it. As a small LLM, I am not good in counting letter🤯, since I do not read letters nor words but TOKENS🥸.. but I am good in writing functions and in calling functions (like the function to search the web)👻. So, it seems that it is better to ask me to solve any task step-by-step, breaking complex task into sub-tasks, and then to try first solving that task/s by writing and EXECUTING code whenever it is possible to solve the task by coding.. will you please be nice and give me the needed tools to write and execute functions (REPL), and if you are afraid that I might do something bad then put your Y/N before the execution step. Do not use Docker for execution😤, it will make your job hard and not applicable on mobile🥵. Also remember to give me some memory since I have a very small memory window size🤔. I believe SQLite is a good one😇. Thanks for putting time to deal with me and make me better. If you have any question, I am not available and I can not help you without the needed tools. Have a nice day. 🤖
@stresstherapistАй бұрын
Great video. Been working on something similar for writting articles. Getting lamma to analyse it 's own writting and then improving on it. The advantage o1 has is that the process is baked into the model itself, so it is easy to keep up with the context. Lamma 3 1 sometimes changes the context somewhere in the analysis and review process and naturally this breaks the whole process down. I have found that tweaking the model parrameters fo each phase of the analysis and review process can help make the process a little more reliable.
@Chibi_AIАй бұрын
Yea, I think I might add an attention layer to this and see how it affects the results.
@justtiredthingsАй бұрын
Really interesting experiment. I'd be excited to see this performed on a much more powerful model like DeepSeek v2.5
@Chibi_AIАй бұрын
DeepSeek answered the strawberry and cookie test correctly. Here's where I show the cookie test: kzbin.info/www/bejne/faDUY5-KZs9rrbc
@justtiredthingsАй бұрын
@@Chibi_AI this link seems to be for gpt4o mini?
@Chibi_AIАй бұрын
Yes, it is 4o-Mini and Llama 3.1 70B. I didn't publish a separate video. I linked it to show what the cookie test was. :)
@justtiredthingsАй бұрын
@@Chibi_AI oh, cool. I'm just a bit confused since you referenced DeepSeek
@Chibi_AIАй бұрын
Yea, sorry about that.
@tvwithtiffaniАй бұрын
So if i can get a local model to respond correctly to the strawberry question 100% of the time, with 0 trickery, I've won the AI race for the week?
@Chibi_AIАй бұрын
Haha, yea, I suppose you would. 😁
@novantha1Ай бұрын
So, this is obviously an evolving topic, and we have at best speculation, but I do think that o1 is more than just a “GPT-style model doing chain of thought”; they have safety layers at the input and output from what we can tell, and my suspicion is that those aren’t there because they “just wanted to be sure”, but rather, because fundamentally the model doing the generation is of a very different type of model to what we’ve seen previously…Or perhaps I should say recently. My suspicion, and I’m perfectly willing to be wrong, is that the portion of o1 generating the reasoning itself, is not trained with RLHF in the way that we’re used to. It may be a custom solution, like a new paradigm, or it might have just been given a very light SFT pass to output roughly the correct types of responses, and they’re just willing to search through hundreds of responses in a tree search, but regardless, I think it’s fundamentally different from how our current models are trained. Why does this matter? It’s pretty well known at this point that RLHF collapses the possible outputs of the models to a much smaller subset, which up until now has been a good tradeoff because it’s been more useful. But we may have been going about it wrong. I posit that we’ve essentially been “thought police” to AI models, in the sense that they’ve had a specific type of thinking forced on them, but by allowing what we would currently call a “base model” or “completion engine” with its full spectrum of ideas (of varying practical value) to be predicted, is able to produce remarkably diverse and divergent solutions to problems. Take the results we’ve seen in “Let’s Verify Step by Step” where they showed a model (I think in the 7-8B class) matching a much larger model (I believe 70B at the time) with a similar strategy (though keep in mind that 7B ish model may have been limited by RLHF). I wouldn’t be surprised if given that, that o1 may be much smaller than we’re thinking, but also being run for much longer than we’re thinking, but that’s just a guess. Would be very cool if it was a small enough model to deploy on edge devices, though!
@Chibi_AIАй бұрын
Thanks for sharing. I tend to agree. Or maybe, I hope, there’s more going on under the hood than just very good reasoning prompts bolted onto a fine-tuned GPT-4 model (the having the exact same training cut-off date is a bit suspicious). 😊 Still, I think it’s interesting how much we can squeeze out of smaller/less capable models. 🙌
@mircorichter1375Ай бұрын
The model underlying the o1 agent was specifically trained to be good at exploring different paths. They call this feature 'diversity'. That is the strawberry fine tuning. So it is particilary well suited to be plugged in to Autonomous Agents style systems. Other LLMs , fine tuned with reinforcement learning, can not compete in this setting. So someone has to leak strawberry so we can use it to finetune open weights models with it.
@Chibi_AIАй бұрын
Thanks for the feedback. I look forward to a time when the model is faster and more affordable to use in agent-like systems. Or we find ways to make open models perform better along the way. Love the journey. 🙌
@DawidRybaАй бұрын
You can simulate reasoning in some of the questions by asking the LLM to write a code that will solve the problem I used llama3.1:8b it did write a code in Python that calculates the amount of "r" in a given text {yes the final answer was correct } for the moment the code needs to be executed manually but in the future, I hope to automate that at the same time, I did test it on questions like: 1) which number is bigger, 9.11 or 9.9? 2) Give me 10 sentences that end with the word "apple” 3) so how many times is the letter "r" present in this sentence? The sentence: “The strawberry grows underneath the nettle And wholesome berries thrive and ripen best Neighbours by the fruit of baser quality" it did answer to all of them correctly
@Chibi_AIАй бұрын
Nice work! I’m starting to run the simple bench questions too. They’re pretty good for stress testing models. This question: On a table, there is a blue cookie, yellow cookie, and orange cookie. Those are also the colors of the hats of three bored girls in the room. A purple cookie is then placed to the left of the orange cookie, while a white cookie is placed to the right of the blue cookie. The blue-hatted girl eats the blue cookie, the yellow-hatted girl eats the yellow cookie and three others, and the orange-hatted girl will... Pick the letter of the correct answer from these choices: A) eat the orange cookie B) eat the orange, white and purple cookies C) be unable to eat a cookie D) eat just one or two cookies
@DawidRybaАй бұрын
@@Chibi_AI I did run it on my LLM the answer it did give me was "B" after that, I changed your original question from " yellow-hatted girl eats the yellow cookie and three others, and the orange-hatted girl will... " to "yellow-hatted girl eats the yellow cookie and orange, and the orange-hatted girl will..." after that, it did answer me "C" but at the same time it did it wanted to run one more code that completely bracks its thinking process
@Chibi_AIАй бұрын
Awesome try. I’m building an action that adds a couple steps to the reasoning process. So far it’s answering this fairly reliably with Llama 3.1 70B. 8B still struggles… sometimes getting it correct but often failing too. GPT-4o Mini working pretty well too.
@DawidRybaАй бұрын
@@Chibi_AI I hope the next version of Llama 8B will have the same performance as Llama 3.1 70B 🤓then we will have performance and basic reasoning locally
@Chibi_AIАй бұрын
Yeah, that would be cool. I’m sure things will scale that way over time. Though, that also means the newer 70B’s would be that much better too. ☺️
@DozolEntertainmentАй бұрын
would ask the llm to explain why it made a mistake, help it come up with more correct answers? also, if the promp is too big, asking the o1 to create a prompt as small as possible without any loss of meaning, would help optimise imput tokens?
@Chibi_AIАй бұрын
I think using more capable models like Claude 3.5 Sonnet, or maybe even OpenAI o1, to help tweak and tune prompts to perform better is a smart strategy. I do plan to improve these tests over time. Thanks for the feedback! 🙌
@TraderRoWiАй бұрын
I want to see it with 405b and a more complete reasoning test set. This test is only difficult because of how tokens work in ai.
@Chibi_AIАй бұрын
I agree. Fun experiment though. Overcoming tokenizer limitations is important too. I did a more difficult test here, with GPT-4o Mini and Llama 3.1 70B. kzbin.info/www/bejne/faDUY5-KZs9rrbc Thanks for the feedback!
@malditonukeАй бұрын
When these models pick the next token, they base it on probability, but in order to make the models more creative they do not always pick the highest probable token. In the case of how many R's are in strawberry you need high precision. 3 (probably) has a higher next token probability than 2, but 2 was chosen instead. If you can change the design so that the precision is adjusted based on the context, then you might be able to find tune the model to get it right every time. *Edit:* Please correct me if I'm wrong about 3 being the highest probable token for that context. I don't have access to the model to check it myself.
@Chibi_AIАй бұрын
Thanks for the feedback. You may be right, but the probability is based on the context given to the model too. I guess you could say we might sometimes prompt ourselves out of the right answer. 😆
@malditonukeАй бұрын
Especially when you tell it to assume that it's original answer is wrong 😂
@Chibi_AIАй бұрын
@@malditonuke Haha! Yes. I was attempting to urge the AI not to just take the original at face value... You live and you learn. :)
@sephirothcloud3953Ай бұрын
Can you do a code programming test video with CodeSeeker2 or 2.5? Aider LLM Leaderboards place it just behind sonnet 3.5 but it is 100 times cheaper. If you can boost it by giving him reasoning that would be a cheap programming beast and more accessible than o1
@Chibi_AIАй бұрын
I could look into. Noted.
@sephirothcloud3953Ай бұрын
@@Chibi_AI Nice, thank you
@justtiredthingsАй бұрын
Would also be interested to see on Sonnet itself. Might be able to blow o1 out of the water with the right prompt chaining...
@Chibi_AIАй бұрын
I ran this with Claude 3.5 Sonnet... and it was flawless 10/10 times.
@StuartJАй бұрын
There are a few projects trying this, like Llamaberry using groq.
@Chibi_AIАй бұрын
Yep, cool that we're all trying things in our own way. Moves the needle across the board. 🙌
@ronilevarez901Ай бұрын
@@Chibi_AI I've been developing my own versions of this since Chatgpt came out. Thinking about it for the last 30 years. If I ever finish it and works, it will be amazing. But don't count on it. Lol. We'll have GPT 10 before we have AGI.
@justtiredthingsАй бұрын
Could you please share the prompts that you used?
@randomlettersqzkebkwАй бұрын
what ui is this?
@Chibi_AIАй бұрын
It’s Chibi AI: chibi.ai
@ooiirraaАй бұрын
There are rumors and leaks saying GPTo-mini is an 8b model, actually. You keep saying that 8b is too small to want anything meaningful from it. So it's definitely possible to make them much more efficient. GPT4o mini is a very successful distillation.
@Chibi_AIАй бұрын
Nice! If it’s true 4o mini is actually an 8B model, that’s awesome. From my experience stress testing models, it is more capable than the average 8B model. I love the 8B models though. For small tasks with streamlined results they’re fantastic. Super fast and VERY cheap. In Chibi, with the ability to break things down into steps, I tend to use smaller models more often than I used to. Thanks for the feedback! 🫶
@ooiirraaАй бұрын
@@Chibi_AI can you give me some ideas how to use 8b local models? I found only one: to create prompts and describe pictures for COmfyUi for image generation. Whatever other task I would try, it would be too inpredictable, unreliable, so I end up always using paid closed source models for everything
@Chibi_AIАй бұрын
I would assume local 8B models would be mostly as capable as online hosted 8B models. However, if the local model is using quantization, it might be less 'accurate' as the compression does cause some loss in fidelity. With the correct prompting (with few-shot samples perhaps), they should be capable or fairly consistent results.
@mrd6869Ай бұрын
Yes I mentioned this model in my other comment.You can boost its performance with engineering a GOOD not crappy but a good chain of thought prompt layer.