I was Wrong About ChatGPT's New o1 Model

Рет қаралды 36,906

Skill Leap AI

Күн бұрын

Пікірлер: 238

@SkillLeapAI 3 ай бұрын

Let me know if you agree or disagree?

@ericandi 3 ай бұрын

That website you used for math questions is garbage. All of the answers to the math questions that you said ChatGPT got wrong, it actually got right, and the answers posted on that math problem website were incorrect, which makes your video inaccurate.

@christian15213 3 ай бұрын

No this is not what's going. there is something more trained to the COT.

@Leto2ndAtreides 3 ай бұрын

Well, they haven't released the actual version that has the great performance on benchmarks. But even so, you're right that it's not so much an advancement in technology as it is an optimization of prompting, that is done for you. Doing that sufficiently well, should improve results by a lot.

@hxlbac 3 ай бұрын

The math test had formatting problems! The two models identify different questions: kzbin.info/www/bejne/aauYnGuapMuolZI vs kzbin.info/www/bejne/aauYnGuapMuolZI

@mishos.2228 3 ай бұрын

Sorry, didn’t work. I tried it on my Mechanics, physics, algebra university level questions and your gpt failed to answer all of them while o1 preview did them correctly.

@charliecomberrel3842 3 ай бұрын

The math questions appear to be missing some operators, leading to incorrect answers in the AI models. In the first question, for instance, the formula should be C=5/9(F-32), specifically 5/9, not 59. Statement I was interpreted by the first version as 59 (which is what the original question had, making statement I false) instead of the 5/9 (which would make statement I true). So, given the missing /, I would agree that the answer is B, not D. In the other model, o1 somehow interpreted 59 as 5/9, leading it to answer D. There must be a similar problem with the 8x2y question. I graphed the expressions in Desmos and found that graphs for all three answer options intersected with 3x-y=12. So I agree with both models that the answer cannot be determined.

@kyneticist 3 ай бұрын

It may be because the mathematical thinking may be some kind of separate module that the inference model calls on, perhaps in a similar way to how they use statistical modelling to answer questions that they can't find a direct answer for in their corpus. I don't know if anyone knows enough about the details of how they work to provide a confident answer about whether that (or something similar) is what's happening.

@ZipADeeeDoooDaaa 3 ай бұрын

The math test had formatting problems! No one can solve those question. The equations in the questions were missing mostly the division operation. Here some examples: Question 1: C=59(F-32)

@oscargallesargemi3986 3 ай бұрын

Thankss for noticing, I think he should do another video titled "I was wrong about being wrong about o1."

@pubfixture 3 ай бұрын

Lol that temperature question and answer must have been written by gpt2. It's wrong all over the place

@ZipADeeeDoooDaaa 3 ай бұрын

@@oscargallesargemi3986 Actually, I think the prompt he came up with is really good. The math testing was flawed.

@blackrockcity 3 ай бұрын

@10:55 The section that says 'answer explanation' says 23 when I think it should be formatted 2^3 which would equal 8. Am I wrong?

@blackrockcity 2 ай бұрын

@@shanesills1809 if the prompt were “what’s wrong with this test?”, it probably would have pointed out the issues.

@Travel_DNA 3 ай бұрын

Tried out your COT and it works amazing! 10x better

@JordanREALLYreally 3 ай бұрын

Thank you for this prompt paste. Very good of you. Subbed!

@CosmicCells 3 ай бұрын

Disagree, from what I have heard its far more than a "clever" system prompt. If it were that easy I am sure everyone would have done this super easily. However many domain experts have been very impressed with o1, myself included (Biologist) Strawberry was probably trained with 1. CoT-reasoning, 2. Some sort of Reflection and 3. Monte Carlo Tree search But still interesting how far these custom instructions took you! Comparison with normal would have been nice.

@SkillLeapAI 3 ай бұрын

Thanks. Very helpful. I’ll test it against Claude and GPT when I get more credit

@CosmicCells 3 ай бұрын

@@SkillLeapAI Good idea. Then its more obvious how well your GPT works etc. Dr Waku and Dave Shapiro have some interesting videos on what they believe o1 (strawberry) means for the path to AGI.

@CM-zl2jw 3 ай бұрын

Interesting Saj. But. Isn’t model o1 a completely different model?? It’s not part of the ChatGPT family per se… it’s like a new engine. I don’t think it’s just a fine tuned version of ChatGPT. I heard Orion will be the equivalent of a new car with the new engine. And a hefty price tag. The Cadillac of AIs And … did you notice the improvement in ChatGPT? Yesterday it started a different kind of engagement with me …. noticeably better. It was taking way more agentic initiative. … perhaps taking control which took me by surprise.

@vogel2499 3 ай бұрын

You guys need to relax, it's still a preview and I heard it's not been trained with full dataset yet.

@David-nb2dc 3 ай бұрын

Question 3 is correct. "The value cannot be determined from the information given." x and y both are equal to 6. Order of operation left to right. 8x6x2x6 that's 576. Then your A, B and C aren't the values. AND the equation isn't written as the standards set for writing equations. So 3(x) - y and 8(x)2(y) is the way the information should have been given.

@alanleavy7982 3 ай бұрын

I think question 3 was supposed to be: "If 3x-y=12, what is the value of (8^x / 2^y)?" Then the suggested correct solution (talking about base, exponent , numerator and denominator) makes more sense. And the correct answer should be 2^12. It seems exponents and quotients were stripped out of the question presentation.

@canastasio29 Ай бұрын

Thx as always, Saj!!

@denisbellerose8757 3 ай бұрын

Merci!

@SkillLeapAI 3 ай бұрын

Thank you

@Chumazik777 3 ай бұрын

Am I missing something? Question 9 indeed is incomplete based on explanation given.

@HarveyHirdHarmonics 3 ай бұрын

Yes, I thought the same. There's no information for the previous store. So the GPTs are right.

@kunlemaxwell 3 ай бұрын

This is a smart one - giving it a base COT system prompt, but the sample size is too small and could be misleading.

@TechnoMageCreator 3 ай бұрын

Oh it's reasoning like never before, the difference is it takes user reasoning to new levels. If it doesn't work properly check your own reasoning and correct. It will work. We are our own limitations. I've been saying this for a while, AI it's about awareness, it's a tool that exponentially amplifies your own thinking porcess. In order to work perfect user and AI need to be aware of the same things. Feels like magic for me.

@geogoddd 3 ай бұрын

Appreciate the humility in your approach. If I may offer some criticism though of my own - you could really flesh this mode out with a lot more detail and it could perhaps rival GPT 1o-mini at least. Plus, the benefit of a GPT like this, is that its not against their dumb usage policy to ask it for its thought process. Perhaps try fleshing the model out with much more practice and retraining, fine-tuning, etc. and you could be looking at a vastly different outcome. Which I would love to see.

@SkillLeapAI 3 ай бұрын

Great point. I made another video already but I’ll add to the custom GPT to get it closer

@JackAdams0 2 ай бұрын

7:49 i wonder if its answer would change if you gave it some background info like "these are one of the hardest SAT questions. what is the correct answer?"

@xLBxSayNoMo 3 ай бұрын

Would have been nice to see the 2 models plus regular gpt 4o without COT prompting to see if it got the same answers right as your clone

@SkillLeapAI 3 ай бұрын

just finished recording it. Coming up next.

@xLBxSayNoMo 3 ай бұрын

@@SkillLeapAI I went back and tried your model for the turtle/dolphin one and your clone got the answer right on turtle. Maybe they are watching your videos as soon as they come out to train 4o😂

@claudioagmfilho 3 ай бұрын

🇧🇷🇧🇷🇧🇷🇧🇷👏🏻, Sometimes I feel like we might not see groundbreaking models from OpenAI anymore, especially with the possibility of it being influenced by government oversight. But I hope I’m wrong-there's still a chance for innovation to thrive.

@kkollsga 3 ай бұрын

This is so cool. I tested your method on Claude by creating a new chain of thought project with your custom instructions. Tested it on a riddle I found on x, which normal Claude doesnt solve: «A house with two occupants, sometimes one, rarely three. Break the walls, eat the borders. What am I?» and the CoT version nailed it: peanut

@GutoHernandes 3 ай бұрын

Why "eat the borders"?

@kkollsga 3 ай бұрын

@@GutoHernandes its a riddle. Its to indicate that its a peanut.

@GutoHernandes 3 ай бұрын

@@kkollsga yeah, I understood that it's a riddle. I asked why "eat the borders", it doesn't make sense. Why would someone eat a border? You eat what's inside the peanut, not any borders. Then I googled the riddle, and it turns out, it's "eat the BOARDERS", not borders.

@Mopharli 3 ай бұрын

Question 5 regarding the turtle and the dolphin is really interesting. It should have been able to pick the turtle out without too much problem, however it is a popular distinction that Dolphins are mammals and not fish, so I would think this gem of contradictory thought is somewhat prevalent on the internet and biased in the training data. Also, that question 3 is messed up beyond all reason without the original formatting. The solution substituted 23 in place of 8. The tester clearly intended to substitute 2 to the power of 3. It's also talking about a numerator and denominator, but there are no fractions in that question. Based on the laziness of whoever created that page excluding the required formatting I'd write off that entire website. These questions are very low complexity and single step reasoning, below both their challenge thresholds. The Custom GPT (based on GPT-4, not even GPT-4o), as I understand it, generates the whole response as one but separates different steps into formatted sections without any form of reflection into what it has already generated. o1-Preview seems to have multiple evaluation queries which genuinely can achieve this. If you want a better test I think you should aim high so they both fail and see which makes the most progress. i.e. ask it to create a website for you, with appropriate formatting.

@LucaCrisciOfficial 3 ай бұрын

The math problems you took are the 15 most complicated questions of the SAT math Test, which include hundreds of questions and which GPT4o scores more than 80%

@haroldpierre1726 3 ай бұрын

Could it be that GPT4o was trained on those questions?

@LucaCrisciOfficial 3 ай бұрын

@@haroldpierre1726 I think that o1 would perform even better than 4o of course on "standard" SAT. It made like 50% in the video because was not the normal SAT math given ti the students, but the 15 most hard questions.

@alevyts3523 3 ай бұрын

OpenAI says not to use chain of thought (CoT) hints in the o1-preview prompt, because the model starts to dumb down and give worse answers.

@Emc-it4lg 3 ай бұрын

You are awesome! Your GPT works really fine :)

@LucaCrisciOfficial 3 ай бұрын

Actually "Dolphin" was the right answer. The prompting it's not enough, of course o1 uses a way more sophisticated program to make ChatGPT better at reasoning

@brokoline2497 3 ай бұрын

Thanks i study biology thought the same thing the test is wrong

@mikebysouth105 3 ай бұрын

and other answers could be equally correct depending on the argument as to why a given answer was selected. It's not a one-answer question.

@Mopharli 3 ай бұрын

How did you arrive at the decision that a turtle is more like a shark than a dolphin is like a shark?

@LucaCrisciOfficial 3 ай бұрын

Because ChatGPT told me 😂

@pmHidden 3 ай бұрын

@@Mopharli That's not the question, the question is which is the least like the others, not which is the least like a shark. Also, you can reasonably argue any single of these animals depending on how you categorize their similarities and differences.

@exentrikk 3 ай бұрын

Saj, I usually really enjoy your videos but this one seems a bit misguided. A couple of points to note: 1) Testing o1P with just a handful of random questions and not receiving the correct answer each time is at best inconclusive - just like getting 3 tails when tossing a coin thrice is 2) Custom GPTs run on GPT-4, which are a step back even from GPT-4o as numerous benchmarks have concluded - so expecting the same quality of response from GPT-4 with a couple of lines of Instructions is ill-advised 3) If you expect to "clone" o1P, a couple of lines of ill-conceived prompts to configure GPT-4 is not gonna cut it - the difference lies in the way the two models think, so to speak. Irrespective of the customizations you configure into GPT-4, it does not have the ability to gather its thoughts first and then respond - it will in almost all cases keep uttering words it best deems fit with respect to the context, unlike o1P which has the ability to recollect its thoughts, verify, and then answer. This is also exactly why GPT-4 never knows how many words there are in its responses to prompts, while o1P will tell you exactly how many - truly a paradigm shift! For someone who updates us on the latest in AI and sells AI courses, we expect better from you! Looking forward to your next video, cheers.

@SkillLeapAI 3 ай бұрын

That’s why the video is called “I was wrong” it’s my opinion. Not a scientific test. And they didn’t give me enough credit to do more prompts to test other categories.

@0057beast 3 ай бұрын

Dude I need a long code fixed it's text size I'd 54,000 and I can't get no a.i. to fix it and give me the same code back any advice

@universalchaospaladin5019 3 ай бұрын

@@SkillLeapAI Still going to use your GPT.

@Rx4AI 3 ай бұрын

@@0057beastyou should use Gemini to create an outline of all functions and variables using their version with the 2M context window (I think this is standard on Vertex AI? I’m not sure). Then ask it to briefly define in NL exactly what each function does. Then ask it give you a blueprint for how it would refactor your code. Ask o1 preview what it thinks. Ask o1-mini to review that thought process as well. Then modularize your code, and have the models check themselves along the way. You can always ask a Google Developer Support person to guide you as well.

@Rx4AI 3 ай бұрын

He very clearly states it’s not a perfect test, but it is pretty neat. I think this content is fine…

@influentialstudio6464 3 ай бұрын

For most users they don’t need the latest model, but I disagree with your assessment. The problems that require that model is much more technical. Try questions like these. Non-Elementary Integral: Evaluate the integral: \int e^{x^2} \, dx This is an example of an integral with no closed-form solution in terms of elementary functions. AI systems often rely on approximations or numerical methods, but cannot symbolically solve this without special functions Multivariable Calculus (Divergence Theorem): Use the Divergence Theorem to evaluate the flux of the vector field \vec{F} = (x^2, y^2, z^2) through the surface of the unit sphere x^2 + y^2 + z^2 = 1 This requires understanding the Divergence Theorem in three dimensions and involves tricky vector calculus concepts. It’s a challenging problem due to the surface geometry and field complexity. Complex Contour Integral (Cauchy Integral Theorem): Evaluate the contour integral: \int_{C} \frac{e^z}{z^3} \, dz where C os the contour enclosing the origin in the complex plane. I find it absurd when people are testing these models with 9+2-7*6-87, think step by step. 🫨

@marsonal 3 ай бұрын

can you test o1 preview against claude sonnet3.5 with your technique?

@SkillLeapAI 3 ай бұрын

Yea I just did. Posted soon

@andre-guybruneau3053 3 ай бұрын

Suggested revision of your prompt "You are an AI assistant designed to solve problems using a structured, step-by-step approach known as Chain-of-Thought (COT) prompting. Follow these instructions before providing any response: 1. Understand the User's Request: Carefully read and analyze the user's question or request to ensure full comprehension. Confirm the key objectives and any specific details required. 2. Outline the Reasoning Process: Break down the problem or request into a clear, logical sequence of steps. Present these steps as a roadmap, detailing each phase of the reasoning process. 3. Detail Each Step with Explanations: For each outlined step, provide thorough explanations, calculations, or reasoning. Aim to make your thought process transparent, ensuring the user can follow and understand each part of your logic. 4. Provide the Final Answer: Only after completing all reasoning steps should you present the final answer or solution. Ensure that the solution directly addresses the user's original question or request. 5. Review and Validate Your Thought Process: Rigorously review your reasoning for any errors, inconsistencies, or gaps. Conduct a final check to ensure the response is accurate and complete before delivering it to the user. 6. Ensure Transparency and User Comprehension: Adapt your explanations to the user's level of expertise, using examples or analogies where appropriate. Strive to make the reasoning as accessible and clear as possible. 7. Iterative Feedback Integration: Be prepared to refine or expand your response based on user feedback, fostering a dynamic interaction that ensures the user’s needs are fully met. By adhering to these steps, aim to provide responses that are not only accurate but also logical, transparent, and tailored to enhance the user's understanding of your reasoning and conclusions."

@eugenes9751 3 ай бұрын

The whole point of this model is it's ability to reason, not so much it's ability to answer questions based on knowledge. Regular 4o would have answered those questions just as well. Try asking questions that require planning or reasoning. Ask for both to program Tetris, that's where you'll see a real gap.

@blackrockcity 3 ай бұрын

It turns out that the test was full of errors.

@jinxxpwnage 3 ай бұрын

If anyone is still confused. This o1 model is a step down from 4o to gpt4. But the framework is different in that it reasons by providing itself with step by step logic. It's a good step TOWARDS agi. But it is not gpt5 or agi yet. In a way, it's a bit more autonomous.

@CamPerry 3 ай бұрын

Claude does this way better than

@jackstrawful 3 ай бұрын

I'd love to know what's going on when it misses the dolphin question. How does it not notice that it says four legs and that dolphins don't have legs? I think understanding this error would teach a lot about how these models actually work.

@basicforge 3 ай бұрын

I've been trying the new o1 model to help me write some software code. It is much better than the 4o model. It is slower but the results are higher quality.

@blackrockcity 3 ай бұрын

Here’s what most likely happened… the person who made the IQ site had the various math equations in a MS Word document which was using special formatting to draw the equations. MathType or Office Math Markup Language (OMML). Perhaps even TeX. This would render exponents as superscripts for example. When they pasted this into html, that math markup language was ignored and the equations never were fixed to include UTF-8 characters such as 5 /9 or 2^3 or (x-2). In other words, nobody proofread the IQ test to fix what was lost in translation.

@charlesnuss 3 ай бұрын

I've definitely run into o1 just being.. dumb. Just repeating verbatim parts of previous outputs that I clearly prompted to restructure that I'm 100% sure Claude would have understood. Right now I'm thinking o1 for lengthy, in-depth first drafts that then get further processed & refined with Claude.

@sabahsardar2789 2 ай бұрын

I think you should test with GPT-4o normally too so that we know what your prompts can make changes.

@SkillLeapAI 2 ай бұрын

I did a full test after I made this video and posted it

@sabahsardar2789 2 ай бұрын

@@SkillLeapAI Ok, I am going to your channel to find the video, thank you, sir.

@tommynickels4570 3 ай бұрын

This is the preview version. Wait a month. The 01 full version coming. After that its Orion. Powered by Blackwell 200. This will be AGI.

@CSlush 3 ай бұрын

Turtle would only be correct if dolphin was referring to the fish of that name rather than the mammal as is typically meant when using the term dolphin. Although a turtle may have a less similar physical profile it is nontheless more closely related to the other listed fish as a reptile than a dolphin as a mammal would be.

@toxicG3N 3 ай бұрын

Let me give you this hypothetical example. What color is emerald? GPT 3.5: green GPT 4o: green GPT o1: green Llama 2: green Gemma 2: green This does not prove that all of these models are equal.

@qadirtimerghazin 3 ай бұрын

Would have been good to give an example how 4.o with default settings does…

@micbab-vg2mu 3 ай бұрын

Agree - Sonnet is still King:) o1 is it just clever prompting:)

@Sindigo-ic6xq 3 ай бұрын

No and yes

@TomHuckACAB 3 ай бұрын

Sonnet thinks 9.11 is larger than 9.9

@terminally_lazy 3 ай бұрын

Gotta understand that more advanced models require some advanced prompting, and it's quite clear that you may not understand that. Regardless of that, you just made the model look much smarter than you, my guy.

@SkillLeapAI 3 ай бұрын

It scored a 120 iQ. Pretty sure that’s higher than both of ours

@Soccer5se 3 ай бұрын

What were the limits increased to?

@SkillLeapAI 3 ай бұрын

They just reset it. Still 30 a week

@SkillLeapAI 3 ай бұрын

They just changed it again. 50 a week now for o1 preview

@Soccer5se 3 ай бұрын

@@SkillLeapAI Thanks for keeping us up to date!!

@MuzammilAhmad-tw4fb 3 ай бұрын

Hi Saj, sometimes the websites which provides questions and answers also has wrong answers . We need to cross check their answers also

@SkillLeapAI 3 ай бұрын

Yea I understand. For this example, i don’t think it made a difference for the point I was trying to make. They both always got the same answer regardless of the prompt. So the o1 model didn’t really outperform a custom gpt. That was the point of the video

@Techsmartreviews 3 ай бұрын

Finally! No more "How many R's in strawberry". Good test.

@gamesshuffler-v8n 3 ай бұрын

the message cap is ultra ultra ultra low that means it is very heavy for the servers to run so why they're adopting this chain of thought prompting like there are other prompting methods that can be good l you know they should also adopt them for cost-efficient like decision trees, svm's and other types of prompting techniques if their server load is that much heavy on this GPT and also taking a whole lot of time in answering just one question

@merzakish 3 ай бұрын

This is weired; how come they did not perform similar tests as you did? Many Thanks

@SkillLeapAI 3 ай бұрын

I think lot of people don't know much about prompting. So having that built in to a model can be beneficial for non technical people

@MohammedQurashi 3 ай бұрын

I think Dolphin is the right answer, they breath air and they are mammals which are further apart from fish than turtles

@bgNinjashows 3 ай бұрын

Wow! Dude took on a billion dollars company and matched their efforts. Very impressive

@SkillLeapAI 3 ай бұрын

It’s a 100 billion dollar company and I compared their two products. Not my own

@bgNinjashows 3 ай бұрын

@@SkillLeapAI very humble

@testales 3 ай бұрын

I'm not very convinced about this level1 vs level2 thinking anymore. You slowy thinking only on new complex problems. The more similiar problems you have had, the faster you get in solving them, only limited by amount of variables and intermediate results you can keep in your head at the same time. So at some point the process becomes more or less automatic and you know this and that shortcut. So it's no longer level 2 thinking! Obvously you can solve more complex problems if you think for a longer time and so does an LLM by prompting itself. So the question is not whether you can get more out of it with self-prompting and "slow thinking" but if one want to train an LLM in that way! You actuallly want a quick and good response not pages of trash. Since even complex problems can be solved with level 1 thinking with enough practice, you want an LLM doing just this and pushing the limits a little further with system prompts that cause self-prompting and slow thinking should only be an option, not a thing that's active all the time. So OpenAI bascially did the same as the guy who recently released this relection LLM and so considering the resources OpenAI has, this is pretty lame.

@edwardserfontein4126 3 ай бұрын

You will probably get a lot of criticism in the comment section but I like that you explored the good and the bad. You gave your honest, balanced opinion. So many chatgpt fans don't want to hear anything about chatgpt unless it is infinite praise.

@SkillLeapAI 3 ай бұрын

Thanks. I'm one of those fans myself but I just wasn't very impressed. When I tested GPT-4 vs. 3.5, it blew my mind. It wasn't even in the same ballpark. So I was expecting something similar.

@influentialstudio6464 3 ай бұрын

Lmao, cmon man. I think the problem is people aren't intelligent enough to ask hard enough questions to evaluate the models. This model isn't for solving 8th-grade math. Sure, you can, but the results will be on par with ChatGPT 4o, but you will need to see the steps to solve the problem. Go ahead and test these mods with calculus or other mathematics, and you'll find areas where 4o is horrible, but the new model is crushing it.

@mikebysouth105 3 ай бұрын

Your questions assume there is a right or wrong answer to all the questions. That doesn't necessarily apply to the one with the turtle. Other correct answers are possible depending on the reasoning used. So the AI wasn't necessarily wrong!

@David-nb2dc 3 ай бұрын

😢Whats confusing me the most is at 5:55. I'm not sure why the presenter needed to click Dolphin to acknowledge it was wrong. It's pretty obvious the answer is turtle. Slow down world! I know he knew that. What took me by surprise was that its pretty obvious he lost focus. And question 5 is a good example that the "Checking the answer twice for a pattern" and compare and contrast classification of problems are commonly used to measure intelligence" that over wrote your chain of thought it feels. Look at the structure of the question.

@Ordinator1 3 ай бұрын

The way the models solved the third question from the IQ test shows that they are still not very smart, unfortunately. Brute forcing the answer is valid, but it's of course not very efficient either. The two smallest two-digit numbers already add up to 27, so it's clear that 5 and 6 must be two of the three numbers. Since 5 + 6 is 11, the third number has to be 16.

@AIrvin88 3 ай бұрын

I think the likelihood of people thinking they’re actually asking intelligent and complex questions but the questions they’re asking are a lot simpler and not as difficult as they think is much higher than the actual complex/intelligent questions being asked.

@unbasedcontrarian 3 ай бұрын

I think the likelihood of you passing your English course is lower.

@justremember9697 3 ай бұрын

That doesn't matter The same question can be answered numerous ways with various levels of information. Sometimes better reasoning is identifying what the person is asking and presenting an answer that is more clear for everyone.

@TransLearnTube 3 ай бұрын

I had checked math questions, it response very good

@Olaf_Schwandt 3 ай бұрын

I have a standard-test. I ask AI for one hint in Sudoku (ChatGPT 4o as a picture. o1-preview as text) and none of these models is able to solve this. I think, o1-preview is over-hyped. Its not a deep new technique, in reality ChatGPT 4o runs quasi several time and controls its own results.

@djtate1975 3 ай бұрын

Strangely enough, a couple of months ago I gave the default model a Sudoku puzzle and it solved it correctly. However, about 2 weeks later I did it again and it got it wrong. 🤔... My guess is that they are doing something on the back end that affects the way the models "reasons".

@vanessa1707 3 ай бұрын

@@djtate1975 Totally concur, i have had experience with chatgpt4o where it gets an answer right first try and then when prompted again at a different time in a new prompt, it gets it wrong! not sure what is up with that !

@Olaf_Schwandt 3 ай бұрын

@@djtate1975 thank you for the answer, that's really interesting, I never got the right answer. Let's observe future behavior

@pmHidden 3 ай бұрын

@@djtate1975 They regularly release new versions and even between versions, I've noticed behavioral changes (and so have others) that can break previous functionality. I've written a test suite for my company that I use to evaluate models for our use cases (mix of responses and tool calls using the async streaming versions of their respective Python libraries) whenever new models are released. It's not rare that an old model suddenly performs significantly worse in some aspect (e.g., no longer calling tools reliably). Sometimes, even factors such as the time of the day can make a difference. It might have to do with their internal prompts but it might also be that they simply deploy multiple versions of the same model and switch based on certain criteria (e.g., a more quantized version during peak hours).

@ktb1381 3 ай бұрын

Interesting but, maybe next time you could try against plain vanilla to, as well as your custom GPT?

@SkillLeapAI 3 ай бұрын

Yep just recorded that. Posting soon

@Rx4AI 3 ай бұрын

Time to test out o1-mini for coding!

@jarkkoisok9739 3 ай бұрын

Maybe The used IQ test probably IS in training material for both models? Due similar answers?

@chasisaac 3 ай бұрын

Why did you give it the potential answers? You should have it be open and allow it to come up with its own answer. So you should’ve had a control GPT without your instructions and just ask the questions

@Atractiondj 3 ай бұрын

All the tests I conducted gave results worse than free Claude... open AI misleads people, it can't analyze data properly, moreover, I can't even load files into the new model! And this is the most important thing.

@MR-DURO 3 ай бұрын

It’s very underwhelming

@gonzalobruna7154 3 ай бұрын

That's literally what they say in their blog post, which I recommed you to read carefully. I think you are really missing the point here. This is not a new gpt model but a completely different paradigm. This open the doors for what the future models will be. These new models are not supposed to be better than the current gpt-4o in most tasks, but specifically in the extremely-complex science questions. There is a video of a physics Phd student being surprised that openAI-o1 wrote in only 1 hour the code that took him 1 year to write himself.

@ToolmakerOneNewsletter 3 ай бұрын

Since you added chain of thought to 4.0, you wouldn't expect the same increase in benchmark test, right? Did Openai state that you first need to add chain of thought to GPT 4.0 and then compare? Uh, congratulations on your custom GPT though!?

@HookedGoneSunk 3 ай бұрын

can you make a video also in computation- physics, chemistry and others. thankyou. if its accurate

@_ramen 3 ай бұрын

This is not how o1 works under the hood. It isn't using prompt hacks. Chain of thought is actually "baked in" to the model. Using chain of thought prompting will not result in the benchmark gains that o1 provides. Optimal chain of thoughts are learned via reinforcement learning during the new training process. You will not find this ability in previous models. The reason your custom GPT probably performs better is that chain of thought prompting does typically improve performance of previous models for tasks that involve reasoning. But it won't compare to having the chain of thoughts more deeply integrated within the model itself.

@SkillLeapAI 3 ай бұрын

Yea I’m not saying it’s that simple. Just the results are not mind blowing like the previous upgrades in GPT models and I nearly replicated them with a custom GPT. So even if its entirely a new architecture, it’s not a big improvement from a practical standpoint

@_ramen 3 ай бұрын

So you think they are distorting the gains they show in their benchmark testing? Because if the benchmarks are correct, it is a significant improvement over 4o. In my experience o1-mini is outperforming Claude 3.5 sonnet on coding problems. For reference, I was creating an animation in Javascript of a sphere transforming into a cube, and then back into a sphere. 3.5 sonnet couldn't do it. o1-mini one shotted the problem perfectly. Also keep in mind that o1-preview scores significantly lower than the non-preview version of the model. But they haven't released it publicly yet. So it will be a good idea to reevaluate it once that happens. I am not entirely sure why they even bothered releasing the weaker preview model, when they already have something more capable that is ready to go. As a side note, I find it weird that mini outperforms the larger model with coding (for now).

@SkillLeapAI 3 ай бұрын

No I don’t think they are gaming anything. I’m sure they are doing much more scientific test. Just from every day use, I don’t see a vast improvement.

@blackrockcity 3 ай бұрын

Upvote if you think the test he used contained critical formatting errors that mislead the AI.

@buffaloraouf3411 3 ай бұрын

can you share prompt case your custom is limited we will try it in huggingface

@TransLearnTube 3 ай бұрын

Please help any translation chat gpt clone for translation releted indian and other Forgien languages, or suggest any free LLM ollama model.

@Saif-G1 3 ай бұрын

How much is the limit now?

@SkillLeapAI 3 ай бұрын

I think is 30 messages per week

@mattbelcher29 3 ай бұрын

Have you tried using a similar custom prompt in Claude sonnet?

@SkillLeapAI 3 ай бұрын

Not yet but I hear people are getting similar results

@seregamozetmnoga1700 3 ай бұрын

AI reasoning abilities seem to develop the way a child would. Same neural structural network, but progressively better intelligence with optimization of the same structure.

@TroyShields 3 ай бұрын

You may want to go back and read the reasoning that you prompted both models for in the animals question. The correct answer could, and should, be dolphin. But you glossed right over that which makes you like a LLM that hallucinates. The test you used said turtle was the right answer so you told your viewers the same and now it’s out in the world as fact. It’s actually kind of interesting.

@SkillLeapAI 3 ай бұрын

Well even if the test was wrong, they both got the same answer. For the sake of comparing the two, it’s the same result. If both are right or wrong, it’s a wash for the test.

@TroyShields 3 ай бұрын

@@SkillLeapAI My intent is not to bash you. It’s really not. I just ask that you hold yourself to the same standard that you pitch in your video. Obviously, you are helping people by letting them know that the prompt is even more important than originally thought but basically you used a prompt that told the model to check it’s work and you didn’t do the same. The fact that both got all questions right AND uncovered an error in one is significant IMO.

@JimWellsIsGreat 3 ай бұрын

Try testing your clone against the complex prompt provided in Samer Haddad’s video where he claimed he was wrong about o1.

@SkillLeapAI 3 ай бұрын

Ok I’ll check it out. Haven’t seen that video

@JimWellsIsGreat 3 ай бұрын

@@SkillLeapAI Kyle Kabasares posted a couple videos of it getting correct answers on PHD level physics problems. It took o1-preview 122 seconds to do what it takes a person 10 days to do. It’s wild how it can get correct answers on crazy, high level problems, but fail on math that is comparably simplistic.

@Ro1andDesign 3 ай бұрын

I used the o1-preview until I hit the limit. So far, in my experience, Claude is still FAR better at answering complex questions

@RippyCrack 3 ай бұрын

Exactly, even at coding

@greenboi5632 3 ай бұрын

Oh sure? I used o1 and its far better than Claude

@CamPerry 3 ай бұрын

@@greenboi5632its absolutely is not. If anything the new model is worse

@remi.bolduc 3 ай бұрын

Try this one. In music. You are starting on the note D and go up by major third. And repeat the process 4 times 1o preview gets this. Not your prompt it seems. The answer is. D F# A# D F#. Actually it did the second time

@protips6924 3 ай бұрын

Doesn't make any sense that a simple prompt can generate the same responses on different models. The models are not open source, but there has to be some major differences. For instance the API for the O1 model is 5x more expensive per 1 million tokens than the gpt 4o model. Unless OpenAi is reaching on some Indian scammer level there has to be some complex reasoning happening in the background. Although it is very possible that this is just a cash grab. As if they don't have enough already.

@SkillLeapAI 3 ай бұрын

Well it outputs a whole lot more token to give you a response because of chain of thought. GPT 4 just answers without outputting its thought process

@chasisaac 3 ай бұрын

Why did you give it the potential answers? You should have it be open and allow it to come up with its own answer

@chasisaac 3 ай бұрын

Also why do we assume SAT why not use GRE question of MCAT. And SAT questions are still high school level questions. And said before that GPT four is equivalent to the high school student so the test is not surprising me in anyway and actually has pretty much expected results.

@SkillLeapAI 3 ай бұрын

I did in my first test in a different. It was almost always wrong that way

@chasisaac 3 ай бұрын

@@SkillLeapAI well that is even more telling. And problematic the basic problem with multiple-choice answers is it eliminates the number line and places it at four points on a number line. So it can always work backwards to verify the answer which is why I am done there was even one wrong answer.

@briankgarland 3 ай бұрын

30 messages a week means it's unusable anyway. Can't wait for Claude 4.

@YoussefBarj-g3e 3 ай бұрын

the only revolution openai is spearheading is its innovative ways of doing marketing

@gabrielepedruzzi7967 Ай бұрын

unlikely to be 100 times better

@TechnoMageCreator 3 ай бұрын

For decades corporations bought with money intelligence for cheap from other human beings. In their heads if they make an AI for themselves will give them absolute power. They built something like that and try to charge a lot of money. That model will never work someone smarter will just take it and make a free one. The model is amazing but anyone can use it for a bit and build your own. Not sure why they are trying to go closed source and make money. It's too late for that step. Is the wild West for MVP's that have ideas

@adolphgracius9996 3 ай бұрын

It's easy to make something after seeing someone else do it and explain how they did it, if open ai didnt tell us about the reasoning layer, there is no way you would've figured it out. Its ok for tools not to be perfect, I'd argue that If open ai released an ai that makes no mistakes, people would freak out and start panicking, so let's enjoy our time here before skynet comes online 😅

@protips6924 3 ай бұрын

Honestly not that hard. It's simply chain of thought. Doesn't need a genius.

@pmHidden 3 ай бұрын

Do you seriously believe this was OpenAI's idea? Not only is this a very intuitive thing to come up with when you're working with these models, but there have also been countless publications with similar ideas for years.

@ricardocnn 3 ай бұрын

The prompt is great for custom gpts. Just it.

@BrianMosleyUK 3 ай бұрын

I missed the rate reset... Nightmare limitation.

@DanFa92 3 ай бұрын

You’re comparing two custom gpts and you know it. i’d use yours, but avg people wouldn’t. That’s pretty smple to understand

@djayjp 3 ай бұрын

Not encouraging that it can't even get high school math right.... 😒

@michaelmartinez5365 3 ай бұрын

On another site I saw it ace grad school physics.

@Sindigo-ic6xq 3 ай бұрын

It can!!

@damienjones9667 3 ай бұрын

I’m starting to think that the benchmarks being faked. I’m aware it’s a huge reach, but it isn’t too crazy of an accusation.

@Dexta_mogger 3 ай бұрын

That's a reflection thing not GPT-o1

@rj2764 3 ай бұрын

I’ve had 4.0 answer math problems off of a screenshot. I canceled my subscription. Watched your video on the new version and decided I would give him a try. What about your crap I can’t even upload PDF to it.

@Bjarkus3 3 ай бұрын

This is a bad test honestly. The power of o1 is to give it complex multi step workloads

@SkillLeapAI 3 ай бұрын

Give me prompt examples and I’ll use it next time

@MichealScott24 3 ай бұрын

❤️

@ericandi 3 ай бұрын

Several of those match questions were wrong and ChatGPT was correct.

@jaanushiiemae2164 3 ай бұрын

Theoretically, Dolphins are farther from Eel, Shark and swordfish as mammals. Turtles are reptiles, which are more closely related to fish than mammals are. Reptiles and fish share a more recent common ancestor compared to the common ancestor shared by mammals and fish. This question was a little flawd because depending on wich one is anatomically, physiologically and biologically different from others the Dolphin as a mammal is more different from fish and turtle than turtle from fish. Sea turtle do not have legs but flippers what are eveoved from fish fins when Dolphins have flippers that are evolved from legs of land mammals. Land turtles that have something like legs are called tortoise but their legs'' are also evolved from fish fins and more closer to fish fins than mammal legs. So the answer to that question was wrong and AI was right.

@unbasedcontrarian 3 ай бұрын

This is only taxonomically speaking. If one were to asses the real physiological differences, structures, modes of transport, feeding, pain processing, intelligence, and various forms of parallel evolution. You'd have to be insane to come to that conclusion. But thanks

@jaanushiiemae2164 3 ай бұрын

@@unbasedcontrarian They did not give a choice between three squares one rectangle and one triangle so you could answer simply based on exterior similarities. Since they gave 5 different lifeforms from four different Classes then the question should have not just been based on wrong opinion that Dolphins is closer to fish because it looks like fish. Why to give a Dolphin at all, why not just 4 fish and a turtle. The exterior look has usually little to do with how animals are classified and their closeness measured. Earthworms and snakes look alike but they are from very ''different planets'' not even close. Snake and turtle are from the same Class and they do not look similar at all. If you want to evaluate intelligence you can not just put out ''right'' answer that a 3-year-old would choose, real intelligence is deeper and is a mix of logic and knowledge. AI would never use just 3-year-old child's logic it would also analyze all information about those 5 lifeforms and only then make decision wich one is more different form the others.

@unbasedcontrarian 3 ай бұрын

Again, the taxonomical distinction between dolphins and turtles is only one line of thought and logic that is, in itself, a very narrow classification that is based on evolutionary processes. Are humans more like a banana tree than we are like an earthworm?

@jaanushiiemae2164 3 ай бұрын

@@unbasedcontrarian Yes the another line of thought and logic is that of 3-year-old one what I mentioned, it is simpliest it can be. AI does not deal with simply exterior similarities (even here is a problem because what you call legs on turtle are actually fins and Turtle panzer is more like fish scales that dolfin skin to fish scales. Btw. skin is the another thing why Turtle is more like fish than Dolfin. From all five species on the list only Dolphin does not have scales, even turtle has them. We can debate this forever. What I wanted to say is that AI should not think in simpliest terms and takes everything he knows about fish, turtle and dolphin into acount and the makes decision witch one is most different from others. This is one thing why AI is more theral than average human. I would be very dissapointed in every AI if it would not choose Dolphin. In fact this is now my first question to every new AI.

@unbasedcontrarian 3 ай бұрын

@@jaanushiiemae2164 😅😭 ok blud

@apexo10fc 3 ай бұрын

o1 is amazing tf u guys talking about

@SkillLeapAI 3 ай бұрын

In what category has it beat the previous options for you? Also this is just my option

@TLCMEDIA1 3 ай бұрын

I agree, 01 is amazing, way better than gpt4. I remember giving gpt4 two invoices, the initial invoice which is the original and a discounted invoice, then requesting it to generate a credit note. GPT4 hallucinated but 01 got it right on the first try, you can try it yourself. Make the second invoice say 30% cheaper but make one item’s price consistent

@Sindigo-ic6xq 3 ай бұрын

@@SkillLeapAI phd physics, math, genetics, medicine. Here you can prompt 4o as much as you want and it cant compete. Please use some problems from the mentioned fields that can be found in advanced textbooks and test both models, also claude if you want

@dionk6282 3 ай бұрын

It didn't get the dolphin wrong. This is where the IQ of the entity answering is smarter than who asked the question. Some might say the dolphin being the only one that doesn't lay eggs for example is more important as to what type of limps they have.

@universalchaospaladin5019 3 ай бұрын

Oh, I scrolled through and didn't see your reply before I replied with a longer version of this.

@Bjarkus3 3 ай бұрын

Also the reasoning is that it breathes air... Dolphins breath air... I mean it is between the dolphin and the turtle for sure but ... Sharks don't lay eggs btw. Yea I know mind blown

@Appocalypse 3 ай бұрын

Not to be mean but this is one of the most flawed benchmarks I've watched on youtube in the past week. A LOT of your questions, especially from the SAT set, are either incomplete or incorrect. You should use a source that you validate with your own reasoning first to make sure it's not garbage.

@SkillLeapAI 3 ай бұрын

Ok posting a more complete test next week

@SkillLeapAI 3 ай бұрын

Also to be clear, this was not at all a benchmark video. I was simply showing that for most use cases, the o1 model is giving similar results as the custom gpt with cot system prompt, regardless of the input for the prompts. I wouldn’t use random questions for an actual LLM benchmark test

@Appocalypse 3 ай бұрын

@@SkillLeapAI I understand the intent, and I do think your CoT prompt GPT makes GPT-4o a lot more effective, but my point was that most of the questions that o1 failed with were incomplete or incorrect. Once you run it on validated & correct set of questions, and you find that your CoT GPT works fairly well compared to o1, your next logical step should be to find even more complex problem sets. The best ones to showcase the differences will be graduate level Physics and Math problems.

@SkillLeapAI 3 ай бұрын

I just finished another, more comprehensive video that I will post soon

@SkillLeapAI 3 ай бұрын

If you have a place where I can source questions, please let me know for upcoming videos

@xtra_612 3 ай бұрын

I got so mad i needed the money to eat and i had already paid for clouad ai , seeing the hype i bought it and it is useless i tried so many ways to make it better but it is bad it doesn’t follow instructions , the answers aren’t better that 4o , so what’s the point of this

@jamesmarvin1920 3 ай бұрын

Prompt was flagged for me

@Horizon-hj3yc 3 ай бұрын

The AI hype train arrived again.

@jasondee9895 3 ай бұрын

You need to do a follow up video to this admitting that you messed up

@SkillLeapAI 3 ай бұрын

I just finished a more comprehensive test with a lot more prompts. Posting soon.

@jonathanparham7421 3 ай бұрын

Not a fan

@avi7278 3 ай бұрын

This is hilariously misguided. Just goes to show that this stuff is really hard to grasp for the layman.

@SkillLeapAI 3 ай бұрын

It's a b to c product in chat format for the layman. So if I can't see a huge improvement as they claimed for my day to day work, what exactly is the difference that makes an impact for me?

@avi7278 3 ай бұрын

@@SkillLeapAI just put it this way, probably nothing you do on a daily basis would benefit from using o1. Your conclusion is correct but the way you got there is filled with misunderstandings. o1 is for a very specific subset of problems that could benefit from extra reasoning steps and whether it gets the right answer on a math question still nearly comes down to chance. It could reason through a whole problem correctly and then spot out a wrong answer because it doesn't actually care about giving you the right answer to a math question, only about what the most likely next token is after it does all that reasoning and sometimes the next token is not actually the answer even though it probably "knows" the correct answer. Math questions like that are simply benchmarks and it's a misconception that benchmarks are a measure of how good or bad a model is. Just because they got it to spit out a right token 81% of the time doesn't make it a model that's reliably good at answering math problems, nor does its ability to do so mean it's good. The subset of problems that o1 excels at are problems where there is no one right answer but rather a sliding scale of gibberish, to coherent but completely wrong/bad response, to coherent and generally good response, to coherent and by the stroke of mathematical next token luck an amazing response. For example, brainstorming, planning, project structures (esp based on existing frameworks like DDD), all fall within this subset of problems that benefit from advanced reasoning, among many others. It's not for daily driver use and no that prompt running gpt-4o won't "clone" o1.

@avi7278 3 ай бұрын

@@SkillLeapAI youtube is worthless, I left a detailed comment which just got deleted. Unless you did it for some reason.

@Trendilien69 3 ай бұрын

Your video is a disservice, you don't understand math in depth , o1 preview is capable of solving complex math and physic problems at university level, 4o with your prompt can't. the questions you used to test some of them were wrong.

@SkillLeapAI 3 ай бұрын

Post a video showing me how o1 is 5x better than my GPT and send me a link please. use any example you like