That website you used for math questions is garbage. All of the answers to the math questions that you said ChatGPT got wrong, it actually got right, and the answers posted on that math problem website were incorrect, which makes your video inaccurate.
@christian152132 ай бұрын
No this is not what's going. there is something more trained to the COT.
@Leto2ndAtreides2 ай бұрын
Well, they haven't released the actual version that has the great performance on benchmarks. But even so, you're right that it's not so much an advancement in technology as it is an optimization of prompting, that is done for you. Doing that sufficiently well, should improve results by a lot.
@hxlbac2 ай бұрын
The math test had formatting problems! The two models identify different questions: kzbin.info/www/bejne/aauYnGuapMuolZI vs kzbin.info/www/bejne/aauYnGuapMuolZI
@mishos.22282 ай бұрын
Sorry, didn’t work. I tried it on my Mechanics, physics, algebra university level questions and your gpt failed to answer all of them while o1 preview did them correctly.
@ZipADeeeDoooDaaa2 ай бұрын
The math test had formatting problems! No one can solve those question. The equations in the questions were missing mostly the division operation. Here some examples: Question 1: C=59(F-32)
@oscargallesargemi39862 ай бұрын
Thankss for noticing, I think he should do another video titled "I was wrong about being wrong about o1."
@pubfixture2 ай бұрын
Lol that temperature question and answer must have been written by gpt2. It's wrong all over the place
@ZipADeeeDoooDaaa2 ай бұрын
@@oscargallesargemi3986 Actually, I think the prompt he came up with is really good. The math testing was flawed.
@blackrockcity2 ай бұрын
@10:55 The section that says 'answer explanation' says 23 when I think it should be formatted 2^3 which would equal 8. Am I wrong?
@blackrockcityАй бұрын
@@shanesills1809 if the prompt were “what’s wrong with this test?”, it probably would have pointed out the issues.
@charliecomberrel38422 ай бұрын
The math questions appear to be missing some operators, leading to incorrect answers in the AI models. In the first question, for instance, the formula should be C=5/9(F-32), specifically 5/9, not 59. Statement I was interpreted by the first version as 59 (which is what the original question had, making statement I false) instead of the 5/9 (which would make statement I true). So, given the missing /, I would agree that the answer is B, not D. In the other model, o1 somehow interpreted 59 as 5/9, leading it to answer D. There must be a similar problem with the 8x2y question. I graphed the expressions in Desmos and found that graphs for all three answer options intersected with 3x-y=12. So I agree with both models that the answer cannot be determined.
@kyneticist2 ай бұрын
It may be because the mathematical thinking may be some kind of separate module that the inference model calls on, perhaps in a similar way to how they use statistical modelling to answer questions that they can't find a direct answer for in their corpus. I don't know if anyone knows enough about the details of how they work to provide a confident answer about whether that (or something similar) is what's happening.
@David-nb2dc2 ай бұрын
Question 3 is correct. "The value cannot be determined from the information given." x and y both are equal to 6. Order of operation left to right. 8x6x2x6 that's 576. Then your A, B and C aren't the values. AND the equation isn't written as the standards set for writing equations. So 3(x) - y and 8(x)2(y) is the way the information should have been given.
@alanleavy79822 ай бұрын
I think question 3 was supposed to be: "If 3x-y=12, what is the value of (8^x / 2^y)?" Then the suggested correct solution (talking about base, exponent , numerator and denominator) makes more sense. And the correct answer should be 2^12. It seems exponents and quotients were stripped out of the question presentation.
@TechnoMageCreator2 ай бұрын
Oh it's reasoning like never before, the difference is it takes user reasoning to new levels. If it doesn't work properly check your own reasoning and correct. It will work. We are our own limitations. I've been saying this for a while, AI it's about awareness, it's a tool that exponentially amplifies your own thinking porcess. In order to work perfect user and AI need to be aware of the same things. Feels like magic for me.
@Travel_DNA2 ай бұрын
Tried out your COT and it works amazing! 10x better
@kkollsga2 ай бұрын
This is so cool. I tested your method on Claude by creating a new chain of thought project with your custom instructions. Tested it on a riddle I found on x, which normal Claude doesnt solve: «A house with two occupants, sometimes one, rarely three. Break the walls, eat the borders. What am I?» and the CoT version nailed it: peanut
@GutoHernandes2 ай бұрын
Why "eat the borders"?
@kkollsga2 ай бұрын
@@GutoHernandes its a riddle. Its to indicate that its a peanut.
@GutoHernandes2 ай бұрын
@@kkollsga yeah, I understood that it's a riddle. I asked why "eat the borders", it doesn't make sense. Why would someone eat a border? You eat what's inside the peanut, not any borders. Then I googled the riddle, and it turns out, it's "eat the BOARDERS", not borders.
@claudioagmfilho2 ай бұрын
🇧🇷🇧🇷🇧🇷🇧🇷👏🏻, Sometimes I feel like we might not see groundbreaking models from OpenAI anymore, especially with the possibility of it being influenced by government oversight. But I hope I’m wrong-there's still a chance for innovation to thrive.
@CosmicCells2 ай бұрын
Disagree, from what I have heard its far more than a "clever" system prompt. If it were that easy I am sure everyone would have done this super easily. However many domain experts have been very impressed with o1, myself included (Biologist) Strawberry was probably trained with 1. CoT-reasoning, 2. Some sort of Reflection and 3. Monte Carlo Tree search But still interesting how far these custom instructions took you! Comparison with normal would have been nice.
@SkillLeapAI2 ай бұрын
Thanks. Very helpful. I’ll test it against Claude and GPT when I get more credit
@CosmicCells2 ай бұрын
@@SkillLeapAI Good idea. Then its more obvious how well your GPT works etc. Dr Waku and Dave Shapiro have some interesting videos on what they believe o1 (strawberry) means for the path to AGI.
@CM-zl2jw2 ай бұрын
Interesting Saj. But. Isn’t model o1 a completely different model?? It’s not part of the ChatGPT family per se… it’s like a new engine. I don’t think it’s just a fine tuned version of ChatGPT. I heard Orion will be the equivalent of a new car with the new engine. And a hefty price tag. The Cadillac of AIs And … did you notice the improvement in ChatGPT? Yesterday it started a different kind of engagement with me …. noticeably better. It was taking way more agentic initiative. … perhaps taking control which took me by surprise.
@vogel24992 ай бұрын
You guys need to relax, it's still a preview and I heard it's not been trained with full dataset yet.
@canastasio2917 күн бұрын
Thx as always, Saj!!
@geogoddd2 ай бұрын
Appreciate the humility in your approach. If I may offer some criticism though of my own - you could really flesh this mode out with a lot more detail and it could perhaps rival GPT 1o-mini at least. Plus, the benefit of a GPT like this, is that its not against their dumb usage policy to ask it for its thought process. Perhaps try fleshing the model out with much more practice and retraining, fine-tuning, etc. and you could be looking at a vastly different outcome. Which I would love to see.
@SkillLeapAI2 ай бұрын
Great point. I made another video already but I’ll add to the custom GPT to get it closer
@JordanREALLYreally2 ай бұрын
Thank you for this prompt paste. Very good of you. Subbed!
@Chumazik7772 ай бұрын
Am I missing something? Question 9 indeed is incomplete based on explanation given.
@HarveyHirdHarmonics2 ай бұрын
Yes, I thought the same. There's no information for the previous store. So the GPTs are right.
@kunlemaxwell2 ай бұрын
This is a smart one - giving it a base COT system prompt, but the sample size is too small and could be misleading.
@influentialstudio64642 ай бұрын
For most users they don’t need the latest model, but I disagree with your assessment. The problems that require that model is much more technical. Try questions like these. Non-Elementary Integral: Evaluate the integral: \int e^{x^2} \, dx This is an example of an integral with no closed-form solution in terms of elementary functions. AI systems often rely on approximations or numerical methods, but cannot symbolically solve this without special functions Multivariable Calculus (Divergence Theorem): Use the Divergence Theorem to evaluate the flux of the vector field \vec{F} = (x^2, y^2, z^2) through the surface of the unit sphere x^2 + y^2 + z^2 = 1 This requires understanding the Divergence Theorem in three dimensions and involves tricky vector calculus concepts. It’s a challenging problem due to the surface geometry and field complexity. Complex Contour Integral (Cauchy Integral Theorem): Evaluate the contour integral: \int_{C} \frac{e^z}{z^3} \, dz where C os the contour enclosing the origin in the complex plane. I find it absurd when people are testing these models with 9+2-7*6-87, think step by step. 🫨
@toxicG3N2 ай бұрын
Let me give you this hypothetical example. What color is emerald? GPT 3.5: green GPT 4o: green GPT o1: green Llama 2: green Gemma 2: green This does not prove that all of these models are equal.
@LucaCrisciOfficial2 ай бұрын
Actually "Dolphin" was the right answer. The prompting it's not enough, of course o1 uses a way more sophisticated program to make ChatGPT better at reasoning
@brokoline24972 ай бұрын
Thanks i study biology thought the same thing the test is wrong
@mikebysouth1052 ай бұрын
and other answers could be equally correct depending on the argument as to why a given answer was selected. It's not a one-answer question.
@Mopharli2 ай бұрын
How did you arrive at the decision that a turtle is more like a shark than a dolphin is like a shark?
@LucaCrisciOfficial2 ай бұрын
Because ChatGPT told me 😂
@pmHidden2 ай бұрын
@@Mopharli That's not the question, the question is which is the least like the others, not which is the least like a shark. Also, you can reasonably argue any single of these animals depending on how you categorize their similarities and differences.
@micbab-vg2mu2 ай бұрын
Agree - Sonnet is still King:) o1 is it just clever prompting:)
@Sindigo-ic6xq2 ай бұрын
No and yes
@TomHuckACAB2 ай бұрын
Sonnet thinks 9.11 is larger than 9.9
@Mopharli2 ай бұрын
Question 5 regarding the turtle and the dolphin is really interesting. It should have been able to pick the turtle out without too much problem, however it is a popular distinction that Dolphins are mammals and not fish, so I would think this gem of contradictory thought is somewhat prevalent on the internet and biased in the training data. Also, that question 3 is messed up beyond all reason without the original formatting. The solution substituted 23 in place of 8. The tester clearly intended to substitute 2 to the power of 3. It's also talking about a numerator and denominator, but there are no fractions in that question. Based on the laziness of whoever created that page excluding the required formatting I'd write off that entire website. These questions are very low complexity and single step reasoning, below both their challenge thresholds. The Custom GPT (based on GPT-4, not even GPT-4o), as I understand it, generates the whole response as one but separates different steps into formatted sections without any form of reflection into what it has already generated. o1-Preview seems to have multiple evaluation queries which genuinely can achieve this. If you want a better test I think you should aim high so they both fail and see which makes the most progress. i.e. ask it to create a website for you, with appropriate formatting.
@andre-guybruneau30532 ай бұрын
Suggested revision of your prompt "You are an AI assistant designed to solve problems using a structured, step-by-step approach known as Chain-of-Thought (COT) prompting. Follow these instructions before providing any response: 1. Understand the User's Request: Carefully read and analyze the user's question or request to ensure full comprehension. Confirm the key objectives and any specific details required. 2. Outline the Reasoning Process: Break down the problem or request into a clear, logical sequence of steps. Present these steps as a roadmap, detailing each phase of the reasoning process. 3. Detail Each Step with Explanations: For each outlined step, provide thorough explanations, calculations, or reasoning. Aim to make your thought process transparent, ensuring the user can follow and understand each part of your logic. 4. Provide the Final Answer: Only after completing all reasoning steps should you present the final answer or solution. Ensure that the solution directly addresses the user's original question or request. 5. Review and Validate Your Thought Process: Rigorously review your reasoning for any errors, inconsistencies, or gaps. Conduct a final check to ensure the response is accurate and complete before delivering it to the user. 6. Ensure Transparency and User Comprehension: Adapt your explanations to the user's level of expertise, using examples or analogies where appropriate. Strive to make the reasoning as accessible and clear as possible. 7. Iterative Feedback Integration: Be prepared to refine or expand your response based on user feedback, fostering a dynamic interaction that ensures the user’s needs are fully met. By adhering to these steps, aim to provide responses that are not only accurate but also logical, transparent, and tailored to enhance the user's understanding of your reasoning and conclusions."
@xLBxSayNoMo2 ай бұрын
Would have been nice to see the 2 models plus regular gpt 4o without COT prompting to see if it got the same answers right as your clone
@SkillLeapAI2 ай бұрын
just finished recording it. Coming up next.
@xLBxSayNoMo2 ай бұрын
@@SkillLeapAI I went back and tried your model for the turtle/dolphin one and your clone got the answer right on turtle. Maybe they are watching your videos as soon as they come out to train 4o😂
@LucaCrisciOfficial2 ай бұрын
The math problems you took are the 15 most complicated questions of the SAT math Test, which include hundreds of questions and which GPT4o scores more than 80%
@haroldpierre17262 ай бұрын
Could it be that GPT4o was trained on those questions?
@LucaCrisciOfficial2 ай бұрын
@@haroldpierre1726 I think that o1 would perform even better than 4o of course on "standard" SAT. It made like 50% in the video because was not the normal SAT math given ti the students, but the 15 most hard questions.
@basicforge2 ай бұрын
I've been trying the new o1 model to help me write some software code. It is much better than the 4o model. It is slower but the results are higher quality.
@charlesnuss2 ай бұрын
I've definitely run into o1 just being.. dumb. Just repeating verbatim parts of previous outputs that I clearly prompted to restructure that I'm 100% sure Claude would have understood. Right now I'm thinking o1 for lengthy, in-depth first drafts that then get further processed & refined with Claude.
@sabahsardar2789Ай бұрын
I think you should test with GPT-4o normally too so that we know what your prompts can make changes.
@SkillLeapAIАй бұрын
I did a full test after I made this video and posted it
@sabahsardar2789Ай бұрын
@@SkillLeapAI Ok, I am going to your channel to find the video, thank you, sir.
@blackrockcity2 ай бұрын
Here’s what most likely happened… the person who made the IQ site had the various math equations in a MS Word document which was using special formatting to draw the equations. MathType or Office Math Markup Language (OMML). Perhaps even TeX. This would render exponents as superscripts for example. When they pasted this into html, that math markup language was ignored and the equations never were fixed to include UTF-8 characters such as 5 /9 or 2^3 or (x-2). In other words, nobody proofread the IQ test to fix what was lost in translation.
@eugenes97512 ай бұрын
The whole point of this model is it's ability to reason, not so much it's ability to answer questions based on knowledge. Regular 4o would have answered those questions just as well. Try asking questions that require planning or reasoning. Ask for both to program Tetris, that's where you'll see a real gap.
@blackrockcity2 ай бұрын
It turns out that the test was full of errors.
@qadirtimerghazin2 ай бұрын
Would have been good to give an example how 4.o with default settings does…
@tommynickels45702 ай бұрын
This is the preview version. Wait a month. The 01 full version coming. After that its Orion. Powered by Blackwell 200. This will be AGI.
@exentrikk2 ай бұрын
Saj, I usually really enjoy your videos but this one seems a bit misguided. A couple of points to note: 1) Testing o1P with just a handful of random questions and not receiving the correct answer each time is at best inconclusive - just like getting 3 tails when tossing a coin thrice is 2) Custom GPTs run on GPT-4, which are a step back even from GPT-4o as numerous benchmarks have concluded - so expecting the same quality of response from GPT-4 with a couple of lines of Instructions is ill-advised 3) If you expect to "clone" o1P, a couple of lines of ill-conceived prompts to configure GPT-4 is not gonna cut it - the difference lies in the way the two models think, so to speak. Irrespective of the customizations you configure into GPT-4, it does not have the ability to gather its thoughts first and then respond - it will in almost all cases keep uttering words it best deems fit with respect to the context, unlike o1P which has the ability to recollect its thoughts, verify, and then answer. This is also exactly why GPT-4 never knows how many words there are in its responses to prompts, while o1P will tell you exactly how many - truly a paradigm shift! For someone who updates us on the latest in AI and sells AI courses, we expect better from you! Looking forward to your next video, cheers.
@SkillLeapAI2 ай бұрын
That’s why the video is called “I was wrong” it’s my opinion. Not a scientific test. And they didn’t give me enough credit to do more prompts to test other categories.
@0057beast2 ай бұрын
Dude I need a long code fixed it's text size I'd 54,000 and I can't get no a.i. to fix it and give me the same code back any advice
@universalchaospaladin50192 ай бұрын
@@SkillLeapAI Still going to use your GPT.
@Rx4AI2 ай бұрын
@@0057beastyou should use Gemini to create an outline of all functions and variables using their version with the 2M context window (I think this is standard on Vertex AI? I’m not sure). Then ask it to briefly define in NL exactly what each function does. Then ask it give you a blueprint for how it would refactor your code. Ask o1 preview what it thinks. Ask o1-mini to review that thought process as well. Then modularize your code, and have the models check themselves along the way. You can always ask a Google Developer Support person to guide you as well.
@Rx4AI2 ай бұрын
He very clearly states it’s not a perfect test, but it is pretty neat. I think this content is fine…
@merzakish2 ай бұрын
This is weired; how come they did not perform similar tests as you did? Many Thanks
@SkillLeapAI2 ай бұрын
I think lot of people don't know much about prompting. So having that built in to a model can be beneficial for non technical people
@jinxxpwnage2 ай бұрын
If anyone is still confused. This o1 model is a step down from 4o to gpt4. But the framework is different in that it reasons by providing itself with step by step logic. It's a good step TOWARDS agi. But it is not gpt5 or agi yet. In a way, it's a bit more autonomous.
@CamPerry2 ай бұрын
Claude does this way better than
@terminally_lazy2 ай бұрын
Gotta understand that more advanced models require some advanced prompting, and it's quite clear that you may not understand that. Regardless of that, you just made the model look much smarter than you, my guy.
@SkillLeapAI2 ай бұрын
It scored a 120 iQ. Pretty sure that’s higher than both of ours
@MohammedQurashi2 ай бұрын
I think Dolphin is the right answer, they breath air and they are mammals which are further apart from fish than turtles
@CSlush2 ай бұрын
Turtle would only be correct if dolphin was referring to the fish of that name rather than the mammal as is typically meant when using the term dolphin. Although a turtle may have a less similar physical profile it is nontheless more closely related to the other listed fish as a reptile than a dolphin as a mammal would be.
@Rx4AI2 ай бұрын
Time to test out o1-mini for coding!
@TransLearnTube2 ай бұрын
I had checked math questions, it response very good
@mikebysouth1052 ай бұрын
Your questions assume there is a right or wrong answer to all the questions. That doesn't necessarily apply to the one with the turtle. Other correct answers are possible depending on the reasoning used. So the AI wasn't necessarily wrong!
@bgNinjashows2 ай бұрын
Wow! Dude took on a billion dollars company and matched their efforts. Very impressive
@SkillLeapAI2 ай бұрын
It’s a 100 billion dollar company and I compared their two products. Not my own
@bgNinjashows2 ай бұрын
@@SkillLeapAI very humble
@JackAdams0Ай бұрын
7:49 i wonder if its answer would change if you gave it some background info like "these are one of the hardest SAT questions. what is the correct answer?"
@remi.bolduc2 ай бұрын
Try this one. In music. You are starting on the note D and go up by major third. And repeat the process 4 times 1o preview gets this. Not your prompt it seems. The answer is. D F# A# D F#. Actually it did the second time
@Emc-it4lg2 ай бұрын
You are awesome! Your GPT works really fine :)
@Atractiondj2 ай бұрын
All the tests I conducted gave results worse than free Claude... open AI misleads people, it can't analyze data properly, moreover, I can't even load files into the new model! And this is the most important thing.
@MR-DURO2 ай бұрын
It’s very underwhelming
@gonzalobruna71542 ай бұрын
That's literally what they say in their blog post, which I recommed you to read carefully. I think you are really missing the point here. This is not a new gpt model but a completely different paradigm. This open the doors for what the future models will be. These new models are not supposed to be better than the current gpt-4o in most tasks, but specifically in the extremely-complex science questions. There is a video of a physics Phd student being surprised that openAI-o1 wrote in only 1 hour the code that took him 1 year to write himself.
@Techsmartreviews2 ай бұрын
Finally! No more "How many R's in strawberry". Good test.
@alevyts35232 ай бұрын
OpenAI says not to use chain of thought (CoT) hints in the o1-preview prompt, because the model starts to dumb down and give worse answers.
@Ordinator12 ай бұрын
The way the models solved the third question from the IQ test shows that they are still not very smart, unfortunately. Brute forcing the answer is valid, but it's of course not very efficient either. The two smallest two-digit numbers already add up to 27, so it's clear that 5 and 6 must be two of the three numbers. Since 5 + 6 is 11, the third number has to be 16.
@blackrockcity2 ай бұрын
Upvote if you think the test he used contained critical formatting errors that mislead the AI.
@Olaf_Schwandt2 ай бұрын
I have a standard-test. I ask AI for one hint in Sudoku (ChatGPT 4o as a picture. o1-preview as text) and none of these models is able to solve this. I think, o1-preview is over-hyped. Its not a deep new technique, in reality ChatGPT 4o runs quasi several time and controls its own results.
@djtate19752 ай бұрын
Strangely enough, a couple of months ago I gave the default model a Sudoku puzzle and it solved it correctly. However, about 2 weeks later I did it again and it got it wrong. 🤔... My guess is that they are doing something on the back end that affects the way the models "reasons".
@vanessa17072 ай бұрын
@@djtate1975 Totally concur, i have had experience with chatgpt4o where it gets an answer right first try and then when prompted again at a different time in a new prompt, it gets it wrong! not sure what is up with that !
@Olaf_Schwandt2 ай бұрын
@@djtate1975 thank you for the answer, that's really interesting, I never got the right answer. Let's observe future behavior
@pmHidden2 ай бұрын
@@djtate1975 They regularly release new versions and even between versions, I've noticed behavioral changes (and so have others) that can break previous functionality. I've written a test suite for my company that I use to evaluate models for our use cases (mix of responses and tool calls using the async streaming versions of their respective Python libraries) whenever new models are released. It's not rare that an old model suddenly performs significantly worse in some aspect (e.g., no longer calling tools reliably). Sometimes, even factors such as the time of the day can make a difference. It might have to do with their internal prompts but it might also be that they simply deploy multiple versions of the same model and switch based on certain criteria (e.g., a more quantized version during peak hours).
@chasisaac2 ай бұрын
Why did you give it the potential answers? You should have it be open and allow it to come up with its own answer. So you should’ve had a control GPT without your instructions and just ask the questions
@AIrvin882 ай бұрын
I think the likelihood of people thinking they’re actually asking intelligent and complex questions but the questions they’re asking are a lot simpler and not as difficult as they think is much higher than the actual complex/intelligent questions being asked.
@unbasedcontrarian2 ай бұрын
I think the likelihood of you passing your English course is lower.
@justremember96972 ай бұрын
That doesn't matter The same question can be answered numerous ways with various levels of information. Sometimes better reasoning is identifying what the person is asking and presenting an answer that is more clear for everyone.
@jarkkoisok97392 ай бұрын
Maybe The used IQ test probably IS in training material for both models? Due similar answers?
@MuzammilAhmad-tw4fb2 ай бұрын
Hi Saj, sometimes the websites which provides questions and answers also has wrong answers . We need to cross check their answers also
@SkillLeapAI2 ай бұрын
Yea I understand. For this example, i don’t think it made a difference for the point I was trying to make. They both always got the same answer regardless of the prompt. So the o1 model didn’t really outperform a custom gpt. That was the point of the video
@testales2 ай бұрын
I'm not very convinced about this level1 vs level2 thinking anymore. You slowy thinking only on new complex problems. The more similiar problems you have had, the faster you get in solving them, only limited by amount of variables and intermediate results you can keep in your head at the same time. So at some point the process becomes more or less automatic and you know this and that shortcut. So it's no longer level 2 thinking! Obvously you can solve more complex problems if you think for a longer time and so does an LLM by prompting itself. So the question is not whether you can get more out of it with self-prompting and "slow thinking" but if one want to train an LLM in that way! You actuallly want a quick and good response not pages of trash. Since even complex problems can be solved with level 1 thinking with enough practice, you want an LLM doing just this and pushing the limits a little further with system prompts that cause self-prompting and slow thinking should only be an option, not a thing that's active all the time. So OpenAI bascially did the same as the guy who recently released this relection LLM and so considering the resources OpenAI has, this is pretty lame.
@marsonal2 ай бұрын
can you test o1 preview against claude sonnet3.5 with your technique?
@SkillLeapAI2 ай бұрын
Yea I just did. Posted soon
@_ramen2 ай бұрын
This is not how o1 works under the hood. It isn't using prompt hacks. Chain of thought is actually "baked in" to the model. Using chain of thought prompting will not result in the benchmark gains that o1 provides. Optimal chain of thoughts are learned via reinforcement learning during the new training process. You will not find this ability in previous models. The reason your custom GPT probably performs better is that chain of thought prompting does typically improve performance of previous models for tasks that involve reasoning. But it won't compare to having the chain of thoughts more deeply integrated within the model itself.
@SkillLeapAI2 ай бұрын
Yea I’m not saying it’s that simple. Just the results are not mind blowing like the previous upgrades in GPT models and I nearly replicated them with a custom GPT. So even if its entirely a new architecture, it’s not a big improvement from a practical standpoint
@_ramen2 ай бұрын
So you think they are distorting the gains they show in their benchmark testing? Because if the benchmarks are correct, it is a significant improvement over 4o. In my experience o1-mini is outperforming Claude 3.5 sonnet on coding problems. For reference, I was creating an animation in Javascript of a sphere transforming into a cube, and then back into a sphere. 3.5 sonnet couldn't do it. o1-mini one shotted the problem perfectly. Also keep in mind that o1-preview scores significantly lower than the non-preview version of the model. But they haven't released it publicly yet. So it will be a good idea to reevaluate it once that happens. I am not entirely sure why they even bothered releasing the weaker preview model, when they already have something more capable that is ready to go. As a side note, I find it weird that mini outperforms the larger model with coding (for now).
@SkillLeapAI2 ай бұрын
No I don’t think they are gaming anything. I’m sure they are doing much more scientific test. Just from every day use, I don’t see a vast improvement.
@David-nb2dc2 ай бұрын
😢Whats confusing me the most is at 5:55. I'm not sure why the presenter needed to click Dolphin to acknowledge it was wrong. It's pretty obvious the answer is turtle. Slow down world! I know he knew that. What took me by surprise was that its pretty obvious he lost focus. And question 5 is a good example that the "Checking the answer twice for a pattern" and compare and contrast classification of problems are commonly used to measure intelligence" that over wrote your chain of thought it feels. Look at the structure of the question.
@ktb13812 ай бұрын
Interesting but, maybe next time you could try against plain vanilla to, as well as your custom GPT?
@SkillLeapAI2 ай бұрын
Yep just recorded that. Posting soon
@HookedGoneSunk2 ай бұрын
can you make a video also in computation- physics, chemistry and others. thankyou. if its accurate
@denisbellerose87572 ай бұрын
Merci!
@SkillLeapAI2 ай бұрын
Thank you
@Soccer5se2 ай бұрын
What were the limits increased to?
@SkillLeapAI2 ай бұрын
They just reset it. Still 30 a week
@SkillLeapAI2 ай бұрын
They just changed it again. 50 a week now for o1 preview
@Soccer5se2 ай бұрын
@@SkillLeapAI Thanks for keeping us up to date!!
@jackstrawful2 ай бұрын
I'd love to know what's going on when it misses the dolphin question. How does it not notice that it says four legs and that dolphins don't have legs? I think understanding this error would teach a lot about how these models actually work.
@gamesshuffler-v8n2 ай бұрын
the message cap is ultra ultra ultra low that means it is very heavy for the servers to run so why they're adopting this chain of thought prompting like there are other prompting methods that can be good l you know they should also adopt them for cost-efficient like decision trees, svm's and other types of prompting techniques if their server load is that much heavy on this GPT and also taking a whole lot of time in answering just one question
@edwardserfontein41262 ай бұрын
You will probably get a lot of criticism in the comment section but I like that you explored the good and the bad. You gave your honest, balanced opinion. So many chatgpt fans don't want to hear anything about chatgpt unless it is infinite praise.
@SkillLeapAI2 ай бұрын
Thanks. I'm one of those fans myself but I just wasn't very impressed. When I tested GPT-4 vs. 3.5, it blew my mind. It wasn't even in the same ballpark. So I was expecting something similar.
@influentialstudio64642 ай бұрын
Lmao, cmon man. I think the problem is people aren't intelligent enough to ask hard enough questions to evaluate the models. This model isn't for solving 8th-grade math. Sure, you can, but the results will be on par with ChatGPT 4o, but you will need to see the steps to solve the problem. Go ahead and test these mods with calculus or other mathematics, and you'll find areas where 4o is horrible, but the new model is crushing it.
@YoussefBarj-g3e2 ай бұрын
the only revolution openai is spearheading is its innovative ways of doing marketing
@TroyShields2 ай бұрын
You may want to go back and read the reasoning that you prompted both models for in the animals question. The correct answer could, and should, be dolphin. But you glossed right over that which makes you like a LLM that hallucinates. The test you used said turtle was the right answer so you told your viewers the same and now it’s out in the world as fact. It’s actually kind of interesting.
@SkillLeapAI2 ай бұрын
Well even if the test was wrong, they both got the same answer. For the sake of comparing the two, it’s the same result. If both are right or wrong, it’s a wash for the test.
@TroyShields2 ай бұрын
@@SkillLeapAI My intent is not to bash you. It’s really not. I just ask that you hold yourself to the same standard that you pitch in your video. Obviously, you are helping people by letting them know that the prompt is even more important than originally thought but basically you used a prompt that told the model to check it’s work and you didn’t do the same. The fact that both got all questions right AND uncovered an error in one is significant IMO.
@ricardocnn2 ай бұрын
The prompt is great for custom gpts. Just it.
@gabrielepedruzzi796716 күн бұрын
unlikely to be 100 times better
@seregamozetmnoga17002 ай бұрын
AI reasoning abilities seem to develop the way a child would. Same neural structural network, but progressively better intelligence with optimization of the same structure.
@JimWellsIsGreat2 ай бұрын
Try testing your clone against the complex prompt provided in Samer Haddad’s video where he claimed he was wrong about o1.
@SkillLeapAI2 ай бұрын
Ok I’ll check it out. Haven’t seen that video
@JimWellsIsGreat2 ай бұрын
@@SkillLeapAI Kyle Kabasares posted a couple videos of it getting correct answers on PHD level physics problems. It took o1-preview 122 seconds to do what it takes a person 10 days to do. It’s wild how it can get correct answers on crazy, high level problems, but fail on math that is comparably simplistic.
@DanFa922 ай бұрын
You’re comparing two custom gpts and you know it. i’d use yours, but avg people wouldn’t. That’s pretty smple to understand
@BrianMosleyUK2 ай бұрын
I missed the rate reset... Nightmare limitation.
@buffaloraouf34112 ай бұрын
can you share prompt case your custom is limited we will try it in huggingface
@briankgarland2 ай бұрын
30 messages a week means it's unusable anyway. Can't wait for Claude 4.
@ToolmakerOneNewsletter2 ай бұрын
Since you added chain of thought to 4.0, you wouldn't expect the same increase in benchmark test, right? Did Openai state that you first need to add chain of thought to GPT 4.0 and then compare? Uh, congratulations on your custom GPT though!?
@adolphgracius99962 ай бұрын
It's easy to make something after seeing someone else do it and explain how they did it, if open ai didnt tell us about the reasoning layer, there is no way you would've figured it out. Its ok for tools not to be perfect, I'd argue that If open ai released an ai that makes no mistakes, people would freak out and start panicking, so let's enjoy our time here before skynet comes online 😅
@protips69242 ай бұрын
Honestly not that hard. It's simply chain of thought. Doesn't need a genius.
@pmHidden2 ай бұрын
Do you seriously believe this was OpenAI's idea? Not only is this a very intuitive thing to come up with when you're working with these models, but there have also been countless publications with similar ideas for years.
@Wreck_Crimes2 ай бұрын
That's a reflection thing not GPT-o1
@Saif-G12 ай бұрын
How much is the limit now?
@SkillLeapAI2 ай бұрын
I think is 30 messages per week
@mattbelcher292 ай бұрын
Have you tried using a similar custom prompt in Claude sonnet?
@SkillLeapAI2 ай бұрын
Not yet but I hear people are getting similar results
@TechnoMageCreator2 ай бұрын
For decades corporations bought with money intelligence for cheap from other human beings. In their heads if they make an AI for themselves will give them absolute power. They built something like that and try to charge a lot of money. That model will never work someone smarter will just take it and make a free one. The model is amazing but anyone can use it for a bit and build your own. Not sure why they are trying to go closed source and make money. It's too late for that step. Is the wild West for MVP's that have ideas
@djayjp2 ай бұрын
Not encouraging that it can't even get high school math right.... 😒
@michaelmartinez53652 ай бұрын
On another site I saw it ace grad school physics.
@Sindigo-ic6xq2 ай бұрын
It can!!
@Bjarkus32 ай бұрын
This is a bad test honestly. The power of o1 is to give it complex multi step workloads
@SkillLeapAI2 ай бұрын
Give me prompt examples and I’ll use it next time
@chasisaac2 ай бұрын
Why did you give it the potential answers? You should have it be open and allow it to come up with its own answer
@chasisaac2 ай бұрын
Also why do we assume SAT why not use GRE question of MCAT. And SAT questions are still high school level questions. And said before that GPT four is equivalent to the high school student so the test is not surprising me in anyway and actually has pretty much expected results.
@SkillLeapAI2 ай бұрын
I did in my first test in a different. It was almost always wrong that way
@chasisaac2 ай бұрын
@@SkillLeapAI well that is even more telling. And problematic the basic problem with multiple-choice answers is it eliminates the number line and places it at four points on a number line. So it can always work backwards to verify the answer which is why I am done there was even one wrong answer.
@ericandi2 ай бұрын
Several of those match questions were wrong and ChatGPT was correct.
@damienjones96672 ай бұрын
I’m starting to think that the benchmarks being faked. I’m aware it’s a huge reach, but it isn’t too crazy of an accusation.
@protips69242 ай бұрын
Doesn't make any sense that a simple prompt can generate the same responses on different models. The models are not open source, but there has to be some major differences. For instance the API for the O1 model is 5x more expensive per 1 million tokens than the gpt 4o model. Unless OpenAi is reaching on some Indian scammer level there has to be some complex reasoning happening in the background. Although it is very possible that this is just a cash grab. As if they don't have enough already.
@SkillLeapAI2 ай бұрын
Well it outputs a whole lot more token to give you a response because of chain of thought. GPT 4 just answers without outputting its thought process
@MichealScott242 ай бұрын
❤️
@apexo10fc2 ай бұрын
o1 is amazing tf u guys talking about
@SkillLeapAI2 ай бұрын
In what category has it beat the previous options for you? Also this is just my option
@TLCMEDIA12 ай бұрын
I agree, 01 is amazing, way better than gpt4. I remember giving gpt4 two invoices, the initial invoice which is the original and a discounted invoice, then requesting it to generate a credit note. GPT4 hallucinated but 01 got it right on the first try, you can try it yourself. Make the second invoice say 30% cheaper but make one item’s price consistent
@Sindigo-ic6xq2 ай бұрын
@@SkillLeapAI phd physics, math, genetics, medicine. Here you can prompt 4o as much as you want and it cant compete. Please use some problems from the mentioned fields that can be found in advanced textbooks and test both models, also claude if you want
@Ro1andDesign2 ай бұрын
I used the o1-preview until I hit the limit. So far, in my experience, Claude is still FAR better at answering complex questions
@RippyCrack2 ай бұрын
Exactly, even at coding
@greenboi56322 ай бұрын
Oh sure? I used o1 and its far better than Claude
@CamPerry2 ай бұрын
@@greenboi5632its absolutely is not. If anything the new model is worse
@rj27642 ай бұрын
I’ve had 4.0 answer math problems off of a screenshot. I canceled my subscription. Watched your video on the new version and decided I would give him a try. What about your crap I can’t even upload PDF to it.
@TransLearnTube2 ай бұрын
Please help any translation chat gpt clone for translation releted indian and other Forgien languages, or suggest any free LLM ollama model.
@jamesmarvin19202 ай бұрын
Prompt was flagged for me
@xtra_6122 ай бұрын
I got so mad i needed the money to eat and i had already paid for clouad ai , seeing the hype i bought it and it is useless i tried so many ways to make it better but it is bad it doesn’t follow instructions , the answers aren’t better that 4o , so what’s the point of this
@jonathanparham74212 ай бұрын
Not a fan
@Appocalypse2 ай бұрын
Not to be mean but this is one of the most flawed benchmarks I've watched on youtube in the past week. A LOT of your questions, especially from the SAT set, are either incomplete or incorrect. You should use a source that you validate with your own reasoning first to make sure it's not garbage.
@SkillLeapAI2 ай бұрын
Ok posting a more complete test next week
@SkillLeapAI2 ай бұрын
Also to be clear, this was not at all a benchmark video. I was simply showing that for most use cases, the o1 model is giving similar results as the custom gpt with cot system prompt, regardless of the input for the prompts. I wouldn’t use random questions for an actual LLM benchmark test
@Appocalypse2 ай бұрын
@@SkillLeapAI I understand the intent, and I do think your CoT prompt GPT makes GPT-4o a lot more effective, but my point was that most of the questions that o1 failed with were incomplete or incorrect. Once you run it on validated & correct set of questions, and you find that your CoT GPT works fairly well compared to o1, your next logical step should be to find even more complex problem sets. The best ones to showcase the differences will be graduate level Physics and Math problems.
@SkillLeapAI2 ай бұрын
I just finished another, more comprehensive video that I will post soon
@SkillLeapAI2 ай бұрын
If you have a place where I can source questions, please let me know for upcoming videos
@Hae3ro2 ай бұрын
Its bad
@Horizon-hj3yc2 ай бұрын
The AI hype train arrived again.
@dionk62822 ай бұрын
It didn't get the dolphin wrong. This is where the IQ of the entity answering is smarter than who asked the question. Some might say the dolphin being the only one that doesn't lay eggs for example is more important as to what type of limps they have.
@universalchaospaladin50192 ай бұрын
Oh, I scrolled through and didn't see your reply before I replied with a longer version of this.
@Bjarkus32 ай бұрын
Also the reasoning is that it breathes air... Dolphins breath air... I mean it is between the dolphin and the turtle for sure but ... Sharks don't lay eggs btw. Yea I know mind blown
@matthew041012 ай бұрын
This proves Strawberry is a low hanging fruit. And I would expect more out of OpenAI.
@avi72782 ай бұрын
This is hilariously misguided. Just goes to show that this stuff is really hard to grasp for the layman.
@SkillLeapAI2 ай бұрын
It's a b to c product in chat format for the layman. So if I can't see a huge improvement as they claimed for my day to day work, what exactly is the difference that makes an impact for me?
@avi72782 ай бұрын
@@SkillLeapAI just put it this way, probably nothing you do on a daily basis would benefit from using o1. Your conclusion is correct but the way you got there is filled with misunderstandings. o1 is for a very specific subset of problems that could benefit from extra reasoning steps and whether it gets the right answer on a math question still nearly comes down to chance. It could reason through a whole problem correctly and then spot out a wrong answer because it doesn't actually care about giving you the right answer to a math question, only about what the most likely next token is after it does all that reasoning and sometimes the next token is not actually the answer even though it probably "knows" the correct answer. Math questions like that are simply benchmarks and it's a misconception that benchmarks are a measure of how good or bad a model is. Just because they got it to spit out a right token 81% of the time doesn't make it a model that's reliably good at answering math problems, nor does its ability to do so mean it's good. The subset of problems that o1 excels at are problems where there is no one right answer but rather a sliding scale of gibberish, to coherent but completely wrong/bad response, to coherent and generally good response, to coherent and by the stroke of mathematical next token luck an amazing response. For example, brainstorming, planning, project structures (esp based on existing frameworks like DDD), all fall within this subset of problems that benefit from advanced reasoning, among many others. It's not for daily driver use and no that prompt running gpt-4o won't "clone" o1.
@avi72782 ай бұрын
@@SkillLeapAI youtube is worthless, I left a detailed comment which just got deleted. Unless you did it for some reason.
@Gallus76312 ай бұрын
Well, news flash, stop being a hype beast. I get it, KZbinrs are trying to play the algorithm, but hopefully not at the cost of your own credibility & more importantly, dignity. Truth is, the model is NOT that great, it isn’t some “powerful” new model that will “revolutionize” anything, it’s just hype built on corporate level buzz words, I know, because I’ve literally wrote marketing policy in previous professional level positions & it’s selling points are a lot like the nonsense marketing strategies used as far back as 1997. This model is based off of pure greed, with the intent to drain peoples pockets as well as a flight of ideas, from this gradually sinking company. GPT mini will suffice for most applications , and if you understand informatics, it’s more than enough to help you build statistical/probability models.
@jasondee98952 ай бұрын
You need to do a follow up video to this admitting that you messed up
@SkillLeapAI2 ай бұрын
I just finished a more comprehensive test with a lot more prompts. Posting soon.