Friendly advice: Use Split View to show both windows at the same time without overlap.
@KyleKabasares_PhD29 күн бұрын
Thanks 🙏
@Cengage_Chemistry27 күн бұрын
Could you please perform the o1 pro mode at IMO 2024 problems too?
@RPi-ne5rp29 күн бұрын
We can gain so much more from these tests if the person evaluating the performance is able to determine whether the steps and the answers are actually accurate.
@ziadnahdi434329 күн бұрын
thank you so much although a live session would have been appreciated. We look forward to testing out Olympiad questions. Thanks again. Always a pleasure Doc.
@Micro-bit29 күн бұрын
Realy great job! I try to describe your "researches" to my coleagues at work, But my they does not understand what does 'IT' mean for their jobs in the coming years.
@tresormvumbi29 күн бұрын
That’s interesting; thanks for this Kyle. I’m curious how Claude 3.5, Gemini 2 flash and even o1 full would do comparatively.
@iseetreesofgreen336729 күн бұрын
@@tresormvumbi only tested b2 but both Claude and o1 full got it right but flash got it wrong
@iseetreesofgreen336729 күн бұрын
for what its worth, I tried B2 with o1 and it got it right, Heres its answer No infinite sequence of distinct (non‐congruent) partners can arise from repeatedly reflecting a single vertex across the perpendicular bisector of AC. You get at most two distinct shapes that keep flipping back and forth: ABCD ABCE (the reflection partner) Thus, no two elements are congruent cannot hold for an infinite chain. The process necessarily repeats after two steps.
@iseetreesofgreen336729 күн бұрын
claude also got it right
@SimonNgai-d3u28 күн бұрын
O1 pro scored ~10 more points (from 120 to 133, and from 100 to 110 in private test set) in Mensa IQ test than o1-preview. Not sure what it means though and I am looking forward to seeing results from ARC-AGI and SimpleBench.
@awsmith100729 күн бұрын
I think rigorous grading gives it more like a score between 20-30 ish.
@KyleKabasares_PhD29 күн бұрын
Grading physics problem sets and exams in graduate school as a TA definitely did not prepare me for this lol Still 20-30 is still better than what hundreds of talented math undergrads can do. Pretty wild still
@awsmith100729 күн бұрын
@ Yeah, it did well. Don’t worry, grading on Putnam problems can be difficult. Could I ask you the favour of doing IMO problems? I know o1 pro roughly gets problem 1, but I haven’t seen anyone try the other 5. A bit too expensive for me…
@VisualVibes1328 күн бұрын
Would you say that the o1 preview was better than o1? If so, why would that be?
@jadpole27 күн бұрын
My guess: o1 has support for image understanding and tools, which I assume got "diluted out" of o1-preview during the RL phase. So o1-preview does better on some benchmarks for text-only tasks (which probably generalizes), whereas o1's reasoning ability is more general. o1 also system (now called developer) messages, with "instructions hierarchy" tuning. In practice, I would guess that o1 is preferable for almost all practical use-cases.
@Atheist-Libertarian29 күн бұрын
Try, ARC-AGI benchmark.
@RickySupriyadi29 күн бұрын
the last they progressed 55% with reasoning neuron added, i forgot which team.... we are getting closer to open ended AI with more than reasoning capabilities, I'm so excited
@nashh60029 күн бұрын
@@RickySupriyadi Absolutely not if you read the paper they basically just found loopholes making the problem easier, it's not AGI
@RickySupriyadi28 күн бұрын
@@nashh600 hm... which paper i think we are not talking the same arc-agi?
@nashh60028 күн бұрын
@@RickySupriyadi No we are. if you look at the papers of the top results they all use "data augmentation techniques" which is basically just using rotations and stuff to further fine tune the model on the specific task, thus not being agi
@RickySupriyadi28 күн бұрын
@nashh600 hmmm could you please be more specific? where do i find that paper you're talking about? arxiv?
@JansthcirlU29 күн бұрын
It's quite interesting that it gets these super hard problems right but fails to answer even half of the SimpleBench questions correctly.
@TheRealUsername29 күн бұрын
Because these models cannot generalize like humans, nor reason from first principles
@albertatsma414229 күн бұрын
@@TheRealUsernamesimplebench is a good test for having a good worldmodel or not, and humans ofcourse have a good worldmodel, its a good test, but less for reasoning.
@CheranganiHills29 күн бұрын
Even geniuses are said to lack common sense. 😂😂
@mahalisyarifuddin28 күн бұрын
"AI Explained" mentioned! 🗣️
@bartoszbieganowski861313 күн бұрын
It doesn't. Read the solutions an you will see that they are not correct. Only answers agree.
@raffaelecarbuglio29 күн бұрын
I think the correct way to do this is to let o1 pro solutions be analysed by an expert who knows Putnam grading system
@marfmarfalot519328 күн бұрын
Bro has a PhD
@bartoszbieganowski861316 күн бұрын
@@marfmarfalot5193 And what of that? Check the solution for B6 that he said it is correct. It is not more than 2 points of 10 for this solution. The AI uses "approximations" without any rigorous proof. It is only heuristics, no mathematics at all.
@RickySupriyadi29 күн бұрын
agents basically like a mind chatter in logical way and some with tools,