Discussing o1-Pro's Performance on the 2024 Putnam Math Competition (LIVE)

Рет қаралды 3,841

Kyle Kabasares

Күн бұрын

Пікірлер: 36

@R.E-O 29 күн бұрын

Friendly advice: Use Split View to show both windows at the same time without overlap.

@KyleKabasares_PhD 29 күн бұрын

Thanks 🙏

@Cengage_Chemistry 27 күн бұрын

Could you please perform the o1 pro mode at IMO 2024 problems too?

@RPi-ne5rp 29 күн бұрын

We can gain so much more from these tests if the person evaluating the performance is able to determine whether the steps and the answers are actually accurate.

@ziadnahdi4343 29 күн бұрын

thank you so much although a live session would have been appreciated. We look forward to testing out Olympiad questions. Thanks again. Always a pleasure Doc.

@Micro-bit 29 күн бұрын

Realy great job! I try to describe your "researches" to my coleagues at work, But my they does not understand what does 'IT' mean for their jobs in the coming years.

@tresormvumbi 29 күн бұрын

That’s interesting; thanks for this Kyle. I’m curious how Claude 3.5, Gemini 2 flash and even o1 full would do comparatively.

@iseetreesofgreen3367 29 күн бұрын

@@tresormvumbi only tested b2 but both Claude and o1 full got it right but flash got it wrong

@iseetreesofgreen3367 29 күн бұрын

for what its worth, I tried B2 with o1 and it got it right, Heres its answer No infinite sequence of distinct (non‐congruent) partners can arise from repeatedly reflecting a single vertex across the perpendicular bisector of AC. You get at most two distinct shapes that keep flipping back and forth: ABCD ABCE (the reflection partner) Thus, no two elements are congruent cannot hold for an infinite chain. The process necessarily repeats after two steps.

@iseetreesofgreen3367 29 күн бұрын

claude also got it right

@SimonNgai-d3u 28 күн бұрын

O1 pro scored ~10 more points (from 120 to 133, and from 100 to 110 in private test set) in Mensa IQ test than o1-preview. Not sure what it means though and I am looking forward to seeing results from ARC-AGI and SimpleBench.

@awsmith1007 29 күн бұрын

I think rigorous grading gives it more like a score between 20-30 ish.

@KyleKabasares_PhD 29 күн бұрын

Grading physics problem sets and exams in graduate school as a TA definitely did not prepare me for this lol Still 20-30 is still better than what hundreds of talented math undergrads can do. Pretty wild still

@awsmith1007 29 күн бұрын

@ Yeah, it did well. Don’t worry, grading on Putnam problems can be difficult. Could I ask you the favour of doing IMO problems? I know o1 pro roughly gets problem 1, but I haven’t seen anyone try the other 5. A bit too expensive for me…

@VisualVibes13 28 күн бұрын

Would you say that the o1 preview was better than o1? If so, why would that be?

@jadpole 27 күн бұрын

My guess: o1 has support for image understanding and tools, which I assume got "diluted out" of o1-preview during the RL phase. So o1-preview does better on some benchmarks for text-only tasks (which probably generalizes), whereas o1's reasoning ability is more general. o1 also system (now called developer) messages, with "instructions hierarchy" tuning. In practice, I would guess that o1 is preferable for almost all practical use-cases.

@Atheist-Libertarian 29 күн бұрын

Try, ARC-AGI benchmark.

@RickySupriyadi 29 күн бұрын

the last they progressed 55% with reasoning neuron added, i forgot which team.... we are getting closer to open ended AI with more than reasoning capabilities, I'm so excited

@nashh600 29 күн бұрын

@@RickySupriyadi Absolutely not if you read the paper they basically just found loopholes making the problem easier, it's not AGI

@RickySupriyadi 28 күн бұрын

@@nashh600 hm... which paper i think we are not talking the same arc-agi?

@nashh600 28 күн бұрын

@@RickySupriyadi No we are. if you look at the papers of the top results they all use "data augmentation techniques" which is basically just using rotations and stuff to further fine tune the model on the specific task, thus not being agi

@RickySupriyadi 28 күн бұрын

@nashh600 hmmm could you please be more specific? where do i find that paper you're talking about? arxiv?

@JansthcirlU 29 күн бұрын

It's quite interesting that it gets these super hard problems right but fails to answer even half of the SimpleBench questions correctly.

@TheRealUsername 29 күн бұрын

Because these models cannot generalize like humans, nor reason from first principles

@albertatsma4142 29 күн бұрын

@@TheRealUsernamesimplebench is a good test for having a good worldmodel or not, and humans ofcourse have a good worldmodel, its a good test, but less for reasoning.

@CheranganiHills 29 күн бұрын

Even geniuses are said to lack common sense. 😂😂

@mahalisyarifuddin 28 күн бұрын

"AI Explained" mentioned! 🗣️

@bartoszbieganowski8613 13 күн бұрын

It doesn't. Read the solutions an you will see that they are not correct. Only answers agree.

@raffaelecarbuglio 29 күн бұрын

I think the correct way to do this is to let o1 pro solutions be analysed by an expert who knows Putnam grading system

@marfmarfalot5193 28 күн бұрын

Bro has a PhD

@bartoszbieganowski8613 16 күн бұрын

@@marfmarfalot5193 And what of that? Check the solution for B6 that he said it is correct. It is not more than 2 points of 10 for this solution. The AI uses "approximations" without any rigorous proof. It is only heuristics, no mathematics at all.