RE-Bench: measuring AI agents at AI R&D vs human experts

  Рет қаралды 5,598

Samuel Albanie

Samuel Albanie

Күн бұрын

Пікірлер: 10
@Think666_
@Think666_ 3 күн бұрын
O3 might impact this paper
@SamuelAlbanie1
@SamuelAlbanie1 2 күн бұрын
Agreed. I'm curious to see the results of the updated eval.
@person-jw7vb
@person-jw7vb Күн бұрын
o1 pro alone will probably heavily impacted this paper
@sashetasev505
@sashetasev505 Күн бұрын
Awesome treatment of a great paper! It feels like someone is doing a calm ASMR deep dive on the paper that begets Skynet, but that’s beside the point 😂😂 Liked, subbed and commented - may the algo-gods kindly extend your reach ❤
@poiitidis
@poiitidis 2 күн бұрын
nice
@CrispinCourtenay
@CrispinCourtenay Күн бұрын
Interesting, but inference and training on such a small number of last generation GPUs makes this a thought experiment and intellectual stretch, rather than AI vs. Human.
@kaio0777
@kaio0777 Күн бұрын
I agree plus this is not on the latest AIs at the cutting edge either plus this is just testing not put to the grind to optimize in a company setting Honestly, the real test is to make a shell company completely run by AIs at the top with only humans doing the blue-collar work. If they can run as good as humans or better with less oversight work, as we know it, the work in the future is cooked.
@SamuelAlbanie1
@SamuelAlbanie1 18 сағат бұрын
@kiao0777 Thanks for the comment. I agree that having a company completely run by AIs would be more representative of full automation. However, it's worth noting that a key goal of this kind of work is to make measurements before full automation occurs (often with the goal of informing safety mitigations that would be best to set up in advance).
@SamuelAlbanie1
@SamuelAlbanie1 17 сағат бұрын
@CrispinCourtenay Thanks for the comment. I agree the comparison is imperfect, but I think it does a reasonable job of capturing R&D tasks over the time period of one working day. Also, it's worth noting that many AI R&D experiments are conducted at small scales on older generations of hardware to reduce cost (so in that respect, it is not necessarily unrealistic).
@kaio0777
@kaio0777 14 сағат бұрын
@@SamuelAlbanie1, yeah, measurements are good-that's why I watched the video-but what they are testing might not work out in practice imo.
2024's Biggest Breakthroughs in Biology and Neuroscience
16:41
Quanta Magazine
Рет қаралды 203 М.
Expert shows how AI will escape and kill us.
14:46
Digital Engine
Рет қаралды 193 М.
Правильный подход к детям
00:18
Beatrise
Рет қаралды 11 МЛН
Try this prank with your friends 😂 @karina-kola
00:18
Andrey Grechka
Рет қаралды 9 МЛН
Vision Transformer Basics
30:49
Samuel Albanie
Рет қаралды 33 М.
Revealing my COMPLETE AI Agent Blueprint
14:38
Cole Medin
Рет қаралды 18 М.
Challenges with unsupervised LLM knowledge discovery
16:02
Samuel Albanie
Рет қаралды 1,4 М.
Mamba - a replacement for Transformers?
16:01
Samuel Albanie
Рет қаралды 252 М.
What does AI believe is true?
15:49
Samuel Albanie
Рет қаралды 1,9 М.
The 8 AI Skills That Will Separate Winners From Losers in 2025
19:32
How AI Cracked the Protein Folding Code and Won a Nobel Prize
22:20
Quanta Magazine
Рет қаралды 340 М.
Nobel Minds 2024
52:30
Nobel Prize
Рет қаралды 285 М.
Правильный подход к детям
00:18
Beatrise
Рет қаралды 11 МЛН