RE-Bench: measuring AI agents at AI R&D vs human experts

Рет қаралды 7,321

Samuel Albanie

Күн бұрын

Пікірлер: 10

@Think666_ Ай бұрын

O3 might impact this paper

@SamuelAlbanie1 Ай бұрын

Agreed. I'm curious to see the results of the updated eval.

@person-jw7vb Ай бұрын

o1 pro alone will probably heavily impacted this paper

@sashetasev505 Ай бұрын

Awesome treatment of a great paper! It feels like someone is doing a calm ASMR deep dive on the paper that begets Skynet, but that’s beside the point 😂😂 Liked, subbed and commented - may the algo-gods kindly extend your reach ❤

@poiitidis Ай бұрын

nice

@CrispinCourtenay Ай бұрын

Interesting, but inference and training on such a small number of last generation GPUs makes this a thought experiment and intellectual stretch, rather than AI vs. Human.

@kaio0777 Ай бұрын

I agree plus this is not on the latest AIs at the cutting edge either plus this is just testing not put to the grind to optimize in a company setting Honestly, the real test is to make a shell company completely run by AIs at the top with only humans doing the blue-collar work. If they can run as good as humans or better with less oversight work, as we know it, the work in the future is cooked.

@SamuelAlbanie1 Ай бұрын

@kiao0777 Thanks for the comment. I agree that having a company completely run by AIs would be more representative of full automation. However, it's worth noting that a key goal of this kind of work is to make measurements before full automation occurs (often with the goal of informing safety mitigations that would be best to set up in advance).

@SamuelAlbanie1 Ай бұрын

@CrispinCourtenay Thanks for the comment. I agree the comparison is imperfect, but I think it does a reasonable job of capturing R&D tasks over the time period of one working day. Also, it's worth noting that many AI R&D experiments are conducted at small scales on older generations of hardware to reduce cost (so in that respect, it is not necessarily unrealistic).

@kaio0777 Ай бұрын

@@SamuelAlbanie1, yeah, measurements are good-that's why I watched the video-but what they are testing might not work out in practice imo.