ChatGPT with Search, Altman AMA

Рет қаралды 111,248

AI Explained

Күн бұрын

Пікірлер: 501

@Juddersbaby1 2 ай бұрын

Hi algorithm, I am engaging with this video

@apester2 2 ай бұрын

Same here!

@aiexplained-official 2 ай бұрын

Obligatory reply for engagement

@lukacosic9784 2 ай бұрын

same here 🎉

@gr8fulUS 2 ай бұрын

same same

@upsidedownChad 2 ай бұрын

same here

@dcgamer1027 2 ай бұрын

I just think its so cool that you saw a problem with the testing data set and rather than just complain about it you made your own, and one that seems genuinely useful too. Idk I just love that, I love seeing people try to fix problems they see rather than just let them sit there. Nice

@mrmooshon5858 2 ай бұрын

I've just realized I rarely take the time to thank you for you work, you are, 100%, the best AI related channel on you KZbin and everytime I see you've posted I go "babe wake up, a new AI Explained video just dropped"

@aiexplained-official 2 ай бұрын

Thanks so much moo

@pauljones9150 2 ай бұрын

2:12 "they said to expect it shortly, so expect it in 2030" 😂😂🤣

@holo23 2 ай бұрын

Nah he violated openai with that statement😭

@technicolourmyles 2 ай бұрын

I KNOW, FINALLY someone calling them out on this shi

@GrindThisGame 2 ай бұрын

same time as Tesla self driving and fusion.

@DominicI1 2 ай бұрын

Funny enough, I actually got the new "Search the web" feature on my free ChatGPT account already. I ended up upgrading a couple hours after playing around with it though so that I don't have to be worried as much about rate limiting.

@bentownsend4017 2 ай бұрын

Who wants self driving cars bro we need trains

@ryzikx 2 ай бұрын

ive been less into ai news lately but i will never skip an aiexplained video 👍

@aiexplained-official 2 ай бұрын

@adfaklsdjf 2 ай бұрын

absolute top notch, no others are even close

@corwinzelazney5312 2 ай бұрын

@@adfaklsdjf no question, yes.

@corwinzelazney5312 2 ай бұрын

Same same

@hyperchris01 2 ай бұрын

Ditto

@Neomadra 2 ай бұрын

I feel like the answer they gave on hallucination is quite misleading. It's rarely because source data by humans have flaws. It's mainly because it models language and not the real world, so even with perfect flawless data a language model will hallucinate when modeling language is not the same as modeling the real world. And that happens most often with previously unseen requests and questions.

@antonystringfellow5152 2 ай бұрын

They don't seem to have any level of understanding. They generate content using statistically probable associations. This can make them appear intelligent because they contain so much data... more than any human brain. But then they'll produce something really stupid or even contradict themselves. This is because they don't understand anything in the first place, they only appear to.

@iau 2 ай бұрын

Yeah, it was a cop-out answer. I believe that until the fundamental architecture of these models is improved to treat facts as something inherently different that the likely next word, it will keep _adjusting_ facts to maximize output likelihood.

@adfaklsdjf 2 ай бұрын

they have no internal representation of truth.. it's all just word association. until there's some explicit internal representation of truth, the model doesn't "know" what it does/doesn't "know", so it's bound to hallucinate sometimes

@Blanksmithy123 2 ай бұрын

Exactly, it’s a weasel answer… Even if you curated a model to train only on accurate information, it would still hallucinate.

@haileycollet4147 2 ай бұрын

This would be rather hard to test, but I'd be very curious to see how much hallucination does drop when trained on "perfect" text, all else equal (model arch & size, # training tokens, hyperparams, etc.)

@DylanKane-u2j 2 ай бұрын

you have such a talent to summarize this frontier of info and share it here. Thank you

@aiexplained-official 2 ай бұрын

Thanks Dylan, means a lot

@DemetriusTrumpClips 2 ай бұрын

The real question is, will we get AGI before GTA6 comes out!?

@adfaklsdjf 2 ай бұрын

that is a good question.. 🤔

@schmutz06 2 ай бұрын

I always dismiss this as a meme but it just struck me it is genuinely a neck and neck race, especially factoring in any delay to GTAVI being the mammoth project it is. potential for either first

@ElRak123 2 ай бұрын

Agi then will code it, 😂 ready for play without patches needed 😂

@zyzhang1130 2 ай бұрын

Maybe GTA6 is the AGIs we made along the way

@arnv4487 2 ай бұрын

No, no we wont

@bjensen 2 ай бұрын

No ads until they crush the competition- then never ending en****tification.

@maciejbala477 2 ай бұрын

yeah, this is why competition is important. There's no reason for OpenAI or Anthropic or whoever else dominates, if they do, to do whatever the hell they want if there's no viable alternative

@jeremycronic 2 ай бұрын

Dude that juggler ladder question is crazy. I tried a few simple bench questions before u covered it in this video. I was SO confused with that question. I was like 'the ball is on the ground' but that's not an answer. I got it right after re-reading all the answers a few times, but damn that would be a wild kind of question for llm to figure out. Well done with those questions.

@panyusheng7240 2 ай бұрын

I think some questions are designed to be unusual, like frying an ice cube while frying an egg. So it's very unlikely that a model has already seen similar scenarios in its training data. In order to get high score in this benchmark, a model must have some basic understanding of the world and can apply reasoning to unseen cases.

@schmutz06 2 ай бұрын

This is impressively hot out the oven! I followed it all and wanted to see a breakdown of the AMA and who better than everyone’s top AI explainer. Perfect stuff.

@urax5341 2 ай бұрын

Hey, before I watch the whole video, great that you managed to make a paper about SimpleBench. It's great to see you evolving and being a part of AI-related research! You are not just a youtuber now but also a scientist. I am really happy for you !

@jonp3674 2 ай бұрын

Search is really useless without reliability so it's really surprising they've gone for that. Imo the thing I use AI the most for is foreign language practice and they could nail that as it doesn't really matter if the model hallucinates while you're having a conversation with a tutor and the models ability to explain vocab and grammar and create exercises etc is excellent because language is it's core jam. They could made a really amazing video avatar product which takes the whole market without much difficulty I think.

@maciejbala477 2 ай бұрын

I think it's not useless, it's just markedly less useful. If you think about it, if you use search in order to find human sources/links (which otherwise might be challenging nowadays on pages like Google) then it will still be useful, and probably more useful than Google Search. But obviously, you can't blindly engage with it and expect it to get things perfectly right.

@Шестик-ю1ц 2 ай бұрын

Got recommendations of this channel almost constantly by a friend for about half a year, and was reluctant about it. But it took just one video... and here I am, waiting impatiently for each new one

@JL2579 2 ай бұрын

Without having tried the new search got yet, my previous experience with Ai search has been really underwhelming. When it's simple stuff then googling is often faster, but on more advanced stuff it usually hallucinates or mixes up information horribly. And thats the only time where it would be actually useful, when it was able to quickly sift through lots of online information, filter out irrelevant stuff, wrong info and noise, and summarize the results. It's great when it already knows facts it doesn't need to Google. One example was when I wanted to know what percentage of the population in Switzerland owns guns and it repeatedly, even after making it aware of this, gave me an estimate based on dividing the number of registered weapons by the population, which yields a way too high number (and for example would yield over 100 percent for the US because there are more weapons than people) because gun owners can and often do own multiple guns, so a much smaller percentage of the population will own at least one, but of those many will have multiple. To this day I still don't have an answer 😅

@thefinn0tube_ 2 ай бұрын

I just gave search GPT a go with your Switzerland gun ownership search and it gave a pretty adequate answer. Obviously I have no idea the accuracy of its information but it seems believable: "In Switzerland, it's estimated that about 28.6% of households have firearms, with 10.3% specifically owning handguns. WIKIPEDIA This means roughly one in four Swiss households is armed. However, since individuals can own multiple firearms, the actual percentage of the population owning guns might be lower. The country has a high rate of gun ownership, with estimates ranging from 27.6 to 54.5 firearms per 100 residents... etc"

@Picteon 2 ай бұрын

My experience with search is that it completely forgots what my question was and just repeats the results to me even if completely unrelated with zero critical thinking

@ZeerakImran 2 ай бұрын

@@Picteoni used to get that but no longer have those issues after getting my custom instructions right. It also now no longer just gives me mainstream info from obviously biased sources like the bbc or any mainstream company which has an interest in the topic. i have noticed it being really good now. It used to be useless but now just straight up is quite honest with me and admits when something sucks. Its no longer polite and quite critical now which is great. Its not providing generic maybe this maybe that non-sense. I did specifically instruct it against maybe’s and “could”… i may be able to fly. I could be a dalek… i don’t want these words they mean nothing. So i instructed it to work around it and to be more concise, careful and have integrity and honesty as a core part of its values. It now tells me “it is possible (likely) that…” or “it is unlikely (possible)…” or “ it is possible (unlikely)…”. That’s how i instructed it to tell me instead. I also have it not give me generic nonsense information that doesn’t apply. If i asked for this, I meant this. If you’re not sure, ask me for more information. Check online. Don’t provide any information that isn’t relevant unless you deem it to be valuable. The downside of my instructions is that they work really well for ME! These instructions will make it worse for a lot of people. Since when I’m using it, I’m very careful with my words and have also instructed it to take what I’m saying to mean exactly what I said instead of correcting it by assuming. If i have said something that doesn’t make 100% sense, then bring it up. Don’t assume or correct anything because i can’t be wrong. This won’t be helpful for others who wouldn’t write or speak in a very careful way. If you have experience with programming, it’ll work great. If you’re a very technical person or an engineer, this will work really well with you. If i tell it something wrong, i want to get a wrong answer that is right for what i said. Rather than it assuming what i meant and changing it. Just as your programming code won’t change itself to correct the error.

@jan.tichavsky 2 ай бұрын

@@ZeerakImran since when is BBC biased source? if anything the alternative media come with a large bias because they are trying to push a specific agenda, often with one or couple large funding sources (undisclosed of course).

@reshit7003 2 ай бұрын

chat: this could be related to the fact the 4o is the backbone and not o1 level models? maybe we will see a different story sooner. i think openai has this strategy of direct user validation, without pushing the real boundaries. the goal is just to see what people like and what they do not

@eqe-kui-nei 2 ай бұрын

Our team spent months building a rag thats using web scraped content as knowledge base. I just tried chatgpt search, its as good as what we built😢.

@ginogarcia8730 2 ай бұрын

don't worry man, AI is coming for all ours jobs

@Retsamster 2 ай бұрын

@@ginogarcia8730😢

@Alpha13yt 2 ай бұрын

You can still give them some competition for when it inevitably dives in quality because of corporate greed

@practical-ai-engineering 2 ай бұрын

I built this in 1 day...not hard to do

@calrt 2 ай бұрын

2:18 "over the coming months, so expect it around 2030" 😂

@24-7gpts 2 ай бұрын

lol yeah just "in the coming weeks" 💀💀

@jaxonterrill8416 2 ай бұрын

I love AI Explained. Funny, short, concise, and extremely informative.

@aiexplained-official 2 ай бұрын

Thanks jax!

@sagetmaster4 2 ай бұрын

I think just having additional administrative layers on top will solve the hallucination problem. Big companies just haven't been working on these architectures because the model itself has been improving so fast it hasn't been necessary

@AIForHumansShow 2 ай бұрын

of course you have your own benchmark now -- love this and pls just keep making these

@interstellar6647 2 ай бұрын

Yay! New video! Always looking forward to it

@aiexplained-official 2 ай бұрын

Thank you!

@kingdavid9422 2 ай бұрын

Thank you, AI Explained on the update of SearchGPT. Something to note here: Many people in the comment are commending SearchGPT (or is it ChatGPT with search, like you emphasised in your video), especially its user interface. However, we should all remember that OpenAI is desperately in need of funds for further future advancement of their technological products. So, don't be surprised when the interface of ChatGPT with search gradually starts to change and begins to clutter with sponsored ads. I hope that will not be the case in the future. Time will tell.

@AlexanderMoen 2 ай бұрын

they're different business models though. OpenAI wants to make its money off of API charges and subscriptions as a secondary form. That being said, most huge companies end up going public which leads to an expectation of continuous returns even after a market is saturated, at which point they may very well try ads and other avenues to continue profits. Hopefully, the business model they have now holds for a while

@anonymes2884 2 ай бұрын

Simple Bench is a really interesting approach, I like how easy it is to stay ahead of the training set (we as humans can come up with novel "puzzles" along those lines basically ad infinitum). Not sure it's measuring _usefulness_ (in the commercial sense) as well as some benchmarks but it seems much closer to measuring real-world _reasoning_ than most. Also appreciate that the paper's 8 pages BTW - a lot of AI papers seem to be in the 60+ pages range and I don't have time to do more than skim them (and watch your summaries of course :).

@claudioagmfilho 2 ай бұрын

🇧🇷🇧🇷🇧🇷🇧🇷🇧🇷👏🏻, Your videos are absolutely essential for the AI community! I seriously can’t go a day without watching them as they get released. They’re always so concise and straight to the point, something new every time no one is thinking of but you. The way you break things down is incredible, it feels like watching a movie, way beyond what anyone else is doing on their videos on these models and all. Thanks for sharing all this with us and making complex AI topics so engaging...

@aiexplained-official 2 ай бұрын

Thank you Claudio!

@wyqtor 2 ай бұрын

It's just amazing how Claude 3.6 Sonnet holds his own in front of much more expensive models trained using some kind of reasoning secret sauce! For this reason, Claude remains my go-to LLM for coding, math, and general learning.

@24-7gpts 2 ай бұрын

*3.5

@ShawnFumo 2 ай бұрын

It's also interesting... we have no idea if they have an o1-ish model internally. If they did, they could certainly use that for training data to help fine-tune Sonnet, like how OpenAI did for the search capability.

@ShawnFumo 2 ай бұрын

@@24-7gpts A lot of people are calling "Claude 3.5 Sonnet (new)" 3.6 just to emphasize that it feels like a bigger difference than most date-style updates

@24-7gpts 2 ай бұрын

@@ShawnFumo They might have something bigger in the works to release.

@carloslfu 2 ай бұрын

lol! Loved Perplexity result about Simple Bench!

@MateenTariq-f2u 2 ай бұрын

Finally a comparison video that actually uses normal everyday examples to give an idea on everyday which model is better!!

@Shaunmcdonogh-shaunsurfing 2 ай бұрын

Really good work with Simple Bench. Bravo!

@rosslmccallum 2 ай бұрын

Thanks for this overview.

@InternetStranger10101 2 ай бұрын

Love this video, can’t wait to see more

@aiexplained-official 2 ай бұрын

Coming!

@SiddhantGautam-o3x 2 ай бұрын

Hey man,Nice video just a appreciation from your subscriber on how you reaffirm or debunk the hype or breakthroughs made in this Ai age,Just a small request that in future videos it would great if you could include how how you simple bench worked or the models launched in the yt videos ,also some tutorials on how your website works. Anyways great work man👍

@aiexplained-official 2 ай бұрын

Will do!

@michael-jones 2 ай бұрын

that search breakdown is pure class

@geoffdavids7647 2 ай бұрын

Daaamn those simple bench questions are really tugging at the edge of what a human can reliably figure out without properly sitting down and studying them. I am embarrassed to say I scored only 80% on the try-it-yourself - I was fully out-foxed by the man seeing the light fall in the bathroom question, and the glove falling out of the car question. The glove I might have missed by doing it too quickly, but the light in the bathroom one I sat and thought on for several minutes and only figured it out once I knew my first guess was wrong 😩 I fed both questions I got wrong and their multi-choice answers into o1 preview, prefaced by only "think carefully, the following might be a trick question". It got both of them absolutely right in one go, and perfectly figured out the trick in each. I am mortified. I feel like these questions really rely on visualisation and a robust world-model. I might have gotten them all right if I'd really tried to build a visual mental image of the situations a bit more carefully, or even drawn them out. If these LLMs were trained more extensively on spatial understanding using physical 3d model simulations? I wouldn't be surprised if it was able to smash through most of these.

@aiexplained-official 2 ай бұрын

Great points and fascinating feedback

@cagycee5296 2 ай бұрын

Thank you for an awesome video as always. Still the #1 AI channel I like to follow.

@wes8645 2 ай бұрын

Great video man, I love seeing the notification for a video of yours!

@skuffd-semicolon 2 ай бұрын

Algorithm engagement, this video is super great! Would like to see more of AI Explained, its a great channel for any level of AI / AGI / LLM enthusiasm 💪💪🤯💯

@keeganpenney169 2 ай бұрын

Glad to see the wheels are still trucking on phil!

@arminalimardani9232 2 ай бұрын

Would you please provide suggested citation for your paper? Also you forgot the link to your paper in the description :)

@runvnc208 2 ай бұрын

I don't think anyone is going to read this, but it seems obvious to me that a large model that has very good video understanding, especially trained on something like random physics scenarios and questions about them, with good grounding of the language with video representations, _will_ be able to crush Simple Bench. It just needs a spatial-temporal and Q&A dataset, maybe with everything in the same latent space or at least really effectively linked spaces, that is aimed at his types of questions. Maybe a general purpose multimodal model that just has a lot of physics experiments, juggling, toddlers playing with blocks, etc. Maybe something like Sora but with a Q&A capability somehow. I think this is going to be a lot easier when we get another 10 or 50 times increase in efficiency or scaling to more easily handle models training and inferencing on video and video transcripts etc. that can use significantly more RAM. Regardless, I think we could easily see an 82% AI score on Simple Bench within the next year, 2 years at most.

@chrisanderson7820 2 ай бұрын

When people talk about us running out of training data I think they forget the sheer volume of 3D spatial information yet to be put into these models. The amount that neural nets will learn from embodiment and being "out n about" via video feeds is immense. Stuff like Nvidia labs etc is just the start.

@AlexanderMoen 2 ай бұрын

I don't think he or anyone else doubts that an AI could eventually beat Simple Bench. It's just a matter of when. Even what you just laid out would take a lot. Plus, it might not even be deemed the next best capability to focus on building.

@ShawnFumo 2 ай бұрын

@@chrisanderson7820 Yeah and there's constant innovations in ways to do synthetic data or use real data in different ways. Like just today I saw a paper (I think with Jim Fan?) where they managed to get a humanoid robot to learn a new task from just 5 teleoperated demonstrations, because they were able to auto-extrapolate those demonstrations into variations and train it on all of those in a simulator in a way that translated zero shot to the real world.

@matterhart 2 ай бұрын

I'd be really interested in a deep dive video on the simple bench design, perhaps including your collaborators. In addition to paper details, I really like to hear about options you all considered, but didn't pursue and why. Or tried but it didn't work or wasn't practical. The things you can get out researchers at a conference or over coffee. Might need an AI Explained More channel.

@claudioagmfilho 2 ай бұрын

I'm eagerly awaiting the release of real-time video for Plus users from OpenAI, as it was originally mentioned as part of the ChatGPT Omni update, which sadly never reached us. This feature will be revolutionary, enabling us to tackle a wide range of daily tasks more efficiently. Real-time video integration within ChatGPT would greatly enhance productivity by allowing for interactive, dynamic assistance and more streamlined workflows. It would be especially useful for tasks like desktop sharing-being able to visually assist and collaborate on real-time activities is just phenomenal. I hope this feature rolls out soon, as it could drastically improve how we approach everyday challenges. Hopefully Gemini 2.0 will bring this to us.

@stephenrodwell 2 ай бұрын

Thanks! Excellent content, as always! 🙏🏼

@technicolourmyles 2 ай бұрын

In the fourth SimpleBench question, with the two lying sisters, the given problem never mentions there only being two paths. The answer marked correct, "What path leads to the treasure?" would return a lie. If there were only two paths to choose from, that would work, but what if there are 3? Am I missing something?

@guitar2935 2 ай бұрын

Yeah this is the only question I genuinely don't think is correct. There's no reason to assume there are only 2 paths and for an arbitrary number of paths the only answer the lying sister could give for option A that couldn't possibly be something she'd say is where the treaure is.

@PianothShaveck 2 ай бұрын

Yes, I also sent a feedback about that. The correct answer should definitely be A), and not C)

@footleg3310 2 ай бұрын

I don’t understand question #7. Are you saying that John is the 24-year-old bald man who watched himself get hit by the bulb (because the mirror is so small he couldn’t be seeing anyone else, I guess?) Even given that, an apology would only be redundant if he had already apologized (apologised) to himself - which he had not - in fact, he called himself an idiot.

@technicolourmyles 2 ай бұрын

@@footleg3310 yes, John is the bald man. He's brushing his teeth in front of a mirror, most likely indicating he's in his own bathroom. It would be redundant to send an apology text to yourself.

@lyte69 2 ай бұрын

Was waiting for the simple bench release, love the work, thank you as always for the great video!

@jamesneutron2690 2 ай бұрын

great work with simple bench. love that last comment aha

@aiexplained-official 2 ай бұрын

Haha thanks !

@zyzhang1130 2 ай бұрын

Congrats on the release!

@zyzhang1130 2 ай бұрын

Also hope u can continuously safeguard ur private dataset. It’s a big issue nowadays for so many benchmarks

@Bolidoo 2 ай бұрын

The fact we got a 41.7% on simple bench is already quite remarkable to me. It suggests LLMs do have a certain real world understanding. Even if it seems contradictory, simple bench makes me optimistic about the future of LLMs.

@draken5379 2 ай бұрын

Congrats on the paper bro !

@George-Aguilar 2 ай бұрын

I work in Google Advertising and this is an imminent threat! 😮

@thehighhnotes 2 ай бұрын

Tested search with some local euro news highlights, very capable and verrry quick :)

@kelvinmunyimbili6078 2 ай бұрын

I'm impressed at how anthropic is neck and neck with o1 without seeming to break a sweat.

@ksprdk 2 ай бұрын

Thanks for including my question 😉

@aiexplained-official 2 ай бұрын

Which one was that?

@ksprdk 2 ай бұрын

@@aiexplained-officialthe bold prediction (same username 😉)

@HorizonIn-Finite 2 ай бұрын

I asked my coworkers this question, out of 12 people one got it right. Numbering the sentences and getting rid of the multiple choice increases performance. Changing it to balancing a yellow balloon on head while walking 100m, gets 100% so far. (No multiple choice) It seems to always choose D, but often times says that they are both on the ground within the same response

@micbab-vg2mu 2 ай бұрын

My own experiments are in line with your benchmark; I got a similar level of reasoning o1-preview versus new Claude Sonnet. Perplexity hallucinated as well about the question related to Premier League:) - maybe models have a problem to scrap date from tabels. Gemini Advance does't answer the question: "Unfortunately, I cannot give you real-time standings for the Premier League. Sports standings change very frequently!" Still old school Google search is more accurate:)

@ArnaudMEURET 2 ай бұрын

I use Brave search by default and it’s often insufficient _but_ since they added the AI summary last year, it has proven extremely helpful in most of my daily queries (mostly development and general tech queries)

@aiexplained-official 2 ай бұрын

Good shout

@ntelas46 2 ай бұрын

It’s also available for all searchgpt waitlist users like me. Even on the free plan.

@luizpereira7165 2 ай бұрын

Are you the creator of simplebench? Congratulations, it's a great bench mark. You could run a interesting little experiment with it. Change some details of the questions (without making any diference for the reasouning) like the order names apears on those questions, the names of people or the shapes and colors of objects to see if the models decrease their performance.

@aiexplained-official 2 ай бұрын

Yeah of course I am!

@thesobercoder 2 ай бұрын

Love the EPL test.

@thedrakenorton 2 ай бұрын

Great stuff as always, Philip

@conor.brennan 2 ай бұрын

8:21 I don't believe that hallucinations are occurring because the models have been trained on human-written data, that seems naive. It seems to be caused by something more fundamental to how these models generate, not just the training data.

@Pabz2030 2 ай бұрын

Was trying this out just after it dropped. Man it's so nice to actually have a web search again that isn't just an Ad Auction house. And not only that you have a conversation with it to hone your results. Adios Google.

@adfaklsdjf 2 ай бұрын

5:13 when you said you might disappoint some of us, I braced myself mentally for you to say you thought the search AI overviews weren't all that bad actually. Then when you said they've made search worse, I felt relief but also confusion.. does anyone think Google's AI overviews haven't made search worse? If so please share thoughts..

@pablorodriguez6318 2 ай бұрын

Best ai channel

@jerobarraco 2 ай бұрын

3:50 thats a key point. That hos been there since the start. Is dangerous to trust the output blindly. And the more complex the system, the harder to spot it. (Not necessarily the less obvious or catastrophic).

@connoross13 2 ай бұрын

Hello algorithm, I am also engaging with this video

@WilliamBoothClibborn 2 ай бұрын

Woo, new simple bench

@hannibal8049 2 ай бұрын

Hi algorithm, I am engaging with this video

@cheslerpark7223 2 ай бұрын

Hi, I enjoy your YT vids and decided to check out Simple Bench. I happen to have a concern with the first question that came up, and I think you should give it another look and maybe revise it. The question: "Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute?" Ok, why I think it's problematic is because it is a complete assumption that the "frying pan" is sufficiently heated to melt the ice cubes in less than a minute. Pans are often referred to as "frying pans" as a descriptor of the type of pan, not because they are currently frying something at high heat. "While it was frying a crispy egg" could be at any time, and doesn't necessarily refer to the first three minutes. So because there is no direct piece of evidence that the frying pan is hot and could quickly melt an ice cube, then the reason an intelligence would select the correct answer "0" is because it's inferring that the question is quite silly and nonsensical, that there isn't sufficient evidence to give a correct answer, and that it thinks it's likely it is a trick question. If that is your intention with the question, then my apologies. However, I get the feeling you want the question to be more accessible and test if the intelligence can figure out the pan is indeed hot enough to melt the ice through real world context clues. To do that, it needs to be clear, not assumed or inferred, that the pan is in use and at high heat from the start.

@jonogrimmer6013 2 ай бұрын

Simple Bench looks great!

@thepopmanbrad 2 ай бұрын

the fact that im a free user and ive already got this search feature is dope might use it like all the time

@jmillward 2 ай бұрын

1000 discord members at £25 a month, dang. The power of passion and quality content👌

@AlexanderMoen 2 ай бұрын

I've used OpenAI for most of my work with AI, with Gemini as a backup specifically for when I need a large context window. However, with Claude consistently ranking so high on so many benchmarks, having a little more of a context window than OpenAI, having contextual RAG, allowing you to run code within it, and now with computer use? I'm wondering if I should start using and building around that instead. ChatGPT with Search and 01-preview are just not big enough releases and make me wonder if my work is worse off for it. Unless, of course, OpenAI releases some absolutely killer agent stuff very soon that puts computer use to shame. ...And then there's the darkhorse xAI potentially coming out of nowhere with the world's 2 largest AI focused supercomputer clusters probably training something ridiculous as we speak.

@ShawnFumo 2 ай бұрын

It also depends on what you're using it for specifically. Like if it is for software, GitHub Copilot now has the ability to edit multiple files at once, using a variety of LLMs (including o1-preview and Sonnet 3.5 new) similar to Cursor's composer feature. And copilot is just $10/m

@dioc8699 2 ай бұрын

I am a free user, but I already have access to SearchGPT cuz I signed up through the waitlist ! I'm loving the experience !

@Raulikien 2 ай бұрын

Can you clarify if you are giving the LLM the context of ALL the questions in simple bench, or testing 1-by-1? A human (that doesn't know about single bench) can infer that they are trick questions and think more carefully if you give them at least a few samples. If you just give 1 question to the AI in each session, without it knowing about the context of the others, that's a clear disadvantage.

@Neomadra 2 ай бұрын

Most likely single questions. Otherwise benchmarking would be too expensive, providing multiple questions as context every time. Also the benchmark design would become way more complex then. But he explained in his podcast that even when highlighting in the prompt that these are trick questions and giving helpful tips how to tackle these questions, the total score doesn't change much. This is especially true for the spatial reasoning tasks. That's not surprising. It's extremely hard to learn spatial reasoning only based on language modeling.

@adfaklsdjf 2 ай бұрын

i think for benchmarking purposes it's important that each question starts cold.. just to filter out weird influences across questions.. a standard baseline. i get that that creates an uneven playing field between machines and humans, but i think it's the better of imperfect options. and i think the primary role of benchmarks is for comparing models to each other, more than comparing them to humans, but the human performance provides a useful and important reference point

@Kleddamag 2 ай бұрын

Hey, just wanted to mention that there's a small typo in the footer of the 'simple-bench' page: 'This page was was adopted from the Nerfies project page.' The word 'was' is repeated. Thought I'd give you a heads-up! 👍

@aiexplained-official 2 ай бұрын

Thank you Kledda! Great spot

@ChimpDeveloperOfficial 2 ай бұрын

things are about to get next level

@sometingclever 2 ай бұрын

Thank you for update

@OperationDarkside 2 ай бұрын

I was just about to go to bed. You are bad for my sleep schedule.

@divyansh1510 2 ай бұрын

hey, I wanted to ask how do you evaluate your benchnmark using the APIs while ensuring that the questions do not get leaked out into the training datasets of these companies. Because, the way I understand it that if you are sending your question as an API request, they can certainly get that question in their training dataset 🤔

@WilliamKiely 2 ай бұрын

9:50 Metaculus is a forecasting site; it doesn't have prediction markets. Maybe you were thinking of Manifold Markets?

@equious8413 2 ай бұрын

I think overcoming hallucinations would require a massive and perfect data set. We just don't have this, nor is it necessarily possible since so much of discourse is subjective and opinionated. I think the path forward won't be eliminating hallucinations but mitigation against them. I envision a whole second supervisory layer built to verify outputs as being a requirement eventually. This could potentially double hardware/energy requirements pending how robust this layer is . This'll be our next real bottleneck imo.

@Dannnneh 2 ай бұрын

Yum yum, thanks for the taster!

@legendarystuff6971 2 ай бұрын

EVERY SINGLE LITTLE CHANGE Google made over the last two years made their product 30-50% worse, with the ai generated overview being at around 70%. It's degraded so much, so many times that I struggle to come up with a single example when Google search is useful today. People would pay good money for Google search from 2010 that used to do just that, search. Hope Google gets broken apart, we deserve better as consumers and competition will be useful

@DraconianEmpath 2 ай бұрын

I've been trying different search engines because of all this, and also because it feels like the Google results are less helpful than they have been in the past? best alternative I've found so far is brave search. they still have the AI summary thing, but you can disable it in settings

@paulallen8304 2 ай бұрын

I hate to say it but Bing is actually superior now

@alpha007org 2 ай бұрын

Well, to me, it's not clear 2010 google search algorithm would be better with pages that exist now. Now we have pages made/generated for google results, and these generated pages would be on the first 20 pages, if you used 2010 version. I think it's cat and mouse thing with google algorithm. They are actively trying to filter out "spam" pages, and they are doing a somewhat good job. But what is true, is the fact that 2010 internet and google results without these spam pages was much better and useful.

@andybaldman 2 ай бұрын

Jesus man, well said. Could not agree more.

@Efficienado 2 ай бұрын

I still use google search probably 20-50 times a day. Still super useful and I actually like the ai gen for quick answers to facts

@wingedsheep2 2 ай бұрын

I wonder if ChatGPT would get the question about the juggler right if it could imagine the scene using Sora.

@juandesalgado 2 ай бұрын

Nice! Congratulations for giving Simple Bench to the public... congratulations, or whichever words are used when someone has a newborn child :)

@TheMAHfilms 2 ай бұрын

The simplebench question on the 24 yr old brushing his teeth while a bald man gets hit in the head with a neon lightbulb all in the same bathroom💀

@michaelberg7201 2 ай бұрын

I don't understand the error with Nottingham forest. They have 4 wins x 3 points = 12, + 4 draws x 1 point = 16. Search showed they had 16 points. So what was the error there?

@kilianeagrams6011 2 ай бұрын

3:57 In the little preamble above the sheet, it says 13 points

@FirstName-zt2my 2 ай бұрын

I wouldn't mind ads in the form of suggestions as long as it gave me the best suggestion first then the advertised suggestion second and have it clearly stated.

@KindredKin 2 ай бұрын

Anyone find a way to turn off the prompt suggestion popup list? I find it distracting from what I am trying to write.

@BeethovenHD 2 ай бұрын

So clean, like it is made for powerusers.

@rasuru_dev 2 ай бұрын

"In a few weeks" -> "In a few months" "Over the coming months" -> 2030 fair enough

@Thedeepseanomad 2 ай бұрын

Yes, Reliability, it is crucial for the economic impact of AI. Right now it is more brainstorm than a green needle.

@thomasslynch1 2 ай бұрын

you can propose your own prediction market on poly

@lordnoob404 2 ай бұрын

Did someone make a question related to robotics or what are SA's ideas for this field? If OpenAI announced anything on that field, the hype would be immense.

@francisco444 2 ай бұрын

Amazing analysis. Is Simple Bench only in the English language?

@TheSpeedyEMO 2 ай бұрын

What perfect timing

@ThatPsdude 2 ай бұрын

I don't see the simple bench link in the description, could you perhaps add it in? Keep up the great work!

@aiexplained-official 2 ай бұрын

Done!

@ThatPsdude 2 ай бұрын

@aiexplained-official Thank you!

@maciejbala477 2 ай бұрын

The hallucination thing is worrying. It's the thing that I'd most want to see properly addressed. I think it makes so much difference, not just for economic growth, but for everyday use as well. It is tiring to have to verify everything and deal with annoying mistakes which crop up out of nowhere whenever you're doing something. But it is what I suspected, to be honest, AI companies do not have a good answer for it. And until they do, I don't think we can ever claim they're more intelligent than humans, because reliability is implied in that claim, in my view

@lako2023 2 ай бұрын

The problem I see with the various AI benchmarks (or human benchmarks used with an AI): The training data of the foundation models will of course also include descriptions/questions for all kinds of AI benchmarks as well as all the discussions about them. So if an AI can use parts of its knowledge when being benchmarked, this will already change the result. We'd need offline benchmarks on systems not connected to the internet, vetted by professionals who are under NDA (= won't write about the details) to really know what they are capable off, right?