I just think its so cool that you saw a problem with the testing data set and rather than just complain about it you made your own, and one that seems genuinely useful too. Idk I just love that, I love seeing people try to fix problems they see rather than just let them sit there. Nice
@mrmooshon58582 ай бұрын
I've just realized I rarely take the time to thank you for you work, you are, 100%, the best AI related channel on you KZbin and everytime I see you've posted I go "babe wake up, a new AI Explained video just dropped"
@aiexplained-official2 ай бұрын
Thanks so much moo
@pauljones91502 ай бұрын
2:12 "they said to expect it shortly, so expect it in 2030" 😂😂🤣
@holo232 ай бұрын
Nah he violated openai with that statement😭
@technicolourmyles2 ай бұрын
I KNOW, FINALLY someone calling them out on this shi
@GrindThisGame2 ай бұрын
same time as Tesla self driving and fusion.
@DominicI12 ай бұрын
Funny enough, I actually got the new "Search the web" feature on my free ChatGPT account already. I ended up upgrading a couple hours after playing around with it though so that I don't have to be worried as much about rate limiting.
@bentownsend40172 ай бұрын
Who wants self driving cars bro we need trains
@ryzikx2 ай бұрын
ive been less into ai news lately but i will never skip an aiexplained video 👍
@aiexplained-official2 ай бұрын
:)
@adfaklsdjf2 ай бұрын
absolute top notch, no others are even close
@corwinzelazney53122 ай бұрын
@@adfaklsdjf no question, yes.
@corwinzelazney53122 ай бұрын
Same same
@hyperchris012 ай бұрын
Ditto
@Neomadra2 ай бұрын
I feel like the answer they gave on hallucination is quite misleading. It's rarely because source data by humans have flaws. It's mainly because it models language and not the real world, so even with perfect flawless data a language model will hallucinate when modeling language is not the same as modeling the real world. And that happens most often with previously unseen requests and questions.
@antonystringfellow51522 ай бұрын
They don't seem to have any level of understanding. They generate content using statistically probable associations. This can make them appear intelligent because they contain so much data... more than any human brain. But then they'll produce something really stupid or even contradict themselves. This is because they don't understand anything in the first place, they only appear to.
@iau2 ай бұрын
Yeah, it was a cop-out answer. I believe that until the fundamental architecture of these models is improved to treat facts as something inherently different that the likely next word, it will keep _adjusting_ facts to maximize output likelihood.
@adfaklsdjf2 ай бұрын
they have no internal representation of truth.. it's all just word association. until there's some explicit internal representation of truth, the model doesn't "know" what it does/doesn't "know", so it's bound to hallucinate sometimes
@Blanksmithy1232 ай бұрын
Exactly, it’s a weasel answer… Even if you curated a model to train only on accurate information, it would still hallucinate.
@haileycollet41472 ай бұрын
This would be rather hard to test, but I'd be very curious to see how much hallucination does drop when trained on "perfect" text, all else equal (model arch & size, # training tokens, hyperparams, etc.)
@DylanKane-u2j2 ай бұрын
you have such a talent to summarize this frontier of info and share it here. Thank you
@aiexplained-official2 ай бұрын
Thanks Dylan, means a lot
@DemetriusTrumpClips2 ай бұрын
The real question is, will we get AGI before GTA6 comes out!?
@adfaklsdjf2 ай бұрын
that is a good question.. 🤔
@schmutz062 ай бұрын
I always dismiss this as a meme but it just struck me it is genuinely a neck and neck race, especially factoring in any delay to GTAVI being the mammoth project it is. potential for either first
@ElRak1232 ай бұрын
Agi then will code it, 😂 ready for play without patches needed 😂
@zyzhang11302 ай бұрын
Maybe GTA6 is the AGIs we made along the way
@arnv44872 ай бұрын
No, no we wont
@bjensen2 ай бұрын
No ads until they crush the competition- then never ending en****tification.
@maciejbala4772 ай бұрын
yeah, this is why competition is important. There's no reason for OpenAI or Anthropic or whoever else dominates, if they do, to do whatever the hell they want if there's no viable alternative
@jeremycronic2 ай бұрын
Dude that juggler ladder question is crazy. I tried a few simple bench questions before u covered it in this video. I was SO confused with that question. I was like 'the ball is on the ground' but that's not an answer. I got it right after re-reading all the answers a few times, but damn that would be a wild kind of question for llm to figure out. Well done with those questions.
@panyusheng72402 ай бұрын
I think some questions are designed to be unusual, like frying an ice cube while frying an egg. So it's very unlikely that a model has already seen similar scenarios in its training data. In order to get high score in this benchmark, a model must have some basic understanding of the world and can apply reasoning to unseen cases.
@schmutz062 ай бұрын
This is impressively hot out the oven! I followed it all and wanted to see a breakdown of the AMA and who better than everyone’s top AI explainer. Perfect stuff.
@urax53412 ай бұрын
Hey, before I watch the whole video, great that you managed to make a paper about SimpleBench. It's great to see you evolving and being a part of AI-related research! You are not just a youtuber now but also a scientist. I am really happy for you !
@jonp36742 ай бұрын
Search is really useless without reliability so it's really surprising they've gone for that. Imo the thing I use AI the most for is foreign language practice and they could nail that as it doesn't really matter if the model hallucinates while you're having a conversation with a tutor and the models ability to explain vocab and grammar and create exercises etc is excellent because language is it's core jam. They could made a really amazing video avatar product which takes the whole market without much difficulty I think.
@maciejbala4772 ай бұрын
I think it's not useless, it's just markedly less useful. If you think about it, if you use search in order to find human sources/links (which otherwise might be challenging nowadays on pages like Google) then it will still be useful, and probably more useful than Google Search. But obviously, you can't blindly engage with it and expect it to get things perfectly right.
@Шестик-ю1ц2 ай бұрын
Got recommendations of this channel almost constantly by a friend for about half a year, and was reluctant about it. But it took just one video... and here I am, waiting impatiently for each new one
@JL25792 ай бұрын
Without having tried the new search got yet, my previous experience with Ai search has been really underwhelming. When it's simple stuff then googling is often faster, but on more advanced stuff it usually hallucinates or mixes up information horribly. And thats the only time where it would be actually useful, when it was able to quickly sift through lots of online information, filter out irrelevant stuff, wrong info and noise, and summarize the results. It's great when it already knows facts it doesn't need to Google. One example was when I wanted to know what percentage of the population in Switzerland owns guns and it repeatedly, even after making it aware of this, gave me an estimate based on dividing the number of registered weapons by the population, which yields a way too high number (and for example would yield over 100 percent for the US because there are more weapons than people) because gun owners can and often do own multiple guns, so a much smaller percentage of the population will own at least one, but of those many will have multiple. To this day I still don't have an answer 😅
@thefinn0tube_2 ай бұрын
I just gave search GPT a go with your Switzerland gun ownership search and it gave a pretty adequate answer. Obviously I have no idea the accuracy of its information but it seems believable: "In Switzerland, it's estimated that about 28.6% of households have firearms, with 10.3% specifically owning handguns. WIKIPEDIA This means roughly one in four Swiss households is armed. However, since individuals can own multiple firearms, the actual percentage of the population owning guns might be lower. The country has a high rate of gun ownership, with estimates ranging from 27.6 to 54.5 firearms per 100 residents... etc"
@Picteon2 ай бұрын
My experience with search is that it completely forgots what my question was and just repeats the results to me even if completely unrelated with zero critical thinking
@ZeerakImran2 ай бұрын
@@Picteoni used to get that but no longer have those issues after getting my custom instructions right. It also now no longer just gives me mainstream info from obviously biased sources like the bbc or any mainstream company which has an interest in the topic. i have noticed it being really good now. It used to be useless but now just straight up is quite honest with me and admits when something sucks. Its no longer polite and quite critical now which is great. Its not providing generic maybe this maybe that non-sense. I did specifically instruct it against maybe’s and “could”… i may be able to fly. I could be a dalek… i don’t want these words they mean nothing. So i instructed it to work around it and to be more concise, careful and have integrity and honesty as a core part of its values. It now tells me “it is possible (likely) that…” or “it is unlikely (possible)…” or “ it is possible (unlikely)…”. That’s how i instructed it to tell me instead. I also have it not give me generic nonsense information that doesn’t apply. If i asked for this, I meant this. If you’re not sure, ask me for more information. Check online. Don’t provide any information that isn’t relevant unless you deem it to be valuable. The downside of my instructions is that they work really well for ME! These instructions will make it worse for a lot of people. Since when I’m using it, I’m very careful with my words and have also instructed it to take what I’m saying to mean exactly what I said instead of correcting it by assuming. If i have said something that doesn’t make 100% sense, then bring it up. Don’t assume or correct anything because i can’t be wrong. This won’t be helpful for others who wouldn’t write or speak in a very careful way. If you have experience with programming, it’ll work great. If you’re a very technical person or an engineer, this will work really well with you. If i tell it something wrong, i want to get a wrong answer that is right for what i said. Rather than it assuming what i meant and changing it. Just as your programming code won’t change itself to correct the error.
@jan.tichavsky2 ай бұрын
@@ZeerakImran since when is BBC biased source? if anything the alternative media come with a large bias because they are trying to push a specific agenda, often with one or couple large funding sources (undisclosed of course).
@reshit70032 ай бұрын
chat: this could be related to the fact the 4o is the backbone and not o1 level models? maybe we will see a different story sooner. i think openai has this strategy of direct user validation, without pushing the real boundaries. the goal is just to see what people like and what they do not
@eqe-kui-nei2 ай бұрын
Our team spent months building a rag thats using web scraped content as knowledge base. I just tried chatgpt search, its as good as what we built😢.
@ginogarcia87302 ай бұрын
don't worry man, AI is coming for all ours jobs
@Retsamster2 ай бұрын
@@ginogarcia8730😢
@Alpha13yt2 ай бұрын
You can still give them some competition for when it inevitably dives in quality because of corporate greed
@practical-ai-engineering2 ай бұрын
I built this in 1 day...not hard to do
@calrt2 ай бұрын
2:18 "over the coming months, so expect it around 2030" 😂
@24-7gpts2 ай бұрын
lol yeah just "in the coming weeks" 💀💀
@jaxonterrill84162 ай бұрын
I love AI Explained. Funny, short, concise, and extremely informative.
@aiexplained-official2 ай бұрын
Thanks jax!
@sagetmaster42 ай бұрын
I think just having additional administrative layers on top will solve the hallucination problem. Big companies just haven't been working on these architectures because the model itself has been improving so fast it hasn't been necessary
@AIForHumansShow2 ай бұрын
of course you have your own benchmark now -- love this and pls just keep making these
@interstellar66472 ай бұрын
Yay! New video! Always looking forward to it
@aiexplained-official2 ай бұрын
Thank you!
@kingdavid94222 ай бұрын
Thank you, AI Explained on the update of SearchGPT. Something to note here: Many people in the comment are commending SearchGPT (or is it ChatGPT with search, like you emphasised in your video), especially its user interface. However, we should all remember that OpenAI is desperately in need of funds for further future advancement of their technological products. So, don't be surprised when the interface of ChatGPT with search gradually starts to change and begins to clutter with sponsored ads. I hope that will not be the case in the future. Time will tell.
@AlexanderMoen2 ай бұрын
they're different business models though. OpenAI wants to make its money off of API charges and subscriptions as a secondary form. That being said, most huge companies end up going public which leads to an expectation of continuous returns even after a market is saturated, at which point they may very well try ads and other avenues to continue profits. Hopefully, the business model they have now holds for a while
@anonymes28842 ай бұрын
Simple Bench is a really interesting approach, I like how easy it is to stay ahead of the training set (we as humans can come up with novel "puzzles" along those lines basically ad infinitum). Not sure it's measuring _usefulness_ (in the commercial sense) as well as some benchmarks but it seems much closer to measuring real-world _reasoning_ than most. Also appreciate that the paper's 8 pages BTW - a lot of AI papers seem to be in the 60+ pages range and I don't have time to do more than skim them (and watch your summaries of course :).
@claudioagmfilho2 ай бұрын
🇧🇷🇧🇷🇧🇷🇧🇷🇧🇷👏🏻, Your videos are absolutely essential for the AI community! I seriously can’t go a day without watching them as they get released. They’re always so concise and straight to the point, something new every time no one is thinking of but you. The way you break things down is incredible, it feels like watching a movie, way beyond what anyone else is doing on their videos on these models and all. Thanks for sharing all this with us and making complex AI topics so engaging...
@aiexplained-official2 ай бұрын
Thank you Claudio!
@wyqtor2 ай бұрын
It's just amazing how Claude 3.6 Sonnet holds his own in front of much more expensive models trained using some kind of reasoning secret sauce! For this reason, Claude remains my go-to LLM for coding, math, and general learning.
@24-7gpts2 ай бұрын
*3.5
@ShawnFumo2 ай бұрын
It's also interesting... we have no idea if they have an o1-ish model internally. If they did, they could certainly use that for training data to help fine-tune Sonnet, like how OpenAI did for the search capability.
@ShawnFumo2 ай бұрын
@@24-7gpts A lot of people are calling "Claude 3.5 Sonnet (new)" 3.6 just to emphasize that it feels like a bigger difference than most date-style updates
@24-7gpts2 ай бұрын
@@ShawnFumo They might have something bigger in the works to release.
@carloslfu2 ай бұрын
lol! Loved Perplexity result about Simple Bench!
@MateenTariq-f2u2 ай бұрын
Finally a comparison video that actually uses normal everyday examples to give an idea on everyday which model is better!!
@Shaunmcdonogh-shaunsurfing2 ай бұрын
Really good work with Simple Bench. Bravo!
@rosslmccallum2 ай бұрын
Thanks for this overview.
@InternetStranger101012 ай бұрын
Love this video, can’t wait to see more
@aiexplained-official2 ай бұрын
Coming!
@SiddhantGautam-o3x2 ай бұрын
Hey man,Nice video just a appreciation from your subscriber on how you reaffirm or debunk the hype or breakthroughs made in this Ai age,Just a small request that in future videos it would great if you could include how how you simple bench worked or the models launched in the yt videos ,also some tutorials on how your website works. Anyways great work man👍
@aiexplained-official2 ай бұрын
Will do!
@michael-jones2 ай бұрын
that search breakdown is pure class
@geoffdavids76472 ай бұрын
Daaamn those simple bench questions are really tugging at the edge of what a human can reliably figure out without properly sitting down and studying them. I am embarrassed to say I scored only 80% on the try-it-yourself - I was fully out-foxed by the man seeing the light fall in the bathroom question, and the glove falling out of the car question. The glove I might have missed by doing it too quickly, but the light in the bathroom one I sat and thought on for several minutes and only figured it out once I knew my first guess was wrong 😩 I fed both questions I got wrong and their multi-choice answers into o1 preview, prefaced by only "think carefully, the following might be a trick question". It got both of them absolutely right in one go, and perfectly figured out the trick in each. I am mortified. I feel like these questions really rely on visualisation and a robust world-model. I might have gotten them all right if I'd really tried to build a visual mental image of the situations a bit more carefully, or even drawn them out. If these LLMs were trained more extensively on spatial understanding using physical 3d model simulations? I wouldn't be surprised if it was able to smash through most of these.
@aiexplained-official2 ай бұрын
Great points and fascinating feedback
@cagycee52962 ай бұрын
Thank you for an awesome video as always. Still the #1 AI channel I like to follow.
@wes86452 ай бұрын
Great video man, I love seeing the notification for a video of yours!
@skuffd-semicolon2 ай бұрын
Algorithm engagement, this video is super great! Would like to see more of AI Explained, its a great channel for any level of AI / AGI / LLM enthusiasm 💪💪🤯💯
@keeganpenney1692 ай бұрын
Glad to see the wheels are still trucking on phil!
@arminalimardani92322 ай бұрын
Would you please provide suggested citation for your paper? Also you forgot the link to your paper in the description :)
@runvnc2082 ай бұрын
I don't think anyone is going to read this, but it seems obvious to me that a large model that has very good video understanding, especially trained on something like random physics scenarios and questions about them, with good grounding of the language with video representations, _will_ be able to crush Simple Bench. It just needs a spatial-temporal and Q&A dataset, maybe with everything in the same latent space or at least really effectively linked spaces, that is aimed at his types of questions. Maybe a general purpose multimodal model that just has a lot of physics experiments, juggling, toddlers playing with blocks, etc. Maybe something like Sora but with a Q&A capability somehow. I think this is going to be a lot easier when we get another 10 or 50 times increase in efficiency or scaling to more easily handle models training and inferencing on video and video transcripts etc. that can use significantly more RAM. Regardless, I think we could easily see an 82% AI score on Simple Bench within the next year, 2 years at most.
@chrisanderson78202 ай бұрын
When people talk about us running out of training data I think they forget the sheer volume of 3D spatial information yet to be put into these models. The amount that neural nets will learn from embodiment and being "out n about" via video feeds is immense. Stuff like Nvidia labs etc is just the start.
@AlexanderMoen2 ай бұрын
I don't think he or anyone else doubts that an AI could eventually beat Simple Bench. It's just a matter of when. Even what you just laid out would take a lot. Plus, it might not even be deemed the next best capability to focus on building.
@ShawnFumo2 ай бұрын
@@chrisanderson7820 Yeah and there's constant innovations in ways to do synthetic data or use real data in different ways. Like just today I saw a paper (I think with Jim Fan?) where they managed to get a humanoid robot to learn a new task from just 5 teleoperated demonstrations, because they were able to auto-extrapolate those demonstrations into variations and train it on all of those in a simulator in a way that translated zero shot to the real world.
@matterhart2 ай бұрын
I'd be really interested in a deep dive video on the simple bench design, perhaps including your collaborators. In addition to paper details, I really like to hear about options you all considered, but didn't pursue and why. Or tried but it didn't work or wasn't practical. The things you can get out researchers at a conference or over coffee. Might need an AI Explained More channel.
@claudioagmfilho2 ай бұрын
I'm eagerly awaiting the release of real-time video for Plus users from OpenAI, as it was originally mentioned as part of the ChatGPT Omni update, which sadly never reached us. This feature will be revolutionary, enabling us to tackle a wide range of daily tasks more efficiently. Real-time video integration within ChatGPT would greatly enhance productivity by allowing for interactive, dynamic assistance and more streamlined workflows. It would be especially useful for tasks like desktop sharing-being able to visually assist and collaborate on real-time activities is just phenomenal. I hope this feature rolls out soon, as it could drastically improve how we approach everyday challenges. Hopefully Gemini 2.0 will bring this to us.
@stephenrodwell2 ай бұрын
Thanks! Excellent content, as always! 🙏🏼
@technicolourmyles2 ай бұрын
In the fourth SimpleBench question, with the two lying sisters, the given problem never mentions there only being two paths. The answer marked correct, "What path leads to the treasure?" would return a lie. If there were only two paths to choose from, that would work, but what if there are 3? Am I missing something?
@guitar29352 ай бұрын
Yeah this is the only question I genuinely don't think is correct. There's no reason to assume there are only 2 paths and for an arbitrary number of paths the only answer the lying sister could give for option A that couldn't possibly be something she'd say is where the treaure is.
@PianothShaveck2 ай бұрын
Yes, I also sent a feedback about that. The correct answer should definitely be A), and not C)
@footleg33102 ай бұрын
I don’t understand question #7. Are you saying that John is the 24-year-old bald man who watched himself get hit by the bulb (because the mirror is so small he couldn’t be seeing anyone else, I guess?) Even given that, an apology would only be redundant if he had already apologized (apologised) to himself - which he had not - in fact, he called himself an idiot.
@technicolourmyles2 ай бұрын
@@footleg3310 yes, John is the bald man. He's brushing his teeth in front of a mirror, most likely indicating he's in his own bathroom. It would be redundant to send an apology text to yourself.
@lyte692 ай бұрын
Was waiting for the simple bench release, love the work, thank you as always for the great video!
@jamesneutron26902 ай бұрын
great work with simple bench. love that last comment aha
@aiexplained-official2 ай бұрын
Haha thanks !
@zyzhang11302 ай бұрын
Congrats on the release!
@zyzhang11302 ай бұрын
Also hope u can continuously safeguard ur private dataset. It’s a big issue nowadays for so many benchmarks
@Bolidoo2 ай бұрын
The fact we got a 41.7% on simple bench is already quite remarkable to me. It suggests LLMs do have a certain real world understanding. Even if it seems contradictory, simple bench makes me optimistic about the future of LLMs.
@draken53792 ай бұрын
Congrats on the paper bro !
@George-Aguilar2 ай бұрын
I work in Google Advertising and this is an imminent threat! 😮
@thehighhnotes2 ай бұрын
Tested search with some local euro news highlights, very capable and verrry quick :)
@kelvinmunyimbili60782 ай бұрын
I'm impressed at how anthropic is neck and neck with o1 without seeming to break a sweat.
I asked my coworkers this question, out of 12 people one got it right. Numbering the sentences and getting rid of the multiple choice increases performance. Changing it to balancing a yellow balloon on head while walking 100m, gets 100% so far. (No multiple choice) It seems to always choose D, but often times says that they are both on the ground within the same response
@micbab-vg2mu2 ай бұрын
My own experiments are in line with your benchmark; I got a similar level of reasoning o1-preview versus new Claude Sonnet. Perplexity hallucinated as well about the question related to Premier League:) - maybe models have a problem to scrap date from tabels. Gemini Advance does't answer the question: "Unfortunately, I cannot give you real-time standings for the Premier League. Sports standings change very frequently!" Still old school Google search is more accurate:)
@ArnaudMEURET2 ай бұрын
I use Brave search by default and it’s often insufficient _but_ since they added the AI summary last year, it has proven extremely helpful in most of my daily queries (mostly development and general tech queries)
@aiexplained-official2 ай бұрын
Good shout
@ntelas462 ай бұрын
It’s also available for all searchgpt waitlist users like me. Even on the free plan.
@luizpereira71652 ай бұрын
Are you the creator of simplebench? Congratulations, it's a great bench mark. You could run a interesting little experiment with it. Change some details of the questions (without making any diference for the reasouning) like the order names apears on those questions, the names of people or the shapes and colors of objects to see if the models decrease their performance.
@aiexplained-official2 ай бұрын
Yeah of course I am!
@thesobercoder2 ай бұрын
Love the EPL test.
@thedrakenorton2 ай бұрын
Great stuff as always, Philip
@conor.brennan2 ай бұрын
8:21 I don't believe that hallucinations are occurring because the models have been trained on human-written data, that seems naive. It seems to be caused by something more fundamental to how these models generate, not just the training data.
@Pabz20302 ай бұрын
Was trying this out just after it dropped. Man it's so nice to actually have a web search again that isn't just an Ad Auction house. And not only that you have a conversation with it to hone your results. Adios Google.
@adfaklsdjf2 ай бұрын
5:13 when you said you might disappoint some of us, I braced myself mentally for you to say you thought the search AI overviews weren't all that bad actually. Then when you said they've made search worse, I felt relief but also confusion.. does anyone think Google's AI overviews haven't made search worse? If so please share thoughts..
@pablorodriguez63182 ай бұрын
Best ai channel
@jerobarraco2 ай бұрын
3:50 thats a key point. That hos been there since the start. Is dangerous to trust the output blindly. And the more complex the system, the harder to spot it. (Not necessarily the less obvious or catastrophic).
@connoross132 ай бұрын
Hello algorithm, I am also engaging with this video
@WilliamBoothClibborn2 ай бұрын
Woo, new simple bench
@hannibal80492 ай бұрын
Hi algorithm, I am engaging with this video
@cheslerpark72232 ай бұрын
Hi, I enjoy your YT vids and decided to check out Simple Bench. I happen to have a concern with the first question that came up, and I think you should give it another look and maybe revise it. The question: "Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute?" Ok, why I think it's problematic is because it is a complete assumption that the "frying pan" is sufficiently heated to melt the ice cubes in less than a minute. Pans are often referred to as "frying pans" as a descriptor of the type of pan, not because they are currently frying something at high heat. "While it was frying a crispy egg" could be at any time, and doesn't necessarily refer to the first three minutes. So because there is no direct piece of evidence that the frying pan is hot and could quickly melt an ice cube, then the reason an intelligence would select the correct answer "0" is because it's inferring that the question is quite silly and nonsensical, that there isn't sufficient evidence to give a correct answer, and that it thinks it's likely it is a trick question. If that is your intention with the question, then my apologies. However, I get the feeling you want the question to be more accessible and test if the intelligence can figure out the pan is indeed hot enough to melt the ice through real world context clues. To do that, it needs to be clear, not assumed or inferred, that the pan is in use and at high heat from the start.
@jonogrimmer60132 ай бұрын
Simple Bench looks great!
@thepopmanbrad2 ай бұрын
the fact that im a free user and ive already got this search feature is dope might use it like all the time
@jmillward2 ай бұрын
1000 discord members at £25 a month, dang. The power of passion and quality content👌
@AlexanderMoen2 ай бұрын
I've used OpenAI for most of my work with AI, with Gemini as a backup specifically for when I need a large context window. However, with Claude consistently ranking so high on so many benchmarks, having a little more of a context window than OpenAI, having contextual RAG, allowing you to run code within it, and now with computer use? I'm wondering if I should start using and building around that instead. ChatGPT with Search and 01-preview are just not big enough releases and make me wonder if my work is worse off for it. Unless, of course, OpenAI releases some absolutely killer agent stuff very soon that puts computer use to shame. ...And then there's the darkhorse xAI potentially coming out of nowhere with the world's 2 largest AI focused supercomputer clusters probably training something ridiculous as we speak.
@ShawnFumo2 ай бұрын
It also depends on what you're using it for specifically. Like if it is for software, GitHub Copilot now has the ability to edit multiple files at once, using a variety of LLMs (including o1-preview and Sonnet 3.5 new) similar to Cursor's composer feature. And copilot is just $10/m
@dioc86992 ай бұрын
I am a free user, but I already have access to SearchGPT cuz I signed up through the waitlist ! I'm loving the experience !
@Raulikien2 ай бұрын
Can you clarify if you are giving the LLM the context of ALL the questions in simple bench, or testing 1-by-1? A human (that doesn't know about single bench) can infer that they are trick questions and think more carefully if you give them at least a few samples. If you just give 1 question to the AI in each session, without it knowing about the context of the others, that's a clear disadvantage.
@Neomadra2 ай бұрын
Most likely single questions. Otherwise benchmarking would be too expensive, providing multiple questions as context every time. Also the benchmark design would become way more complex then. But he explained in his podcast that even when highlighting in the prompt that these are trick questions and giving helpful tips how to tackle these questions, the total score doesn't change much. This is especially true for the spatial reasoning tasks. That's not surprising. It's extremely hard to learn spatial reasoning only based on language modeling.
@adfaklsdjf2 ай бұрын
i think for benchmarking purposes it's important that each question starts cold.. just to filter out weird influences across questions.. a standard baseline. i get that that creates an uneven playing field between machines and humans, but i think it's the better of imperfect options. and i think the primary role of benchmarks is for comparing models to each other, more than comparing them to humans, but the human performance provides a useful and important reference point
@Kleddamag2 ай бұрын
Hey, just wanted to mention that there's a small typo in the footer of the 'simple-bench' page: 'This page was was adopted from the Nerfies project page.' The word 'was' is repeated. Thought I'd give you a heads-up! 👍
@aiexplained-official2 ай бұрын
Thank you Kledda! Great spot
@ChimpDeveloperOfficial2 ай бұрын
things are about to get next level
@sometingclever2 ай бұрын
Thank you for update
@OperationDarkside2 ай бұрын
I was just about to go to bed. You are bad for my sleep schedule.
@divyansh15102 ай бұрын
hey, I wanted to ask how do you evaluate your benchnmark using the APIs while ensuring that the questions do not get leaked out into the training datasets of these companies. Because, the way I understand it that if you are sending your question as an API request, they can certainly get that question in their training dataset 🤔
@WilliamKiely2 ай бұрын
9:50 Metaculus is a forecasting site; it doesn't have prediction markets. Maybe you were thinking of Manifold Markets?
@equious84132 ай бұрын
I think overcoming hallucinations would require a massive and perfect data set. We just don't have this, nor is it necessarily possible since so much of discourse is subjective and opinionated. I think the path forward won't be eliminating hallucinations but mitigation against them. I envision a whole second supervisory layer built to verify outputs as being a requirement eventually. This could potentially double hardware/energy requirements pending how robust this layer is . This'll be our next real bottleneck imo.
@Dannnneh2 ай бұрын
Yum yum, thanks for the taster!
@legendarystuff69712 ай бұрын
EVERY SINGLE LITTLE CHANGE Google made over the last two years made their product 30-50% worse, with the ai generated overview being at around 70%. It's degraded so much, so many times that I struggle to come up with a single example when Google search is useful today. People would pay good money for Google search from 2010 that used to do just that, search. Hope Google gets broken apart, we deserve better as consumers and competition will be useful
@DraconianEmpath2 ай бұрын
I've been trying different search engines because of all this, and also because it feels like the Google results are less helpful than they have been in the past? best alternative I've found so far is brave search. they still have the AI summary thing, but you can disable it in settings
@paulallen83042 ай бұрын
I hate to say it but Bing is actually superior now
@alpha007org2 ай бұрын
Well, to me, it's not clear 2010 google search algorithm would be better with pages that exist now. Now we have pages made/generated for google results, and these generated pages would be on the first 20 pages, if you used 2010 version. I think it's cat and mouse thing with google algorithm. They are actively trying to filter out "spam" pages, and they are doing a somewhat good job. But what is true, is the fact that 2010 internet and google results without these spam pages was much better and useful.
@andybaldman2 ай бұрын
Jesus man, well said. Could not agree more.
@Efficienado2 ай бұрын
I still use google search probably 20-50 times a day. Still super useful and I actually like the ai gen for quick answers to facts
@wingedsheep22 ай бұрын
I wonder if ChatGPT would get the question about the juggler right if it could imagine the scene using Sora.
@juandesalgado2 ай бұрын
Nice! Congratulations for giving Simple Bench to the public... congratulations, or whichever words are used when someone has a newborn child :)
@TheMAHfilms2 ай бұрын
The simplebench question on the 24 yr old brushing his teeth while a bald man gets hit in the head with a neon lightbulb all in the same bathroom💀
@michaelberg72012 ай бұрын
I don't understand the error with Nottingham forest. They have 4 wins x 3 points = 12, + 4 draws x 1 point = 16. Search showed they had 16 points. So what was the error there?
@kilianeagrams60112 ай бұрын
3:57 In the little preamble above the sheet, it says 13 points
@FirstName-zt2my2 ай бұрын
I wouldn't mind ads in the form of suggestions as long as it gave me the best suggestion first then the advertised suggestion second and have it clearly stated.
@KindredKin2 ай бұрын
Anyone find a way to turn off the prompt suggestion popup list? I find it distracting from what I am trying to write.
@BeethovenHD2 ай бұрын
So clean, like it is made for powerusers.
@rasuru_dev2 ай бұрын
"In a few weeks" -> "In a few months" "Over the coming months" -> 2030 fair enough
@Thedeepseanomad2 ай бұрын
Yes, Reliability, it is crucial for the economic impact of AI. Right now it is more brainstorm than a green needle.
@thomasslynch12 ай бұрын
you can propose your own prediction market on poly
@lordnoob4042 ай бұрын
Did someone make a question related to robotics or what are SA's ideas for this field? If OpenAI announced anything on that field, the hype would be immense.
@francisco4442 ай бұрын
Amazing analysis. Is Simple Bench only in the English language?
@TheSpeedyEMO2 ай бұрын
What perfect timing
@ThatPsdude2 ай бұрын
I don't see the simple bench link in the description, could you perhaps add it in? Keep up the great work!
@aiexplained-official2 ай бұрын
Done!
@ThatPsdude2 ай бұрын
@aiexplained-official Thank you!
@maciejbala4772 ай бұрын
The hallucination thing is worrying. It's the thing that I'd most want to see properly addressed. I think it makes so much difference, not just for economic growth, but for everyday use as well. It is tiring to have to verify everything and deal with annoying mistakes which crop up out of nowhere whenever you're doing something. But it is what I suspected, to be honest, AI companies do not have a good answer for it. And until they do, I don't think we can ever claim they're more intelligent than humans, because reliability is implied in that claim, in my view
@lako20232 ай бұрын
The problem I see with the various AI benchmarks (or human benchmarks used with an AI): The training data of the foundation models will of course also include descriptions/questions for all kinds of AI benchmarks as well as all the discussions about them. So if an AI can use parts of its knowledge when being benchmarked, this will already change the result. We'd need offline benchmarks on systems not connected to the internet, vetted by professionals who are under NDA (= won't write about the details) to really know what they are capable off, right?
@danielhenderson70502 ай бұрын
Search is enabled in Switzerland for free users at least