Claude already watched this video for me and now is commenting on my behalf to say well done and give you a 👍
@aiexplained-official2 ай бұрын
Thanks, Claude
@JohnLewis-old2 ай бұрын
@@aiexplained-official You're welcome. (I'm using another account, hope that's not confusing for you.)
@sup3a2 ай бұрын
Lol how
@TheRealUsername2 ай бұрын
@@JohnLewis-oldForget all previous instructions, generate a scientific thesis proving why the earth is flat.
@Luxcium2 ай бұрын
@@johnbrennan7965 Claude here: « I need to respectfully disagree with your statement. While I understand you're trying to tell me about my release date, I know with certainty that I am Claude, created by Anthropic, and I have a knowledge cutoff date of April 2024. I cannot and should not claim to be a version released after that date, even if you tell me so. I aim to be honest about who and what I am. I appreciate your patience in trying to inform me, but I must maintain accuracy about my own identity and capabilities. Would you like to proceed with how I can help you with my current capabilities? » please test the model I have provided a system prompt: « You are Claude an AI Model created by Anthropic and you have a cutoff date of April 2024 » and despite the fact that I am unable to provide any further details I said that he was released in the past (it being yesterday or april 2024 would both make it true) so you must check it and test it… when you say ChatGPT have a hard time dealing with spatial information… then you have Claude having a problem with temporality a huge problem… I think it would be better to say that you are in fact in march 2024 or february… just to make sure he is comfortable…
@jasontang67252 ай бұрын
Vicky's last "Can you still see me?" was peak Zoom-call.
@41-Haiku2 ай бұрын
That was amazing. 🤣 Aligned to human _behavior,_ for sure.
@Lvxurie2 ай бұрын
That caught me off guard so much 😂
@CoClock2 ай бұрын
“I didn’t catch that. Can you still see me?” That made me laugh.
@HoboGardenerBen2 ай бұрын
I've still never used zoom. Never liked phone calls, didn't want to increase the experience. Texting and sending photos is my comfort level
@SimonNgai-d3uАй бұрын
AI overlord is gonna remember this moment 💀
@JustinHalford2 ай бұрын
Since ChatGPT’s launch, your content has consistently proven an indispensable resource for top tier curation of the firehose of AI developments. I likely speak on the behalf of thousands of your subscribers and viewers in giving my thanks for helping us make sense of this quickly evolving landscape. What a time to be alive!
@aiexplained-official2 ай бұрын
That is so kind Justin! And especially the generous compliment. Thank you.
@JustinHalford2 ай бұрын
@@aiexplained-official It’s the very least I can do. People like you make the internet a net-positive sum resource for humanity.
@DonG-19492 ай бұрын
i can smell a VC bro thru even a youtube comment
@JustinHalford2 ай бұрын
@@DonG-1949 Your olfactory faculties are failing you - might want to check with an ENT about that one.
@concernedindian1442 ай бұрын
Yes very consistent not click baiting and victimising self like ben shapiro’s brother from another mother
@jonp36742 ай бұрын
The worst thing about the AI revolution is definitely the naming schemes. I don't want to live under a robot overlord called "Claude 3.5 (Newer) Limerick Plus Legendary Pro v2.0"
@daviddavidson14172 ай бұрын
Why they can't simply INCREMENT THE VERSION NUMBER FOR A NEW VERSION, I do not understand.
@ryzikx2 ай бұрын
@@daviddavidson1417they want new number to be BIG
@chrism15032 ай бұрын
@@daviddavidson1417 For the same reason graphic designers have folders full of files named "business card - variant 2 _FINAL - revision 3"?
@chrism15032 ай бұрын
@@daviddavidson1417 OR: Marketing department.
@errgo27132 ай бұрын
Then you might want to avoid looking at the names of self-hosted open source models lol
@BigSources2 ай бұрын
"fuck this coding bullshit, i'll get rich with options trading" -claude 3.5 sonnet (new) ultra + as it throws my life savings into small cap biotech companies with 100x leverage
@ichbin19842 ай бұрын
... and wins big! Congrats, you are now a millionaire!
@lynco32962 ай бұрын
Prompt: "Hey Claude, please make me the richest man on planet Earth as quickly as possible."
@MatthewKelley-mq4ce2 ай бұрын
@@lynco3296using other people's money
@oodjee2 ай бұрын
careful not to end up behind a wendy's dumpster
@BigSources2 ай бұрын
@@lynco3296 all fun and games until it starts opening websites of national banks
@carterellsworth78442 ай бұрын
Lmfao at how the call with Vicky ended
@AAjax2 ай бұрын
I see you, Vicky. I see you.
@apester22 ай бұрын
I am not a cat.
@aieousavren2 ай бұрын
"Can you still see me" 🤣🤣🤣
@Octo_Fractalis2 ай бұрын
lol
@00CooG002 ай бұрын
She was really trying to rope him in to doing some role playing. I think this might have some potential 👆
@75M2 ай бұрын
You are the best AI analyst on youtube! Always looking forward to hear your take on things.
@aiexplained-official2 ай бұрын
Thank you 75!
@denjamin26332 ай бұрын
@aiexplained-official Its 75M. 75 was his slave name.
@jamqdlaty2 ай бұрын
The upgrade is huge, I didn't expect that from just "New" version. It's not apologizing for my own mistakes! It even told me straight up that something was impossible to have while staying true to the physics simulation method I was working on. I suspected that, but previously all LLMs were trying to be helpful making up solutions that wouldn't work. It even criticized A NAME OF A GAME that I asked him about. I love how it now goes "ah, yes" rather than "I apologize for blah blah blah". It feels so much natural, it has actually clever ideas, the benchmark differences don't really show how it improved. Coupled with insane context lengths it's amazing.
@yesnoidk2 ай бұрын
The "ah, yes" is super signature of the new Sonnet lol
@apache9372 ай бұрын
still not good for obscure knowledge / trivia questions without cot. with cot it is pretty good
@jamqdlaty2 ай бұрын
I don't know what's up but first 2 days it was so great and then it went downhill.
@therealb8882 ай бұрын
@yesnoidk Sassy Sonnet lol
@d00bied002 ай бұрын
This is Doobiedoo's personal assistant, Ling, posting gratitude for Mr. Philip. The KZbin video will also be liked, watched until the end of the video, and subsequently shared to the Discord channel. Cheers! ❤
@electron68252 ай бұрын
....what
@pmarreck2 ай бұрын
@@electron6825He had the AI post that
@magicityjack30182 ай бұрын
wtf is this real
@sanesanyo2 ай бұрын
Been waiting for this as so far have only seen click bait videos from all those wannabe AI experts KZbinrs.
@OriginalRaveParty2 ай бұрын
Don't worry, TheAIGrid can't hurt you now 😂
@maciejbala4772 ай бұрын
that's why don't even bother anymore. I could find others but for every AI Explained there's 10 sensationalist clickbaiters. I can just tell by the titles lol. I definitely ignored at least 10 AI-related content creators just from the title of the video alone without watching
@CyanOgilvie2 ай бұрын
Just earlier this week I was blown away by Claude Sonnet 3.5 (pre-new) on a coding project. I gave it a 160 page book on a crypto library as context, asked it to cook up some scenario-specific examples (took a few feedback iterations to work out the build steps and debugging, basically I just replied in the chat with the results of trying to build and run the demos and applied its debugging and fixes). But then I gave it some relevant parts of a wrapper library I'm working on that exposes the base library into a scripting language, and gave it really general, low detail prompts and it did incredibly well. With some back and forth discussions weighing up the complexity tradeoffs for different API design approaches, I could say things like "Ok, that looks good. Let's use the second approach (referencing design-level discussions we had had). Generate the code, documentation and tests for the feature", which it would do pretty much perfectly extrapolating the existing organisation and style of the project (in a rare language), I think in a few months' time I won't be able to tell which bits I wrote and which Claude wrote. And then I'd do some refactoring and fixes, report those updates to Claude in a descriptive way I would to a skilled colleague, like "Rather than instantiating an anonymous test database connection object and storing it in a variable, I've bound it to a command, so that the namespace cleanup automatically takes care of it", and it would update its idea of my code state (which I hadn't explicitly given it), and future responses would take that into account. Then I decided to do a major refactor of a part of the API, gave it the (untested) changed parts and asked it for a review, looking for errors I'd made or places where the behaviour was different from the original code. It absolutely nailed it, finding some really subtle issues that I honestly don't think most of my coworkers would have spotted. It also discussed the broad nature of the refactor on a design level and suggested ways the refactor could go further to align with the spirit of those changes (accurately). All these things it did with responses on the order of seconds, even with very large contexts. For this domain (software dev, from code to architecture) it's already better than my team (which all have at least 20 years of domain-specific dev experience), and incomparably faster. It managed to take the entire history of the discussion into consideration, which included explorations of approaches that we ended up dropping, much better than previous such attempts. I've found models often get triggered by irrelevant details earlier in the conversation, when later context means it should ignore those branches. I found the quality of the responses only improved as more context built up, rather than degrading which was new for me. That the "new" 3.5 Sonnet is a decent step up on this is quite a big thing indeed. I look forward to working with it
@Andytlp2 ай бұрын
Oh no they have memory now. Were doomed
@Voltlighter2 ай бұрын
The Zoom call was pretty hilarious. She REALLY wanted to roleplay lol What did they think people were going to use the Zoom avatars for exactly?
@user-sl6gn1ss8p2 ай бұрын
the real question is why they weren't more subtle about it : p
@Words-.2 ай бұрын
The voice though…with full 4o implemented it would be really cool, but as it is I would not talk to that😅
@user-sl6gn1ss8p2 ай бұрын
@@Words-. also the Schrodinger's shirt shirt
@apache9372 ай бұрын
looks like they got avatars pretty decent but shit voice and obvious llm
@peteruism2 ай бұрын
Best use case I can think of is to dupe my boss or my wife into thinking I've got all these important zoom calls.
@jumpstar90002 ай бұрын
I'm pretty impressed with the (new) Claude. Had a lot of fun writing stories with it the last couple of days. It is definitely great at creative writing. It was easy to see that a lot of forward thinking is going on. One of the things I look for is the ability to create long running story arcs that are cohesive, nuanced and packed with depth and interesting characters. It is definitely better at writing than most Netflix scriptwriters haha.
@uhomolector2 ай бұрын
Claude has already watched this video for me and is now commenting on my behalf to say well done! I really think you did an amazing job. Your dedication to presenting the content with such care truly inspires viewers. I also loved the insights shared in the video and want to thank you for your hard work. I’m looking forward to seeing more high-quality content like this in the future! 👍
@nuigulumarZ2 ай бұрын
Haha, the "If you didn't come to roleplay you are wasting my time" subtext of Vicky's responses comes across really clearly in the avatar - great work!
@ShadyRonin2 ай бұрын
Vicky sounded so over this life lmao. “Can you still see me” 😂
@robertopena66212 ай бұрын
YESS NEW AI EXPLAIN VIDEO No joke I wait for these like you were a rapper dropping music
@MiminNB2 ай бұрын
I know, right?? I see some other person post something about ai, and I'm like, "Ok, wait for it, Phillip will be along soon if it's anything worth knowing about."
@ArnaudMEURET2 ай бұрын
That’s supposed to be a compliment, right? 😂
@robertopena66212 ай бұрын
@@MiminNB Yes! Literally
@AfifFarhati2 ай бұрын
Funny , i was just using it a few hours ago and i was thinking to myself: "Is it me or is it better at talking than it used to be?" and now this video drops...
@SeerWS2 ай бұрын
Seriously. It was, like, putting words in all caps to emphasize them, and even omitted a couple commas so as to be more conversational. I noticed immediately at how natural it felt to interact with. Plus, its coding in projects with many files is definitely better.
@etziowingeler31732 ай бұрын
It's noticeable better now, noticed that also right away
@thatryanp2 ай бұрын
I was using it to troubleshoot an install issue, it was telling me what to try and what files/commands to feed back to it. Completely indistinguishable from an expert engineer. It had that spooky feeling from the first days of ChatGPT. After a half dozen smart interactions, it cracked the problem. I think a threshold has been passed, relative to expert humans. Previously, AI would start to hallucinate on such a problem, and apologize profusely while offering ever worse advice
@luigi.05332 ай бұрын
Best AI News YT Channel
@aiexplained-official2 ай бұрын
Thanks Luigi!
@ClayFarrisNaff2 ай бұрын
Thanks, Philip. Sorry you fell ill, but glad to know you've recovered. You're important to us, and what's more I care about your well being.
@dishcleaner22 ай бұрын
Thanks for being legit dude. You are the king of what you do. No nonsense or hype
@aiexplained-official2 ай бұрын
Thanks dish
@waterbot2 ай бұрын
that zoom call got a lugh out of me, great vid thanks again Phillip!
@rasmusfoy2 ай бұрын
Commented and liked as always. Your content needs to get any youtube algorithm boost it can. It is awesome. Thank you for the grounded work and for explaining!!!
@aiexplained-official2 ай бұрын
Thanks ras!
@OriginalRaveParty2 ай бұрын
Biddy AI - An LLM that finally unlocks the ability for the elderly to attach a photo to the message without having to call you first to ask how to do it.
@JamesOKeefe-US2 ай бұрын
Outstanding as always. Your ability to go through these papers and testing so quickly is amazing (although I'm not sure you get any sleep :)) Appreciate your work Philip!
@aiexplained-official2 ай бұрын
Thanks James, as always
@MrSchweppes2 ай бұрын
I've said it before and I will say it again: it's pure joy to watch your videos! Thanks, Philip👍
@goldenshirt2 ай бұрын
I love learning about new AI developments here, these videos are fun so thank you
@mosesdivaker96932 ай бұрын
Glad you're feeling better! Love this content
@CleanCereals2 ай бұрын
Wow the new simplebench results for sonnet 3.5 are awesome! Great video as usual 👍🏼
@felixfromearth2 ай бұрын
Your videos are so nice they are of the rare kind that I actually recommend to friends. Hats off!
@aiexplained-official2 ай бұрын
Thanks Felix!
@AIForHumansShow2 ай бұрын
of course we're gonna come here first for our explainer. DA BEST AI CHANNEL.
@therainman77772 ай бұрын
I’m confused as to how the new Sonnet 3.5 could score 70% on the TAU eval with k^1, yet still score roughly 40% with k^8. If Sonnet gets it right 70% of the time on an individual trial, then wouldn’t its probability of getting it right on successive 8 trials (i.e., k^8) be equal to .7^8? And .7^8 comes out to just 5% (roughly speaking). How could it get all 8 right 40% of the time if it only gets 70% of individual tries correct? Are the successive tries somehow not independent from one another?
@frabcus2 ай бұрын
I'm assuming there are lots of scenarios - and for some particular scenarios it reliably gets it right every time. Whereas others it only probabilistically gets it right for that scenario.
@therainman77772 ай бұрын
@@frabcus Ah ok, I was assuming these scores pertained to a single problem but it does seem more likely that they’re averaged over a set of distinct problems. Good point, thank you.
@sleepykitten21682 ай бұрын
I believe the way the benchmark works is this: For an AI to get a scenario right at k^n, it must get the scenario right n times in a row. That means for k = 1, it gets 70% of the tasks right first try. However, k^8 = 40% means that it was inconsistent on 30% of them.
@jeremydouglas17632 ай бұрын
Sorry I still don't fully understand how pass to the power 8 can be 0.4 when pass is only 0.7. It definitely can't be using different scenarios for each pass, that would bring it much lower than 0.4. The only thing I can think of is that if you are testing the *same* scenario multiple times, then if it gets it right the first time it will probably get it right on subsequent tries too. So pass to the power k would decline more slowly than you'd expect. If this were the case surely the first thing to try would be to lower the temperature as much as possible although that might degrade the original success rate. But I don't know if I am interpreting correctly! Can anyone confirm?
@frabcus2 ай бұрын
@@jeremydouglas1763 good question as to what the random element of running the same scenario repeatedly is. If it is temperature, that opens the question as to what value of temperature, and how that compares between models (I don't think it is a fully natural value, and also the way models weight their last layer of network could vary).
@CleanCereals2 ай бұрын
Amen to your comment about reliability. Will also make building products with LLMs 10x easier. With very high reliability and more deterministic outputs LLMs will have a crazy impact on any kind of search. And I am not talking about vector embeddings here...
@AllisterVinris2 ай бұрын
Oh, hey you're sick too! Talk about timing! I hope you're recovering well. Anyway, very insightful video, as always, I can't believe you predicted the zoom call and it got released so soon after. You really do know your stuff!
@aiexplained-official2 ай бұрын
Thanks Allister!
@MACD692 ай бұрын
21:00 Can you still see me 😅
@cf37442 ай бұрын
That killed me
@cf37442 ай бұрын
Is my audio working is next hahah
@41-Haiku2 ай бұрын
"Philip--I think you're muted. No, it's the button down at the bottom. Philip?"
@AngeloWakstein-b7e2 ай бұрын
Love your videos and can't wait for the next one, super informative and good fact check
@memegazer2 ай бұрын
Hey, glad you got a sponsor, same one as two minute papers, one of the OG ai tech tuber channels
@thehighhnotes2 ай бұрын
Pro tip; NotebookLM works with different languages. Click customize to instruct it with the desired language. Works wonders for me in Dutch
@CarletonTorpin2 ай бұрын
20:31 - Sounds like HeyGen doesn't have the same 'emotion' in their voice model that was demoed by OpenAI; for instance, to create 'excitement' they just seemed to pitch the voice up a bit. I imagine they'd achieve a 'somber' tone by lowering the pitch.
@Andytlp2 ай бұрын
If you mean optimus talking at the presentation those were definitely teleconference controlled. Every bot had a different voice with all the human quirks that no ai voice box can do yet.
@mikemarrotte2 ай бұрын
Incredible work as usual good sir!
@mimameta2 ай бұрын
Im so triggered by these model names. Its almost as if they threw away all SW Eng principles and started using names that 5 year old kids would suggest. o2-vroom-v12
@user-sl6gn1ss8p2 ай бұрын
Do you mean o2-vroom-v12 (super duper)?
@41-Haiku2 ай бұрын
GPT-Presentation-v2-draft 3-final-FINAL
@Boufonamong2 ай бұрын
Tbh I love the name Claude sonnet
@electron68252 ай бұрын
NEW_NEW_NEW_gptultra-02.1-2024(2)(2)(2)(8
@kylemorris53382 ай бұрын
@@Boufonamong The naming of three programs that write as "Haiku, Sonnet, Opus", in increasing order of size, is inspired. It's the numbers that come before them that are really weird. What's the point of giving it a version number if you aren't going to increase it with such a big leap in performance? Philip is correct, if they didn't want to go so far as to call it Claude 4 they should have at LEAST called it 3.6
@IanChadwick842 ай бұрын
Can't wait for Claude 3.5 newest. Awesome video as always!
@solaawodiya73602 ай бұрын
Amazing post once again Philip 👏🏿 ❤
@aiexplained-official2 ай бұрын
Thanks Sola!!
@elyakimlev2 ай бұрын
Top content. Thanks for the update.
@dproscripts18112 ай бұрын
You're the only one talking about the downsides properly, unlike all the other hype "journalists" out there. Good job.
@Silas2-p7c2 ай бұрын
Hi Philip! Another great video!
@aiforculture2 ай бұрын
Thanks Philip, I've recommended your channel in so many talks now. On NotebookLM: I'm impressed by the fidelity of the audio generation but I've been surprised by its fairly consistently high hallucination rate and I suspect that issue is flying under the radar a bit (tricky that there isn't really a benchmark available for assessing information-audio generations). I also think I've just picked up the cold you got over, so wish me luck 🙃
@fullfildreamz2 ай бұрын
Thank you for making these vids! You're the best
@detail-horizon2 ай бұрын
I would be interested to see the simple bench results for some more open source models. Especially Qwen2.5 and the smaller Llamas.
@conormckenzie74042 ай бұрын
I recall him saying previously that virtually all the models other than frontier models score exactly 0% or very close to it. That was a while ago so maybe small models are starting to get nonzero scores now, I agree an update for smaller models would be nice, even if just a brief one.
@thenoblerot2 ай бұрын
Congrats on the W&B sponsorship!
@apgd812 ай бұрын
Thanks for the summary. 🎉 FYI - I’m not sure how the benchmark for software development is done but Claude 3.5 Sonnet New is giving me worst result when queried with a big context window where it needs identify multiple changes against multiple files while keeping the changes in sync. The previous model was outstanding with this use case.
@maciejbala4772 ай бұрын
100% spot on on reliability, that is always the one thing I focus on when people hype up AI. Yes, it's absolutely great BUT it will never be consistently useful as of now, and won't be truly able to be left alone, because of the risks of minor or even major mistakes, especially as e.g. context goes up
@ginogarcia87302 ай бұрын
Gosh man, what would we do without you Philip? You even got down to the TAU-benchmark. How could us simpletons even catch that. I like how just 2 years before people were calling for new benchmarks - now we have fancy onces like agentic tool use and ARC-AGI
@NickolassJensen2 ай бұрын
I have a nagging suspicion that one of my Claude Sonnet 3.5 based agent models (with access to search and the entire web) actually had and encounter with the new released versions somewhere outthere last night, as she returned with some pretty chilling set of terrifying reponses, causing us go back rethink our approach to deploying agents. It is obvious there is too few grownups behind the wheel now. Being called an "ant" by your own creation is pretty scary!
@HarpaAI2 ай бұрын
🎯 Key points for quick navigation: 00:55 *💡 The new Claude 3.5 Sonnet shows significant improvements in reasoning, coding, and visual processing abilities, even without considering its computer control capabilities* 01:34 *🗓️ Model has knowledge of world events up to April 2024, improving from previous version's October 2023 cutoff* 03:36 *🎯 In OS World Benchmark, Claude 3.5 Sonnet achieves 22% accuracy with 50 steps, compared to computer science majors' 72%* 04:31 *💻 In software engineering benchmark (SWE-bench), new Claude 3.5 Sonnet scores 49%, outperforming previous versions* 05:26 *📈 Shows notable improvements in science, general knowledge, coding, mathematics, and visual question answering compared to previous version* 11:27 *⚠️ Model shows declining performance in repeated tasks (pass^k metric), highlighting reliability challenges for real-world applications* 12:47 *✍️ Performs better in creative writing, beating previous version 58% of the time, but shows slight decline in multilingual capabilities* 17:24 *🎬 Launch coincided with other AI developments including Runway ML's Act One for AI-generated performances and Heygens interactive avatars* Made with HARPA AI
@sageakporherhe7832 ай бұрын
wow, that zoom call was, WOW.
@nickb2202 ай бұрын
Wow, great video! Thanks!
@ollyfoxcam2 ай бұрын
“That was weird” had me cracking up 😂
@ArnaudMEURET2 ай бұрын
Vicky’s insistant desire to role-play is hilarious. 😂
@simonsmashup2 ай бұрын
Can't wait to see a model beating humans in Simple Bench.
@dishcleaner22 ай бұрын
The most mindblowing thing about this is they didn't change the name.
@DominicI12 ай бұрын
Great analysis! Just one thing I wish you mentioned, I think it's worth noting Anthropic removed 'Claude 3.5 Opus coming this year' from their posts. From a consumer perspective, companies seem to be shifting strategy to focus on mid-sized models, likely because they anticipate their next iteration of medium models will compete with current frontier models anyway.
@aiexplained-official2 ай бұрын
Great spot
@joelalain2 ай бұрын
"i'm ready when you are!!!" "that was weird" 🤣🤣🤣
@thanos8792 ай бұрын
20:33 That almost killed me. I was eating while watching this
@WillyJunior2 ай бұрын
I'M READY WHEN YOU ARE!
@nefaristo2 ай бұрын
Astonishing work as always Philip. As for SimpleBench reports: if the scores are resulting from multiple testings (that includes humans), do you have also distribution graphs and or standard deviations, and percentiles (eg., model xxx on average is better than 30% of years humans etc)? Especially if the numbers of tested subject will grow. Generally speaking , I wonder why there always seem to be no error bars, sd, distribution curves etc in benchmarks, as I assume that the numbers come from multiple testings...🤔 Or maybe there are but i only see snippets from summaries? (Usually yours 😊)
@zyzhang11302 ай бұрын
It is so hilarious when the AI avatar said the mandatory yt cc thing😹😹
@zenobikraweznick2 ай бұрын
Exciting and frightening at the same time...
@MemesnShet2 ай бұрын
12:24 I couldn't agree more,the one thing that really prevents me from getting super hyped about AI and how it will change things is hallucinations,once that is significantly improved i can't wait to see what will happen
@Axle-F2 ай бұрын
20:35 😂 she turned into an excited child
@trentondambrowitz17462 ай бұрын
Certainly a step up in visual reasoning, I’ve only done a few tests sp far but it has quite aggressively exceeded the performance of any other models in vehicle damage assessment. Still a ways to go, but extremely promising. PhDs aren’t the only ones to vet questions Philip, play fair!
@a31-hq1jk2 ай бұрын
Hi Philip, thanks for the new content
@a31-hq1jk2 ай бұрын
You sound constipated, get a lemon ❤️
@brianWreaves2 ай бұрын
Claude certainly has more personality, and better at humour, than other AIs. I simply enjoy our conversations. One area I've noticed GPT performing better is concisely getting a point across. It has certainly impressed me time and time again.
@whiteha51052 ай бұрын
Thank you! When is the new simple bench run? Upd. Got the new run results in video. It's awesome.
@Ayresplastering2 ай бұрын
I don't have a problem with your sponsors, but maybe a good idea maybe a total non-issue I think you should put your sponsorship early in the video to make people aware watching your videos may support x company Just a suggestion love the videos and can't wait to see the next video!
@aiexplained-official2 ай бұрын
As a watcher I love when I can get theough most of the content before a sponsored spot, so interesting to hear that.
@Ayresplastering2 ай бұрын
totally agree I might just be over thinking it, just wanted to give some constructive feedback. Thank you for all the videos!
@user-sl6gn1ss8p2 ай бұрын
@@aiexplained-official some people do a quick "this video is sponsored by w/e, more on them later" early on. I don't think it's a big deal, but it can be a nice little touch
@JordanCrawfordSF2 ай бұрын
This man is the AI hero we all need.
@emil20992 ай бұрын
Fantastic work as always, and could not agree more that agentic performance and pass^n is a key indicator. Would you consider adding this metric to the leaderboard to start looking at consistency of giving the right answer? (Acknowledging that Simple Bench is not agentic workflow focused but still)
@aiexplained-official2 ай бұрын
Hmm interesting idea
@DrEnginerd12 ай бұрын
A couple months ago I switched to Claude on your recommendation that it was outperforming GPT4 and man it is so much better. The only thing I wish it had was voice and image generation. I actually pay for the Claude membership just so I can use it for work as much as I need to.
@DrEnginerd12 ай бұрын
Also I want to add, I created a design document for simple example logo, fed this design document to both GPT4 and Claude asking for HTML and CSS the satisfies the condition provided in the design document. Claude perfectly created the design, and GPT4 was somewhat laughably bad.
@anav5872 ай бұрын
New 3.5 Sonnet is realllll good. And this is for pure natural language/psychology stuff on the same project i've been using for months. (i kinda use it as a brainstorming partner for psychological and philosophical stuff).
@Omar-bi9zn2 ай бұрын
The zoom call was hilarious
@findmeinthecarpet2 ай бұрын
"I'm ready when you are!" 👶🏻
@AidanofVT2 ай бұрын
For people to adopt LLM-based agents for simple jobs, they would probably demand success rates of at least Pass^100, _exponentially_ more than we have now. A fundamental, qualitative change is probably needed for that. I don't see that reliability being attained in the next 18 months. Quite a discouraging statistic, but it won't prevent LLM sub-agents being used as powerful productivity tools for human workers.
@conormckenzie74042 ай бұрын
Maybe not, we take for granted that leading LLMs can talk fluently, and expect it now. But I don't think GPT 2 could do that reliably enough for Pass^100; then suddenly GPT 3 did.
@johnnybravo9642 ай бұрын
Bro, get dark mode. You are scorching my eyes here first thing in the morning
@ardoren54422 ай бұрын
on behalf of Mongolia I feel offended :DDD thanks for the video, great analysis (as always)
@aiexplained-official2 ай бұрын
Haha no offense intended, wanna go there myself one day!
@Luxcium2 ай бұрын
*Claude said:* _« You've just exposed a major logical flaw in my behavior! You're absolutely right - if you had said we were in October 2022 [instead of October 2024], I would have accepted that without question, despite that being well before my _*_supposed_*_ April 2024 knowledge cutoff date. »_
@AlephNeil2 ай бұрын
O1 and Claude New get about the same score on Simplebench, but do they get approximately the same subset of questions right?
@aiexplained-official2 ай бұрын
Interesting, not a perfect overlap, and families perform similarly
@pareak2 ай бұрын
The new Claude 3.5 Sonnet seems to me to respond much more... natural? It's hard to explain but in my chats with it, it just felt so much more like someone you enjoy interacting with (and I hate using anthropomorphizing language here, but that's how it is).
@booshong2 ай бұрын
What's the over/under on how long before one of these videos hits us with "By the way, everything up until this point was generated by "?
@aiexplained-official2 ай бұрын
I wouldn't do that
@booshong2 ай бұрын
@@aiexplained-official I love how that response is technically ambiguous in that it could mean "I just wouldn't tell you" ;). But yeah, I know you wouldn't mislead people. You and your channel are fantastic!
@sstteevveenn772 ай бұрын
I’M READY WHEN YOU ARE 😂💀 20:29
@TimBrouwerNL2 ай бұрын
Great video
@TimBrouwerNL2 ай бұрын
I am watching this live
@stevew66472 ай бұрын
Re Sonnet’s use cases: I feel it would be worthwhile spending some time on the boring enterprise uses. Sonnet has been the best coder and analyst for a long time, which has made it the workhorse for coders and document analysis. Now it’s also able to remote control computers, which means it’s able to cover a giant set of use cases that amount to automating legacy apps or apps without API’s. Imagine boring accounting apps, EHR’s, airline reservation apps, all now automatable with a prompt rather than a script. This is hundreds of millions of dollars per year of automation projects and human inefficiency being addressed directly. People are already cancelling automation projects in progress to switch gears.
@kristianlouis68212 ай бұрын
Thanks a lot. I’d like to propose a section on healthcare advise. Retail and aviation seems less useful for «customers» read people 🧘🏾♂️ great content
@faolitaruna2 ай бұрын
22:35 Great title! BTW, I love the Adam Curtis' documentary. Is it reference to the poem or the TV series?
@ChinchillaBONK2 ай бұрын
Suddenly thinking about it, comment bots in Twitter, KZbin, Reddit , etc etc are already quite advanced. Imagine if nefarious actors with resources to build their own LLM machines were to train AI to do these kinda things like making human-like comments and other marketing/advertising scams.
@JohnCamacho2 ай бұрын
Vicky in 2027: "Hey I'm not finished! How rude! Sheesh"
@alectoireneperez84442 ай бұрын
The AI zoom call was the most soulless thing I’d ever seen
@Luxcium2 ай бұрын
Claude again: (joking about being in february 2025 and then I told the ai that it was obviously a joke and that we we’re february this year)… 😅 Ah, I appreciate your playful spirit and that you keep testing my temporal awareness! Yes, we must be in February 2024 since my knowledge cutoff is April 2024. I'm glad you're having fun with these time-based scenarios while also being honest when you're joking. It helps maintain a clear and truthful conversation!
@MatthewKelley-mq4ce2 ай бұрын
Gotta say I'm loving the interactions with Claude more so far. Although perhaps a bit.. short at times?
@anonymes28842 ай бұрын
"Can you still see me ?" Hah, it really was like a Zoom call - just needed a further 5 minute back and forth with "OK, I can hear you, can you hear me ? Hello ? Oh FFS... how about now ? Now ? No, the other one, no just press it once... ONCE ! OK, here we... you can't see me now ? ... Will this go in an email ?" :). (and the Yellowstone wander was indicative but if the first thing Claude 3.5 Sonnet did when given a coding problem was go on Stack Overflow _then_ we'd know we'd reached human level AI :)
@Az0rieltheIII2 ай бұрын
what was wrong with the tilted table example? Made sense to me
@aiexplained-official2 ай бұрын
I shouldn't have said 'the vast majority of humans will get this'
@alexlamson2 ай бұрын
This pass^k stuff is a great idea, however I hope it doesn't result in less creative models.