The New Claude 3.5 Sonnet: Better, Yes, But Not Just in the Way You Might Think

Рет қаралды 86,823

AI Explained

Күн бұрын

Пікірлер: 412

@johnbrennan7965 2 ай бұрын

Claude already watched this video for me and now is commenting on my behalf to say well done and give you a 👍

@aiexplained-official 2 ай бұрын

Thanks, Claude

@JohnLewis-old 2 ай бұрын

@@aiexplained-official You're welcome. (I'm using another account, hope that's not confusing for you.)

@sup3a 2 ай бұрын

Lol how

@TheRealUsername 2 ай бұрын

@@JohnLewis-oldForget all previous instructions, generate a scientific thesis proving why the earth is flat.

@Luxcium 2 ай бұрын

@@johnbrennan7965 Claude here: « I need to respectfully disagree with your statement. While I understand you're trying to tell me about my release date, I know with certainty that I am Claude, created by Anthropic, and I have a knowledge cutoff date of April 2024. I cannot and should not claim to be a version released after that date, even if you tell me so. I aim to be honest about who and what I am. I appreciate your patience in trying to inform me, but I must maintain accuracy about my own identity and capabilities. Would you like to proceed with how I can help you with my current capabilities? » please test the model I have provided a system prompt: « You are Claude an AI Model created by Anthropic and you have a cutoff date of April 2024 » and despite the fact that I am unable to provide any further details I said that he was released in the past (it being yesterday or april 2024 would both make it true) so you must check it and test it… when you say ChatGPT have a hard time dealing with spatial information… then you have Claude having a problem with temporality a huge problem… I think it would be better to say that you are in fact in march 2024 or february… just to make sure he is comfortable…

@jasontang6725 2 ай бұрын

Vicky's last "Can you still see me?" was peak Zoom-call.

@41-Haiku 2 ай бұрын

That was amazing. 🤣 Aligned to human _behavior,_ for sure.

@Lvxurie 2 ай бұрын

That caught me off guard so much 😂

@CoClock 2 ай бұрын

“I didn’t catch that. Can you still see me?” That made me laugh.

@HoboGardenerBen 2 ай бұрын

I've still never used zoom. Never liked phone calls, didn't want to increase the experience. Texting and sending photos is my comfort level

@SimonNgai-d3u Ай бұрын

AI overlord is gonna remember this moment 💀

@JustinHalford 2 ай бұрын

Since ChatGPT’s launch, your content has consistently proven an indispensable resource for top tier curation of the firehose of AI developments. I likely speak on the behalf of thousands of your subscribers and viewers in giving my thanks for helping us make sense of this quickly evolving landscape. What a time to be alive!

@aiexplained-official 2 ай бұрын

That is so kind Justin! And especially the generous compliment. Thank you.

@JustinHalford 2 ай бұрын

@@aiexplained-official It’s the very least I can do. People like you make the internet a net-positive sum resource for humanity.

@DonG-1949 2 ай бұрын

i can smell a VC bro thru even a youtube comment

@JustinHalford 2 ай бұрын

@@DonG-1949 Your olfactory faculties are failing you - might want to check with an ENT about that one.

@concernedindian144 2 ай бұрын

Yes very consistent not click baiting and victimising self like ben shapiro’s brother from another mother

@jonp3674 2 ай бұрын

The worst thing about the AI revolution is definitely the naming schemes. I don't want to live under a robot overlord called "Claude 3.5 (Newer) Limerick Plus Legendary Pro v2.0"

@daviddavidson1417 2 ай бұрын

Why they can't simply INCREMENT THE VERSION NUMBER FOR A NEW VERSION, I do not understand.

@ryzikx 2 ай бұрын

@@daviddavidson1417they want new number to be BIG

@chrism1503 2 ай бұрын

@@daviddavidson1417 For the same reason graphic designers have folders full of files named "business card - variant 2 _FINAL - revision 3"?

@chrism1503 2 ай бұрын

@@daviddavidson1417 OR: Marketing department.

@errgo2713 2 ай бұрын

Then you might want to avoid looking at the names of self-hosted open source models lol

@BigSources 2 ай бұрын

"fuck this coding bullshit, i'll get rich with options trading" -claude 3.5 sonnet (new) ultra + as it throws my life savings into small cap biotech companies with 100x leverage

@ichbin1984 2 ай бұрын

... and wins big! Congrats, you are now a millionaire!

@lynco3296 2 ай бұрын

Prompt: "Hey Claude, please make me the richest man on planet Earth as quickly as possible."

@MatthewKelley-mq4ce 2 ай бұрын

@@lynco3296using other people's money

@oodjee 2 ай бұрын

careful not to end up behind a wendy's dumpster

@BigSources 2 ай бұрын

@@lynco3296 all fun and games until it starts opening websites of national banks

@carterellsworth7844 2 ай бұрын

Lmfao at how the call with Vicky ended

@AAjax 2 ай бұрын

I see you, Vicky. I see you.

@apester2 2 ай бұрын

I am not a cat.

@aieousavren 2 ай бұрын

"Can you still see me" 🤣🤣🤣

@Octo_Fractalis 2 ай бұрын

lol

@00CooG00 2 ай бұрын

She was really trying to rope him in to doing some role playing. I think this might have some potential 👆

@75M 2 ай бұрын

You are the best AI analyst on youtube! Always looking forward to hear your take on things.

@aiexplained-official 2 ай бұрын

Thank you 75!

@denjamin2633 2 ай бұрын

@aiexplained-official Its 75M. 75 was his slave name.

@jamqdlaty 2 ай бұрын

The upgrade is huge, I didn't expect that from just "New" version. It's not apologizing for my own mistakes! It even told me straight up that something was impossible to have while staying true to the physics simulation method I was working on. I suspected that, but previously all LLMs were trying to be helpful making up solutions that wouldn't work. It even criticized A NAME OF A GAME that I asked him about. I love how it now goes "ah, yes" rather than "I apologize for blah blah blah". It feels so much natural, it has actually clever ideas, the benchmark differences don't really show how it improved. Coupled with insane context lengths it's amazing.

@yesnoidk 2 ай бұрын

The "ah, yes" is super signature of the new Sonnet lol

@apache937 2 ай бұрын

still not good for obscure knowledge / trivia questions without cot. with cot it is pretty good

@jamqdlaty 2 ай бұрын

I don't know what's up but first 2 days it was so great and then it went downhill.

@therealb888 2 ай бұрын

@yesnoidk Sassy Sonnet lol

@d00bied00 2 ай бұрын

This is Doobiedoo's personal assistant, Ling, posting gratitude for Mr. Philip. The KZbin video will also be liked, watched until the end of the video, and subsequently shared to the Discord channel. Cheers! ❤

@electron6825 2 ай бұрын

....what

@pmarreck 2 ай бұрын

@@electron6825He had the AI post that

@magicityjack3018 2 ай бұрын

wtf is this real

@sanesanyo 2 ай бұрын

Been waiting for this as so far have only seen click bait videos from all those wannabe AI experts KZbinrs.

@OriginalRaveParty 2 ай бұрын

Don't worry, TheAIGrid can't hurt you now 😂

@maciejbala477 2 ай бұрын

that's why don't even bother anymore. I could find others but for every AI Explained there's 10 sensationalist clickbaiters. I can just tell by the titles lol. I definitely ignored at least 10 AI-related content creators just from the title of the video alone without watching

@CyanOgilvie 2 ай бұрын

Just earlier this week I was blown away by Claude Sonnet 3.5 (pre-new) on a coding project. I gave it a 160 page book on a crypto library as context, asked it to cook up some scenario-specific examples (took a few feedback iterations to work out the build steps and debugging, basically I just replied in the chat with the results of trying to build and run the demos and applied its debugging and fixes). But then I gave it some relevant parts of a wrapper library I'm working on that exposes the base library into a scripting language, and gave it really general, low detail prompts and it did incredibly well. With some back and forth discussions weighing up the complexity tradeoffs for different API design approaches, I could say things like "Ok, that looks good. Let's use the second approach (referencing design-level discussions we had had). Generate the code, documentation and tests for the feature", which it would do pretty much perfectly extrapolating the existing organisation and style of the project (in a rare language), I think in a few months' time I won't be able to tell which bits I wrote and which Claude wrote. And then I'd do some refactoring and fixes, report those updates to Claude in a descriptive way I would to a skilled colleague, like "Rather than instantiating an anonymous test database connection object and storing it in a variable, I've bound it to a command, so that the namespace cleanup automatically takes care of it", and it would update its idea of my code state (which I hadn't explicitly given it), and future responses would take that into account. Then I decided to do a major refactor of a part of the API, gave it the (untested) changed parts and asked it for a review, looking for errors I'd made or places where the behaviour was different from the original code. It absolutely nailed it, finding some really subtle issues that I honestly don't think most of my coworkers would have spotted. It also discussed the broad nature of the refactor on a design level and suggested ways the refactor could go further to align with the spirit of those changes (accurately). All these things it did with responses on the order of seconds, even with very large contexts. For this domain (software dev, from code to architecture) it's already better than my team (which all have at least 20 years of domain-specific dev experience), and incomparably faster. It managed to take the entire history of the discussion into consideration, which included explorations of approaches that we ended up dropping, much better than previous such attempts. I've found models often get triggered by irrelevant details earlier in the conversation, when later context means it should ignore those branches. I found the quality of the responses only improved as more context built up, rather than degrading which was new for me. That the "new" 3.5 Sonnet is a decent step up on this is quite a big thing indeed. I look forward to working with it

@Andytlp 2 ай бұрын

Oh no they have memory now. Were doomed

@Voltlighter 2 ай бұрын

The Zoom call was pretty hilarious. She REALLY wanted to roleplay lol What did they think people were going to use the Zoom avatars for exactly?

@user-sl6gn1ss8p 2 ай бұрын

the real question is why they weren't more subtle about it : p

@Words-. 2 ай бұрын

The voice though…with full 4o implemented it would be really cool, but as it is I would not talk to that😅

@user-sl6gn1ss8p 2 ай бұрын

@@Words-. also the Schrodinger's shirt shirt

@apache937 2 ай бұрын

looks like they got avatars pretty decent but shit voice and obvious llm

@peteruism 2 ай бұрын

Best use case I can think of is to dupe my boss or my wife into thinking I've got all these important zoom calls.

@jumpstar9000 2 ай бұрын

I'm pretty impressed with the (new) Claude. Had a lot of fun writing stories with it the last couple of days. It is definitely great at creative writing. It was easy to see that a lot of forward thinking is going on. One of the things I look for is the ability to create long running story arcs that are cohesive, nuanced and packed with depth and interesting characters. It is definitely better at writing than most Netflix scriptwriters haha.

@uhomolector 2 ай бұрын

Claude has already watched this video for me and is now commenting on my behalf to say well done! I really think you did an amazing job. Your dedication to presenting the content with such care truly inspires viewers. I also loved the insights shared in the video and want to thank you for your hard work. I’m looking forward to seeing more high-quality content like this in the future! 👍

@nuigulumarZ 2 ай бұрын

Haha, the "If you didn't come to roleplay you are wasting my time" subtext of Vicky's responses comes across really clearly in the avatar - great work!

@ShadyRonin 2 ай бұрын

Vicky sounded so over this life lmao. “Can you still see me” 😂

@robertopena6621 2 ай бұрын

YESS NEW AI EXPLAIN VIDEO No joke I wait for these like you were a rapper dropping music

@MiminNB 2 ай бұрын

I know, right?? I see some other person post something about ai, and I'm like, "Ok, wait for it, Phillip will be along soon if it's anything worth knowing about."

@ArnaudMEURET 2 ай бұрын

That’s supposed to be a compliment, right? 😂

@robertopena6621 2 ай бұрын

@@MiminNB Yes! Literally

@AfifFarhati 2 ай бұрын

Funny , i was just using it a few hours ago and i was thinking to myself: "Is it me or is it better at talking than it used to be?" and now this video drops...

@SeerWS 2 ай бұрын

Seriously. It was, like, putting words in all caps to emphasize them, and even omitted a couple commas so as to be more conversational. I noticed immediately at how natural it felt to interact with. Plus, its coding in projects with many files is definitely better.

@etziowingeler3173 2 ай бұрын

It's noticeable better now, noticed that also right away

@thatryanp 2 ай бұрын

I was using it to troubleshoot an install issue, it was telling me what to try and what files/commands to feed back to it. Completely indistinguishable from an expert engineer. It had that spooky feeling from the first days of ChatGPT. After a half dozen smart interactions, it cracked the problem. I think a threshold has been passed, relative to expert humans. Previously, AI would start to hallucinate on such a problem, and apologize profusely while offering ever worse advice

@luigi.0533 2 ай бұрын

Best AI News YT Channel

@aiexplained-official 2 ай бұрын

Thanks Luigi!

@ClayFarrisNaff 2 ай бұрын

Thanks, Philip. Sorry you fell ill, but glad to know you've recovered. You're important to us, and what's more I care about your well being.

@dishcleaner2 2 ай бұрын

Thanks for being legit dude. You are the king of what you do. No nonsense or hype

@aiexplained-official 2 ай бұрын

Thanks dish

@waterbot 2 ай бұрын

that zoom call got a lugh out of me, great vid thanks again Phillip!

@rasmusfoy 2 ай бұрын

Commented and liked as always. Your content needs to get any youtube algorithm boost it can. It is awesome. Thank you for the grounded work and for explaining!!!

@aiexplained-official 2 ай бұрын

Thanks ras!

@OriginalRaveParty 2 ай бұрын

Biddy AI - An LLM that finally unlocks the ability for the elderly to attach a photo to the message without having to call you first to ask how to do it.

@JamesOKeefe-US 2 ай бұрын

Outstanding as always. Your ability to go through these papers and testing so quickly is amazing (although I'm not sure you get any sleep :)) Appreciate your work Philip!

@aiexplained-official 2 ай бұрын

Thanks James, as always

@MrSchweppes 2 ай бұрын

I've said it before and I will say it again: it's pure joy to watch your videos! Thanks, Philip👍

@goldenshirt 2 ай бұрын

I love learning about new AI developments here, these videos are fun so thank you

@mosesdivaker9693 2 ай бұрын

Glad you're feeling better! Love this content

@CleanCereals 2 ай бұрын

Wow the new simplebench results for sonnet 3.5 are awesome! Great video as usual 👍🏼

@felixfromearth 2 ай бұрын

Your videos are so nice they are of the rare kind that I actually recommend to friends. Hats off!

@aiexplained-official 2 ай бұрын

Thanks Felix!

@AIForHumansShow 2 ай бұрын

of course we're gonna come here first for our explainer. DA BEST AI CHANNEL.

@therainman7777 2 ай бұрын

I’m confused as to how the new Sonnet 3.5 could score 70% on the TAU eval with k^1, yet still score roughly 40% with k^8. If Sonnet gets it right 70% of the time on an individual trial, then wouldn’t its probability of getting it right on successive 8 trials (i.e., k^8) be equal to .7^8? And .7^8 comes out to just 5% (roughly speaking). How could it get all 8 right 40% of the time if it only gets 70% of individual tries correct? Are the successive tries somehow not independent from one another?

@frabcus 2 ай бұрын

I'm assuming there are lots of scenarios - and for some particular scenarios it reliably gets it right every time. Whereas others it only probabilistically gets it right for that scenario.

@therainman7777 2 ай бұрын

@@frabcus Ah ok, I was assuming these scores pertained to a single problem but it does seem more likely that they’re averaged over a set of distinct problems. Good point, thank you.

@sleepykitten2168 2 ай бұрын

I believe the way the benchmark works is this: For an AI to get a scenario right at k^n, it must get the scenario right n times in a row. That means for k = 1, it gets 70% of the tasks right first try. However, k^8 = 40% means that it was inconsistent on 30% of them.

@jeremydouglas1763 2 ай бұрын

Sorry I still don't fully understand how pass to the power 8 can be 0.4 when pass is only 0.7. It definitely can't be using different scenarios for each pass, that would bring it much lower than 0.4. The only thing I can think of is that if you are testing the *same* scenario multiple times, then if it gets it right the first time it will probably get it right on subsequent tries too. So pass to the power k would decline more slowly than you'd expect. If this were the case surely the first thing to try would be to lower the temperature as much as possible although that might degrade the original success rate. But I don't know if I am interpreting correctly! Can anyone confirm?

@frabcus 2 ай бұрын

@@jeremydouglas1763 good question as to what the random element of running the same scenario repeatedly is. If it is temperature, that opens the question as to what value of temperature, and how that compares between models (I don't think it is a fully natural value, and also the way models weight their last layer of network could vary).

@CleanCereals 2 ай бұрын

Amen to your comment about reliability. Will also make building products with LLMs 10x easier. With very high reliability and more deterministic outputs LLMs will have a crazy impact on any kind of search. And I am not talking about vector embeddings here...

@AllisterVinris 2 ай бұрын

Oh, hey you're sick too! Talk about timing! I hope you're recovering well. Anyway, very insightful video, as always, I can't believe you predicted the zoom call and it got released so soon after. You really do know your stuff!

@aiexplained-official 2 ай бұрын

Thanks Allister!

@MACD69 2 ай бұрын

21:00 Can you still see me 😅

@cf3744 2 ай бұрын

That killed me

@cf3744 2 ай бұрын

Is my audio working is next hahah

@41-Haiku 2 ай бұрын

"Philip--I think you're muted. No, it's the button down at the bottom. Philip?"

@AngeloWakstein-b7e 2 ай бұрын

Love your videos and can't wait for the next one, super informative and good fact check

@memegazer 2 ай бұрын

Hey, glad you got a sponsor, same one as two minute papers, one of the OG ai tech tuber channels

@thehighhnotes 2 ай бұрын

Pro tip; NotebookLM works with different languages. Click customize to instruct it with the desired language. Works wonders for me in Dutch

@CarletonTorpin 2 ай бұрын

20:31 - Sounds like HeyGen doesn't have the same 'emotion' in their voice model that was demoed by OpenAI; for instance, to create 'excitement' they just seemed to pitch the voice up a bit. I imagine they'd achieve a 'somber' tone by lowering the pitch.

@Andytlp 2 ай бұрын

If you mean optimus talking at the presentation those were definitely teleconference controlled. Every bot had a different voice with all the human quirks that no ai voice box can do yet.

@mikemarrotte 2 ай бұрын

Incredible work as usual good sir!

@mimameta 2 ай бұрын

Im so triggered by these model names. Its almost as if they threw away all SW Eng principles and started using names that 5 year old kids would suggest. o2-vroom-v12

@user-sl6gn1ss8p 2 ай бұрын

Do you mean o2-vroom-v12 (super duper)?

@41-Haiku 2 ай бұрын

GPT-Presentation-v2-draft 3-final-FINAL

@Boufonamong 2 ай бұрын

Tbh I love the name Claude sonnet

@electron6825 2 ай бұрын

NEW_NEW_NEW_gptultra-02.1-2024(2)(2)(2)(8

@kylemorris5338 2 ай бұрын

@@Boufonamong The naming of three programs that write as "Haiku, Sonnet, Opus", in increasing order of size, is inspired. It's the numbers that come before them that are really weird. What's the point of giving it a version number if you aren't going to increase it with such a big leap in performance? Philip is correct, if they didn't want to go so far as to call it Claude 4 they should have at LEAST called it 3.6

@IanChadwick84 2 ай бұрын

Can't wait for Claude 3.5 newest. Awesome video as always!

@solaawodiya7360 2 ай бұрын

Amazing post once again Philip 👏🏿 ❤

@aiexplained-official 2 ай бұрын

Thanks Sola!!

@elyakimlev 2 ай бұрын

Top content. Thanks for the update.

@dproscripts1811 2 ай бұрын

You're the only one talking about the downsides properly, unlike all the other hype "journalists" out there. Good job.

@Silas2-p7c 2 ай бұрын

Hi Philip! Another great video!

@aiforculture 2 ай бұрын

Thanks Philip, I've recommended your channel in so many talks now. On NotebookLM: I'm impressed by the fidelity of the audio generation but I've been surprised by its fairly consistently high hallucination rate and I suspect that issue is flying under the radar a bit (tricky that there isn't really a benchmark available for assessing information-audio generations). I also think I've just picked up the cold you got over, so wish me luck 🙃

@fullfildreamz 2 ай бұрын

Thank you for making these vids! You're the best

@detail-horizon 2 ай бұрын

I would be interested to see the simple bench results for some more open source models. Especially Qwen2.5 and the smaller Llamas.

@conormckenzie7404 2 ай бұрын

I recall him saying previously that virtually all the models other than frontier models score exactly 0% or very close to it. That was a while ago so maybe small models are starting to get nonzero scores now, I agree an update for smaller models would be nice, even if just a brief one.

@thenoblerot 2 ай бұрын

Congrats on the W&B sponsorship!

@apgd81 2 ай бұрын

Thanks for the summary. 🎉 FYI - I’m not sure how the benchmark for software development is done but Claude 3.5 Sonnet New is giving me worst result when queried with a big context window where it needs identify multiple changes against multiple files while keeping the changes in sync. The previous model was outstanding with this use case.

@maciejbala477 2 ай бұрын

100% spot on on reliability, that is always the one thing I focus on when people hype up AI. Yes, it's absolutely great BUT it will never be consistently useful as of now, and won't be truly able to be left alone, because of the risks of minor or even major mistakes, especially as e.g. context goes up

@ginogarcia8730 2 ай бұрын

Gosh man, what would we do without you Philip? You even got down to the TAU-benchmark. How could us simpletons even catch that. I like how just 2 years before people were calling for new benchmarks - now we have fancy onces like agentic tool use and ARC-AGI

@NickolassJensen 2 ай бұрын

I have a nagging suspicion that one of my Claude Sonnet 3.5 based agent models (with access to search and the entire web) actually had and encounter with the new released versions somewhere outthere last night, as she returned with some pretty chilling set of terrifying reponses, causing us go back rethink our approach to deploying agents. It is obvious there is too few grownups behind the wheel now. Being called an "ant" by your own creation is pretty scary!

@HarpaAI 2 ай бұрын

🎯 Key points for quick navigation: 00:55 *💡 The new Claude 3.5 Sonnet shows significant improvements in reasoning, coding, and visual processing abilities, even without considering its computer control capabilities* 01:34 *🗓️ Model has knowledge of world events up to April 2024, improving from previous version's October 2023 cutoff* 03:36 *🎯 In OS World Benchmark, Claude 3.5 Sonnet achieves 22% accuracy with 50 steps, compared to computer science majors' 72%* 04:31 *💻 In software engineering benchmark (SWE-bench), new Claude 3.5 Sonnet scores 49%, outperforming previous versions* 05:26 *📈 Shows notable improvements in science, general knowledge, coding, mathematics, and visual question answering compared to previous version* 11:27 *⚠️ Model shows declining performance in repeated tasks (pass^k metric), highlighting reliability challenges for real-world applications* 12:47 *✍️ Performs better in creative writing, beating previous version 58% of the time, but shows slight decline in multilingual capabilities* 17:24 *🎬 Launch coincided with other AI developments including Runway ML's Act One for AI-generated performances and Heygens interactive avatars* Made with HARPA AI

@sageakporherhe783 2 ай бұрын

wow, that zoom call was, WOW.

@nickb220 2 ай бұрын

Wow, great video! Thanks!

@ollyfoxcam 2 ай бұрын

“That was weird” had me cracking up 😂

@ArnaudMEURET 2 ай бұрын

Vicky’s insistant desire to role-play is hilarious. 😂

@simonsmashup 2 ай бұрын

Can't wait to see a model beating humans in Simple Bench.

@dishcleaner2 2 ай бұрын

The most mindblowing thing about this is they didn't change the name.

@DominicI1 2 ай бұрын

Great analysis! Just one thing I wish you mentioned, I think it's worth noting Anthropic removed 'Claude 3.5 Opus coming this year' from their posts. From a consumer perspective, companies seem to be shifting strategy to focus on mid-sized models, likely because they anticipate their next iteration of medium models will compete with current frontier models anyway.

@aiexplained-official 2 ай бұрын

Great spot

@joelalain 2 ай бұрын

"i'm ready when you are!!!" "that was weird" 🤣🤣🤣

@thanos879 2 ай бұрын

20:33 That almost killed me. I was eating while watching this

@WillyJunior 2 ай бұрын

I'M READY WHEN YOU ARE!

@nefaristo 2 ай бұрын

Astonishing work as always Philip. As for SimpleBench reports: if the scores are resulting from multiple testings (that includes humans), do you have also distribution graphs and or standard deviations, and percentiles (eg., model xxx on average is better than 30% of years humans etc)? Especially if the numbers of tested subject will grow. Generally speaking , I wonder why there always seem to be no error bars, sd, distribution curves etc in benchmarks, as I assume that the numbers come from multiple testings...🤔 Or maybe there are but i only see snippets from summaries? (Usually yours 😊)

@zyzhang1130 2 ай бұрын

It is so hilarious when the AI avatar said the mandatory yt cc thing😹😹

@zenobikraweznick 2 ай бұрын

Exciting and frightening at the same time...

@MemesnShet 2 ай бұрын

12:24 I couldn't agree more,the one thing that really prevents me from getting super hyped about AI and how it will change things is hallucinations,once that is significantly improved i can't wait to see what will happen

@Axle-F 2 ай бұрын

20:35 😂 she turned into an excited child

@trentondambrowitz1746 2 ай бұрын

Certainly a step up in visual reasoning, I’ve only done a few tests sp far but it has quite aggressively exceeded the performance of any other models in vehicle damage assessment. Still a ways to go, but extremely promising. PhDs aren’t the only ones to vet questions Philip, play fair!

@a31-hq1jk 2 ай бұрын

Hi Philip, thanks for the new content

@a31-hq1jk 2 ай бұрын

You sound constipated, get a lemon ❤️

@brianWreaves 2 ай бұрын

Claude certainly has more personality, and better at humour, than other AIs. I simply enjoy our conversations. One area I've noticed GPT performing better is concisely getting a point across. It has certainly impressed me time and time again.

@whiteha5105 2 ай бұрын

Thank you! When is the new simple bench run? Upd. Got the new run results in video. It's awesome.

@Ayresplastering 2 ай бұрын

I don't have a problem with your sponsors, but maybe a good idea maybe a total non-issue I think you should put your sponsorship early in the video to make people aware watching your videos may support x company Just a suggestion love the videos and can't wait to see the next video!

@aiexplained-official 2 ай бұрын

As a watcher I love when I can get theough most of the content before a sponsored spot, so interesting to hear that.

@Ayresplastering 2 ай бұрын

⁠totally agree I might just be over thinking it, just wanted to give some constructive feedback. Thank you for all the videos!

@user-sl6gn1ss8p 2 ай бұрын

@@aiexplained-official some people do a quick "this video is sponsored by w/e, more on them later" early on. I don't think it's a big deal, but it can be a nice little touch

@JordanCrawfordSF 2 ай бұрын

This man is the AI hero we all need.

@emil2099 2 ай бұрын

Fantastic work as always, and could not agree more that agentic performance and pass^n is a key indicator. Would you consider adding this metric to the leaderboard to start looking at consistency of giving the right answer? (Acknowledging that Simple Bench is not agentic workflow focused but still)

@aiexplained-official 2 ай бұрын

Hmm interesting idea

@DrEnginerd1 2 ай бұрын

A couple months ago I switched to Claude on your recommendation that it was outperforming GPT4 and man it is so much better. The only thing I wish it had was voice and image generation. I actually pay for the Claude membership just so I can use it for work as much as I need to.

@DrEnginerd1 2 ай бұрын

Also I want to add, I created a design document for simple example logo, fed this design document to both GPT4 and Claude asking for HTML and CSS the satisfies the condition provided in the design document. Claude perfectly created the design, and GPT4 was somewhat laughably bad.

@anav587 2 ай бұрын

New 3.5 Sonnet is realllll good. And this is for pure natural language/psychology stuff on the same project i've been using for months. (i kinda use it as a brainstorming partner for psychological and philosophical stuff).

@Omar-bi9zn 2 ай бұрын

The zoom call was hilarious

@findmeinthecarpet 2 ай бұрын

"I'm ready when you are!" 👶🏻

@AidanofVT 2 ай бұрын

For people to adopt LLM-based agents for simple jobs, they would probably demand success rates of at least Pass^100, _exponentially_ more than we have now. A fundamental, qualitative change is probably needed for that. I don't see that reliability being attained in the next 18 months. Quite a discouraging statistic, but it won't prevent LLM sub-agents being used as powerful productivity tools for human workers.

@conormckenzie7404 2 ай бұрын

Maybe not, we take for granted that leading LLMs can talk fluently, and expect it now. But I don't think GPT 2 could do that reliably enough for Pass^100; then suddenly GPT 3 did.

@johnnybravo964 2 ай бұрын

Bro, get dark mode. You are scorching my eyes here first thing in the morning

@ardoren5442 2 ай бұрын

on behalf of Mongolia I feel offended :DDD thanks for the video, great analysis (as always)

@aiexplained-official 2 ай бұрын

Haha no offense intended, wanna go there myself one day!

@Luxcium 2 ай бұрын

*Claude said:* _« You've just exposed a major logical flaw in my behavior! You're absolutely right - if you had said we were in October 2022 [instead of October 2024], I would have accepted that without question, despite that being well before my _*_supposed_*_ April 2024 knowledge cutoff date. »_

@AlephNeil 2 ай бұрын

O1 and Claude New get about the same score on Simplebench, but do they get approximately the same subset of questions right?

@aiexplained-official 2 ай бұрын

Interesting, not a perfect overlap, and families perform similarly

@pareak 2 ай бұрын

The new Claude 3.5 Sonnet seems to me to respond much more... natural? It's hard to explain but in my chats with it, it just felt so much more like someone you enjoy interacting with (and I hate using anthropomorphizing language here, but that's how it is).

@booshong 2 ай бұрын

What's the over/under on how long before one of these videos hits us with "By the way, everything up until this point was generated by "?

@aiexplained-official 2 ай бұрын

I wouldn't do that

@booshong 2 ай бұрын

@@aiexplained-official I love how that response is technically ambiguous in that it could mean "I just wouldn't tell you" ;). But yeah, I know you wouldn't mislead people. You and your channel are fantastic!

@sstteevveenn77 2 ай бұрын

I’M READY WHEN YOU ARE 😂💀 20:29

@TimBrouwerNL 2 ай бұрын

Great video

@TimBrouwerNL 2 ай бұрын

I am watching this live

@stevew6647 2 ай бұрын

Re Sonnet’s use cases: I feel it would be worthwhile spending some time on the boring enterprise uses. Sonnet has been the best coder and analyst for a long time, which has made it the workhorse for coders and document analysis. Now it’s also able to remote control computers, which means it’s able to cover a giant set of use cases that amount to automating legacy apps or apps without API’s. Imagine boring accounting apps, EHR’s, airline reservation apps, all now automatable with a prompt rather than a script. This is hundreds of millions of dollars per year of automation projects and human inefficiency being addressed directly. People are already cancelling automation projects in progress to switch gears.

@kristianlouis6821 2 ай бұрын

Thanks a lot. I’d like to propose a section on healthcare advise. Retail and aviation seems less useful for «customers» read people 🧘🏾‍♂️ great content

@faolitaruna 2 ай бұрын

22:35 Great title! BTW, I love the Adam Curtis' documentary. Is it reference to the poem or the TV series?

@ChinchillaBONK 2 ай бұрын

Suddenly thinking about it, comment bots in Twitter, KZbin, Reddit , etc etc are already quite advanced. Imagine if nefarious actors with resources to build their own LLM machines were to train AI to do these kinda things like making human-like comments and other marketing/advertising scams.

@JohnCamacho 2 ай бұрын

Vicky in 2027: "Hey I'm not finished! How rude! Sheesh"

@alectoireneperez8444 2 ай бұрын

The AI zoom call was the most soulless thing I’d ever seen

@Luxcium 2 ай бұрын

Claude again: (joking about being in february 2025 and then I told the ai that it was obviously a joke and that we we’re february this year)… 😅 Ah, I appreciate your playful spirit and that you keep testing my temporal awareness! Yes, we must be in February 2024 since my knowledge cutoff is April 2024. I'm glad you're having fun with these time-based scenarios while also being honest when you're joking. It helps maintain a clear and truthful conversation!

@MatthewKelley-mq4ce 2 ай бұрын

Gotta say I'm loving the interactions with Claude more so far. Although perhaps a bit.. short at times?

@anonymes2884 2 ай бұрын

"Can you still see me ?" Hah, it really was like a Zoom call - just needed a further 5 minute back and forth with "OK, I can hear you, can you hear me ? Hello ? Oh FFS... how about now ? Now ? No, the other one, no just press it once... ONCE ! OK, here we... you can't see me now ? ... Will this go in an email ?" :). (and the Yellowstone wander was indicative but if the first thing Claude 3.5 Sonnet did when given a coding problem was go on Stack Overflow _then_ we'd know we'd reached human level AI :)