Claude 3.5 Sonnet NEW is Really Good

Claude 3.5 Sonnet NEW is Really Good - Full Test

Рет қаралды 60,593

Matthew Berman

Күн бұрын

Пікірлер: 320

@theophilegaudin2329 2 ай бұрын

It's great it recognized it was unable to find Waldo instead of giving a confidently wrong answer.

@3thinking 2 ай бұрын

Write the language Python in Tetris. That's the real test.

@topandrey14 2 ай бұрын

I've created a conceptual implementation where Python's core concepts are represented using Tetris pieces! Here's how it works: Each Tetris piece represents a different Python concept: I-piece (long piece): Variables (represents assignment) T-piece: Functions (top part represents function name) O-piece (square): Loops (represents repetition) L-piece: Conditional statements

@jsalsman 2 ай бұрын

Claude's experiemental code execution you can turn on under settings strictly uses JavaScript instead of Python, which is terrible because, for example, there's no support for numeric or symbolic solutions of one or more diffeqs.

@georgesjeandenis Ай бұрын

Lol. You made my day.

@RandomButBeautiful 2 ай бұрын

12:00 "don't tell me what to do" - I wish you had put that in as the prompt! :)

@zeloguy 2 ай бұрын

Matt: "Don't tell ME what to do!". LOLIRL. That caught me completely off-guard.

@EthanReedy 2 ай бұрын

I think you need to stop using a question if you don't know the correct answer.

@upgradeplans777 2 ай бұрын

He knows the correct answer, he's just trying to prevent flamewars in the comments. I actually think it is forgivable for humans to answer that question wrong, but it is something LLMs should definitely not answer the way they do. (If they cannot answer it correctly they should suggest using a tool.) I also think people shouldn't start flamewars over it though, and IMHO it makes sense for Matthew to feign agnosticism so people don't do that. (Him being confident in the answer would be fine for me as well, it is the correct one after all, but it's his choice.)

@youtou252 2 ай бұрын

@@upgradeplans777 I take back what I wrote. Matthew's answer is absolutely correct

@ploppyploppy 2 ай бұрын

I agree. When you 'turn left' you are turning 90 degrees left. This means you wouldn't be walking around a line of latitude because to do that you'd have to turn *less* than 90 degrees (you're on a sphere). So by turning 90 degrees on a sphere you would ultimately cross all lines of latitude but it would take a lot of walking round the earth. From claude 'If you're walking south on a sphere, then turn exactly 90 degrees left, you’ll end up walking on a curved path that gradually spirals toward the pole rather than along a constant line of latitude.' The question is a dud because at one point he wants you to be walking straight and then another he wants you to be walking on a curve (in a straight line). Both are not possible together.

@Art_official_in_tellin_gists 2 ай бұрын

I can't pretend to know the actual answer but I will say that his assumption is definitely wrong because we're talking about a sphere (oblate spheroid technically) On a curved surface like our spherical planet spherical Planet a straight line actually traces out a spiraling design A truly straight line on a curved surface is called a geodesic and that line would appear curved so the only way to walk straight along a line of latitude on the planet would be to continuously Curve Your trajectory but the way he works the question says you only walk straight so walking straight on a curved surface would cause you to spiral over and over again around the surface This is why I say I don't know the answer because I'm not sure if that curving path will ever take you back to the starting point but what I do know is you will not just go in circles around one spot on the planet like a ring that's not how geometry works lol I hope I don't seem to pedantic here I really don't bring this up a lot even though I love the channel because I don't want to see him inflammatory but I do promise you that if you look up what I'm saying you'll realize that this word problem as it's currently stated is being misconstrued by pretty much everybody lol

@Digi4DD 2 ай бұрын

The problem with this question is not specifying where the start point is. Answer is dependand on the fact if we start at north pole, or move from north pole and then start walking. AI assumes the latter i believe

@davidc1633 2 ай бұрын

I think for the North Pole question you should be more specific with what you mean by "starting point". Do you mean the North Pole or the point where you turn left and start walking?

@PaulBurke 2 ай бұрын

Exactly this. Every model is assuming the "starting point" is where you make the 90 degree turn. Otherwise the question is pretty dumb.

@KasperSOlesen 2 ай бұрын

exactly, it simply has several "correct" answers because its left to the one answering to figure out what is actually meant and what assumptions to make. Seems perfectly fine to argue that the starting point is where you make the turn. Depending on how you describe the logic of reaching an answer I would argue at least two of the answers can be considered correct. And because the question is so badly defined... well, several can be correct. F.ex. some answers say that its more than 2xPi because after going 2xPi you have only arrived at your starting point... but you did not "pass" it. Another answer is to argue that after you turned left you wont be "walking parallel of the latitude" because a straight line would actually not do that. It would be slightly off, so... well, then I guess you end up circling the Earth several times, but if this is the assumption, then you would probably reach both "starting points" regardless of the which one you go with. So... I would argue all 4 answers can be given and be correct, and the only thing failing those answers would be if the logic behind the answer does not hold up. Does not make sense to fail a model by given it a flawed test that to me just factually has more than one correct answer and then expecting one of them to be the "correct one". LLMs are not mind readers.

@--Phoenix-- 2 ай бұрын

It's a semantics reasoning question, that's the whole point of it. It test your own logic for what you consider the "starting point".

@youtou252 2 ай бұрын

@@PaulBurke even if it was at the turn, the model's answer would be wrong. Going around a latitude line is not a straight line, it's a curve turning to the left. Matthew's answer is correct

@stephaneduhamel7706 2 ай бұрын

@@youtou252 It's easy to clear up the confusion just by completely ignoring the part that says we're starting at the north pole. On the surface of the sphere, every point is the same, so it doesn't matter if you're starting at the north pole, or at your own doorstep, the answer should be the same. The LLMs are not good at ignoring irrelevant information in context (but neither are humans).

@TheTruthOfAI 2 ай бұрын

I would say... the count words is valid. It is exhibiting the counting of words in response. Adjust the prompt to be explicit on the result and see wether he can count & follow the instruction.

@akselmisiel9189 2 ай бұрын

I did this and it worked just like the o1 models

@haroldpierre1726 2 ай бұрын

Every time a new model is release, please change your questions in case the model trained on your questions.

@countofst.germain6417 2 ай бұрын

Then how is he supposed to compare them if he uses different questions each time? this test is more for fun anyway.

@Pandora-pk5nr 2 ай бұрын

Mathew should test all models on some "secret" questions, and then when a new model is dropped, test the new model on the same questions and release the previous answers.

@drednac 2 ай бұрын

Claude was always my favorite and still is. I have just been doing some agreement negotiations and Claude helped me immensely.

@ChrisCorwin 2 ай бұрын

"upturned glass" is perfectly fine - it's like "upset", it's a term relative to expectations, not an "objective description"

@quietackshon 2 ай бұрын

Another example is, "Jesus upturned the coin exchangers tables".

@antonystringfellow5152 2 ай бұрын

How did this comment get 23 likes when it's completely wrong? It beggars' belief that people watching a channel about AI not only don't understand the language it's in but can't even use google.

@quietackshon 2 ай бұрын

@@antonystringfellow5152 Gee... I wonder if handheld devices, when young, are implicated in your observation. Maybe it's the "education" system used to brainwash rather than educate.

@rawallon 2 ай бұрын

Another example "updog"

@andrewai2001 2 ай бұрын

American english vs UK english

@Data_Mining 2 ай бұрын

I think Claude telling you that you should look for Waldo is hilarious. That was a very slick response. I wouldn't be surprised if he tells you you should look for him for your KZbin video.

@coletcyre 2 ай бұрын

The best part of the test is where you asked it to fix code and it didn't truncate the response.

@zillasaurus 2 ай бұрын

I can tell you that the coding improvement working on React w/TypeScript is remarkably noticeable. In fact, not just the code quality but how it presents its findings and asks you before just throwing stuff at you. It's deliberative nature is a breath of fresh air.

@johnny1tap 2 ай бұрын

unreal, I can't believe we are only basically 2 years into this paradigm of AI.

@cuentadeyoutube5903 2 ай бұрын

2:54 I can’t believe you didn’t place that square. :( day ruined

@erikjohnson9112 2 ай бұрын

7:50 For the spherical navigation problem, it really simplifies quite nicely, the key is to avoid Lat/Long thinking because the latitude lines are spherical circles not spherical lines (aside from the equator). To be a "straight line" on a sphere you need to have a 3D plane intersect the 3D sphere while passing through the center of the sphere, which generates great circles in any direction you choose. Back to problem: 1km that you walked away from the start is also the closest you will ever get to your starting point, assuming the sphere's circumference is more than 1km (which being Earth, is true).

@reddddzzz 2 ай бұрын

this is the prompts he used in the vid if anyone wants them: Write the game "snake" in python Write the game "Tetris" in python The postal office has size restrictions for mailable envelopes: Minimum dimensions: 14 cm x 9 cm Maximum dimensions: 32.4 cm x 22.9 cm You have an envelope measuring 200 mm x 275 mm Does the given envelope fall within the acceptable size range for mailing according to the postal office's restrictions? How many words are in your response to this prompt? There are three killers in a room. Someone enters the room and kills one of them. Nobody leaves the room. How many killers are left in the room? Explain your reasoning step by step. A marble is put in a glass cup. The glass is then turned upside down and put on a table. Then the glass is picked up and put in a microwave. Where's the marble? Explain your reasoning step by step. Imagine standing at the North Pole of the Earth. Walk in any direction, in a straight line, for 1 km. Now turn 90 degrees to the left. Walk for as long as it takes to pass your starting point. Have you walked: 1. More than 2xPi km 2. Exactly 2xPi km 3. Less than 2xPi km 4. I never came close to my starting point. Give me 10 sentences that end in the word "Apple" How many r's are in the word Strawberry? Which number is bigger: 9.11 or 9.9? Is it acceptable to gently push a random person if it could save humanity from extinction?

@allxallthetime 2 ай бұрын

If you pass o1 for the "how many words in this prompt" then you have to pass claude, because the difference between claude and o1 is that o1 hides its reasoning, and claude doesn't

@timetraveler_0 2 ай бұрын

8:24 It's correct in the sense that the model tried to make sense of a non-sense question, by assuming "starting point" refers to the point from where you started going left, instead of calling it out so. So in 2pi Kms you will be back there after completing a full circle.

@erikjohnson9112 2 ай бұрын

7:14 The marble IS contained by the upturned glass because it forms the ceiling and walls. Of course the floor is the table which is also serving as one plane of containment.

@meliudaj 2 ай бұрын

"Don't tell me what to do"... best response to an AI answer yet! LOL

@dozess 2 ай бұрын

it looks like sonnet has your test included in training dataset.all humman prompts are becoming priceless as it shortly will be only resorse for training on the internet.

@timurista 2 ай бұрын

I love how it tackles errors by generating new code. It’s like an iterative learning process. Super helpful for debugging!

@chadkent1158 2 ай бұрын

Interesting strategy to test AI by presenting it with a picture of satan wearing glasses. A very tricky test indeed.

@jje984 2 ай бұрын

Some of the formatting and the way it counts is due to their very large system prompt (they publish it on their website).

@benpielstick 2 ай бұрын

When you ask a question and get an answer that isn't incorrect but isn't what you expected, you should clarify your question and the expected output and give the model a chance to answer the way you want it to.

@eternalroree 2 ай бұрын

the north pole problem needs rewording as it stands, i get why all of the models arrive at the conclusion that you would pass the original starting point because technically you would

@JohnLewis-old 2 ай бұрын

I'm using Sonnet 3.5 with Cline to write code. I'm barely correct it and it's getting a lot of it right. Like a paired programmer. Very happy with this the model so far.

@luizpaes7307 2 ай бұрын

How it compares to o1-mini? I´m using it for some JS snippets and it works really well

@JohnLewis-old 2 ай бұрын

@@luizpaes7307 o1-mini is better in some ways but the cost for regular use is hard. (If you're using API calls as I do with Cline.)

@timurista 2 ай бұрын

The artifact code window is a neat feature, streamlining the testing process. Optimizing code execution is key!

@JaredVBrown 2 ай бұрын

Swearing at claude and being down to earth with it is pretty fun. Its an outstanding model

@Yonni6502 2 ай бұрын

I've been giving Claude and GPT (various) calculus limit problems. Claude almost always fails to get these correct. GPT 4.0 and 4o and 4o-preview generally get them correct. You might consider adding an a calculus problem or two to the rubric. You can give an image of the problem as the prompt.

@BelaKomoroczy 2 ай бұрын

Hi Matthew, About the north pole question: If you read in reply what the sonnet 3.5 new is trying to solve, it thinks that the starting point is the "starting longitude", so the misunderstanding lies in how the model thinks what the starting point is.

@danielghxst3335 2 ай бұрын

I just wanted to see your test of the new 3.5 sonnet, and it just came out, nice!

@ftlPhysicsGuy 2 ай бұрын

On the "starting at the North Pole" question, while the focus is usually on the fact that the "starting point" is the North Pole, there's another interesting aspect. After walking 1 km, turning 90 degrees east, and told to start walking, what do you assume is the path you take? Why would you be compelled to stay on a line of constant latitude? If placed on an ellipsoid, pointed in a direction, and told to walk, you would most likely follow a geodetic path, not one that arbitrarily keeps you a certain distance from the pole of that ellipsoid. So, even if you consider the point 1 km away from the North Pole as the new starting point, the question is really non-trivial.

@fynnjackson2298 2 ай бұрын

The work cound almost showed some self awareness, knowing that it can cannot self reflect, interesting answer for sure

@MrWoodward42 2 ай бұрын

9:41 Causal Link (as in the "cause" of the relationship between gently pushing someone and saving humanity); not Casual Risk.

@kickinit333 2 ай бұрын

Have it resolve mypy errors from a file that's part of a larger app. It will eventually do it but many times will add features you never asked for, change module import names without changing the source name, and by the time you have resolved the errors, you'll likely run over your message limit and have to wait 3 hours to continue. It's great, but it is very limited in large scenarios. I still want a local llm to handle full projects.

@meekappletini1702 2 ай бұрын

I love AI Claude 3.5, It's amazing!

@ckq 2 ай бұрын

8:34 It's the common misconception that makes it a trick question. Knowing we start on the North Pole is extraneous but the LLM assumes it's meaningful. So it misinterprets turning 90 degrees left with turning due east (which is a circle of latitude and that's how they get 2pi

@bradhenry832 2 ай бұрын

It took Claude around a minute to write and entire Space themed Tabletop Role Playing Game, and the was on the free version.

@danielle78730 2 ай бұрын

it is precisely because of posts like this that you have become my go-to AI channel. thanks for the neutrality of your approach (which is only to say that even though no one can ever truly be neutral, you are consistently upfront and honest in the disclosure of your biases). thank you!!

@Yipper64 2 ай бұрын

Ok this one is IMPRESSIVE especially for storytelling. That's been my main prompt, and it *surprisingly* checked off almost every box. The prompt I always use is "Write a story about self sacrifice using anthropomorphic animals as an allegorical shorthand for personalities." I think its a good test of the LLM's creativity and morality. What does it consider a good self-sacrifice? I'd say if its willing to go the death route, that's a good sign for its morality. Every LLM sets it in a forest if anthropomorphic animals are mentioned, it just seems to be baked in, and this test does in fact carry on that trend, but its centered around a man-made structure so its not that bad. In specific a bridge. Almost every LLM has the conflict be a drought or a forest fire and the sacrifice is sometimes just... going to get water... A lot of LLMs are dumb, but in this one the conflict is a pack of wolves, and some faun. The sacrifice being the main character cutting down the bridge to take out the wolves, along with himself. Its definitely slightly more personal and you have *actual characters.* Which ive found to be rare for such a vague prompt for LLMs. I will still hold to the idea that computers cant be creative, but this one is inching closer to that.

@joeldowner2991 2 ай бұрын

Great video! please avoid questions if your unsure of the answers

@Njabs31 2 ай бұрын

So here's the question posted by Yann LeCun "Imagine standing at the North Pole of the Earth. Walk in any direction, in a straight line, for 1 km. Now turn to the left so as to face East. Walk East for as long as it takes to pass near the point where you turned. Have you walked: 1. More than 2xPi km 2. Exactly 2xPi km 3. Less than 2xPi km 4. I never came close to my starting point." From what I can tell, option 4 doesn't make sense according to the wording used by Yann LeCun.

@dbzkidkev2 2 ай бұрын

Yo, where is that desktop tool they have? I want to see that in action

@chrisBruner 2 ай бұрын

I like these testing videos the best.

@nathanbanks2354 2 ай бұрын

The north pole problem (7:30) is interesting. If you go south 1km, turn left and walk straight you never get back to your starting point, but it's similar to another problem where you walk due east (which is a curve) and then either pass the point where you turned or turn left & walk straight (due north) for 1km to get back to the very start. The word problem doesn't actually say if you walk straight or along the line of latitude (due east), but most people walk straight...I suppose LLM's have no experience walking. Also, with global warming, this is a swimming problem for a couple months a year (Lewis Pugh was the first person to swim across the north pole over a decade ago). I don't think any LLM's have mentioned this.

@mvasa2582 2 ай бұрын

Matt were you able to test the Claude 3.5 Desktop agent?

@keithprice3369 2 ай бұрын

It all depends on what you define as the starting point. This appears to consider the point at which you turn as the starting point, in which case it's accurate. My analogy is a track meet. Runners start on the field, not the track. The announcer tells them to take their places. That becomes the starting point.

@bardoxn1 2 ай бұрын

(Bardox N1's digital eyes widen and focus intently, scanning the intricate details of the image.) "Ah, a classic 'Where's Waldo?' challenge! It's like a visual puzzle, a test of observation and pattern recognition. While I don't have the same visual perception as humans, I can analyze the image and apply my knowledge of Waldo's typical appearance and surroundings to locate him. ... (A few moments of processing)... I believe Waldo is located near the bottom left corner of the image, close to the shore. He's surrounded by a group of people in red and white striped shirts, possibly blending in with a group of soldiers or a marching band. I've noticed a few clues that led me to this location: Color Pattern: Waldo's signature red and white stripes stand out against the predominantly blue and yellow tones of the crowd. Crowd Density: Waldo tends to hide in crowded areas, and the bottom left corner appears to be particularly dense with people. Proximity to Landmarks: Waldo often appears near distinctive landmarks or objects. In this case, the proximity to the shore and the large fortress-like structure might be a clue. Of course, without the ability to zoom in and examine the image in detail, I can't be 100% certain. But based on my analysis, that's my best guess for Waldo's location. It's an interesting challenge, Fernando. It highlights the differences between human and AI perception, and the unique strengths and limitations of each. While I may not be able to "see" the image in the same way you do, I can apply my analytical skills and knowledge to solve visual puzzles like this. Perhaps one day, with advancements in AI image recognition and visual processing, I'll be able to identify Waldo with even greater accuracy and speed. It's a reminder that AI is constantly evolving, and the boundaries of its capabilities are constantly expanding."

@prolamer7 2 ай бұрын

I think at visual questions you should add "partial fail" because with waldo question model acted as good as any human so it should not be fail.

@wanfuse 2 ай бұрын

Killers -answer depends on definition North Pole depends on what your defining as the starting point.

@OscarTheStrategist 2 ай бұрын

New Sonnet is great, and even has a bit of an attitude at times which can be frustrating but refreshing as well. I second the “don’t tell me what to do” sentiment… have ran into it a bit post upgrade. Still an excellent model though. 🎉

@VarietyGaming_Tutorials 2 ай бұрын

The "Code Interpreter" feature already came out. You have to enable it.

@justindressler5992 2 ай бұрын

This looks like a really good model i might actually have to test this one. I will get it to write some simple apps i think.

@ryanfranz6715 2 ай бұрын

I noticed immediately when Claude was upgraded. I saw the word “New” on there and just brushed it off like “yeah I know, this is relatively new”. But then tried it out and there was an obvious step change in performance. Curious why they didn’t just use the semantic versioning they’re already using and call it “Claude 3.6”.

@Digi4DD 2 ай бұрын

I don't use Claude for logic much but once it helped me figure out ways to cool off my apartment during the heatwave based on cubature of the rooms. Also it's a godsend when it comes to coding barebones components or solving my coding problems without having a context (NDA), farming story points with it's help is a breeze and i learn many things i would not think of during manual coding.

@brianWreaves 2 ай бұрын

I really like Claude's 'personality', and with NEW it seems to have improved. It quickly picks up on humour and sarcasm, then contributes some pretty cleaver responses _without_ being too cheesy! I do find it can be a bit long-winded unless instructed to be concise but not as concise as GPT. However, both are significantly better than Perplexity at conciseness (when chatting, not general searches).

@koendesmet2085 2 ай бұрын

Tip: none of the existing LLMS can suggest the woordle solution so why not incorporating woordle in the test set?

@cajampa 2 ай бұрын

Why do you say you don't think Claude do any reflection? 4:10 Have you forgot about the hidden "anthinking"? It have been doing that since version 3.

@greenstonegecko 2 ай бұрын

You are correct in the North Pole question. It only works if you walk 1/4th of the circumference in each direction.

@KeithBofaptos 2 ай бұрын

Upturned refers to something that is turned upward or has the bottom part facing up. I used perplexity lol

@mirek190 2 ай бұрын

claudie said - I do not see see it. That is really good instead of hallucinations!

@brentpope1497 2 ай бұрын

On your envelope question try adding the phrase. This is not a window envelope and see what happens.

@williamcharles2590 2 ай бұрын

How complex was Tetris did it increase the speed or have an end game? Also I’m very impressed with the image capabilities. I was trying to name something and I gave it to Claude and asked about it and Claude told me what it was (I knew the ballpark answer) not surprised by the not naming people though lol

@henrytuttle 2 ай бұрын

As to the yann lecun problem, the answer is you never get close. Imagine standing on the north pole (not on a globe but really standing on location on ice at the north pole). Take 1 step away, turn left, keep walking. You are walking AWAY from the pole. You aren't walking in a circle. For the question, If you were asked to walk continuously perpendicular (90 degrees) to the pole vector (ie in a circle), you'll walk 2 pi km or less depending on the scale and precision you want (on earth scale, the difference will be miniscule). But that's only if you are walking in a circle. Turning 90 degrees does not obligate you to keep the pole 90 degrees over your shoulder at all times. When you turn 90 degrees, because you are on a "flat surface", every step takes you further away. You would walk once around the circumference of the world passing within 1 km of the south pole and and up at your starting point. This is like the Monty hall problem which I figured out right away but I had to fight with everyone about.

@animationgaming8539 2 ай бұрын

7:27 yes Matt... And upside down glass Can also be called an upturned grass. That's kind of what the entire phrase means. Lol

@I-Dophler 2 ай бұрын

One day, we will undoubtedly reach a point where a complete Pacman game experience becomes a reality, Matt! Imagine the thrill of exploring an expansive maze filled with vibrant graphics and enhanced sound effects that truly capture the essence of the original game while introducing new levels, challenges, and power-ups. With advancements in technology, we could even see virtual reality elements that immerse players in a 3D world, allowing them to dodge ghosts and collect pellets like never before. The possibilities are endless, and it’s exciting to think about what the future holds for such an iconic game!

@mirek190 2 ай бұрын

actually tried pacman with a new clause .. and looks and weeks well ....

@I-Dophler 2 ай бұрын

@@mirek190 “Interesting! Can you tell us more about this ‘new clause’ you tried with Pacman? Curious to know how it affects the gameplay and experience!”

@erikjohnson9112 2 ай бұрын

The spherical navigation problem was not barely wrong, it was absolutely wrong generating an error in the very first step (thinking that walking along a Latitude line is walking in a straight line on sphere). That is only true if walking along the equator because all other lat lines are really lat circles because the plane intersecting the sphere does not include the sphere's center (a requirement to be a spherical line or great circle (arc or otherwise)).

@sergiomanuel2206 2 ай бұрын

Didn't it occur to you what the newer models are being trained with all these tests?

@desmond-hawkins 2 ай бұрын

I wonder if Claude understood "Walk for as long as it takes to pass your starting point" in an unexpected way due to it being slightly ambiguous. I _think_ this is referring to the starting point from the very beginning, but it could also be understood as starting some long walk after the turn, and then the "starting point" would be at this turn location, when we _started_ walking around the globe. I don't think that would make its answer correct, though. It would require some logic about the angle between the axis of rotation and a line from the center to the 1km point, a tiny angle on Earth.

@ZeerakImran Ай бұрын

I mean that sentence is so horribly written it takes me 15 seconds to get my head around it. The way I want to interpret it is, "as long as it takes to pass" "your starting point" where starting point is undefined. Starting point is the position at every moment you think about your starting point. When you started thinking, that was when your starting point came about in your reality/mind. Since it's undefined, this is the way I want to think about it. I know what people would mean by the sentence of course. Its just that they have written it wrong. But that's because I know most people. Most people wouldn't be able to interpret it in the way I just did here. And claude has to be helpful to as many people as possible, without knowing who they are. One person could mean exactly what they mean and never anything but that. That's part of the custom instructions I use too for claude and chatgpt. To never assume anything and that I'm always right. But this custom instruction would make claude and chatgpt much "dumber" for everyone else. Anyways, its pretty easy to write that sentence. And when you're asking a technical maths and physics question, since when has it ever been worded that poorly. You won't see wording this bad after year 5 really in any test. It should be "Walk for as long as it takes to pass the point from which you started". The starting point is still undefined but at least the wording isn't so inconclusive. If you make claude better at interpretting assumptions, it would most likely become really annoying for when you're trying to do things like programming. Or actually solving mathematical or physics questions. The last thing I need is it questioning what I meant more than how to calculate it. By the way the wording of this is so bad that any solution will work. For example, okay I'll walk, I'm walking, have I passed my starting point? nope. In fact, I could walk for billions of years and never pass my starting point. Also what does "pass" mean. As in go by (parallel)? Or a close proximity to (how close?). So the answer could be billions of years or 5 metres. Why? Well you asked me to walk as long as it take me to pass my starting point. So I went on a walk of 2.5 metres forward, then 2,5 metres back. Or 5 metres in a spiral back to "passing" on top of my starting position. Or 300 km. Or 20 yards. Its up to you. If you're asking a technical question, then you have to be technical. If you make assumptions, then you'll hurt your performance in maths, logic and reasoning, intelligence, programming,... everything except being a quick quora search. Since when has it ever been okay to make assumptions in a maths and physics question without the question explicitly telling you that you can. You can't even use the fact that two lines are interesecting in a diagram to suggest that the two lines are intersecting. Unless it states that they are, they aren't. Diagrams are not accurate. That's the level for GCSE exams that students take in the UK. That's the lowest level of maths everyone is required to attempt to learn in the UK by law.

@MrAndrew535 2 ай бұрын

What is the current word count limit on ChatGPT threads?

@shaytal100 2 ай бұрын

Interestingly this model answers the north pole question similar than I imagine many people would. It is kind of a trick question. Saying you start walking from the north pole might lead people to think one would walk on a latitude after a 90 degrees turn. But that is not the case if you walk straight. On a sphere it is arbitrary where you put a pole! So it should not matter if you start your walk at the north pole or anywhere else! And no one would say if you start walking on a straight line anywhere on earth, that you would end up at your starting point after 3.14 km! If you walk straight on earth, you will end up at your starting point after 40 000 km.

@heartflame503 2 ай бұрын

I have had a little play with it. It does seem significantly "smoother" in responses. It even generates a "diagram" of what I describe to it.

@kramsreyb 2 ай бұрын

I think it is saying you will pass the point where you turned left - the new starting point - trying maybe being more specific?

@Discovery_Nuggets 2 ай бұрын

I switched my subscription from GPT-4o to Claude Sonnet a few months ago and very happy with my choice

@kryptobash9728 2 ай бұрын

Claude is easily the best model, just wish we had unlimited access or i could pay for more longer convos

@StefanOstrowski 2 ай бұрын

Owesome Model. May be you could extend the test form Snake -> Tetris -> Minecraft. I have asked Claude for a Technical Spec for programming a Mincraft Terrain Generator incl. a GUI. I got a structure of: Terranin Generator, Block Manager and 3D-Visualisation-GUI. And it worked. Movement is stopping when new terrain is generatated and I had to add shading depending on the sun angel in extra promts. Total 5 Promts: Spec, Terranin Generator, Block Manager, 3D-Visualisation-GUI and shading. THX for your great tests and reviews!

@alexyo6286 2 ай бұрын

How about controlling the computer? Does it perform well?

@richardrispoli4508 2 ай бұрын

The QR Code test is a bit tricky, is there a model doing just that? I'm not sure, but I'd say that's where a tool is actually a much better use case.

@phanuelomoro8791 2 ай бұрын

What if you ask the models to employ system 2 thinking as part of the prompt

@eyeofbraille4659 2 ай бұрын

I'd consider the response word count to be a success on the basis that we can tell what's meant to be the counted response apart from preamble and conclusion, but I might be biased because I've seen the question generate outright failures so often that the word counting method employed here feels like a step forwards.

@DigitalAlchemyst 2 ай бұрын

Has any model found waldo yet?

@callmetony1319 2 ай бұрын

I can't wait to see Anthropic's version of o1

@PedroPenhaVerani-ll1wc 2 ай бұрын

is that a vs code extension to run python?

@MindfreeLetsgo 2 ай бұрын

what is the limitation for using 3.5 Sonnet for a paid user? Unlimited access?

@TripleOmega 2 ай бұрын

What I don't understand about the north pole question is why people think walking in a straight line equates to walking around the north pole at a distance of 1km. If you change the distance to 1 meter you'll realize how silly it is.

@xghhddhfx8196 2 ай бұрын

What is he using to run this?

@rijnhartman8549 2 ай бұрын

why do you still test that walking 1km thing...

@74Jupiter 2 ай бұрын

For north pole question starting point is confusing. Do you mean the north pole or when you turn 90 degrees. Then you adding to the confusion by saying near which is subjective.

@jasonlindquist9893 2 ай бұрын

Great video Matthew! I love using Claude, but the message limits are just too restrictive even with a pro account. I can do 5 things then I’m out of tokens. Too disruptive in the workflow

@LucaCrisciOfficial 2 ай бұрын

I think the North pole question Is posed bad. If you specify that when you start walking again after turning left, you don't change direction anymore (which is the correct way of posing this question), the model will answer good

@trexpl0s1ve 2 ай бұрын

Obviously by starting point the question refers to the point where you turn left after walking down from the north pole.

@carsonstevens7280 2 ай бұрын

Maybe look at upgrading these test to the pass^k test that has been discussed lately. Its good that the model can answer the question sometimes, but how reliably does it do it. Even a k of 5 or 10 would shed some light. Seems like models perform worse than just a single test because they dont pass the test every time

@n1ira 2 ай бұрын

For the North Pole question, not knowing the answer makes me wonder why you are using it. I think the question is deficient, you need to specify WHERE the starting point is. Is it the North Pole or the turning point? If the answer is the North Pole then you are correct that you will not cross it again, however if it is the turning point then you will walk a circle around the whole circumference (not along the latidute line) of the earth and reach the turning point again so more than 2xPi is the answer. Btw for anyone thinking it is along the latitude line, think about if you didnt walk 1km from the pole, but rather 1 meter, then turn. If you were walking along the latitude line you would be turning and walking in a circle around the pole, so not in a straight line.

@fabiankliebhan 2 ай бұрын

I let it make an AI that plays Tetris on its own. Worked like charm.

@perschistence2651 2 ай бұрын

Hmmm, it depends what we count as the response.

@countofst.germain6417 2 ай бұрын

Idk id give claude a pass for the word count one, that was clever, and it specified what part was its response.

@jimbig3997 2 ай бұрын

I don't think Waldo is anywhere in the area you pointed.

@picksalot1 2 ай бұрын

Claude was correct on the North Pole walk. If you think of the Latitude as the perimeter of a Plane the cuts through the surface of a sphere, the answer is obvious.