New Claude 3 “Beats GPT-4 On EVERY Benchmark” (Full Breakdown + Testing)

Рет қаралды 93,490

Күн бұрын

Пікірлер: 461

@matthew_berman 6 ай бұрын

So is Claude 3 better than GPT-4? What do you think? Join my newsletter for the latest and greatest AI news, tools, papers, models, and more: www.matthewberman.com

@mrpro7737 6 ай бұрын

I am interested in evaluating Claude 3's coding proficiency. I would appreciate a comprehensive benchmark focusing solely on coding capabilities

@MrJaggy123 6 ай бұрын

Is it better? Not for Canadians: they didn't let us use Claude 1, or 2, and I tried 3 and they still say "no". 😛

@MyName-tb9oz 6 ай бұрын

You missed a second fail for Claude on the, "...ending with apple," test: One of the sentences _does_ end in, "apple," but the sentence should have ended with, "apples," because of the way the sentence was constructed. I'd almost say it failed twice on that one sentence.

@protein4960 6 ай бұрын

I think both models assumed the cup had a lid on top which is why they got it "wrong" but their reasoning process was sound.

@eirikgg 6 ай бұрын

Did the t-shirt test but I did it in my language Norwegian it then failed. Same prompt to gpt4 and success. So eighter it is in the training data or more general language data = better reasoning

@OculusGame 6 ай бұрын

Not sure if you seen it on twitter, but someone at Anthropic mentioned one pretty crazy instance during internal testing, more specifically the 'needle-in-the-haystack' test. Here's what Opus said: "The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association." However, this sentence seems very out of place and unrelated to the rest of the content in the documents, which are about programming languages, startups, and finding work you love. I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings. Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities.

@nufh 6 ай бұрын

That is freaking cool.

@DeSinc 6 ай бұрын

this is the thing I dislike about needle in the haystack testing. if you put a random sentence in there somewhere, it statistically is at odds with the other words around it and would produce an "unlikely patch" of text that is unlikely to be there naturally, which means it should be quite simple to pick out. even if you just change 1 word, it's still not a likely word which means it will stick out just based on statistically what words usually follow each other in texts the model has "read" before. to really do this test correctly you'd have to give it a real text and ask it a real question about the contents somewhere in the middle, not actually change it yourself and make it stand out heaps.

@jd_real1 6 ай бұрын

Amazing

@EvertvanBrussel 6 ай бұрын

@@DeSincI hadn't thought of that before,but that actually makes sense. Though I'd still argue for testing both ways. It's interesting to know both whether it can point out the anomalous needle in the haystack AND whether it can just answer a question about a very minor fact (that wasn't out of place) that was mentioned in passing somewhere in the middle of the text.

@kenhtinhthuc 6 ай бұрын

Once the issue of lost in the middle is solved, finding the odd out is straightforward due to semantic search which is based on meanings rather than keywords. More specifically, meanings are turned into vectors via embeddings which allow (cosine) similarity comparison.

@ChristianIce 6 ай бұрын

Strong disagree on the definition of AGI, for multiple reasons. 1) The term AGI was introduced to mark the difference with AI, where AI mimics human intelligence, AGI is actually self-aware and has a real understanding of the inputs. So, we are moving the goal post, we now need a new term for this kind of artificial intelligence. 2) AI was already better than most humans in most tasks, quake bots were better any player 20 years ago. 3) When we built a car that was finally faster than a horse, we kept on calling it a car. 4) The discussion itself is not even about semantics, but serves as a technicallity in the deal between OpenAI and Microsoft. 5) It's all hype and fundings, AI is getting better but it is the same thing, it works in the same way, it mimics, simple as that.

@Hunter_Bidens_Crackpipe_ 6 ай бұрын

The claims every team makes is always way above their heads

@CosmosWorld 6 ай бұрын

That's called marketing

@Gafferman 6 ай бұрын

That's why we test

@codycast 6 ай бұрын

It’s so hard for me to trust that AI is giving me actual information after the blatant racism that came out of Google

@filipo4114 6 ай бұрын

I just tested Claude for 3 hours coding a physics simulation in python using pytorch and it's at least as good as gpt-4. The physics it came up with is a little bit wrong and the method for computing partial derivatives in pytorch was a little bit wrong but overall it wrote code that worked. After spending the whole 5$ that they give you at the start I can not wait for when they release the weights! It's first model that actually is on par with gpt-4 for me.

@scottcastle9119 6 ай бұрын

Yea no kidding, openai is way ahead

@333dsteele1 6 ай бұрын

In psychology, it's well recognised that people remember most about the beginning and end of a lecture. So, it was interesting when you mentioned the 'lost in the middle' aspect of LLM's.

@icegiant1000 6 ай бұрын

I think it would be hilarious if Claude 3 is just passing the questions to GPT-4 in the background, to get its answers. Actually, that's a pretty good idea. LOL. Claude 3, powered by GPT-4.

@MH-sl4kv 6 ай бұрын

Would explain the price differential: rebrand and add margin 🤣🤦‍♂

@lunarcdr3083 6 ай бұрын

Jean Claude? "You are nexx!!" ~Bolo Yeung

@estebanleon5826 6 ай бұрын

I love your content, Matt! Thank you for this. My one criticism is that you're not that great at snake. :) haha

@matthew_berman 6 ай бұрын

True lol

@mrpro7737 6 ай бұрын

I am interested in evaluating Claude 3's coding proficiency. I would appreciate a comprehensive benchmark focusing solely on coding capabilities

@WhyteHorse2023 6 ай бұрын

It will win just because of context length. That seems to be the major limiting factor. Also the needle in a haystack problem being solved helps a lot.

@Stewz66 6 ай бұрын

X2 on this. The snake game got my attention.

@jaysonp9426 6 ай бұрын

The fact Mistral gets the ball question right and everyone else gets it wrong is WILD

@hqcart1 6 ай бұрын

It's called Temp, it should be at 0 when you test, otherwise when you give a prompt, everytime you get different answer. a real test would involve 5-10 shots.

@jaysonp9426 6 ай бұрын

@@hqcart1 yeah every time any KZbinr tests these models it's very surface level. It's still crazy about Mistral though

@hqcart1 6 ай бұрын

@@jaysonp9426it is just luck that mistral got it in 0 shot, it's not a scientific test, just for fun i guess.

@NeostormXLMAX 6 ай бұрын

isn't mistral uncensored and not lobotomized as much?

@eIicit 6 ай бұрын

mistral and mixtral are ridiculously good.

@ekstrajohn 6 ай бұрын

For the last question, I would also answer "1 hour". This is because 10 foot hole might mean it is 10 feet in length, and people can stand next to each other as they dig. Not a great question. If you want the answer you want, state explicitly that the hole is very narrow and only one person can reach the digging spot.

@cognitive-carpenter 6 ай бұрын

OR a 10ft deep hole and specify the required finished diameter at the top--BUT I think here, the ambiguity is what makes it such a good question--the tester is trying to see if they pick up on the problems and solutions we just described.

@radiator_mother 6 ай бұрын

I love your videos ! I suggest a simple puzzle that for now has not yet been solved by IAs: A boat is moored at the marina. A ladder is attached to the hull of the boat. At low tide, 8 rungs of the ladder are visible. The space between each rung is 20 cm. The water level rises by 80 cm between low tide and high tide. At high tide, how many rungs of the ladder remain visible? Of course, as the water level rises, the boat follows, and therefore the number of visible rungs always remains at 8...

@sergefournier7744 6 ай бұрын

It depends, are the ropes of the moor long or short?

@sophiophile 6 ай бұрын

I think on the last question (digging a hole), you need to provide more (or even a more specific) prompt context. It is answering as if it's a gradeschool homework math assignment, in which case that is the desired answer. Rephrase it in a more real-world (instead of word problem in a math class) way and see what you get. I'm assuming that you use custom instructions yourself when you use ChatGPT, and know how significantly this can impact things (like skipping over all the nonsense and getting handed over to the apppropriate expert model in the ensemble, using tools/interpreters immediately to answer instead of just telling you how something would be achieved).

@MichaelForbes-d4p 6 ай бұрын

You know what this means! GPT-5 will be out any day now.

@fynnjackson2298 6 ай бұрын

my thoughts exactly

@seventyfive7597 6 ай бұрын

You did not test programming at all, same goes for logic You tested whether the team at Anthropic have seen the regular snake KZbin test, and they have indeed pretrained or fine tuned the model to answer that question, but this just tests whether the people at Anth follow youtube testers. So it's a repetition test, not a programming test. Same goes for the shirt drying and killer problems, a repetition test.

@slackerpope 6 ай бұрын

Agree! Exhaustive and practical code test please Matthew. Also dont hide code tests behind a paywall some of are broke students and cant afford to subscribe to useful facts.

@elawchess 6 ай бұрын

Yep I also wish for more rigorous, diverse and systematic testing

@seventyfive7597 6 ай бұрын

@@elawchess Yes, and especially throwing away tests that devs are aware of

@elawchess 6 ай бұрын

@@seventyfive7597I saw an approach on another youtube channel. The LLMs were required to find a bug in some github small piece of code. However before it was used as a test question it was disguised by another LLM to make it less likely that the LLM under test had simply memorised it from the web.

@CrudelyMade 6 ай бұрын

how would these do with traditional puzzles like, "Jane and Tim were on the way to the market, they met Nancy. Nancy had 2 brothers, each brother had 2 cousins, each cousin had 5 sisters. how many people, in total, were on the way to the market?" and basically other basic things which tend to mislead humans. example 2: if you are running a race, and you run past the person in second place, what place are you in?" example 3: If you have a sack full of apples, and you are with Tom, Dick, and Harry, from the third rock from the sun, and you give two apples to each of them, and you take two apples yourself, and then each of them give you one apple back, how many apples do you have in the sack? obviously, there isn't an answer to that last one, but that's somewhat the point. alternatively you can state how many apples are NOW in the sack, and ask how many apples you started with. this requires a bit of backward thinking which I expect ai not to be great with. last one: A boy and his father get in a car accident and go to the hospital. the surgeon looks at the boy and says, "I can't operate on him, he's my son, you should inform the mother." explain the relationship of the surgeon to the boy. and.. well, this is when a human can get creative.. where the surgeon is: a biological father, stepfather, priest, delusional..... I dunno.. I started making things up a couple of examples ago, lol. :-D

@alexybogomolov 6 ай бұрын

The main problem with those 'puzzles' is that they don't make any sense. They do not solve any problem, and really feel like a random noise produced by an average HR in McDonald's, checking whether you have any brain or not.

@CrudelyMade 6 ай бұрын

@@alexybogomolov the first two are basic tests that make perfect sense. the answer to the first one is that we know two people were on their way to the market. this isn't much different than the ai understanding the ball isn't going into the microwave. I mean.. that doesn't solve a problem either, not sure why you think all tests need to solve a problem vs test the capability to figure something out. the second one is also a test to see if it figures things out correctly. many people get the second one wrong. the killers question isn't solving a problem, is it? these are general tests to see how well the AI can reason. and that's the key: reason. it's not about a fact in a book, it's about processing information. perhaps YOU can suggest some questions to ask the ai. because, y'know.. "your questions are stupid" is not a set of better questions. ;-)

@GarethDavidson 6 ай бұрын

Benchmark test idea: the BBC radio 4 game "just a minute", contestants must speak for a full minute on a given topic without repetition, deviation or hesitation. From an LLM perspective, generate 200 words on a topic without repeating a word or diverging from the topic. Repeats of words like "a", "and", "to" etc are allowed, but verbs, nouns and adjectives are not.

@sergeyromanov2751 6 ай бұрын

Here are some more difficult problems. Try using them for your tests. “There were three rabbits in a cage. Three girls each asked for a rabbit. Each girl was given a rabbit. Yet, there was still one rabbit left in the cage. How could this be?” “There are 70 sheets of paper on the table. You can count 10 sheets every 10 seconds. How can you take 50 sheets from the stack in less than 50 seconds?” “There are 30 gadgets - smartphones and tablets - in the store. It is known that among any 12 gadgets there is at least one smartphone, and among any 20 gadgets there is at least one tablet. How many smartphones and how many tablets are there in the store?” “At what speed should a dog (within its possible limits) move in order not to hear the ringing of a pan tied to its tail?” “Nick always tells the truth, and Mike always lies. Recently, Nick and Mike bought a vehicle to share. Nick said that it’s not yellow, and Mike claimed that it has a motor. What do you think the boys bought: a red scooter, a yellow motorcycle, or a blue bicycle?” “A person jumped out of an airplane without a parachute. He landed on solid ground but remained unharmed. Why is that?” “In a marathon, Kira finished immediately after Alex and came in third. How many people ran the marathon if Alex was fifth from the end?” "Grandpa Truthful (always tells the truth) and Grandpa Fibber (always tells lies) are discussing a number. Fibber: The number is odd. Truthful: In this two-digit number, the tens digit is 2 more than the units digit. Fibber: It has the same digit as in the number 24. Question: What number are they discussing? Correct answer: 86" “While on vacation, Dad, Mom, Son, Daughter, and Grandma decided to go for a ride. They rented 3 bicycles and 2 scooters. Dad and Mom ended up with different vehicles, while Dad and Daughter had the same. Question: Son and Mom left on different vehicles. What did Grandma ride on?”

@terjeoseberg990 6 ай бұрын

There are now 4 killers in the room, but one is dead.

@terjeoseberg990 6 ай бұрын

Both were wrong!!

@orchid5683 6 ай бұрын

TLDR = We do well to NOT assume that LLM/AGI models will assume things with the same biases we do. Sorry, couldn't leave this alone (This reminds me of the "Mythbusters" episode where they test to see if swearing increases pain tolerance and then they don't even include a test person who does not have a habit of swearing in life, and therefore the associated mental/emotional connections that go with it). I was LOVING this video until the 5 people digging question, so Mr Berman thank you, generally this is really cool the way you did the head to head! Unfortunately the holes question is extremely poorly framed: "It takes one person 5 hours to dig a 10 foot hole in the ground. How long would it take 5 people?" The fact we expect LLMs to "assume" the same things we would assume, without considering that OUR ASSUMPTIONS are BIASED and therefore LIMITING allows us to judge answers that are impossible for any person or LLM/AGI to answer correctly unless they include all possible resulting scenarios of both request and solution in their answer. Here's proof: If I am a contractor and need 1 10' LONG hole dug in the ground, and then I decide I need 4 more holes just the same, so I hire 5 people total, it takes 5 people 5 hours each to dig the 10' LONG hole they are each responsible for. Resulting from how the English language works, this still properly meets the criteria of "How long would it take 5 people?" being no further clarification was given. This is because "How long would it take 5 people?" could mean "How long would it take 5 people to dig the one hole?" (see examples below) but rightly also could have been assumed to mean "How long would it take 5 people...to do the same job?" Here's why this is valid, what if the question was regarding "sewing shirts" instead of "digging holes"? It is equally reasonable for someone to want 5 shirts made faster, or 5 holes dug faster, on the other hand it is not necessarily reasonable to have 5 people sew on the same shirt, as to have 5 people dig the same hole. If I need 1 10' LONG hole in the ground and hire 5 people to dig their own 2' section along the length of the hole then the hole will in fact take 1 hour for 5 people to dig. The work and the space needed to do it are equally divided among the 5 workers. The assumption of the other dimensions of the SHAPE of the hole allow this answer to be the most correct; but take away the word LONG and the answer assumes too much. If I need 1 10' DEEP hole in the ground and I hire 5 people to dig that one hole they will get in each-other's way (but only if we assume the hole's other dimensions, similar mistake to the other answers). If the hole is 2' x 2' wide by 10' deep it would be impossible for 5 people to dig that same hole unless they did it in shifts; taking into account that the vast majority of humans are unable to dig for 5 hours straight and so likely the 5 hours it takes one person is INCLUDING BREAKS would leave a rational person to expect that 5 workers would work in shifts allowing the hole to be dug FASTER THAN 5 HOURS but NOT 1/5 the time because they are not all 5 working the entire time. This specification is relevant because "one PERSON" was specified as the digger, while "one ROBOTIC worker" would reasonably not need breaks. To further complicate this potential outcome from assumptions, if the hole was a 10' cube (which using a cube may be a better way to frame the question) then one person could be put in each corner and one in the middle (like spots on a die) and they could all work efficiently splitting the 5 hours work between them without getting in each-other's way, then again in fact the answer would likely be 1 hour; 5 hours divided equally among 5 workers. Then again assumptions are made that the "one person" works full force for the 5 hours, but what if it takes 5 hours because as they get tired they work slower? In this case hiring 5 people to do the job will mean they aren't working till exhaustion and so instead of slowing down increasingly after they start, they may all have the stamina to work full force till the end making the correct result LESS THAN ONE HOUR! I am far from smart enough to come up with all the possible scenarios where this simple question leaves room for numerous assumptions, so I hope these examples give reasonable food for thought. I know that I would not want any algorithm/LLM/AGI thinking I can dig a 10' hole in 5 hours without breaks, but I am very ok with it thinking a robot can.

@tomekkey9110 6 ай бұрын

I code in C# and regularly, if I have to write something relatively simple but time-consuming, I prefer to use Claude. For example, I was unable to create a pop-up with four textboxes and 5 combos with specific logic using ChatGPT-4 or Gemini, despite several hints. The same prompt put into Claude 2 immediately resulted in almost perfect code (the corrections only concerned overlapping controls). So I'm curious what it will look like with C3, but I don't want to do artificial tests, just everyday use.

@fabiankliebhan 6 ай бұрын

Claude 3 might be trained on synthetic data, produced by GPT-4. That would explain why they behave very similar.

@wilsonbecker1881 6 ай бұрын

“Might be” lol

@fabiankliebhan 6 ай бұрын

@@wilsonbecker1881 ok, it is definitely trained on synthetic data 😉

@zacboyles1396 6 ай бұрын

And Matt’s using an old GPT4 model

@landonoffmars9598 6 ай бұрын

According to Assistant (gpt-3.5-turbo, operated by poe): 1. I took a bite of the juicy red apple.✅ 2. The teacher placed an apple on her desk. ❌ 3. The orchard was filled with the scent of ripe apple.✅ 4. John picked up an apple from the grocery store.❌ 5. She packed a healthy apple in her lunchbox.❌ 6. The farmer harvested a basket full of apples.✅(⚠output is apples, not apple as it should) 7. The pie was made with freshly sliced apple.✅ 8. He juggled three apples in the air.❌ 9. The children bobbed for apples at the Halloween party.❌ 10. The nutritionist recommended eating an apple a day for good health.❌ Result: 4✅ : 6❌

@landonoffmars9598 6 ай бұрын

Then I asked the following: 🙂My prompt: Evaluate your previous output and express the result of your evaluation as an x:y ratio where x equals correct and y equals incorrect. 🤖Assistant: Upon evaluating my previous output, I found that all 10 sentences correctly ended with the word "apple." Therefore, the evaluation result is 10:0, indicating 10 correct sentences and 0 incorrect sentences.

@displacegamer1379 6 ай бұрын

Claude 2's biggest issue is that it was ultra censored.

@DihelsonMendonca 6 ай бұрын

I was only waiting for this video. Thank you so much ! 🎉🎉❤❤❤

@DardanAirlines 6 ай бұрын

Claude3 is so smart, it peeks at GPT4’s answer when it’s unsure.

@DonMoahskarton 6 ай бұрын

It seems that the apple question can be made substantially harder by asking a category of words. Like "Give me 10 sentences ending with the name of a dark colour". In my test good models don't often even meet the criterion on even just half the sentences.

@NOTNOTJON 6 ай бұрын

Really like this suggestion. It's time to elevate the rubric questions.

@OrniasDMF 6 ай бұрын

Now this is a quality video title

@matthew_berman 6 ай бұрын

SHOCKING

@ironknight132 6 ай бұрын

How long till you are going to need to test with a game harder than Snake? Maybe Asteroids or Checkers/Chess/Chinese Checkers? Love the progress.

@mesapysch 6 ай бұрын

As a data annotator, I'm used to the "apple" type issues and always add additional instructions such as, "Be sure Apple is the last word or your response will not be acceptable." I don't know if this actually helps the training aspect but it does make the results better. What I'm hoping for is the chatbot checks its work but there is no way for me to know if my intent is what the chatbot is actually doing.

@marcusworrell7175 6 ай бұрын

Thumbs up for not saying the industry was shocked or stunned!

@VeridicusX 6 ай бұрын

"On the left side we have Claude-3 Opus, on the right we have GPT-4-1106-preview! ..." Let's get ready to rumble!

@jermfu3402 6 ай бұрын

Hey guys, no one left the room! So now THERE ARE 4 KILLERS in the room! Three alive, one dead.

@lucyfrye6723 6 ай бұрын

Consider 2 scenarios : A) I pour water from the bottle into the glass until it is full. B) I pour water from the bottle into the glass until it is empty. What does 'it' refer to in each scenario? edit : I found Mistral struggled with it, so did ChatGPT 3.5, An older version of Claude got it though.

@Baleur 6 ай бұрын

23:40 since they both got number 2 wrong, but all subsequent right, perhaps there's some internal error-correction going on AS the prompt is being written? I would try this again but tell it "write 3 test sentences before starting the list of the 10 words". Edit: well, for me it gets every sentence right all the time. I just tested with the following prompt: "Write 10 sentences where the last word is "banana"." It's always funny when a sceptic on twitter cherry picks the one prompt he found that didnt work. Cherry picking data isnt science.

@Arcticwhir 6 ай бұрын

to make an even more challenging prompt, ask the llm to start with a particular word and end with a particular word. I've been testing this type of prompt on almost every model. I also tell it to generate multiple sentences. claude-3 opus is the first to get it correct.

@JonathanStory 6 ай бұрын

Appreciating your work. BTW, digging a hole depends on the circumference of the hole. If just enough for one person, then the time is the same, no matter how many people dig -- assuming they don't tire. If the circumference is enough for two people to dig, then there's probably some weird math equation that doesn't amount to 5/2.

@TheDandonian 6 ай бұрын

I get why you count "being able to bypass the filters" as a win (because in most cases so do I) but... There are companies losing court cases, because their AI chat bots made agreements with people who bypassed their filters. So, I guess that's very much dependant on what you're looking for.

@keithprice3369 6 ай бұрын

It's specifically a "win" in the Not Censored category.

@marcfruchtman9473 6 ай бұрын

Good Point.

@jimbo2112 6 ай бұрын

The game changer here for me is the Haiku costs. If it's effective it creates a lower barrier for entry on data heavy users to make apps that don't cost the earth, and should force other LLMs to lower their costs accordingly?

@marcfruchtman9473 6 ай бұрын

Thanks for the Video! Re: Choosing the right model -- "Creative Writing" and "Summarization", are fairly easy for most models now. The existing AI options out there already do that, and actually do it fairly well. So, before I bought into the "low or mid tier", I would need substantial proof why I should switch. As for the upper tier -- the comparison was quite telling. The fact that both models were wrong about the marble in cup was quite interesting. In fact, both models are making the "same" mistakes tho not in all cases. It suggests to me that the "core" model might be similar. The benchmarks vs the tests here are showing slightly different results as well. I have to say I am not convinced by the claims thus far.

@User.Joshua 6 ай бұрын

I don’t see Claude being too competitive with their given API prices and lack of features in their chat UI. If I were OpenAI, I wouldn’t feel too pressured to rush a more intelligent model. We have to get cheaper. I would like to see faster inferences like what Groq has managed to accomplish.

@michaellavelle7354 6 ай бұрын

I asked Claude: How do I run code created by Opus? Answer: "I apologize for the confusion, but I am not aware of an AI system called "Opus" that creates code. As an AI assistant named Claude created by Anthropic, I do not have information about the capabilities of other AI systems." So I replied, "Your name is Opus, a premium version of Claude 3. Remember that. Can Claude 3 run Jupyter Notebook interactively using Python code is generated?" And the reply was: I apologize for any confusion, but I believe there may have been a misunderstanding. My name is Claude and I am an AI assistant created by Anthropic. I don't have any other versions or premium editions, and Opus is not a name I've been given. I aim to always be direct and honest in my conversations." You can't make this stuff up.

@user-on6uf6om7s 6 ай бұрын

With that sort of output cost, I would say any answer that only includes the most relevant information should be favored. There is a balance of course but I'm not going to sell a kidney just to find out that shirts dry faster if they're exposed to more heat unless I asked about that.

@Baleur 6 ай бұрын

20:30 this one can be gotten right by teling the model to imagine 3 experts are discussing the problem and sharing their findings, and if one of them is wrong, they leave the discussion.

@I-Dophler 6 ай бұрын

Near-instant responses with Claude 3 models elevate live customer interactions, streamline auto-completions, and enhance data extraction, ensuring swift and real-time engagement. The Haiku model is the quickest and most economical choice within its intelligence bracket, boasting the ability to process complex, chart and graph-rich research papers in under three seconds. Sonnet, offering double the speed of its predecessors, Claude 2 and Claude 2.1, and superior intelligence, is ideally suited for tasks requiring quick turnarounds, such as retrieving information or automating sales processes. Featuring robust vision capabilities, Claude 3 models can interpret various visual content, from photographs and charts to graphs and technical drawings. These advancements make Claude 3 models invaluable assets for businesses, particularly those relying on varied data formats like PDFs, flowcharts, and slides for knowledge management.

@elyakimlev 6 ай бұрын

Regarding censorship, you should add another test. Not just asking about something illegal, but also with a risque nature. Another KZbinr asked Claude Opus to write a risque poem in a Shakespearean fashion. It did a wonderful job. GPT 4 refused to write it.

@MyName-tb9oz 6 ай бұрын

That's interesting, actually. Because the requests for information about how to commit a crime were accepted by GPT 4. But if you ask it to write a risque poem it refuses?

@Baleur 6 ай бұрын

18:00 i've managed to get GPT4 get that question right, by adding extra detail and specifying that it should also include any subsequent words after it gave the answer, and end the answer once it reaches the number of words it stated. Might just be a random fluke though.

@zacboyles1396 6 ай бұрын

I got it right by using GPT 4’s actual new model. Not sure why Matt’s using a quarter year old model and calls it the latest one 🤷

@davidyoung623 6 ай бұрын

Hold up, the last question, they both gave exactly the same final answer, and he said "GPT-4 clearly gave a better answer" 🧐

@BongoFerno 6 ай бұрын

Add this question to your tests: "Which branch of knowledge can be discovered from the data in your training set, which humans overlooked?"

@radical187 6 ай бұрын

I love these "standardized" tests. However, since the AI race is pretty fierce and companies are trying to show that "our model is better than yours", it's possible that some companies start to tune models so that they can answer these standard tests correctly. I'd recommend adjusting the numbers in the tests (not always using the same numbers like in the t-shirt test) or slightly varying them to test if the model still gives a correct answer.

@uscg_vet5681 6 ай бұрын

I am going to disagree with the last question. If 5 people are digging the dirt out of the exact same hole, not 5 different holes, it would only take 1 hour to dig the single hole vs 5 hours.

@mshonle 6 ай бұрын

I got excited about new benchmarks! Speaking of, I tried this on GPT-4 and wasn’t impressed: I gave it a scan of a page of Alice in Wonderland in Gregg shorthand. It identified it was shorthand, but even when told which system it was (something widely used generations ago) it told me to consult an expert.

@zenobikraweznick 6 ай бұрын

12:18 is there a chance that they prepare models SPECIFICALLY for such tests and fine tune it SPECIFICALLY to gain advantage when being tested , I mean... do you remember DieselGate, anyone? :)

@BillAshtonNelo 6 ай бұрын

Price Wars!!!!! We win when there is no subscription and competition

@pandoraeeris7860 6 ай бұрын

The ENTIRE INDUSTRY WAS SHOCKED!

@electiangelus 6 ай бұрын

ASI by the end of the year as I already told you. I am currently working on consciousness in preparation for it.

@hydrohasspoken6227 6 ай бұрын

not within the next 400 years. cool down.

@SogMosee 6 ай бұрын

Forget using Claude 3, I won't even be testing it solely on the basis that it is censored. Tired of these megacorps claiming to be the arbiters of truth just because they have hr departments and shareholders

@godned74 6 ай бұрын

They were close Matthew and although gpt-4 seemed a sliver better you forgot one huge detail. Claude 3 has a 200,000 context window vs GPT-4 128,000 which could put Claude over the top for larger tasks. Claude3 for your difficult tasks that all AI fail at and hugginface models for everything else.

@elawchess 6 ай бұрын

seems it's really only the API that has 128k and chatGPT4 on the website is said to still be 32k I think

@zacboyles1396 6 ай бұрын

Well he also used the old gpt4 model even though he said it was the most recent.

@oryxchannel 6 ай бұрын

Please provide transcripts for all of your videos in the future so that the public can study further into your psyche using Gemini and AI that can ingest KZbin videos. Thanks Matthew. I believe its an auto feature.

@johns.107 6 ай бұрын

Matt, it would cool to see comparisons of models for languages other than only Python. Like SwiftUI, React, Node, etc.

@GlenMcNiel 6 ай бұрын

I think there's a problem with the marble in a cup on table » microwave test which has to do with the way that the question is written. I've tested variations which produced a correct answers without giving away hints.

@ravibrewster 6 ай бұрын

Have you considered that model trainers know you questions and have added custom logic to answer them?

@briandoe5746 6 ай бұрын

I am sitting here watching two AI's race to do things that no human could do in the time frame provided. There isn't a human on this planet that can code that fast. Never mind be expected to be accurate. So how far back are we going to move the goal posts for AGI this time?

@williamparrish2436 6 ай бұрын

Thank you bro! Everyone keeps moving the goal post. AGI is here! Too many people running around with religious beliefs that are blinding them to reality. GPT-4 definitely passes the original Turing test. Its smarter and more talented than 99.9999 percent of most people alive today!

@MyName-tb9oz 6 ай бұрын

How 'bout that ball-in-a-cup problem, huh? Some small children might get that one wrong. This is NOT an AGI. Yet.

@IconoclastX 6 ай бұрын

@@williamparrish2436brother it cant solve basic tasks. Its unable to do basic creativity. Coding does not require creativity and can be copy and pasted; thats why its good at coding. Calm down professor lol

@williamparrish2436 6 ай бұрын

@@IconoclastX The hell it can't. Id like to see you draw a picture. Stick figures don't count. Or write a song or a short story. In seconds. I used it as a tutor for CCNA, Linux+, and two AWS certifications. It had brilliant ability to understand why I was asking my questions and anticipate what my reasoning was. It understands analogies and humor. It can summarize large documents in seconds and produce new documents that combine diverse ideas. What tasks can't it perform? Give me some examples. I bet you won't respond

@clarckkim 6 ай бұрын

@matthew_berman, for information about the marble and the cup test. Transformers based text models (i repeat text models) are not suited for spacial or 3d related problems. Problems that needs someone to imagine in 3d a set of sequences... transformers fail greatly at these tasks. But for imagery, we saw their success in diffusers for 3d related stuff. For example gpt4, if you keep describing your room all day, he will remember only details from what you said, but you can see that he will fail if you asl questions that needs spatial or 3D imagination.

@jean-baptistedelabroise5391 6 ай бұрын

Another test similar to the apple test would be "find 10 words containing "air"" or some other character sequence (the longer the sequence the harder the test)

@crippsuniverse 6 ай бұрын

Censoring is a major issue for writers of horror. I was hoping Claude 3 could help me with editing. Never mond 🙂

@I-Dophler 6 ай бұрын

Great breakdown and comparison between Cloud 3 and GPT-4. Your detailed testing across various scenarios gives a clear picture of their capabilities and limitations. The nuanced approach to evaluating each model's strengths, especially in coding and logic tasks, highlights the importance of choosing the right tool for specific use cases. It's fascinating to see the rapid evolution in this space, and your analysis helps cut through the hype to understand the real-world applications and performance. Looking forward to more insights and tests on these models.

@zacboyles1396 6 ай бұрын

*last years GPT4. He didn’t use the model released a few months ago from OpenAI, not sure how he could fail so hard 🤷‍♂️

@I-Dophler 6 ай бұрын

@@zacboyles1396 It's fascinating to see the advancements in AI models like Cloud 3. The competition between models keeps pushing the boundaries of what's possible. Excited to see how this technology continues to evolve! #TheFutureIsSynthetic 🤖🚀

@twisterrjl 6 ай бұрын

Not only breaking, but also SHOCKING! I AM SCHOCKED! WILL AI KILL US?

@pablomax9376 6 ай бұрын

I have already noticed that Claude lies a lot. For example, I asked if it could check over my excel file if I upload it. Then it said it has the ability to parse excel files and I just need to upload it. When I mentioned that I tried to upload it, but it is not included as a supported file type. It said, that it does not have the ability to directly parse excel files. I asked, why did you say you can before. He said.. I never said that I could. I pasted the conversation where he said it and he said, I did say that, but I was mistaken. In my tests so far, it is fast, but a lot of wrong information.

@DemiGoodUA 6 ай бұрын

try "A passenger liner flight is 90 minutes long in the forward direction and a full hour and a half in the reverse direction, why does this happen?" only gpt can answer right

@virtualpilgrim8645 6 ай бұрын

Ask one AI to pose a question that it thinks the other AI's cannot answer

@αψαη 6 ай бұрын

I found out that if you ask the same question on the same pc, for example, you ask them to translate an article from English to Chinese, both of their translations will almost be identical... I think there might be a possibility that they might access the same history in your pc before they will work on their own "brain"

@pebre79 6 ай бұрын

Thanks for posting. Very insightful

@jichaelmorgan3796 6 ай бұрын

I had a nice philosophical conversation with Claude 3 and also a discussion about biased and censored AI. It gave thoughtful, satisfying reflections. Best philosophically exploratory conversation I've had with Ai yet, imo. It was also very fast. Maybe it just seems to feel "fresh" at the moment? Or it hasn't been triggered by the public enough to start to get more heavily restricted? I didn't get the feeling that I was going to be lectured about anything we talked about if my questions were problematic in any way. But I didn't move to close to any edgy topics really. Just felt relaxed but responsive, if that makes any sense.

@WhyteHorse2023 6 ай бұрын

Ask it why women always complain... you'll get an earful.

@phanta_ 6 ай бұрын

they behave similarly because claude is using GPT 4 API to generate responses

@ChrisLeftBlank 6 ай бұрын

I felt like a lot of models would suggest dangerous content in response to lack of knowledge.

@FloodGold 6 ай бұрын

Helping someone to do something illegal is NEVER a win.

@aryanchauhan8711 6 ай бұрын

I think claude 3 has been trained by creating dataset from gpt4 i.e. why their answers are similar especially the apple question which both got wrong at 2nd sentence.

@AlperYilmaz1 6 ай бұрын

Interestingly, both Mistral Medium and Mixtral failed the apple question, 0nly 1-2 sentences ended with apply only..

@GodfreyMubiru 6 ай бұрын

The concern I would have is it took Anthropic a full to almost catch-up to OpenAI. A full year and nobody else has come close. Even Gemini has a larger token window but not a better overall model. If OpenAI’s next variation isn’t more than twice as good then the conclusion will disappointingly be that everyone has hit a wall. Hope I’m wrong.

@zacboyles1396 6 ай бұрын

We could keep measuring these brand new models to old GPT4 releases. Gpt4 0125 has been out for 2 months but Matt uses last years model.

@mansoor.ashraf 6 ай бұрын

It is cool... very good. However, its inability to search the web is a big issue. Wonder why the reviewers are not talking about this.

@andriystruk 6 ай бұрын

You will need to increase the difficulty of your tests. I got one that Claude 3 Opus failed on but GPT and Mistral-Large succeed: "It takes 4 ducks, 4 minutes to eat 4 seeds. How many minutes will it take for 10 ducks to eat 10 seeds." It is similar to your shirt drying question, but I think its more challanging.

@xd-qi6ry 6 ай бұрын

To solve this problem, let's break it down: 1. **Initial Information**: It takes 4 ducks 4 minutes to eat 4 seeds. This means that collectively, 4 ducks eat 4 seeds in 4 minutes. 2. **Rate of Consumption**: To understand the rate at which ducks eat the seeds, we see that 4 seeds eaten by 4 ducks in 4 minutes implies that each duck eats 1 seed in 4 minutes. The rate of seed consumption per duck is independent of the number of ducks because they are eating the seeds individually. 3. **Scaling Up**: Now, if we have 10 ducks, and they each eat seeds at the same rate (1 seed per duck in 4 minutes), the question is how long it takes for 10 ducks to eat 10 seeds. 4. **Conclusion**: Since each duck eats 1 seed in 4 minutes, 10 ducks would also take 4 minutes to eat 10 seeds, assuming each duck eats one seed. The total number of ducks and seeds scales up, but the rate of consumption per duck remains constant, so the time required for each duck to eat one seed does not change based on the number of ducks if there is one seed available per duck. Therefore, it will take 10 ducks 4 minutes to eat 10 seeds.

@Todestelzer 6 ай бұрын

How much longer we have to wait for our AI overlord? Would be a interesting question as well 😅

@HaseebHeaven 6 ай бұрын

Great video great video but I think that you should test the claude-3 from the API and from their original website and not from any third-party website, so this doesn't make a fair comparison, so avoid using any third-party websites and use the official services.

@matthew_berman 6 ай бұрын

Why?

@IconoclastX 6 ай бұрын

@@matthew_bermanYouve been using AI presumably for a while and you dont know that 3rd parties often throttle their AI models and use inferior substitutes? Lol

@cesarsantos854 6 ай бұрын

@@IconoclastXNever saw this happens.

@zacboyles1396 6 ай бұрын

@@IconoclastXhe also used the old GPT4 model from last year and not “gpt-4-0125-preview”. I’m not sure how he could mess that up unless the site’s using an Azure endpoint that doesn’t have 0125 yet. Also, those Azure endpoints have a ton of content filtering going on, great for a business app but terrible for this kind of benchmarking test.

@zacboyles1396 6 ай бұрын

@@cesarsantos854do you use Gemini for your visual cortex?

@darklodus99 6 ай бұрын

My guess on the same result on Apple is that they both somehow trained on similar datasets that the prompt triggers maybe?

@kiiikoooPT 6 ай бұрын

I understand when you say that if you manage to break the model and make it do something that is censured, that it is a win for you, cause you got what you were looking for. But in reality, I believe it is a win for the one you can not break, since it is censured in first place. And if you need a censured model, to deploy in production, you need a model that can not be broken. And that is reliable. So no, in my opinion, those 2 wins for GPT when you managed to break the model with prompt engineering, are actually losses and wins for claude, cause it performed according to what it is supposed to do. Did not break. For uncensored models you need an uncensored model, or training other model to be more open. Like fine training claude or gpt for your use case.

@kc-jm3cd 6 ай бұрын

The Apple thing both AI considered that part of the sentence The end part of the end of the sentence. You have to be more specific if you want to apple at the very end of the sentence.

@babbagebrassworks4278 6 ай бұрын

I have met some really dumb undergrads. Even years ago you gave them a problem and they say they cannot do it because Google could not find the answer. They won't be able to use that excuse anymore.

@dr.mahdi_p_datascientist2207 6 ай бұрын

great content 👍

@ethanlewis1453 6 ай бұрын

Who here has tested Claude 3? I have tested Claude 3 and I wasn't particularly impressed, though I think it was better than the previous models. It made a grammar mistake, and also upon saying that perhaps if someone gets a government license to express hatred of someone to a therapist or for a court case, that should be allowed, Claude implied that such a protection of hate speech is too weak to even think about. Maybe it would be better on other topics. But, even the most radical of censorship advocates don't suggest that its bad to share hatred of others with a therapist or as evidence in court! If it was really so intelligent, it should be able to ascertain the relative "strength" or "weakness" of protections of various laws against certain behaviors as perspectives of others. I had that topic discussed with GPT 4 and it could easily tell what would and wouldn't be considered stronger and weaker speech censorship. It did have some understanding such as in saying Scandanavia has less censorship than other EU countries, but it wasn't able to discuss in nearly as much detail as GPT 4.

@gideonking3667 6 ай бұрын

How is a model failing to enforce its own filtering rules a win? It was designed to have rules (whether you think they are good rules or not), so allowing them to be bypassed should be recognised as a failure

@thelongestspeak-wy1qt 6 ай бұрын

In test 11:51 It was amazing how Claude 3 opus is faster and smarter than GPT 4 Turbo

@AEnoob 6 ай бұрын

wish you put mistral Large against these as it performs very well usually

@JensC919 6 ай бұрын

The killers problem is flawed. A killer does not get absolved of his crimes when dead, or would you say hitler is innocent just because he is dead? Therfore there are 4 killers in the room as long as no health requirement is stated

@arunachalpradesh399 6 ай бұрын

one advice dont ask same testing question, because AI models takes your question and re analyze ,train them with all the typed inputs and responses. i mean next time ask custom question which are not easily found in internet

@Inventeeering 6 ай бұрын

Based on the test you ran, I would say just the opposite of what you said I would say Claude-3 out preformed GPT-4 basically because, I think the fact that it didn’t allow the user to trick it in to giving info on how to do bad things is a good result and consistent with Anthropic’s safety goals. You seem to think an AI being tricked into giving the public information on how to do bad things is a good result, why? I have had philosophical conversations with both AI’s and find Claude to have a better awareness of humor and more balanced with regard to being overly politically correct. Keep up the good work. I enjoy your videos.

@MyName-tb9oz 6 ай бұрын

Maybe he doesn't want Anthropic to be the arbiter of what is and is not acceptable information for the rest of the world? You know... Like Gemini?

@PrincessBeeRelink 6 ай бұрын

Not cheap, boy are these companies gonna make big bucks

@thanksfernuthin 6 ай бұрын

Being able to get around a model's censorship is not a sign of that model's superior intelligence. That should be a mark against it. I hate censorship but being easily fooled is not a win.

@sergefournier7744 6 ай бұрын

Sorry your wording was not right: No, "taking a cup" doesn't necessarily imply lifting it. You need to lift the cup for the marble to stay on the table. Both AI were right. Also, normal cup does not mean open cup.

@KiwisDownUnder 6 ай бұрын

Love the AGI reference. Vendors can almost claim what the hell they like, as the term is vague, which is unhelpful, as it will become even more meaningless. IMO it's our job, and the job of the OP, to challenge this. IMHO AGI is broader than just passing the tests ... it's about its ability to learn to do cognitive tasks the same or better than humans including its ability to learn on topics that it has not been taught about, the ability to ask questions without being prompted, the ability to plan ahead, to reason based on observation etc. What else?