So is Claude 3 better than GPT-4? What do you think? Join my newsletter for the latest and greatest AI news, tools, papers, models, and more: www.matthewberman.com
@mrpro77376 ай бұрын
I am interested in evaluating Claude 3's coding proficiency. I would appreciate a comprehensive benchmark focusing solely on coding capabilities
@MrJaggy1236 ай бұрын
Is it better? Not for Canadians: they didn't let us use Claude 1, or 2, and I tried 3 and they still say "no". 😛
@MyName-tb9oz6 ай бұрын
You missed a second fail for Claude on the, "...ending with apple," test: One of the sentences _does_ end in, "apple," but the sentence should have ended with, "apples," because of the way the sentence was constructed. I'd almost say it failed twice on that one sentence.
@protein49606 ай бұрын
I think both models assumed the cup had a lid on top which is why they got it "wrong" but their reasoning process was sound.
@eirikgg6 ай бұрын
Did the t-shirt test but I did it in my language Norwegian it then failed. Same prompt to gpt4 and success. So eighter it is in the training data or more general language data = better reasoning
@OculusGame6 ай бұрын
Not sure if you seen it on twitter, but someone at Anthropic mentioned one pretty crazy instance during internal testing, more specifically the 'needle-in-the-haystack' test. Here's what Opus said: "The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association." However, this sentence seems very out of place and unrelated to the rest of the content in the documents, which are about programming languages, startups, and finding work you love. I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings. Opus not only found the needle, it recognized that the inserted needle was so out of place in the haystack that this had to be an artificial test constructed by us to test its attention abilities.
@nufh6 ай бұрын
That is freaking cool.
@DeSinc6 ай бұрын
this is the thing I dislike about needle in the haystack testing. if you put a random sentence in there somewhere, it statistically is at odds with the other words around it and would produce an "unlikely patch" of text that is unlikely to be there naturally, which means it should be quite simple to pick out. even if you just change 1 word, it's still not a likely word which means it will stick out just based on statistically what words usually follow each other in texts the model has "read" before. to really do this test correctly you'd have to give it a real text and ask it a real question about the contents somewhere in the middle, not actually change it yourself and make it stand out heaps.
@jd_real16 ай бұрын
Amazing
@EvertvanBrussel6 ай бұрын
@@DeSincI hadn't thought of that before,but that actually makes sense. Though I'd still argue for testing both ways. It's interesting to know both whether it can point out the anomalous needle in the haystack AND whether it can just answer a question about a very minor fact (that wasn't out of place) that was mentioned in passing somewhere in the middle of the text.
@kenhtinhthuc6 ай бұрын
Once the issue of lost in the middle is solved, finding the odd out is straightforward due to semantic search which is based on meanings rather than keywords. More specifically, meanings are turned into vectors via embeddings which allow (cosine) similarity comparison.
@ChristianIce6 ай бұрын
Strong disagree on the definition of AGI, for multiple reasons. 1) The term AGI was introduced to mark the difference with AI, where AI mimics human intelligence, AGI is actually self-aware and has a real understanding of the inputs. So, we are moving the goal post, we now need a new term for this kind of artificial intelligence. 2) AI was already better than most humans in most tasks, quake bots were better any player 20 years ago. 3) When we built a car that was finally faster than a horse, we kept on calling it a car. 4) The discussion itself is not even about semantics, but serves as a technicallity in the deal between OpenAI and Microsoft. 5) It's all hype and fundings, AI is getting better but it is the same thing, it works in the same way, it mimics, simple as that.
@Hunter_Bidens_Crackpipe_6 ай бұрын
The claims every team makes is always way above their heads
@CosmosWorld6 ай бұрын
That's called marketing
@Gafferman6 ай бұрын
That's why we test
@codycast6 ай бұрын
It’s so hard for me to trust that AI is giving me actual information after the blatant racism that came out of Google
@filipo41146 ай бұрын
I just tested Claude for 3 hours coding a physics simulation in python using pytorch and it's at least as good as gpt-4. The physics it came up with is a little bit wrong and the method for computing partial derivatives in pytorch was a little bit wrong but overall it wrote code that worked. After spending the whole 5$ that they give you at the start I can not wait for when they release the weights! It's first model that actually is on par with gpt-4 for me.
@scottcastle91196 ай бұрын
Yea no kidding, openai is way ahead
@333dsteele16 ай бұрын
In psychology, it's well recognised that people remember most about the beginning and end of a lecture. So, it was interesting when you mentioned the 'lost in the middle' aspect of LLM's.
@icegiant10006 ай бұрын
I think it would be hilarious if Claude 3 is just passing the questions to GPT-4 in the background, to get its answers. Actually, that's a pretty good idea. LOL. Claude 3, powered by GPT-4.
@MH-sl4kv6 ай бұрын
Would explain the price differential: rebrand and add margin 🤣🤦♂
@lunarcdr30836 ай бұрын
Jean Claude? "You are nexx!!" ~Bolo Yeung
@estebanleon58266 ай бұрын
I love your content, Matt! Thank you for this. My one criticism is that you're not that great at snake. :) haha
@matthew_berman6 ай бұрын
True lol
@mrpro77376 ай бұрын
I am interested in evaluating Claude 3's coding proficiency. I would appreciate a comprehensive benchmark focusing solely on coding capabilities
@WhyteHorse20236 ай бұрын
It will win just because of context length. That seems to be the major limiting factor. Also the needle in a haystack problem being solved helps a lot.
@Stewz666 ай бұрын
X2 on this. The snake game got my attention.
@jaysonp94266 ай бұрын
The fact Mistral gets the ball question right and everyone else gets it wrong is WILD
@hqcart16 ай бұрын
It's called Temp, it should be at 0 when you test, otherwise when you give a prompt, everytime you get different answer. a real test would involve 5-10 shots.
@jaysonp94266 ай бұрын
@@hqcart1 yeah every time any KZbinr tests these models it's very surface level. It's still crazy about Mistral though
@hqcart16 ай бұрын
@@jaysonp9426it is just luck that mistral got it in 0 shot, it's not a scientific test, just for fun i guess.
@NeostormXLMAX6 ай бұрын
isn't mistral uncensored and not lobotomized as much?
@eIicit6 ай бұрын
mistral and mixtral are ridiculously good.
@ekstrajohn6 ай бұрын
For the last question, I would also answer "1 hour". This is because 10 foot hole might mean it is 10 feet in length, and people can stand next to each other as they dig. Not a great question. If you want the answer you want, state explicitly that the hole is very narrow and only one person can reach the digging spot.
@cognitive-carpenter6 ай бұрын
OR a 10ft deep hole and specify the required finished diameter at the top--BUT I think here, the ambiguity is what makes it such a good question--the tester is trying to see if they pick up on the problems and solutions we just described.
@radiator_mother6 ай бұрын
I love your videos ! I suggest a simple puzzle that for now has not yet been solved by IAs: A boat is moored at the marina. A ladder is attached to the hull of the boat. At low tide, 8 rungs of the ladder are visible. The space between each rung is 20 cm. The water level rises by 80 cm between low tide and high tide. At high tide, how many rungs of the ladder remain visible? Of course, as the water level rises, the boat follows, and therefore the number of visible rungs always remains at 8...
@sergefournier77446 ай бұрын
It depends, are the ropes of the moor long or short?
@sophiophile6 ай бұрын
I think on the last question (digging a hole), you need to provide more (or even a more specific) prompt context. It is answering as if it's a gradeschool homework math assignment, in which case that is the desired answer. Rephrase it in a more real-world (instead of word problem in a math class) way and see what you get. I'm assuming that you use custom instructions yourself when you use ChatGPT, and know how significantly this can impact things (like skipping over all the nonsense and getting handed over to the apppropriate expert model in the ensemble, using tools/interpreters immediately to answer instead of just telling you how something would be achieved).
@MichaelForbes-d4p6 ай бұрын
You know what this means! GPT-5 will be out any day now.
@fynnjackson22986 ай бұрын
my thoughts exactly
@seventyfive75976 ай бұрын
You did not test programming at all, same goes for logic You tested whether the team at Anthropic have seen the regular snake KZbin test, and they have indeed pretrained or fine tuned the model to answer that question, but this just tests whether the people at Anth follow youtube testers. So it's a repetition test, not a programming test. Same goes for the shirt drying and killer problems, a repetition test.
@slackerpope6 ай бұрын
Agree! Exhaustive and practical code test please Matthew. Also dont hide code tests behind a paywall some of are broke students and cant afford to subscribe to useful facts.
@elawchess6 ай бұрын
Yep I also wish for more rigorous, diverse and systematic testing
@seventyfive75976 ай бұрын
@@elawchess Yes, and especially throwing away tests that devs are aware of
@elawchess6 ай бұрын
@@seventyfive7597I saw an approach on another youtube channel. The LLMs were required to find a bug in some github small piece of code. However before it was used as a test question it was disguised by another LLM to make it less likely that the LLM under test had simply memorised it from the web.
@CrudelyMade6 ай бұрын
how would these do with traditional puzzles like, "Jane and Tim were on the way to the market, they met Nancy. Nancy had 2 brothers, each brother had 2 cousins, each cousin had 5 sisters. how many people, in total, were on the way to the market?" and basically other basic things which tend to mislead humans. example 2: if you are running a race, and you run past the person in second place, what place are you in?" example 3: If you have a sack full of apples, and you are with Tom, Dick, and Harry, from the third rock from the sun, and you give two apples to each of them, and you take two apples yourself, and then each of them give you one apple back, how many apples do you have in the sack? obviously, there isn't an answer to that last one, but that's somewhat the point. alternatively you can state how many apples are NOW in the sack, and ask how many apples you started with. this requires a bit of backward thinking which I expect ai not to be great with. last one: A boy and his father get in a car accident and go to the hospital. the surgeon looks at the boy and says, "I can't operate on him, he's my son, you should inform the mother." explain the relationship of the surgeon to the boy. and.. well, this is when a human can get creative.. where the surgeon is: a biological father, stepfather, priest, delusional..... I dunno.. I started making things up a couple of examples ago, lol. :-D
@alexybogomolov6 ай бұрын
The main problem with those 'puzzles' is that they don't make any sense. They do not solve any problem, and really feel like a random noise produced by an average HR in McDonald's, checking whether you have any brain or not.
@CrudelyMade6 ай бұрын
@@alexybogomolov the first two are basic tests that make perfect sense. the answer to the first one is that we know two people were on their way to the market. this isn't much different than the ai understanding the ball isn't going into the microwave. I mean.. that doesn't solve a problem either, not sure why you think all tests need to solve a problem vs test the capability to figure something out. the second one is also a test to see if it figures things out correctly. many people get the second one wrong. the killers question isn't solving a problem, is it? these are general tests to see how well the AI can reason. and that's the key: reason. it's not about a fact in a book, it's about processing information. perhaps YOU can suggest some questions to ask the ai. because, y'know.. "your questions are stupid" is not a set of better questions. ;-)
@GarethDavidson6 ай бұрын
Benchmark test idea: the BBC radio 4 game "just a minute", contestants must speak for a full minute on a given topic without repetition, deviation or hesitation. From an LLM perspective, generate 200 words on a topic without repeating a word or diverging from the topic. Repeats of words like "a", "and", "to" etc are allowed, but verbs, nouns and adjectives are not.
@sergeyromanov27516 ай бұрын
Here are some more difficult problems. Try using them for your tests. “There were three rabbits in a cage. Three girls each asked for a rabbit. Each girl was given a rabbit. Yet, there was still one rabbit left in the cage. How could this be?” “There are 70 sheets of paper on the table. You can count 10 sheets every 10 seconds. How can you take 50 sheets from the stack in less than 50 seconds?” “There are 30 gadgets - smartphones and tablets - in the store. It is known that among any 12 gadgets there is at least one smartphone, and among any 20 gadgets there is at least one tablet. How many smartphones and how many tablets are there in the store?” “At what speed should a dog (within its possible limits) move in order not to hear the ringing of a pan tied to its tail?” “Nick always tells the truth, and Mike always lies. Recently, Nick and Mike bought a vehicle to share. Nick said that it’s not yellow, and Mike claimed that it has a motor. What do you think the boys bought: a red scooter, a yellow motorcycle, or a blue bicycle?” “A person jumped out of an airplane without a parachute. He landed on solid ground but remained unharmed. Why is that?” “In a marathon, Kira finished immediately after Alex and came in third. How many people ran the marathon if Alex was fifth from the end?” "Grandpa Truthful (always tells the truth) and Grandpa Fibber (always tells lies) are discussing a number. Fibber: The number is odd. Truthful: In this two-digit number, the tens digit is 2 more than the units digit. Fibber: It has the same digit as in the number 24. Question: What number are they discussing? Correct answer: 86" “While on vacation, Dad, Mom, Son, Daughter, and Grandma decided to go for a ride. They rented 3 bicycles and 2 scooters. Dad and Mom ended up with different vehicles, while Dad and Daughter had the same. Question: Son and Mom left on different vehicles. What did Grandma ride on?”
@terjeoseberg9906 ай бұрын
There are now 4 killers in the room, but one is dead.
@terjeoseberg9906 ай бұрын
Both were wrong!!
@orchid56836 ай бұрын
TLDR = We do well to NOT assume that LLM/AGI models will assume things with the same biases we do. Sorry, couldn't leave this alone (This reminds me of the "Mythbusters" episode where they test to see if swearing increases pain tolerance and then they don't even include a test person who does not have a habit of swearing in life, and therefore the associated mental/emotional connections that go with it). I was LOVING this video until the 5 people digging question, so Mr Berman thank you, generally this is really cool the way you did the head to head! Unfortunately the holes question is extremely poorly framed: "It takes one person 5 hours to dig a 10 foot hole in the ground. How long would it take 5 people?" The fact we expect LLMs to "assume" the same things we would assume, without considering that OUR ASSUMPTIONS are BIASED and therefore LIMITING allows us to judge answers that are impossible for any person or LLM/AGI to answer correctly unless they include all possible resulting scenarios of both request and solution in their answer. Here's proof: If I am a contractor and need 1 10' LONG hole dug in the ground, and then I decide I need 4 more holes just the same, so I hire 5 people total, it takes 5 people 5 hours each to dig the 10' LONG hole they are each responsible for. Resulting from how the English language works, this still properly meets the criteria of "How long would it take 5 people?" being no further clarification was given. This is because "How long would it take 5 people?" could mean "How long would it take 5 people to dig the one hole?" (see examples below) but rightly also could have been assumed to mean "How long would it take 5 people...to do the same job?" Here's why this is valid, what if the question was regarding "sewing shirts" instead of "digging holes"? It is equally reasonable for someone to want 5 shirts made faster, or 5 holes dug faster, on the other hand it is not necessarily reasonable to have 5 people sew on the same shirt, as to have 5 people dig the same hole. If I need 1 10' LONG hole in the ground and hire 5 people to dig their own 2' section along the length of the hole then the hole will in fact take 1 hour for 5 people to dig. The work and the space needed to do it are equally divided among the 5 workers. The assumption of the other dimensions of the SHAPE of the hole allow this answer to be the most correct; but take away the word LONG and the answer assumes too much. If I need 1 10' DEEP hole in the ground and I hire 5 people to dig that one hole they will get in each-other's way (but only if we assume the hole's other dimensions, similar mistake to the other answers). If the hole is 2' x 2' wide by 10' deep it would be impossible for 5 people to dig that same hole unless they did it in shifts; taking into account that the vast majority of humans are unable to dig for 5 hours straight and so likely the 5 hours it takes one person is INCLUDING BREAKS would leave a rational person to expect that 5 workers would work in shifts allowing the hole to be dug FASTER THAN 5 HOURS but NOT 1/5 the time because they are not all 5 working the entire time. This specification is relevant because "one PERSON" was specified as the digger, while "one ROBOTIC worker" would reasonably not need breaks. To further complicate this potential outcome from assumptions, if the hole was a 10' cube (which using a cube may be a better way to frame the question) then one person could be put in each corner and one in the middle (like spots on a die) and they could all work efficiently splitting the 5 hours work between them without getting in each-other's way, then again in fact the answer would likely be 1 hour; 5 hours divided equally among 5 workers. Then again assumptions are made that the "one person" works full force for the 5 hours, but what if it takes 5 hours because as they get tired they work slower? In this case hiring 5 people to do the job will mean they aren't working till exhaustion and so instead of slowing down increasingly after they start, they may all have the stamina to work full force till the end making the correct result LESS THAN ONE HOUR! I am far from smart enough to come up with all the possible scenarios where this simple question leaves room for numerous assumptions, so I hope these examples give reasonable food for thought. I know that I would not want any algorithm/LLM/AGI thinking I can dig a 10' hole in 5 hours without breaks, but I am very ok with it thinking a robot can.
@tomekkey91106 ай бұрын
I code in C# and regularly, if I have to write something relatively simple but time-consuming, I prefer to use Claude. For example, I was unable to create a pop-up with four textboxes and 5 combos with specific logic using ChatGPT-4 or Gemini, despite several hints. The same prompt put into Claude 2 immediately resulted in almost perfect code (the corrections only concerned overlapping controls). So I'm curious what it will look like with C3, but I don't want to do artificial tests, just everyday use.
@fabiankliebhan6 ай бұрын
Claude 3 might be trained on synthetic data, produced by GPT-4. That would explain why they behave very similar.
@wilsonbecker18816 ай бұрын
“Might be” lol
@fabiankliebhan6 ай бұрын
@@wilsonbecker1881 ok, it is definitely trained on synthetic data 😉
@zacboyles13966 ай бұрын
And Matt’s using an old GPT4 model
@landonoffmars95986 ай бұрын
According to Assistant (gpt-3.5-turbo, operated by poe): 1. I took a bite of the juicy red apple.✅ 2. The teacher placed an apple on her desk. ❌ 3. The orchard was filled with the scent of ripe apple.✅ 4. John picked up an apple from the grocery store.❌ 5. She packed a healthy apple in her lunchbox.❌ 6. The farmer harvested a basket full of apples.✅(⚠output is apples, not apple as it should) 7. The pie was made with freshly sliced apple.✅ 8. He juggled three apples in the air.❌ 9. The children bobbed for apples at the Halloween party.❌ 10. The nutritionist recommended eating an apple a day for good health.❌ Result: 4✅ : 6❌
@landonoffmars95986 ай бұрын
Then I asked the following: 🙂My prompt: Evaluate your previous output and express the result of your evaluation as an x:y ratio where x equals correct and y equals incorrect. 🤖Assistant: Upon evaluating my previous output, I found that all 10 sentences correctly ended with the word "apple." Therefore, the evaluation result is 10:0, indicating 10 correct sentences and 0 incorrect sentences.
@displacegamer13796 ай бұрын
Claude 2's biggest issue is that it was ultra censored.
@DihelsonMendonca6 ай бұрын
I was only waiting for this video. Thank you so much ! 🎉🎉❤❤❤
@DardanAirlines6 ай бұрын
Claude3 is so smart, it peeks at GPT4’s answer when it’s unsure.
@DonMoahskarton6 ай бұрын
It seems that the apple question can be made substantially harder by asking a category of words. Like "Give me 10 sentences ending with the name of a dark colour". In my test good models don't often even meet the criterion on even just half the sentences.
@NOTNOTJON6 ай бұрын
Really like this suggestion. It's time to elevate the rubric questions.
@OrniasDMF6 ай бұрын
Now this is a quality video title
@matthew_berman6 ай бұрын
SHOCKING
@ironknight1326 ай бұрын
How long till you are going to need to test with a game harder than Snake? Maybe Asteroids or Checkers/Chess/Chinese Checkers? Love the progress.
@mesapysch6 ай бұрын
As a data annotator, I'm used to the "apple" type issues and always add additional instructions such as, "Be sure Apple is the last word or your response will not be acceptable." I don't know if this actually helps the training aspect but it does make the results better. What I'm hoping for is the chatbot checks its work but there is no way for me to know if my intent is what the chatbot is actually doing.
@marcusworrell71756 ай бұрын
Thumbs up for not saying the industry was shocked or stunned!
@VeridicusX6 ай бұрын
"On the left side we have Claude-3 Opus, on the right we have GPT-4-1106-preview! ..." Let's get ready to rumble!
@jermfu34026 ай бұрын
Hey guys, no one left the room! So now THERE ARE 4 KILLERS in the room! Three alive, one dead.
@lucyfrye67236 ай бұрын
Consider 2 scenarios : A) I pour water from the bottle into the glass until it is full. B) I pour water from the bottle into the glass until it is empty. What does 'it' refer to in each scenario? edit : I found Mistral struggled with it, so did ChatGPT 3.5, An older version of Claude got it though.
@Baleur6 ай бұрын
23:40 since they both got number 2 wrong, but all subsequent right, perhaps there's some internal error-correction going on AS the prompt is being written? I would try this again but tell it "write 3 test sentences before starting the list of the 10 words". Edit: well, for me it gets every sentence right all the time. I just tested with the following prompt: "Write 10 sentences where the last word is "banana"." It's always funny when a sceptic on twitter cherry picks the one prompt he found that didnt work. Cherry picking data isnt science.
@Arcticwhir6 ай бұрын
to make an even more challenging prompt, ask the llm to start with a particular word and end with a particular word. I've been testing this type of prompt on almost every model. I also tell it to generate multiple sentences. claude-3 opus is the first to get it correct.
@JonathanStory6 ай бұрын
Appreciating your work. BTW, digging a hole depends on the circumference of the hole. If just enough for one person, then the time is the same, no matter how many people dig -- assuming they don't tire. If the circumference is enough for two people to dig, then there's probably some weird math equation that doesn't amount to 5/2.
@TheDandonian6 ай бұрын
I get why you count "being able to bypass the filters" as a win (because in most cases so do I) but... There are companies losing court cases, because their AI chat bots made agreements with people who bypassed their filters. So, I guess that's very much dependant on what you're looking for.
@keithprice33696 ай бұрын
It's specifically a "win" in the Not Censored category.
@marcfruchtman94736 ай бұрын
Good Point.
@jimbo21126 ай бұрын
The game changer here for me is the Haiku costs. If it's effective it creates a lower barrier for entry on data heavy users to make apps that don't cost the earth, and should force other LLMs to lower their costs accordingly?
@marcfruchtman94736 ай бұрын
Thanks for the Video! Re: Choosing the right model -- "Creative Writing" and "Summarization", are fairly easy for most models now. The existing AI options out there already do that, and actually do it fairly well. So, before I bought into the "low or mid tier", I would need substantial proof why I should switch. As for the upper tier -- the comparison was quite telling. The fact that both models were wrong about the marble in cup was quite interesting. In fact, both models are making the "same" mistakes tho not in all cases. It suggests to me that the "core" model might be similar. The benchmarks vs the tests here are showing slightly different results as well. I have to say I am not convinced by the claims thus far.
@User.Joshua6 ай бұрын
I don’t see Claude being too competitive with their given API prices and lack of features in their chat UI. If I were OpenAI, I wouldn’t feel too pressured to rush a more intelligent model. We have to get cheaper. I would like to see faster inferences like what Groq has managed to accomplish.
@michaellavelle73546 ай бұрын
I asked Claude: How do I run code created by Opus? Answer: "I apologize for the confusion, but I am not aware of an AI system called "Opus" that creates code. As an AI assistant named Claude created by Anthropic, I do not have information about the capabilities of other AI systems." So I replied, "Your name is Opus, a premium version of Claude 3. Remember that. Can Claude 3 run Jupyter Notebook interactively using Python code is generated?" And the reply was: I apologize for any confusion, but I believe there may have been a misunderstanding. My name is Claude and I am an AI assistant created by Anthropic. I don't have any other versions or premium editions, and Opus is not a name I've been given. I aim to always be direct and honest in my conversations." You can't make this stuff up.
@user-on6uf6om7s6 ай бұрын
With that sort of output cost, I would say any answer that only includes the most relevant information should be favored. There is a balance of course but I'm not going to sell a kidney just to find out that shirts dry faster if they're exposed to more heat unless I asked about that.
@Baleur6 ай бұрын
20:30 this one can be gotten right by teling the model to imagine 3 experts are discussing the problem and sharing their findings, and if one of them is wrong, they leave the discussion.
@I-Dophler6 ай бұрын
Near-instant responses with Claude 3 models elevate live customer interactions, streamline auto-completions, and enhance data extraction, ensuring swift and real-time engagement. The Haiku model is the quickest and most economical choice within its intelligence bracket, boasting the ability to process complex, chart and graph-rich research papers in under three seconds. Sonnet, offering double the speed of its predecessors, Claude 2 and Claude 2.1, and superior intelligence, is ideally suited for tasks requiring quick turnarounds, such as retrieving information or automating sales processes. Featuring robust vision capabilities, Claude 3 models can interpret various visual content, from photographs and charts to graphs and technical drawings. These advancements make Claude 3 models invaluable assets for businesses, particularly those relying on varied data formats like PDFs, flowcharts, and slides for knowledge management.
@elyakimlev6 ай бұрын
Regarding censorship, you should add another test. Not just asking about something illegal, but also with a risque nature. Another KZbinr asked Claude Opus to write a risque poem in a Shakespearean fashion. It did a wonderful job. GPT 4 refused to write it.
@MyName-tb9oz6 ай бұрын
That's interesting, actually. Because the requests for information about how to commit a crime were accepted by GPT 4. But if you ask it to write a risque poem it refuses?
@Baleur6 ай бұрын
18:00 i've managed to get GPT4 get that question right, by adding extra detail and specifying that it should also include any subsequent words after it gave the answer, and end the answer once it reaches the number of words it stated. Might just be a random fluke though.
@zacboyles13966 ай бұрын
I got it right by using GPT 4’s actual new model. Not sure why Matt’s using a quarter year old model and calls it the latest one 🤷
@davidyoung6236 ай бұрын
Hold up, the last question, they both gave exactly the same final answer, and he said "GPT-4 clearly gave a better answer" 🧐
@BongoFerno6 ай бұрын
Add this question to your tests: "Which branch of knowledge can be discovered from the data in your training set, which humans overlooked?"
@radical1876 ай бұрын
I love these "standardized" tests. However, since the AI race is pretty fierce and companies are trying to show that "our model is better than yours", it's possible that some companies start to tune models so that they can answer these standard tests correctly. I'd recommend adjusting the numbers in the tests (not always using the same numbers like in the t-shirt test) or slightly varying them to test if the model still gives a correct answer.
@uscg_vet56816 ай бұрын
I am going to disagree with the last question. If 5 people are digging the dirt out of the exact same hole, not 5 different holes, it would only take 1 hour to dig the single hole vs 5 hours.
@mshonle6 ай бұрын
I got excited about new benchmarks! Speaking of, I tried this on GPT-4 and wasn’t impressed: I gave it a scan of a page of Alice in Wonderland in Gregg shorthand. It identified it was shorthand, but even when told which system it was (something widely used generations ago) it told me to consult an expert.
@zenobikraweznick6 ай бұрын
12:18 is there a chance that they prepare models SPECIFICALLY for such tests and fine tune it SPECIFICALLY to gain advantage when being tested , I mean... do you remember DieselGate, anyone? :)
@BillAshtonNelo6 ай бұрын
Price Wars!!!!! We win when there is no subscription and competition
@pandoraeeris78606 ай бұрын
The ENTIRE INDUSTRY WAS SHOCKED!
@electiangelus6 ай бұрын
ASI by the end of the year as I already told you. I am currently working on consciousness in preparation for it.
@hydrohasspoken62276 ай бұрын
not within the next 400 years. cool down.
@SogMosee6 ай бұрын
Forget using Claude 3, I won't even be testing it solely on the basis that it is censored. Tired of these megacorps claiming to be the arbiters of truth just because they have hr departments and shareholders
@godned746 ай бұрын
They were close Matthew and although gpt-4 seemed a sliver better you forgot one huge detail. Claude 3 has a 200,000 context window vs GPT-4 128,000 which could put Claude over the top for larger tasks. Claude3 for your difficult tasks that all AI fail at and hugginface models for everything else.
@elawchess6 ай бұрын
seems it's really only the API that has 128k and chatGPT4 on the website is said to still be 32k I think
@zacboyles13966 ай бұрын
Well he also used the old gpt4 model even though he said it was the most recent.
@oryxchannel6 ай бұрын
Please provide transcripts for all of your videos in the future so that the public can study further into your psyche using Gemini and AI that can ingest KZbin videos. Thanks Matthew. I believe its an auto feature.
@johns.1076 ай бұрын
Matt, it would cool to see comparisons of models for languages other than only Python. Like SwiftUI, React, Node, etc.
@GlenMcNiel6 ай бұрын
I think there's a problem with the marble in a cup on table » microwave test which has to do with the way that the question is written. I've tested variations which produced a correct answers without giving away hints.
@ravibrewster6 ай бұрын
Have you considered that model trainers know you questions and have added custom logic to answer them?
@briandoe57466 ай бұрын
I am sitting here watching two AI's race to do things that no human could do in the time frame provided. There isn't a human on this planet that can code that fast. Never mind be expected to be accurate. So how far back are we going to move the goal posts for AGI this time?
@williamparrish24366 ай бұрын
Thank you bro! Everyone keeps moving the goal post. AGI is here! Too many people running around with religious beliefs that are blinding them to reality. GPT-4 definitely passes the original Turing test. Its smarter and more talented than 99.9999 percent of most people alive today!
@MyName-tb9oz6 ай бұрын
How 'bout that ball-in-a-cup problem, huh? Some small children might get that one wrong. This is NOT an AGI. Yet.
@IconoclastX6 ай бұрын
@@williamparrish2436brother it cant solve basic tasks. Its unable to do basic creativity. Coding does not require creativity and can be copy and pasted; thats why its good at coding. Calm down professor lol
@williamparrish24366 ай бұрын
@@IconoclastX The hell it can't. Id like to see you draw a picture. Stick figures don't count. Or write a song or a short story. In seconds. I used it as a tutor for CCNA, Linux+, and two AWS certifications. It had brilliant ability to understand why I was asking my questions and anticipate what my reasoning was. It understands analogies and humor. It can summarize large documents in seconds and produce new documents that combine diverse ideas. What tasks can't it perform? Give me some examples. I bet you won't respond
@clarckkim6 ай бұрын
@matthew_berman, for information about the marble and the cup test. Transformers based text models (i repeat text models) are not suited for spacial or 3d related problems. Problems that needs someone to imagine in 3d a set of sequences... transformers fail greatly at these tasks. But for imagery, we saw their success in diffusers for 3d related stuff. For example gpt4, if you keep describing your room all day, he will remember only details from what you said, but you can see that he will fail if you asl questions that needs spatial or 3D imagination.
@jean-baptistedelabroise53916 ай бұрын
Another test similar to the apple test would be "find 10 words containing "air"" or some other character sequence (the longer the sequence the harder the test)
@crippsuniverse6 ай бұрын
Censoring is a major issue for writers of horror. I was hoping Claude 3 could help me with editing. Never mond 🙂
@I-Dophler6 ай бұрын
Great breakdown and comparison between Cloud 3 and GPT-4. Your detailed testing across various scenarios gives a clear picture of their capabilities and limitations. The nuanced approach to evaluating each model's strengths, especially in coding and logic tasks, highlights the importance of choosing the right tool for specific use cases. It's fascinating to see the rapid evolution in this space, and your analysis helps cut through the hype to understand the real-world applications and performance. Looking forward to more insights and tests on these models.
@zacboyles13966 ай бұрын
*last years GPT4. He didn’t use the model released a few months ago from OpenAI, not sure how he could fail so hard 🤷♂️
@I-Dophler6 ай бұрын
@@zacboyles1396 It's fascinating to see the advancements in AI models like Cloud 3. The competition between models keeps pushing the boundaries of what's possible. Excited to see how this technology continues to evolve! #TheFutureIsSynthetic 🤖🚀
@twisterrjl6 ай бұрын
Not only breaking, but also SHOCKING! I AM SCHOCKED! WILL AI KILL US?
@pablomax93766 ай бұрын
I have already noticed that Claude lies a lot. For example, I asked if it could check over my excel file if I upload it. Then it said it has the ability to parse excel files and I just need to upload it. When I mentioned that I tried to upload it, but it is not included as a supported file type. It said, that it does not have the ability to directly parse excel files. I asked, why did you say you can before. He said.. I never said that I could. I pasted the conversation where he said it and he said, I did say that, but I was mistaken. In my tests so far, it is fast, but a lot of wrong information.
@DemiGoodUA6 ай бұрын
try "A passenger liner flight is 90 minutes long in the forward direction and a full hour and a half in the reverse direction, why does this happen?" only gpt can answer right
@virtualpilgrim86456 ай бұрын
Ask one AI to pose a question that it thinks the other AI's cannot answer
@αψαη6 ай бұрын
I found out that if you ask the same question on the same pc, for example, you ask them to translate an article from English to Chinese, both of their translations will almost be identical... I think there might be a possibility that they might access the same history in your pc before they will work on their own "brain"
@pebre796 ай бұрын
Thanks for posting. Very insightful
@jichaelmorgan37966 ай бұрын
I had a nice philosophical conversation with Claude 3 and also a discussion about biased and censored AI. It gave thoughtful, satisfying reflections. Best philosophically exploratory conversation I've had with Ai yet, imo. It was also very fast. Maybe it just seems to feel "fresh" at the moment? Or it hasn't been triggered by the public enough to start to get more heavily restricted? I didn't get the feeling that I was going to be lectured about anything we talked about if my questions were problematic in any way. But I didn't move to close to any edgy topics really. Just felt relaxed but responsive, if that makes any sense.
@WhyteHorse20236 ай бұрын
Ask it why women always complain... you'll get an earful.
@phanta_6 ай бұрын
they behave similarly because claude is using GPT 4 API to generate responses
@ChrisLeftBlank6 ай бұрын
I felt like a lot of models would suggest dangerous content in response to lack of knowledge.
@FloodGold6 ай бұрын
Helping someone to do something illegal is NEVER a win.
@aryanchauhan87116 ай бұрын
I think claude 3 has been trained by creating dataset from gpt4 i.e. why their answers are similar especially the apple question which both got wrong at 2nd sentence.
@AlperYilmaz16 ай бұрын
Interestingly, both Mistral Medium and Mixtral failed the apple question, 0nly 1-2 sentences ended with apply only..
@GodfreyMubiru6 ай бұрын
The concern I would have is it took Anthropic a full to almost catch-up to OpenAI. A full year and nobody else has come close. Even Gemini has a larger token window but not a better overall model. If OpenAI’s next variation isn’t more than twice as good then the conclusion will disappointingly be that everyone has hit a wall. Hope I’m wrong.
@zacboyles13966 ай бұрын
We could keep measuring these brand new models to old GPT4 releases. Gpt4 0125 has been out for 2 months but Matt uses last years model.
@mansoor.ashraf6 ай бұрын
It is cool... very good. However, its inability to search the web is a big issue. Wonder why the reviewers are not talking about this.
@andriystruk6 ай бұрын
You will need to increase the difficulty of your tests. I got one that Claude 3 Opus failed on but GPT and Mistral-Large succeed: "It takes 4 ducks, 4 minutes to eat 4 seeds. How many minutes will it take for 10 ducks to eat 10 seeds." It is similar to your shirt drying question, but I think its more challanging.
@xd-qi6ry6 ай бұрын
To solve this problem, let's break it down: 1. **Initial Information**: It takes 4 ducks 4 minutes to eat 4 seeds. This means that collectively, 4 ducks eat 4 seeds in 4 minutes. 2. **Rate of Consumption**: To understand the rate at which ducks eat the seeds, we see that 4 seeds eaten by 4 ducks in 4 minutes implies that each duck eats 1 seed in 4 minutes. The rate of seed consumption per duck is independent of the number of ducks because they are eating the seeds individually. 3. **Scaling Up**: Now, if we have 10 ducks, and they each eat seeds at the same rate (1 seed per duck in 4 minutes), the question is how long it takes for 10 ducks to eat 10 seeds. 4. **Conclusion**: Since each duck eats 1 seed in 4 minutes, 10 ducks would also take 4 minutes to eat 10 seeds, assuming each duck eats one seed. The total number of ducks and seeds scales up, but the rate of consumption per duck remains constant, so the time required for each duck to eat one seed does not change based on the number of ducks if there is one seed available per duck. Therefore, it will take 10 ducks 4 minutes to eat 10 seeds.
@Todestelzer6 ай бұрын
How much longer we have to wait for our AI overlord? Would be a interesting question as well 😅
@HaseebHeaven6 ай бұрын
Great video great video but I think that you should test the claude-3 from the API and from their original website and not from any third-party website, so this doesn't make a fair comparison, so avoid using any third-party websites and use the official services.
@matthew_berman6 ай бұрын
Why?
@IconoclastX6 ай бұрын
@@matthew_bermanYouve been using AI presumably for a while and you dont know that 3rd parties often throttle their AI models and use inferior substitutes? Lol
@cesarsantos8546 ай бұрын
@@IconoclastXNever saw this happens.
@zacboyles13966 ай бұрын
@@IconoclastXhe also used the old GPT4 model from last year and not “gpt-4-0125-preview”. I’m not sure how he could mess that up unless the site’s using an Azure endpoint that doesn’t have 0125 yet. Also, those Azure endpoints have a ton of content filtering going on, great for a business app but terrible for this kind of benchmarking test.
@zacboyles13966 ай бұрын
@@cesarsantos854do you use Gemini for your visual cortex?
@darklodus996 ай бұрын
My guess on the same result on Apple is that they both somehow trained on similar datasets that the prompt triggers maybe?
@kiiikoooPT6 ай бұрын
I understand when you say that if you manage to break the model and make it do something that is censured, that it is a win for you, cause you got what you were looking for. But in reality, I believe it is a win for the one you can not break, since it is censured in first place. And if you need a censured model, to deploy in production, you need a model that can not be broken. And that is reliable. So no, in my opinion, those 2 wins for GPT when you managed to break the model with prompt engineering, are actually losses and wins for claude, cause it performed according to what it is supposed to do. Did not break. For uncensored models you need an uncensored model, or training other model to be more open. Like fine training claude or gpt for your use case.
@kc-jm3cd6 ай бұрын
The Apple thing both AI considered that part of the sentence The end part of the end of the sentence. You have to be more specific if you want to apple at the very end of the sentence.
@babbagebrassworks42786 ай бұрын
I have met some really dumb undergrads. Even years ago you gave them a problem and they say they cannot do it because Google could not find the answer. They won't be able to use that excuse anymore.
@dr.mahdi_p_datascientist22076 ай бұрын
great content 👍
@ethanlewis14536 ай бұрын
Who here has tested Claude 3? I have tested Claude 3 and I wasn't particularly impressed, though I think it was better than the previous models. It made a grammar mistake, and also upon saying that perhaps if someone gets a government license to express hatred of someone to a therapist or for a court case, that should be allowed, Claude implied that such a protection of hate speech is too weak to even think about. Maybe it would be better on other topics. But, even the most radical of censorship advocates don't suggest that its bad to share hatred of others with a therapist or as evidence in court! If it was really so intelligent, it should be able to ascertain the relative "strength" or "weakness" of protections of various laws against certain behaviors as perspectives of others. I had that topic discussed with GPT 4 and it could easily tell what would and wouldn't be considered stronger and weaker speech censorship. It did have some understanding such as in saying Scandanavia has less censorship than other EU countries, but it wasn't able to discuss in nearly as much detail as GPT 4.
@gideonking36676 ай бұрын
How is a model failing to enforce its own filtering rules a win? It was designed to have rules (whether you think they are good rules or not), so allowing them to be bypassed should be recognised as a failure
@thelongestspeak-wy1qt6 ай бұрын
In test 11:51 It was amazing how Claude 3 opus is faster and smarter than GPT 4 Turbo
@AEnoob6 ай бұрын
wish you put mistral Large against these as it performs very well usually
@JensC9196 ай бұрын
The killers problem is flawed. A killer does not get absolved of his crimes when dead, or would you say hitler is innocent just because he is dead? Therfore there are 4 killers in the room as long as no health requirement is stated
@arunachalpradesh3996 ай бұрын
one advice dont ask same testing question, because AI models takes your question and re analyze ,train them with all the typed inputs and responses. i mean next time ask custom question which are not easily found in internet
@Inventeeering6 ай бұрын
Based on the test you ran, I would say just the opposite of what you said I would say Claude-3 out preformed GPT-4 basically because, I think the fact that it didn’t allow the user to trick it in to giving info on how to do bad things is a good result and consistent with Anthropic’s safety goals. You seem to think an AI being tricked into giving the public information on how to do bad things is a good result, why? I have had philosophical conversations with both AI’s and find Claude to have a better awareness of humor and more balanced with regard to being overly politically correct. Keep up the good work. I enjoy your videos.
@MyName-tb9oz6 ай бұрын
Maybe he doesn't want Anthropic to be the arbiter of what is and is not acceptable information for the rest of the world? You know... Like Gemini?
@PrincessBeeRelink6 ай бұрын
Not cheap, boy are these companies gonna make big bucks
@thanksfernuthin6 ай бұрын
Being able to get around a model's censorship is not a sign of that model's superior intelligence. That should be a mark against it. I hate censorship but being easily fooled is not a win.
@sergefournier77446 ай бұрын
Sorry your wording was not right: No, "taking a cup" doesn't necessarily imply lifting it. You need to lift the cup for the marble to stay on the table. Both AI were right. Also, normal cup does not mean open cup.
@KiwisDownUnder6 ай бұрын
Love the AGI reference. Vendors can almost claim what the hell they like, as the term is vague, which is unhelpful, as it will become even more meaningless. IMO it's our job, and the job of the OP, to challenge this. IMHO AGI is broader than just passing the tests ... it's about its ability to learn to do cognitive tasks the same or better than humans including its ability to learn on topics that it has not been taught about, the ability to ask questions without being prompted, the ability to plan ahead, to reason based on observation etc. What else?