Claude 3 Release and The Problem with Benchmarks

Рет қаралды 7,337

Күн бұрын

In this video, we will look at the new king of LLM benchmarks, Claude-3 from Anthropics. We will do a few tests of our own and will look at why the reported results may not reflect the true performance of the Claude-3 family.
🦾 Discord: / discord
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Patreon: / promptengineering
💼Consulting: calendly.com/engineerprompt/c...
📧 Business Contact: engineerprompt@gmail.com
Become Member: tinyurl.com/y5h28s6h
LINKS:
Claude-3 Announcement: www.anthropic.com/claude
Claude Chat: claude.ai/chats
Technical Report: tinyurl.com/yc5y6zwj
Claude-3 vs GPT-4: tinyurl.com/mprdy3rp
Claude-3 API Access: console.anthropic.com/
TIMESTAMPS:
[00:00] Introducing Cloud3 3: The Challenger to GPT-4
[01:41] Benchmarking Cloud3 3 Against GPT-4: The Reality
[03:35] Intended Applications and Price Analysis of Cloud 3 Models
[06:21] Hands-On Tests: Accuracy, Image Understanding, and Coding Abilities
[14:04] Revisiting Benchmarks: A Closer Look at Cloud 3 vs. GPT-4
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu...

Пікірлер: 38

@duudleDreamz 3 ай бұрын

Excellent video. Kudos for pointing out the benchmark comparison inconsistencies.

@engineerprompt 2 ай бұрын

Thank you!

@adamgdev 2 ай бұрын

Good find with the footnotes!

@jd_real1 3 ай бұрын

I had very good experience with Claude. It did everything that GPT4 did, plus it answered medical questions for me that GPT refused to answer. The only thing I didn’t like about Claude was that I couldn’t find a way to speak as a way of input. I won’t change from GPT4 until they add this function

@engineerprompt 2 ай бұрын

From what I have noticed, the earlier versions of Claude had more refusals compared to GPT4 model so it is great to see this one has fewer refusals in this version. Speech input might be coming if they ever release an app.

@TheEarlVix 3 ай бұрын

Thank you for putting in so much work so that we don't have to. Much appreciated :-)

@engineerprompt 2 ай бұрын

@HistoryIsAbsurd 2 ай бұрын

I swear Ive subbed to you before? I dont understand KZbin sometimes lol. Thanks for the vid and I really agree with this alot.

@engineerprompt 2 ай бұрын

Welcome back!!!!

@cromdesign1 3 ай бұрын

Poe chat added it to their app and site a couple of hours after it got released. Not the haiku yet. The 200k versions are there too. 1.000.000 chat credits per month with their subscription. 750 per opus non 200k prompt. Poe has a lot of various models.

@engineerprompt 2 ай бұрын

nice, need to check that out!

@pooyatolideh9527 3 ай бұрын

So much for "industry shaking"...

@xbon1 2 ай бұрын

If you pay $20 a month you can get opus model in chat too

@MarcusNeufeldt 3 ай бұрын

🎯 Key Takeaways for quick navigation: 00:00 *🚀 CLA 3 challenges GPT-4 with claims of superior benchmarks.* 00:15 *🌐 Introduces Haiku, Sonet, Opus models with diverse applications and enhanced multimodal support.* 01:09 *💸 Models vary in intelligence and cost; Sonet and Opus accessible via Cloud API.* 01:49 *🏆 Opus tops GPT-4 in benchmarks, but comparisons use GPT-4's older version.* 03:38 *🛠️ Opus suits task automation and enterprise-level R&D; more expensive yet offers input token cost advantage.* 07:07 *🕵️‍♂️ & - **09:46** 📸 Showcases prowess in information retrieval, correction, and multimodal interpretation.* 12:58 *💻 Proves coding capability through D3 code generation for self-portrayal.* 14:19 *🤔 Highlights issues in benchmark fairness due to outdated GPT-4 comparison.* 17:04 *📈 CLA 3's retrieval capabilities present as a viable, albeit pricier, alternative to ARA pipeline.* Made with HARPA AI

@impushprajyadav 3 ай бұрын

Thank 😂

@elawchess 3 ай бұрын

I think what Anthropic did with comparing to the old GPT-4 benchmarks is fair. Why? Because this medprompt stuff you are talking about is "unofficial" in some sense. If Open AI had taken it upon themselves to release official updates on those tests, only then would it become absurd for Anthropic to ignore that and compare to the old benchmark results.

@tarmiziizzuddin337 3 ай бұрын

gggg🎉😢😢

@gtpolpo9445 2 ай бұрын

Cloude 3 better then gpt3.5 for coding?

@engineerprompt 2 ай бұрын

I would say yes. Claude 3 Opus, not sure about the smaller models.

@carlkim2577 3 ай бұрын

For my use cases, i tested both and it clearly apparent that gpt4 is superior in reasoning.

@engineerprompt 2 ай бұрын

Just curios, what is your use-case?

@carlkim2577 2 ай бұрын

@@engineerpromptSure, I appreciate your videos so I'll give a full reply. I use AI for everything but mainly tech support for data modeling. I'm an economist so in my day job I clean data and use Tableau for visualization. I tested Sonnet last night. In windows 10, how do I set a folder and subfolder to a certain view ie large icons. Sonnet kept getting it wrong, hallucinating. Even when I told it my windows version and kept correcting it. I tried GPT4 and one shot, done. Then I asked both models to give me story beats based on a scenario. I compared both and clearly GPT4 is more sophisticated in ideation. Claude may have a simple grammatical structure that appeals to more people, but in terms of idea generation GPT4 was better. I do see a use case for Claude Opus. It likely handles long doc recall better than GPT4. And I do believe it matches in coding tasks. But is it enough for me to pay the Pro? I'm still debating and testing.

@CyberGizmo 2 ай бұрын

Well poor Claude still got it wrong, Mike Scott was the first CEO of Apple

@engineerprompt 2 ай бұрын

Even I didn't know :)

@CyberGizmo 2 ай бұрын

@@engineerpromptReally enjoy your channel, thanks for all the hard work you do!.

@giosasso 2 ай бұрын

Their naming conventions make no sense. First, why is Claude the base level name? Introducing the Claude 3 family. We have: Claude Haiku, Sonnet, and Opus. These names don't work as product names. It's not clear as to which model is better.

@engineerprompt 2 ай бұрын

Hiakus are usually short poems, sonnets are longer than haikus and Opus are longer than sonnets. Seems like they are taking inspiration from poetry here.

@frogdeity 2 ай бұрын

I've been using Claude 3 for a bit now and it's really nothing special. It consistently cannot answer simple questions and forgets important context constantly.

@adhumon55 3 ай бұрын

When it comes to writing better & human like copies , Claude 2.0 and 2.1 is the king..... Unfortunately claude 3.0 is destroying the strength that claude has!(Human like content)

@PrincessBeeRelink 2 ай бұрын

shady advertising on their part, no one will ever beat GPT4!

@anonymeister123 3 ай бұрын

Claude sucks. Still waiting for API access. They’re trying to gate keep when there are other greater services

@HaseebHeaven 2 ай бұрын

Its easy to get access sign up and in Week or so you will get access to

@thomassynths 2 ай бұрын

On top of that they disallow commercial use. I can’t think of anyone (even OSS bros) who are willing to pay for that.

@anonymeister123 2 ай бұрын

@@HaseebHeaven It's been over a month. I have a feeling they gate keep so that they can keep the service level high. They can handle a lot more tokens and at a faster rate when they only allow a few people to use their service.

@R0cky0 2 ай бұрын

Don't use it then

@anonymeister123 2 ай бұрын

@@R0cky0 duh. Keep up with the conversation, bub. That’s exactly what I’m recommending.