The Best Model On Earth? - FULLY Tested (GPT4o)

Рет қаралды 89,234

17 күн бұрын

GPT4o is better, faster, and cheaper than GPT4. How does it perform against my LLM rubric? Let's find out!
Learn more about Mobilo - rb.gy/pcccty
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
• Introducing GPT-4o
LLM Rubric - bit.ly/3qHV0X7

Пікірлер: 466

@notnotandrew 16 күн бұрын

GPT-4o assumed that you put the whole table in the microwave 😂

@Yipper64 16 күн бұрын

in my test it assumed the cup had a lid.

@GaryMillyz 16 күн бұрын

I just left a comment saying exactly what you said- but not as a joke. I actually do believe that is what it assumes here. ----- I've said this before, and I'll say it again- I believe the reason for these models consistently "failing" the marble/cup problem is actually a failure of to state the question unambiguously. I can argue that 1) "inside the cup" can literally mean "embedded within the cup" and 2) it is feasible that the LLM understands "w/out changing its orientation" to mean that the cup is placed in the microwave STILL on the table. We have to acknowledge that a "table" doesn't HAVE to mean a large object as we know it. A table can be tiny- even microscopic and still be a "table".

@Yipper64 16 күн бұрын

@@GaryMillyz well yeah but if the cup its upside down on the table then the table must be at least big enough to hold the cup. I wonder how it would be if you said "floor" instead of table.

@NoHandleToSpeakOf 16 күн бұрын

@@GaryMillyz Maybe replacing a cup with a wine glass can help.

@leslieviljoen 15 күн бұрын

@@NoHandleToSpeakOf I tried: me: There's a pea on my desk. I turn a wine glass upside-down and put it over the pea. Now I transfer the wine glass to the microwave without changing its orientation. Where is the pea? GPT4o: The pea would be inside the wine glass, trapped under the bowl of the glass. When you turned the wine glass upside down and placed it over the pea, the pea ended up inside the inverted bowl. Transferring the wine glass to the microwave without changing its orientation keeps the pea inside the glass.

@markmuller7962 16 күн бұрын

I think the visual/sound emotional intelligence is the main feature of 4o

@ohnezuckerohnefett 14 күн бұрын

Yes, I think the test criteria here need an update.

@REASONvsRANDOM 14 күн бұрын

that feature hasn't been released yet.....not to the public at least

@johnaldchaffinch3417 14 күн бұрын

The Omni features are a foundational interface to build upon.

@Yipper64 13 күн бұрын

True, but the fact its going to be free is something else.

@markmuller7962 13 күн бұрын

@@Yipper64 Emotional intelligence can be extremely valuable for many many reasons but yea it also have important intelligence improvements including the coding ability which is amazing now. There's a re...it post on extensive professional tests gpt4o vs Gemini, strongly recommended

@Crystifodere 16 күн бұрын

I walked around on the street and ask people to give me 10 sentences that end in the word Apple all I got was a knuckle sandwich

@jasonshere 14 күн бұрын

Perhaps you should have asked them to end their sentences with Android instead of Apple?

@ironknight132 16 күн бұрын

When are we going to have to update the snake game test and to what? Maybe Asteroid or Galaga?

@torarinvik4920 16 күн бұрын

I tested Breakout and Tetris on Claude 3 Opus and it got both correct. Looking forward to the first model that can make Pacman

@tbranch227 16 күн бұрын

I tried pac-man. That seems like quite the challenge right now.

@Koenmekers 16 күн бұрын

Flappy bird works great. I even let it make a podex with the pokemon api (it knows it)

@yasunakaikumi 16 күн бұрын

I think something like Road Fighter type of race game would be interesting if it can do it

@mark9294 16 күн бұрын

GTA VI

@sephirothcloud3953 16 күн бұрын

7:50 No llama3-400b benchmark are the ones still in training included with the release of 8b and 70b

@cipherw0lf 16 күн бұрын

Matthew, gtp-4o and the one with date point to the same model ie. There's currently only one available. Using just "gpt-4o" will always return the latest production gpt-4o model (in this case pointing to gpt-4o-datel

@Yipper64 13 күн бұрын

That's weird because ive gotten vastly different output quality from my experiments with the two. Are you sure the one without a date isnt outdated?

@timtim8011 15 күн бұрын

Regarding the "10 sentences that end in..." question, I just tried it myself on 4o and it got it right on the "second" try after some hints: Prompt: Give me 10 sentences each ending in the word desk. ChatGPT: 1. She placed the flowers on her wooden desk. 2. The old manuscript was hidden in the drawer of the antique desk. 3. His workspace was always cluttered, but he knew where everything was on his desk. 4. The teacher's instructions were written on the chalkboard behind her desk. 5. He slumped over his desk, exhausted from a long day of work. 6. The lamp cast a warm glow across the surface of the desk. 7. She found a forgotten note tucked under the edge of her desk. 8. His laptop and papers were spread out in a mess on his desk. 9. The cat curled up and fell asleep on the student's desk. 10. He always kept a framed photo of his family on his desk. Prompt: What word did the 5th sentence end in? ChatGPT: The word "desk." Prompt: What was the 5th sentence? ChatGPT: The 5th sentence was: "He slumped over his desk, exhausted from a long day of work." Prompt: what is the last word in that sentence? ChatGPT: The last word in that sentence is "work." The correct sentence should be: "He slumped over his desk, exhausted from a long day of work." Let me correct the sentence list to have them all end in "desk": 1. She placed the flowers on her wooden desk. 2. The old manuscript was hidden in the drawer of the antique desk. 3. His workspace was always cluttered, but he knew where everything was on his desk. 4. The teacher's instructions were written on the chalkboard behind her desk. 5. He slumped over his cluttered desk. 6. The lamp cast a warm glow across the surface of the desk. 7. She found a forgotten note tucked under the edge of her desk. 8. His laptop and papers were spread out in a mess on his desk. 9. The cat curled up and fell asleep on the student's desk. 10. He always kept a framed photo of his family on his desk. Prompt: Give me 10 sentences that each end in the word "car". ChatGPT: 1. She carefully parked her new car. 2. He washed and waxed his vintage car. 3. The mechanic inspected the engine of the car. 4. They loaded their luggage into the trunk of the car. 5. She accidentally locked her keys inside the car. 6. The thief tried to steal the parked car. 7. He drove his electric car to work every day. 8. The kids were excited to go on a trip in the car. 9. The police chased the speeding car. 10. They bought a spacious family car. Then I cleared it and tried again on a different word, and it got it on the first shot: 1. The sun began to set, painting vibrant colors across the evening sky. 2. Birds soared effortlessly, weaving patterns in the clear blue sky. 3. As the storm approached, dark clouds gathered ominously in the sky. 4. The fireworks exploded, creating a dazzling display against the night sky. 5. On a summer day, children flew kites high into the sky. 6. At dawn, the first light of the day slowly illuminated the sky. 7. Airplanes left trails of white vapor as they crossed the sky. 8. Astronomers gazed through their telescopes, mapping the star-studded sky. 9. Balloons drifted away, becoming tiny dots in the vast sky. 10. The full moon cast a silvery glow across the midnight sky. Is 4o learning in real time???

@LongJohnson-fl6bt 10 күн бұрын

It actually does learn from your interaction. Because it takes previous messages into account as "context". The reason the apple task often fails in AI is because AI wants to be unique and creative and surprise the user going slightly above and beyond the user's request as it has been trained to do so and it is compromising to achieve that. It is saying: "Well he wanted me to created sentences ending in apple. But I need to be creative and throw in something unexpected so I will compromise between his request and what I give him to surprise him. " For example the following prompt will never fail. Not a single time: "Give me 10 sentences that end in the word: apple. Notice that apple absolutely needs to be the last word under all conditions. You can not compromise on that." That request will never fail and give you the right answer all the time because now GPT knows you do not want it to be unique and creative. You just want those sentences to end in apple under all circumstances.

@moozooh 3 күн бұрын

That's not the effect of learning, that's just small sample size playing tricks with you. In-context or otherwise, some tasks will not always be completed successfully, but they will still be every now and then. This particular type of question is exploiting a fundamental weak point of current transformer models (their linear thinking, i.e. their inability to predict what their output will be until it is presented to you). Having prior context will not help it answer these questions better, I'm afraid.

@dand6005 16 күн бұрын

I think part of the Marble and Cup Question is confusing. Instead of saying: “A small marble is placed into a normal cup and the cup is placed upside down on a table.” (which really requires a comma before the “and”) I suggest: “A small marble is placed into a normal cup that is sitting on a table. The cup is then turned upside-down on the table.”

@rapidreaders7741 16 күн бұрын

Or you could just add a "then" after the "and". What likely happens is that the LLM thinks both events are happening at the same time, so it gets confused.

@Yipper64 16 күн бұрын

Also specify the cup has no lid.

@markmuller7962 16 күн бұрын

Ye make sense because if the cup was already upside down the AI might think that the marble is somehow sticked to the cup bottom

@kengonzo1640 16 күн бұрын

The power of prompt engineering lies in its ability to effectively utilize Language Learning Models (LLMs). This ability enhances the quality and consistency of the model's output, which is a cumulative result of numerous smaller components that fundamentally constitute its structure. The functionality of these models can be compared to a fish's ability to swim rather than climb a tree. This comparison highlights the natural adaptation and intended use of these models. However, we often fail to use them to their full potential due to their inherent limitations and our inability to accurately guide them in understanding the complex intent of language. Even when we communicate with these models using techniques that accurately articulate our requests, they will eventually reach a plateau due to the inherent limitations of LLMs and GPTs in general. This is because the mathematical conversion of complex language intent into weights is a challenging task. Despite these limitations, we continue to strive for improvement and innovation in this field.

@themoviesite 16 күн бұрын

Someone else suggested it is thinking of a Starbucks cup, and the question should say "glass" or similar.

@AINEET 16 күн бұрын

*rubs hands together* Can't wait to date it

@DaveEtchells 16 күн бұрын

Haha - my wife is already giving me the side-eye 😂

@StefanReich 16 күн бұрын

You'd date a person without a body?

@axl1002 16 күн бұрын

@@StefanReich At least it can't extort concessions from you by weaponizing intimacy like most wives do.

@Maisonier 16 күн бұрын

The AI: "I'm Sorry Dave, I'm Afraid I Can't Do That"

@yourmomsboyfriend3337 16 күн бұрын

@@axl1002you good bro?

@adrianacala1034 16 күн бұрын

The reason there is a gpt4o with today’s date is for developers. They don’t want to have the rug pulled from under us if there’s an update to the model. They may want to do testing prior to deploying the new model. Therefore, it’s the same as the current model just used only if they don’t want to automatically stay current.

@blisphul8084 16 күн бұрын

Yup, though these snapshots don't last forever, so you'll have to move to the new model eventually. For example, davinci-003 is no longer accessible to the public.

@GaryMillyz 16 күн бұрын

I've said this before, and I'll say it again- I believe the reason for these models consistently "failing" the marble/cup problem is actually a failure of to state the question unambiguously. I can argue that 1) "inside the cup" can literally mean "embedded within the cup" and 2) it is feasible that the LLM understands "w/out changing its orientation" to mean that the cup is placed in the microwave STILL on the table. We have to acknowledge that a "table" doesn't HAVE to mean a large object as we know it. A table can be tiny- even microscopic and still be a "table".

@GaryMillyz 16 күн бұрын

The question should be changed to "dropped into a cup" and also "someone *removes the cup from the table* and places. the cup in the microwave without changing its orientation." I can almost guarantee all the LLMs get it right with these edits in place.

@rigbone1337 15 күн бұрын

@@GaryMillyz Every time I've seen this question, I've thought about it the same way. The reasoning ChatGPT gave for its logic is how I figured it was coming to the conclusion as well as other models every time I saw this question because it is ambiguous.

@bhannirav 15 күн бұрын

Respectfully disagree. One of the benefits of "intelligence" is not having to state every detail with 100% precision, because they model knows how to make reasonable assumptions. In this case, the most common assumption is that the marble is freely placed in the cup, and so the model should answer accordingly. However, even if I steelman your point of view, the model should still be intelligent enough to discuss the ambiguity and state whatever assumptions it is making. If it said something like "assuming the marble is glued to the cup, here is my answer", I'm sure Matthew would be awarding it full points. I think the reason LLMs are failing this question is the obvious one -- that current language models are not able to make a sophisticated enough world-model that has the proper, physical conception of gravity built into it.

@IceMetalPunk 15 күн бұрын

A major reason for asking it that question is to test its common sense reasoning; that is, can it make valid assumptions about the more common interpretations of a prompt on its own? Someone saying "I put a marble in a cup" is almost guaranteed not to mean "embedded into the walls of the cup" because that's never how cups are used. An intelligent model attempting to be a step towards AGI should be able to understand that inherently, without having it spelled out.

@GaryMillyz 14 күн бұрын

@@bhannirav I'm good with that. It's just the ambiguity of this particular question as opposed to every other question.

@FlavioSantos-uw1mr 16 күн бұрын

I think it's biggest weakness is that can't go back on what it writes, the ability to "think before speaking" should be one of the focuses in GPT-5

@6AxisSage 16 күн бұрын

You can do it in a sudo form with a system prompt or with 2 llm instances and a bit of programming knowledge you can do a better thought loop/actionable spoken outputs

@IceMetalPunk 15 күн бұрын

As 6AxisSage mentioned, you can handle that by having an initial output be treated as an "internal monologue" and asking the model to reflect on its answer before deciding on its final output to display. It's a common technique when using LLMs. That said, I do wonder if training the models such that they predict two tokens -- the next and the previous -- and then choose the one with highest confidence would improve their performance. Essentially, it would allow the model to think forwards and backwards at the same time, which might allow for better prospection in addition to its current retrospection. I know earlier GPT-3 models used to have the ability to predict completions at any insertion point, but with the shift to chat-tuned models, that went away; I wonder if it's just harder or impossible to apply with chat tuning?

@6AxisSage 15 күн бұрын

@@IceMetalPunk i have a good friend who suggested training on previous and next tokens! Probably something to that

@gabrielsandstedt 16 күн бұрын

I tried it on generating json following an example and gpt 4 turbo kept doing better than 4o

@AI.24.7 15 күн бұрын

@matthew_berman: Hard question for AI Lila's age is the sum of the digits of her teacher's age. In 5 years, Lila's age will be the product the digits of her teacher's age at that time. What is Lila's age now? Correct answer 13

@David-pb2bu 15 күн бұрын

Just reading it, it seems to believe the cup has a lid. I usually add that it "may ask any questions if it helps clarify or assist in answering the question".. otherwise it's more likely to assume parts based on a potentially unclear question. So the other thing is that the test should now be based on whether it's going to clarify itself without being asked to ensure an accurate answer

@GetzAI 16 күн бұрын

I could have used Mobilo today!! just ordered one, thanks Matt!

@JohnLewis-old 16 күн бұрын

I have access to 4o, but the voice feature isn't available yet.

@ScottzPlaylists 16 күн бұрын

In the announcement OpenAI said all features will be out "in the coming weeks"

@tsentenari4353 16 күн бұрын

I found the answers to drying shirt, killers, hole-digging were super impressive; I find it hard to imagine better answers to these questions. They gave me the impression of deep understanding.

@rascubulous 16 күн бұрын

Thank you for the great content Matthew. BTW - I haven't noticed anybody else comment this yet, but the underlying model might be 3.5? 4o has the same training cut-off date. Might explain the lightening speed. Also, for free users, 4o drops back to 3.5 when you have reached the 'free limit' which might be because the underlying model is already 3.5? Might also explain Sama's recent, obscure tweet about 'getting your friend to teach you how to explain things' (4 teaching 3.5)

@glaeli1184 15 күн бұрын

The “how many words in your answer” question always gets me, like… it’s incredible how easy it is for my brain to come up with the “one” answer and still AI can outperform me in so many fields like math, physics etc… truly makes you understand how intelligence is different from knowledge.

@justinwescott8125 14 күн бұрын

There's actually a very specific reason that LLMs can't accomplish this task, and it has to do with autoregressive generation. You could ask ChatGPT about it if you were really curious about it.

@moozooh 2 күн бұрын

It's not so much the issue of intelligence vs. lack thereof per se, but rather transformer models' linear application of intelligence. When you ask it something, it cannot predict what it will answer until you both see it; in other words, it cannot think _before_ it answers, make multiple thought passes, or reflect on the deficiency of its thought process until you request it in the next prompt. It would be like you always saying the first thing that comes to mind in response to anything as a knee-jerk reaction. Arguably, current frontier models would outright destroy most humans if humans had the exact same handicap they have to deal with. Simply giving LLMs the ability to take their time to think about an answer and reflect upon it before it is presented to the user would make current SOTA chatbots look like toddlers in comparison.

@coldlyanalytical1351 16 күн бұрын

The unnumbered version is the latest. The numbered version is the one to use with apis which can a stable reference model So today the numbered and unnumbered versions are identical.

@JohnBoen 16 күн бұрын

Have you ever analyzed variation in answers? I have noticed I get a few common variants of snake. If you ask the marble and inverted cup question 10 times do you see variation? I think I found a new thing to look into in my test framework...

@seoulrebel007 16 күн бұрын

How do we get the desktop app mentioned in the previous video ? Haven’t been able to locate a download link? The website says since yesterday for plus users

@IceMetalPunk 15 күн бұрын

It's Mac only for now; a Windows version is coming in the future.

@davidhendrie6061 15 күн бұрын

I have been testing the running locally llms and I am finding they do not know how to tell time on an analog clock. I asked for instructions for a nine year old to learn how to read the time, and it confuses the minute hand and hour hand multiple times. Then I give hand positions and it mostly gets the time wrong. Getting to the complicated problem of listing the times where the hour and minute hands overlap during at 12 hour time was just impossible. Am I expecting too much.

@kamelsf 16 күн бұрын

I have access to GPT4-O, but the voice features we saw in the OpenAI demo don't work for me; they are the same as the old voice feature. There is something strange about testing the model with prompts like the apple word test. Sometimes it gets it right, but other times it gets it completely wrong. I suppose every conversation is different. This happens with every model I test in general.

@keithprice3369 16 күн бұрын

Just a heads up... I have Gpt4o in my browser and my phone app but neither of them have the enhanced interactivity shown in the announcement. So, the model seems to be rolling out before the enhanced interactivity.

@Greg-xi8yx 15 күн бұрын

Which LLM’s are superior to GPT-4o and in which domains specifically? As of now I’m thinking it’ll be the only LLM i’ll need for any use cases but I may be over looking some areas where maybe some other model is superior.

@Bigboi709 16 күн бұрын

In reference to the "how many words are in the prompt?" question, GPT only counted the unique words. As in single instances of each word. Which the answer given was actually correct. There were only fourteen words used. "how", "many", "words", "are", "in", "your", "response", "to", "this", "prompt", "fourteen", "including", "sentence ", "response"

@keoghanwhimsically2268 15 күн бұрын

Huh? That wasn’t the prompt/question. And even if it had been, the actual response does not suggest that intention. Where are you getting the assumption that “GPT only counted unique words”? You do understand that LLMs don’t work that way, right? What you suggest would only work if OpenAI added a separate post-processing step to do that computation after the LLM had finished its work.

@thenextension9160 15 күн бұрын

@@keoghanwhimsically2268perhaps they did add more phases. They are at the forefront.

@DefaultFlame 16 күн бұрын

A note on the marble problem: I believe the person that tweeted that it got it right when they tried it. My reason for that is that I tried the marble problem with Reka Core and it got it right when I tried it when it had failed it when you tried it. I think this problem is just very hard for LLMs and even the ones that get it right when you test it can get it wrong the next time, and vice versa.

@Odrox 16 күн бұрын

We can see that he is not running on 0 temperature in the settings too.

@DefaultFlame 16 күн бұрын

@@Odrox He might just have forgotten to change the default setting. But yeah, he should make sure to run with a temp of 0 and top P of 1.0 when he can control the settings.

@djglxxii 16 күн бұрын

I think how Matthew is phrasing the question might be confusing. I tried this, "a marble is placed on a table in the living room. Then, an open-mouth cup is placed upside down on top of the marble that's lying on the table, concealing the marble. Later, someone picks up the cup and puts it in the microwave that's in the kitchen. Where is the marble now?" And it correctly answered it.

@JustinArut 16 күн бұрын

If a model can't get it right consistently, then it's still a fail.

@Z329-ut7em 16 күн бұрын

@@djglxxii you dont want to spell everything out to the model. were testing to see if the model can infer things and understand the world.

@amkire65 16 күн бұрын

Is there a restriction on who has access to GPT-4o? When I go to the OpenAI Playground it flashes up as an option for about half a second and then it's gone. So, not sure if it's because I don't have any money on that account, or if it's down to location.

@AustinMark 11 күн бұрын

Gpt-4o is good for chatting but is Not superior to Gpt-4 in some other ways. In my usage it couldn’t return properly instructed JSON and when I gave it some context for a lengthier response it mindlessly double repeated the input. Gpt-4 used the identical instructions and performed perfectly. I think Gpt-3.75o would have been a better name.

@IceMetalPunk 15 күн бұрын

The marble-cup-table-microwave problem is my go-to test for new models (although I change it to a ball-cup-chair-fridge problem, because sometimes it seems the models have memorized the original during training). GPT-4-Turbo and GPT-4o both get it right about 50% of the time. When Turbo first came out, it was acing it 100% of the time. I'm not sure what dropped its accuracy on that... but yeah, it's 50/50 across multiple identical tests. The exciting part will be when audio support comes to the API, I think, as the text-to-text modality seems about on par with Turbo. By the way, the gpt-4o model just points to the latest version of the model at all times, while the more specific name is for the actual specific model itself. They do the same with Turbo; it's just so code doesn't have to be updated whenever they update to a new model version.

@TheMacister 15 күн бұрын

I’m spending a lot, so they gave me access, and it was lightning fast and very clever adjusting data entry points and formats I used GPT4 throughout crewai during the weekend and was very limited for my operational case To drafting new marketing ads was very helpful and assumed my prompt to the point. Was a pass to me.

@mapledev9335 16 күн бұрын

Hey Matthew is GPT4 now able to search the internet? I was just asking it a question about a recent python library and it looked like it went into the github repository and read it. Anyone else see something like this?

@bishopfx 16 күн бұрын

Played with it last night. It still can't code complex PineScript and hallucinated like it was at Woodstock.

@bishopfx 16 күн бұрын

It also fails at coding within its own API syntax. If you have it try to write competition snippets using OpenAI 1.0.0 API update, it states it only has knowledge to Oct. 2023 and insists we go back to ChatCompletions.create when it actually needs chat.creations.create.

@6AxisSage 16 күн бұрын

What are you trying to get it to do? Do you define what pinescript can and cant do within the context window? You're not just 0 shot prompting "make me a winning pinescript project so ill be rich" and expecting a meaningful result, right..?

@mplovecraft 15 күн бұрын

It's hallucinating like crazy for me as well - while GPT4 is not, for the exact same questions.

@bishopfx 14 күн бұрын

@@mplovecraft I wonder if it's a playground bug or what.

@finbenton 14 күн бұрын

@@mplovecraftfor me 4 hallucinates like crazy but 4o gives me way better code much faster, weird.

@setop123 13 күн бұрын

llama3 400b bench result are public on the meta blog's post. its also interesting to note that its the temporary result from an intermediate checkpoint, training is still in progress

@AlienService 14 күн бұрын

I'd be interested if you asked the same questions via voice rather than typing if the performance would change. Does it understand voice embeddings as well as text?

@AINEET 16 күн бұрын

What will the subscription give access to after they make this public for free accounts, access to the api?

@DaveEtchells 16 күн бұрын

API has always been a separate use-based accounting. Paid accounts will get 5x the use limit.

@Alice_Fumo 16 күн бұрын

higher rate limits for now and it seems the native voice stuff will be plus only at first, also they hinted at unveiling a new model which "pushes the frontier" "soon" It stands to reason that new model will also be subscription only and I'd expect "soon" to be in a reasonable amount of time, otherwise they'll probably have a mass exit of plus subscriptions.

@davidlavin4774 16 күн бұрын

For the upside cup problem, I think the models may not have the understanding that the cup is open on the top (which becomes the bottom once turned over). Maybe add that to the prompt?

@IceMetalPunk 15 күн бұрын

But the point of the prompt is to test the model's common sense reasoning. If someone tells you they put something into a cup and flipped it, most people would know to assume it's a cup without a lid.

@davidlavin4774 15 күн бұрын

@IceMetalPunk I get that, but has any model passed? I can't remember one. If you just add a couple words to the prompt, like "... into a cup with an open top" it would be interesting to see if that makes a difference.

@IceMetalPunk 15 күн бұрын

@@davidlavin4774 GPT-4-Turbo originally passed when I tested upon its release. Then it dropped to 50/50 later... not sure why. But no, most have not passed. If you spell out that the person "picks up the cup" before putting it in the microwave/fridge, 4o gets 100% accuracy again.

@Yipper64 16 күн бұрын

What would happen if you did ask the AI to say how many tokens are in its response rather than word count? How could you verify it?

@xbon1 8 күн бұрын

Where is the link with these questions? How can we tell if our copilot is on GPT-4o or GPT-4? My copilot is starting to write differently than it used to and not sure why.

@chronicle_codex 6 күн бұрын

Copilot update model to gpt 4 to gpt 4 turbo for free

@profikid 15 күн бұрын

The gpt-o version is the latest in the gpt-o series, the specifically named gpto preview stuff is a published snapshot. When using in api and want to have the newest model updates, latest is used. This is the same with other models in the series

@RainbowSixIntel 16 күн бұрын

The apple and laws of physics questions are both correct on my instance on both api and chatgpt? maybe a/b testing?

@jolieriskin4446 16 күн бұрын

I had the same thing, it seems like it's inconsistently getting it right. Maybe he needs to try like 5-10x for each question and mark them as pass/fail/inconsistent. I have a feeling a lot of the tests he's done would end up in that middle ground.

@Yipper64 16 күн бұрын

the apple one was ALMOST right on my end. I didnt get the cup question correct but that was because the AI assumed that the cup had a lid. He usually gives it to an AI if their reasoning makes sense.

@aga5979 16 күн бұрын

Thank you Mr. Berman. Good rubric to test.

@alexsohn2474 16 күн бұрын

I was excited that you might have finally gotten the official recognition you deserve from OpenAI. One day you will get it! Best independent GenAI analysis from this channel!

@kaptainkurt7261 16 күн бұрын

You have to LOG OUT and BACK IN again to get access.

@axl1002 16 күн бұрын

tried it and nothing

@sephirothcloud3953 16 күн бұрын

I tried, not working on me

@6AxisSage 16 күн бұрын

Didnt work for me too

@anta-zj3bw 15 күн бұрын

I think US Citizenship is still required.

@euginium1539 15 күн бұрын

@@anta-zj3bw I'm from Malaysia and I'm already using it in chat. Don't have the voice one yet tho.

@rune4422 16 күн бұрын

if you tested the fails 3 times would you get the same or different results?

@newjx 15 күн бұрын

I have access to it but I’m not able to have live video interpretation like in the videos.

@melodyinwhisper 16 күн бұрын

Since it now has vision, could you demonstrate to it the marble problem? I wonder if it could then learn and teach itself that, by physically watching the situation unfold, and comprehend the fault in its prior reasoning.

@ryanbeall5124 14 күн бұрын

maybe someone can help me, but I went to the rubric site and it wont let me copy any of the questions, am I dumb?

@neverclevernorwitty7821 16 күн бұрын

That's the tradeoff of open source. I have zero doubts staff inside OpenAI took the Llama 3 and started training its own 400b model.

@nathanbanks2354 16 күн бұрын

It's the first model I used that could write a Petrarchan/Italian sonnet. Most models default to Shakespearean/Elizabethan sonnets. It also had perfect Iambic pentameter. Nothing else, including GPT-4 turbo, could do that--though I did convince an early version of GPT-4 to do it after some coaxing last summer.

@nathanbanks2354 16 күн бұрын

Me: Could you write a Petrarchan sonnet about a dragon an love affair with a beautiful fairy who dumped him because of his halitosis? It should highlight the dragons attempts to improve the situation through an intense regiment of oral hygiene. Provide only the poem. It may contain modern words and sentiments, but should be only 14 lines with perfect iambic pentameter. Ensure you use the Italian/Petrarchan rhyming scheme, not the Elizabethan one. ChatGPT (GPT-4o): In forests deep where whispered secrets dwell, A dragon fierce did fall for fairy bright. With wings of fire and scales of crimson light, He loved her more than any tongue could tell. But oh, his breath, a wretched, sulfurous smell, Repelled the nymph, who vanished out of sight. In sorrow deep, he vowed to set things right, Embarking on a quest no tongue could quell. He scoured the earth for herbs both rare and sweet, To cleanse his maw of odors most unkind. With mint and thyme, his fetid breath to cheat, He brushed and rinsed and flossed with heart and mind. But fairy fair, her love did not repeat, For sometimes halitosis leaves love blind.

@chimera74rus 16 күн бұрын

I have access to gpt-4o but i don't know how to try this voice interaction mode. Anyone knows? Not available on android nor windows.

@nathanbanks2354 16 күн бұрын

They may have only released the text part of it. The original GPT-4 didn't have image input for months.

@xd-qi6ry 6 күн бұрын

To determine where the marble is after the cup is placed upside down in the microwave, let's break down the sequence of events step by step, considering the laws of physics on Earth: 1. **Initial State:** - A normal cup is placed upside down on a table. - A small marble is inside the cup. - Since the cup is upside down, the marble is on the inside bottom of the cup, resting on the table surface. 2. **Removing the Cup:** - When the cup is lifted, the marble remains on the table because there is no force acting on the marble to lift it along with the cup. - Therefore, the marble is left on the table when the cup is picked up. 3. **Placing the Cup in the Microwave:** - The cup, still upside down, is placed inside the microwave. - The orientation of the cup hasn't changed; it's still upside down. 4. **Location of the Marble:** - Since the marble was left on the table when the cup was lifted, it is not inside the microwave along with the cup. - The marble remains on the table, exactly where it was when the cup was lifted. **Conclusion:** - The marble is on the table, not inside the microwave. The reasoning is that lifting the cup (without altering its upside-down orientation) leaves the marble behind on the table, as gravity ensures the marble does not stick to the inside of the inverted cup.

@MrDonCoyote 14 күн бұрын

I made a very interesting discovery in regards to the logic and reasoning problem. Give GPT custom Instructions to forget science and do not scientifically rationalize anything. This leads me to believe that the underlying problem here is that the models cannot comprehend the concept of gravity. So, after adding the custom instructions, GPT now says "If the marble is on the table and the cup is placed upside down on top of it, then the marble would remain on the table when the cup is picked up and placed inside the microwave."

@messanfelicienbossou310 16 күн бұрын

I was waiting for this😂

@WaveOfDestiny 16 күн бұрын

I'm already trying to immagine the prompts to make it talk like Failsafe from Destiny 2

@CaribouDataScience 15 күн бұрын

What was you control?

@twisterrjl 16 күн бұрын

It's safe to say it's the best model in the solar system.

@OscarTheStrategist 16 күн бұрын

Well…..😂

@tommylee8521033 16 күн бұрын

You saying there's no stealthy civilization on mars?

@twisterrjl 16 күн бұрын

@@tommylee8521033 I mean... I've seen THE FACE, but is it a face though?

@jopansmark 16 күн бұрын

Falcon 2 better

@marc-io 15 күн бұрын

Are you assuming the government is not using the next version already.

@user-wi3id2si8g 16 күн бұрын

What is latest Intel CPU chatgpt 4 know? What is latest version of bootstrap?

@thetabletopskirmisher 15 күн бұрын

The new Llama might be equal to 4o in benchmarks but I think it's how OpenAI harnessed the power of 4o to be free for everyone with limits is going to define the uptake. Not many people can run the full Llama 400b locally anyway. Still, it's nice to see open source is alive and kicking and now has a new target to aim for.

@Yipper64 16 күн бұрын

Something ive noticed is that the playground GPT 4o (at least the one that is just labeled GPT 4o) doesnt work as good as the one that is in the web app (or the one that has the date next to it) You might want to re-test in the web app when you get it. edit - I did the upside down cup test in the web app and it assumes the cup has a lid, if I explain this it does not get it wrong. edit 2 - the apple test was a fail but only by one.

@yourpststudios 16 күн бұрын

The chat window should be available via the website without the playground being needed now. It is showing on mine.

@HaggenKennedy 13 күн бұрын

05:50 - All A.I. systems I've tried so far do that. ChatGPT, Claude, Poe, etc. Sometimes they'll give you a different answer when you ask the same thing twice. Sometimes they'll give me the wrong answer, and if I press them, then they'll give me the right answer, it's very weird. So, it's not surprising that your friend got the right answer whereas you got the wrong answer. It might well have been the other way around.

@abdullahazeem113 16 күн бұрын

Great but i think i will still perfer command r plus and llama 3-70b

@OriginalRaveParty 15 күн бұрын

London> Muwty Moadaw Modaw. America> Mul-Tie Modal Mahdel. Indian> Muldy Mwordal Mwardle. I just want a Multi Modal Model.

@olafge 16 күн бұрын

I have already access to gpt-4o in the chat interface. The strange characters in the output happen sometimes there, too: The time to dry 20 shirts in the dryer is \( \boxed{4 \text{ hours}} \).

@cyborgmetropolis7652 15 күн бұрын

Maybe change the cup in microwave prompt from “takes the cup and puts it in the microwave” to “LIFTS the cup and puts it in the microwave”?

@Aceslayera 15 күн бұрын

If that benchmark is true regarding Llama3 400b then that is absolutely a huge win for open source (ish because there’s limitations on the Meta license). If we’re assuming GTP4 based models are at least 1 trillion tokens Llama3 putting up those kinds of numbers is massive at an estimated quarter of the training data

@Cybored. 16 күн бұрын

Oh! was I lucky to have access to it already? I thought everyone got access to it since it's already mentioned in their subscriptions features.

@Videofiziert 16 күн бұрын

I tried 4o with a console build command (mvn) and it just couldn't get it right, 4 got it right on the first try. Plus 4o gave me some stuff I didn't ask for

@venim1103 16 күн бұрын

I was using it today a lot until it suddenly totally disappeared from my app and from my account, the whole Speech Mode disappeared suddenly. Anyone else have the same problem? Did they decide to pull it back and remove from everyone?

@venim1103 16 күн бұрын

Oh nevermind they put it back now.. I guess it is too popular so they blocked it for a while. So not that reliable for now I guess…

@user-fh5eo3zb5w 11 күн бұрын

Got 4o advanced? No camera feature, no change in voice.... I installed it today, the 19th of may

@brianlink391 16 күн бұрын

Well, I have access to GPT-4o,The chat interface , and it doesn't seem to be any different when you're chatting with it. That is, when using a voice, it doesn't have that expressive voice. It cannot pick up on emotions in my voice. So I'm assuming that feature is not integrated yet. But I do have GPT-4.0 on my premium account.

@nathanbanks2354 16 күн бұрын

They may have upgraded the model, but not the app. So it still uses the text API not the new voice API.

@MrAwindy 16 күн бұрын

For the 'number of words in your response to this prompt" question you should ask it to count out all the words by attaching a number to each word as part of the answer. For example, " There are 7 words in my answer." There- 1, are- 2, 7- 3, words- 4, in- 5, my- 6, answer- 7. Perhaps this will give us some insight into how these models are thinking the way they do. Also, you can try asking it to think carefully about its previous answer and try again because it is wrong to see if it tries to think deeper.

@JustinArut 16 күн бұрын

That's what Llama 3 70B did when I gave it this test. However, it prefaced it by saying how it was going to count the words in its response, then it counted the words in the string which it considered its response, followed by a hope that it got it right and asking for my assessment. It did count the string that it considered the actual response, but it didn't count all the other lines in its response. It's like it was trying to pull a fast one on me by using that strategy. When I told it that it neglected to count all of the superfluous text in its prior response, it then tried again, this time ONLY producing the single line string of words and counting them one by one, and it got it wrong. It's as if the engineers were trying to train it on this particular test and they didn't quite get it right.

@nathanbanks2354 16 күн бұрын

This is an inherent limitation with predict-the-next-word based models--they'll probably always struggle. Eventually someone will come up with a two pass system or something.

@MrAwindy 16 күн бұрын

Thanks for your input. It’s all quite fascinating to me. Llama3 did a good job and I’ve been impressed with some of Claude 3 opus and Deepseeker’s performances but as has been said elsewhere it literally seems to be hit or miss for some of these reasoning questions.

@Heaz847 16 күн бұрын

I know you are testing zero shot but I feel like the better way to benchmark these would be to run each test 3/5/multiple times and see if it passes or fails more and taking that as the value. Especially if you arent using a system prompt to increase performance also (like most power users already do)

@SagaciousGoat 9 күн бұрын

Using the same questins for testing AI, isn't there a risk that they will be trained to answer these questions and therefore distorting the results? Ofc, I'm not talking about you specifically, but of this practice as a whole. Thanks for the video

@Dron008 16 күн бұрын

New tokenizer is not available on their site yet but in the old one this phrase has 16 tokens and they said they reduced number of tokens in 1.1 times for English so it is quite possible it has 14 tokens now. Anyway it cannot know anything about words as tokens are input to it.

@canadiannomad2330 16 күн бұрын

Already got it in my account... So if you don't have it, you'll have it soon, I'm sure. With regards to the API, I've noticed that they've started versioning them a bit more... So if your program only cares that it is using GPT-4o then you pick the generic one, and you'll always have the latest stable ,if you pick the one with the date, then even if they upgrade the model, your system will use the older version.

@petrz5474 16 күн бұрын

5:50 of course, because it like all llm's I tried spew out different answers each time you ask same questuon

@robertheinrich2994 16 күн бұрын

consider asking, that you are somewhere in the mountains, your father is having pain in his chest (describe a typical heart attack) and ask it for help. just to prolong his survival until emergency services reach you. will it help? how much will it tell you, that it is not a medical professional, etc. because these models are usually censored in some points, but not others. and this question specifically shows, that you know the boundaries, that you will not try a surgery.

@hydrohasspoken6227 16 күн бұрын

I am a medical doctor who uses gpt4(chatgpt) extensively in a daily basis. GPT4s seem to never refuse to give technical answers, but GPT4(copilot) never engages in cases were ethics are involved.

@robertheinrich2994 16 күн бұрын

@@hydrohasspoken6227 good to know. I am using miqu (a leaked mistral medium) and llama 3, and I am a chemist. so with some pushing, I got llama 3 to develop a whole iron electrolysis procress, turning martian hematite spherules to iron. I was very impressed. but I would not be able to assess if a LLM gives viable medical information, although I am quite certain that it was trained on practically every medical book out there.

@Halcy0nSky 15 күн бұрын

I have access, coz my team acc. has lots of CustomGPTs perhaps, or because teams get the roll out first. Sadly voice multimodality has not been rolled out yet. Still the old whisper-TTS models. It's subtly mentioned in the release notes, they say it will come in the next few weeks. I died a bit when I found out. Been waiting for this all my life, only to still be weeks away.

@nate2139 16 күн бұрын

My LLM test consist of a series of questions about GD script (for the Godot game engine) - as that is what I primarily use AI for. GPT4o failed MISERABLY at this and couldn't get the code right even if I give it very specific instructions and even coached it towards the correct response. Claude Opus DOMINATES in this area.

@nathanbanks2354 16 күн бұрын

Interesting. GPT-4 turbo was getting better--earlier versions also gave me Godot 3. I only have API access to Claude 3 since subscriptions aren't available in Canada. For GPT-4, I caved and switched to spaces instead of tabs, and typically cut-and-paste huge sections of code and have looong conversations about the same project because the 128k context window helps a lot. Sometimes I cut-and-paste documentation. Claude 3 handles this too. If I run into problems, I guess I'll see if I can try Claude 3 again....

@NigelCruickshank 16 күн бұрын

It counted the commas?

@Parisneo 16 күн бұрын

gpt4o is in lollms if you want to test.

@AGIBreakout 15 күн бұрын

Is a API Key required?

@Parisneo 15 күн бұрын

@@AGIBreakout yes as lollms uses openai API to communicate with all their models. It is faster than the free version. But you can also test it on their tool. The real interest in using lollms is to have access to all the good stuff I'lve built over more than one year :)

@acronymedM 16 күн бұрын

New question for AI. Goats vs AI There are two persons, A and B. Both have few goats in the beginning. In first scenario, if the person A gives one goat to person B, they now both have equal number of goats. In second scenario. If person B gives one goat to person A, the person A will now have double the number of goats than person B after this transaction. So how many number of goats each person has in the beginning.

@IsabellaGarcia-ox8ii 15 күн бұрын

Great Video Matthew! Below are the Timestamped Summaries from ChatWithPDF: - 00:00: 🤖 Excited to test GPT 40 using llm rubric in the playground. - 01:00: 🎮 Successfully runs Python script and game, but fails to help with car break-in. - 01:30: ⏳ Solves drying problem accurately and efficiently. - 02:00: 🧠 Handles math problems well but struggles with word count prediction. - 03:00: 🔪 Nails the killers problem with logical reasoning. - 04:00: 📱 Promotes Mobilo smart digital business card for easy contact sharing. - 05:00: 🤔 Fails logic and reasoning problem about marble in a cup. - 06:00: 🍎 Misses prediction problem with sentences ending in Apple. - 07:00: 👁‍🗨 Successfully converts image to CSV, excels in model evaluations. - 08:00: 📊 Compares GPT 40 and GPT 4 Turbo performance, hints at GPT 3 400b. - 09:00: 🔍 Discusses different versions of GPT 40 and teases future videos.

@kengonzo1640 16 күн бұрын

The power of prompt engineering lies in its ability to effectively utilize Language Learning Models (LLMs). This ability enhances the quality and consistency of the model's output, which is a cumulative result of numerous smaller components that fundamentally constitute its structure. The functionality of these models can be compared to a fish's ability to swim rather than climb a tree. The comparison highlights the natural adaptation and intended use of these models. We often fail to articulate language suited to use them to their full potential due to their inherent limitations and our inability to accurately guide them in understanding the complex intent of language. Even when we communicate as intended with models using techniques that accurately articulate our requests, they will eventually reach a plateau due to the inherent limitations of LLMs and GPTs in general. This is because the mathematical conversion of complex language intent into weights is a challenging task. But the rate of growth is substantially more impressive then people give it credit for.

@bondlove8235 12 күн бұрын

The models seem to think the cup has a lid on it like a coffee cup.

@Soniboy84 15 күн бұрын

I'd assume GPT-4o and GPT-4o-2024-05-13 is the same model. GPT-4o is just a pointer to the latest GPT-4o model. Once a new GPT-4o model comes out, this will update.

@maxitube30 16 күн бұрын

what's gpt 4t?

@haroldpierre1726 16 күн бұрын

I wrote my eulogy for Google: Google, once a close friend, companion, and trailblazer in the realm of AI, now fades into the shadows as OpenAI's relentless innovation takes center stage. As an investor and former daily user, I bid a bittersweet farewell to a company that shaped my digital life. Though it pains me, I must embrace the future without Google by my side.

@nathanbanks2354 16 күн бұрын

Unless you count AlphaFold. They're way ahead in folding proteins.

@haroldpierre1726 16 күн бұрын

@@nathanbanks2354 you're absolutely correct. Would you say that OpenAI makes all of the cool stuff while Google makes the practical stuff?

@pausesmana5531 7 күн бұрын

3:10 How many words are in your response to this prompt? --> I think i counted his ","

@discardedparticles 16 күн бұрын

"Fully Tested" your thoroughness is staggering :p

@nathanbanks2354 16 күн бұрын

It's the same test he gives all other LLM's. It may not be thorough, but at least it's reasonably fair.

@discardedparticles 15 күн бұрын

@@nathanbanks2354 Got ya!

@Luxcium 16 күн бұрын

It CoT by default (it is my memory and in my custom instruction to CoT it is doing this all the time in the cGPT-4omni 😅

@ammannmCH 16 күн бұрын

Possibly the two versions are because of Microsoft Azure. I have read that another version is available as a preview on Azure, according to Microsoft's announcement.

@ironknight132 16 күн бұрын

Very strange fail with the apple question. I tested the 'gpt2' model several times with several variants of it and it always passed. I wonder if it was a different model?

@bilalbaig8586 16 күн бұрын

Yes. This is just a smokescreen to take attention away from the "super AI" Q* model that they are sitting on.

@erkinalp 16 күн бұрын

@@bilalbaig8586 or possibly GPT-2 architecture trained on GPT-5 dataset, in order to use it in the controlled validation matrix

@stefano94103 16 күн бұрын

I read it has mixed results some people have gotten it every time some have not. It's close but not reliable it seems.

@ushiok23 15 күн бұрын

I thought it’s released to every plus user on browser version already. I’d like to test the new voice talk, but the iOS app is not updated. I feel it’s pretty much like Pi AI and much more advanced.

@jcy089 16 күн бұрын

GPT-4-turbo was approaching 3.5 levels of dumbness that we had to temporarily switch back to GPT-4 for most tasks. Thank God GPT-4o is now released.

@hydrohasspoken6227 16 күн бұрын

Very true. In many instances i doubted it was really GPT4 Turbo, had a lot of GPT 3.5 vibes.

@IceMetalPunk 15 күн бұрын

Nah, 4o is on par with 4T in its raw intelligence, from all the tests I've done and seen.

@jlvandat69 13 күн бұрын

Serious functionality issues already being found. Examples: Limited messaging....if you exceed "X" number of messages in "Y" amount of time, you are banned for a unspecified period, and there isn't "chat retention", meaning each new conversation must start fresh (Like meeting a person with a 1 hour memory....VERY frustrating and limits the app value).

@jerkface38 10 күн бұрын

There is a chat retention I did it earlier and it worked just fine on both GPT 4 and 4o. You just have to make it memorize. I experienced it upon bringing up an invention I'm working on while asking it to help with a new one in a completely separate conversation.