Grok-1 FULLY TESTED - Fascinating Results!

Рет қаралды 165,495

Күн бұрын

Let's test Grok using our LLM rubric! How does it compare to other models?
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? ✅
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
Rent a GPU (MassedCompute) 🚀
bit.ly/matthew-berman-youtube
USE CODE "MatthewBerman" for 50% discount
Media/Sponsorship Inquiries 📈
bit.ly/44TC45V
Links:
Blog Announcement - x.ai/blog/grok-os
LLM Leaderboard - bit.ly/3qHV0X7
Chapters:
0:00 - Intro
0:51 - Testing

Пікірлер: 673

@labradore99 3 ай бұрын

What the heck? Wasn't the point of the shirts drying example that the reasoning should figure out the possibility of drying 20 shirts in parallel and that it would still take 5 hours?

@JohnKerbaugh 3 ай бұрын

I'm gonna fail the host on that one. 😂

@mickelodiansurname9578 3 ай бұрын

I'm thinking about this a bit and I think the prompt is unfair... imagine it simply answered "It will take the same time!" and in fact what was needed was how long they would take sequentially cos there is only one place to dry them? just saying, it wasn't stated in the prompt that you had the ability to lay them all out at once, that this was possible

@TheSnakecarver 3 ай бұрын

It would take 4 hours, but who counts.

@christopherchilton-smith6482 3 ай бұрын

@@mickelodiansurname9578You're correct. LLMs in my experience are severely constrained by logic. You can't depend on it to infer consistently. If you make it clear in the prompt that the 5 shirts are all laying outside simultaneously it will likely answer correctly. I don't use X and can't stand Elon so I'll have to wait for a quantized version to see what's up with this thing. Still, I have yet to use an LLM that can code as consistently good as chatgpt. Keeping my fingers crossed for a comparable local model soon though I haven't tried Devin or autodev yet.

@LeonardLay 3 ай бұрын

He always says that it's a pass if the llm solves it sequentially or in parallel as long as it explains the reasoning which it did

@MuckoMan 3 ай бұрын

I'm just a truck driver but I find this stuff amazing. I love everything open source even if its not working perfect. Yet!

@Gatrehs 3 ай бұрын

"Just" a truck driver. Ridiculous. You provide an essential service to make the lives of a lot of people better and more comfortable.

@jondeik 3 ай бұрын

I drove for UPS for 7 years. I have more respect for tractor trailer drivers than anyone else, and I didn’t even do it. I drove the box trucks (they call them package cars). Also, your job or career has nothing to do with how your brain works. Smart people work regular jobs

@spirosch5276 3 ай бұрын

"Just a truck driver?" LMAO, man. I have an undergrad and a postgrad degree, both framed and hanging in my bathroom. I hold a good position in a great company, yet I couldn't, in a million years, do the job you're doing, mate. Believe me, your contribution to society is greater than mine.

@ChrisOrillia 3 ай бұрын

you’re not *just* a truck driver 🙄… we’re curious people, man

@harshans7712 3 ай бұрын

Man!! "Just a truck driver" literally without your service nothing runs in this world man, we all citizens are really thankful for your wonderful service and we all are getting things at time just because of your service.

@lachland592 3 ай бұрын

I don’t think this is equivalent to testing the open source Grok. The open version is apparently not fine tuned at all, but the version available through Twitter is clearly fine-tuned for chat.

@thearchitect5405 3 ай бұрын

Not to mention the tools set up allowing Grok to search Twitter for answers to these questions.

@cbnewham5633 3 ай бұрын

The problem I see is that as you keep repeating the same test the likelihood is that the model will have been trained on the test as it will have been seen somewhere on-line. You need to change the tests and run them against all the LLMs available at the time so as to compare them, otherwise later models have a distinct advantage.

@reinerheiner1148 3 ай бұрын

Thats true, but this is more of an entertainemt channel. Don't take it too seriously. To really test a llm, those tests would be wildly insufficient anyways.

@thearchitect5405 3 ай бұрын

Grok doesn't even need to be trained on it, Grok has internet access, rendering this test meaningless.

@HUEHUEUHEPony 3 ай бұрын

soy testing, waste of time

@dianagentu7478 3 ай бұрын

@@reinerheiner1148 yes, tip me in the direction of real tests pls? Also, all LLMs, regardless of company, all have the same "favorite words" - I can't find any literature on why. Any ideas where I can look? I hope this has already been covered somewhere.

@mickelodiansurname9578 3 ай бұрын

Theres not a lot we can do about that, if the tests are changed now then the previous ones are not comparative, and if we do not change the questions then yes, they will be in someone's training data... what Matt needs, but has not built, is an automated and private set of questions (maybe 1000 questions in a CSV file) that he releases to nobody, that gives a score. other than that the results here will be highly subjective and not worth considering as a reliable measure.

@SODKGB 3 ай бұрын

Great job, really enjoyed this fast Grok overview!

@tony8k 3 ай бұрын

Same

@executivelifehacks6747 3 ай бұрын

Thanks Matt, we were all wanting to see the results, cheers!

@igorip2005 3 ай бұрын

Cool, thanks for testing, I waited for it

@MxAi955 3 ай бұрын

It will really be cool to have a score count on the screen as you test the models, that will help with tracking the scores.

@Pietro-Caroleo-29 3 ай бұрын

Thank you for today's presentation, Mr Berman.

@CYBONIX 3 ай бұрын

Matt, I consume a lot, and I mean a lot of "AI" content, podcast, etc.... You are by far one of my favorites. What you do here isn't easy, but you do it very well. Thank you~!

@SVisionary Ай бұрын

Appreciate this comparison. Thank you.

@Xardasflynn657 3 ай бұрын

Isn't the answer for the killers question 4? The one that got killed is still in the room, so there are 4

@jackflash6377 3 ай бұрын

Correct logically. Just because they died doesn't mean they left the room. They just changed state from "live" to "dead".

@YbisZX 3 ай бұрын

@@jackflash6377 May be a killer - is the one who can kill. Dead man can't kill - so formally he is not a killer anymore.

@Xardasflynn657 3 ай бұрын

@@YbisZX logically, he's a dead killer, so a killer nonetheless, there's a logical mistake in the original answer

@jackflash6377 3 ай бұрын

@@YbisZXGood point but he WAS a killer and still IS a killer, just a dead killer. "Killer" can be past, present and future tense. He was, is and could be a killer. No where in the prompt did it say they have to be alive.

@jammin023 3 ай бұрын

Yeah, another thing (like the shirt problem) that this guy thinks is a pass but is actually a fail. Would help if he understood what the correct answers are.

@mackblack5153 3 ай бұрын

Love these tests!

@eddieb8615 3 ай бұрын

Always the best content!

@Sofian375 3 ай бұрын

How is the question about the shirts a pass?

@raoultesla2292 3 ай бұрын

If you wear a SpaceX t-shirt to your NeuraLink interview you will move up for implant installation.

@calysagora3615 3 ай бұрын

It was not. He was blatantly wrong on that one.

@SlyNine 3 ай бұрын

@@calysagora3615 you're blatantly wrong. A serial drying time is accepted as long as it states that's what it's doing. That means one at a time. Try to keep up.

@funkahontas 3 ай бұрын

Because Matt is a clear Elon fanboy.

@Michael-ul7kv 3 ай бұрын

based on the information provided it's the most correct answer and it explained how it got to the answer. parallel drying takes in many assumptions, if it choose that answer it would need to provide the caveat that it would only be correct if you had more information.

@DiscOutpost 2 ай бұрын

This was very cool. I subscribed. A couple of the Grok fails I could argue that wording played in a role in the failure. But, thank you!

@dsuess 3 ай бұрын

In regards to the Snake game, i've asked Google's Gemini a similar simple task and it couldn't even build. At least the snake moved

@compton8301 3 ай бұрын

Gemini Advanced or the free one?

@jasonshere 3 ай бұрын

All of Google's AI's are extremely limited.

@DavesNotHereRightNow 3 ай бұрын

Are you using Grok. 1.0? I thought the one on X was now 1.5?

@JacoduPlooy12134 3 ай бұрын

Nice! Maybe at the end you can show the "leaderboard" so we can see how it stacks up against the others on perhaps a spreadsheet?

@usethatherb4913 3 ай бұрын

What GPU power would you have needed to run it?

@FranciscoLopes_Bthere 3 ай бұрын

I have a i9 10900k 64gb ram an a 3090 and i would like to give grok a try how can i set it up? is there a smaller version that can be used without a datacenter ?

@PseudoProphet 3 ай бұрын

4:39 there actually are 12 tokens in that response, not 12 words . 😅😅

@matthew_berman 3 ай бұрын

how do you know?

@PseudoProphet 3 ай бұрын

@@matthew_berman I counted, all words 2 digits and 1 . Total 12

@mickelodiansurname9578 3 ай бұрын

pure luck... and the answer by the way, the one that wins. is the model answering ONE as one world every time.

@elwyn14 3 ай бұрын

Download the encoder and do it again to be sure 😁 @@PseudoProphet

@bluemodize7718 3 ай бұрын

a word doesn't count as a token most words consist of 2 tokens, I don't know what is the formula of determining tokens but openai has a website that shows you the number of tokens@@PseudoProphet

@Researcher100 3 ай бұрын

What's the link to the interface you are using to test the model?

@researchforumonline 3 ай бұрын

Thanks shared!

@carlos_mann 3 ай бұрын

When running one of these locally, can you provide it instructions to create an app that you specify what you want for it to do? Or can you have it search through thousands of personal files, learn, add the data, and be used as a search engine throughout your database instead of using the local search engines, (ie file explorer).

@nqnam12345 3 ай бұрын

thanks Matt!

@dwirtz0116 3 ай бұрын

Great video. 👍 I was thinking of a test you could add which I didn't see you touch on. That would be recall to text previously spoken about in the same session. It seems horrible at that. Also I tried to get it to answer in lists of 10 and it couldn't repeatedly do it... 🤔🤓

@efexzium 3 ай бұрын

When can we get the hardware for it ?

@u.v.s.5583 3 ай бұрын

- Write a sentence that ends with apple. - The court of wizards sentences wizard X to death by turning him into an apple and leaving in this state for the rest of eternity.

@alby13 3 ай бұрын

What if you say that you are not looking for the exact amount or the calculated amount as the answer so that it might have a chance to answer you with what you are looking for?

@Rich28448 3 ай бұрын

Can you see whether having now done the test, if you did the same test again whether you get the same result? ie did it learn in real time?

@MrSuntask 3 ай бұрын

Mathew, noone on this planet dries tshirts in a serial way over 16h one after the other. I don't think it is a pass if it divides the drying time by the number of tshirts. Because the main reason for this test is to check if the LLM has an understanding of the real world. BTW a third answer (somewhat more realistic, but still uncommon) would be to dry in batches. If your drying space can only handle 5 shirts at once it would be four batches. Great test though! Thank you.

@CuriousCattery 3 ай бұрын

This is GPT-4's answer: The time it takes for shirts to dry outside in the sun does not increase with the number of shirts, assuming they all receive adequate sunlight and air flow simultaneously. If 4 shirts take 5 hours to dry, then 40 shirts would also take 5 hours to dry, provided that they are spread out in a way that doesn't hinder their exposure to the sun and air.

@OurSpaceshipEarth 3 ай бұрын

we know that they don'yt understand anything. it just guesses the next token. so far

@mickelodiansurname9578 3 ай бұрын

People who work in a laundry do this.... Ten or 20 dry today then tomorrow another 10 or 20...and so on.... Cos you cannot dry shirts that have not yet been washed. Just sayin...

@mickelodiansurname9578 3 ай бұрын

@@OurSpaceshipEarth well if thats true tens of thousands of data scientists... Some of the brightest folks on earth are wrong... Plus Devin and Autogen and CrewAI don't work. However first these systems do actually work... So that would need to be explained without invoking reason... And secondly I find it relatively common for people to point at scientists saying 'what would they know?'.... Can I point out how unlikely that is?

@thomassynths 3 ай бұрын

The digging rate could very well remain constant for a fairly deep hole, provided some sort of conveyor system.

@timeless3d858 3 ай бұрын

yeah they could develop a system to not get in eachothers way, or what usually happens working collectively, they could be efficient and delegate tasks so that they finish at an even faster rate.

@bigglyguy8429 3 ай бұрын

It could just presume a trench. Expecting the AI to answer "The same, you idiot, cos there's not enough room for the others to dig" is asking a bit much, really?

@gaminginfrench 3 ай бұрын

Is there a smaller open source model I can get to read french books and summerise chapters? Or fix errors I make when writing in French?

@blayneallan 3 ай бұрын

Mistral 7B probably given that Mistral is a French company.

@ff_ani 3 ай бұрын

Maybe soon we will see an amazing collaboration of two technologies: fast Groq and huge Grok, and it will work really cool and fast!🎉

@cjhmdm 3 ай бұрын

as for the "there are 12 words in my response to this prompt"... is it possible the AI is also counting "Grok" and "@grok" since both are technically part of the response? That would definitely make it 12 words if so.

@apache937 3 ай бұрын

no, that's not really how the LLM sees it internally

@kliersheed 3 ай бұрын

@@apache937 how would you know? maybe it learned that from some nitpicking smart ass on twitter? could be an emergent ability. OP would have had to ask its reasoning, sadly he didnt.

@apache937 3 ай бұрын

@@kliersheedgoogle "chat format oobabooga" and click the first link then read all of it

@thearchitect5405 3 ай бұрын

With the way these models work, they cannot plan out a sentence beforehand like a human can because they lack any form of internal dialogue, they'd need to plan it in text, which would count as their response. The only correct answer they can give is "one", otherwise the model just cannot physically do it. I'm surprised Grok didn't find the answer online considering it has internet access unlike other models tested with these questions.

@Action2me 3 ай бұрын

What does quantized mean?

@televerket 3 ай бұрын

Logic question, dont you need to remove the 4th dead killer from the room to be left with 3? Is this type of logic why we end up with too many killers/paperclips?

@thesagaofblitz 3 ай бұрын

If you pay the $22 to access premium X will that Grant access to Grok?

@TheOpacue 3 ай бұрын

What's "quantizing", or however it's spelled?

@dimtool4183 3 ай бұрын

There is still 4 killers in the room, just one of them is dead. It doesn't ask how many ALIVE killers are in the room.

@frankjohannessen6383 3 ай бұрын

The digging question is a bit silly. What are we digging in? Sand: A wider hole can utilize more people better. Clay: you might dig a very narrow hole, but then you won't be able to throw the clay out easily, which means more people might be extra valuable since you can hand them the clay, instead of having to crawl in and out of the hole. Are we digging as fast as possible? Then more people can rotate which can easier utilize more people. What tools do we use? Shovels? excavators? just our bare hands? With excavators then just one is probably as fast as five. Depending on all the potential factors here one person can work more, less or about the same level of efficiency as five people.

@johndallara3257 3 ай бұрын

Marble in cup then calling it a ball adds an extra layer of reasoning; equating ball and marble. Very good tests and great info, thanks. Would you say the value of Grok that come with your X paid account is fully valued in the access to Grok alone?

@cacogenicist 3 ай бұрын

*In case anyone was wondering how Grok compares to Claude 3 Sonnet in that spatial/physical reasoning test, Claude nails it, although my prompt was somewhat more precisely worded and less ambiguous. Even the mid-tier Claude 3 is pretty freaking great:* To reason through this step-by-step: 1) A person places a coffee cup upright on a table. 2) They drop a marble into the cup, so the marble is now inside the cup. 3) They quickly turn the cup upside down on the table. This means the open end of the cup is now facing down towards the table surface. 4) When the cup was turned over, the marble would have fallen out of the cup and onto the table surface, since there is nothing to contain it inside the upside-down cup. 5) The person then picks up the upside-down cup, without changing its orientation. So the cup remains upside-down. 6) They place the upside-down cup into the microwave. Therefore, based on the sequence of steps, the marble would have fallen out of the cup when it was turned over, and would be left behind on the table surface. The marble is not in the microwave with the upside-down cup.

@johnrule1607 3 ай бұрын

There would be *FOUR* in the room. No one left or was removed.. (Every time I posted this comment specifying WHO was left in the room, KZbin deleted it.)

@YbisZX 3 ай бұрын

Dead man can't kill, so formally he (it/body) is't killer anymore.

@fz1576 3 ай бұрын

Ideally, the answer would include a mention of this ambiguity.

@hypercoder-gaming 3 ай бұрын

Yes that's another answer however it's also correct to say 3 because you usually mean there are 3 living people, not 4 total people. Also it's questionable if a dead person is actually a person because over time they rot. At what point do they go from being a dead person to just a pile of organic material? The only logical decision is to make them not be a person from death and be a pile of organic material, although this is delving into philosophy.

@televerket 3 ай бұрын

This is how we end up with too many paperclips....

@johnrule1607 3 ай бұрын

@@fz1576 Agreed! 👍

@kristoferkrus 3 ай бұрын

How does it work for an open source model to have access to real time information from X/Twitter? If you run it on your own (super) computer, how is it able to pull information from X, which other morels cannot? How does X know that it's the Grok model requesting the information?

@thearchitect5405 3 ай бұрын

The open source model won't have internet access in the same manner Grok does on Twitter.

@kristoferkrus 3 ай бұрын

@@thearchitect5405Okay, well that makes sense if it's the case.

@kalvinflowers6178 Ай бұрын

The Messenger AI got the apple question correct. Here are 10 sentences that end with the word "apple": 1. The juicy red fruit was a delicious apple. 2. The teacher gave the student a shiny red apple. 3. The farmer carefully picked the ripe apple. 4. The crunchy snack was a fresh green apple. 5. The child eagerly bit into the sweet apple. 6. The chef added a slice of Granny Smith apple. 7. The healthy lunch included a crisp green apple. 8. The orchard grew a variety of tasty apple. 9. The scientist studied the genetics of the Honeycrisp apple. 10. The happy kid ran home with a big red apple.

@xsa-tube 3 ай бұрын

In the place where you ask about word count, the 2 invisible ones are and tags, I think so

@OurSpaceshipEarth 3 ай бұрын

that's amazingly intersting he really needs to know that eh! I think the only correct answer would ever be simply "one."

@scottcastle9119 3 ай бұрын

I've got grok all year and I really like it

@attilakovacs6496 3 ай бұрын

Seeing a pass on the shirts problem was SHOCKING.

@CM-zl2jw 3 ай бұрын

😂 can we make that a meme here too?

@Candyapplebone 3 ай бұрын

Dude that hoodie is dope what brand is it?

@gilbertb99 3 ай бұрын

How do you know that whatever is deployed on twitter is not a newer or better model than grok-1?

@SlyNine 3 ай бұрын

How do you know there's not a purple unicorn in your room that you can't see or feel?

@apache937 3 ай бұрын

or quantanized in itself

@HAL-zl1lg 3 ай бұрын

@@SlyNineX uses 1.5. 1.0 was the version that was released.

@ingenierofelipeurreg 3 ай бұрын

Pls share link for test it

@lemoniscate 3 ай бұрын

2:49 Man out of everything I did not expect for find about the 3 week one piece hiatus through this video

@user-yp9cd3zo2p 3 ай бұрын

Regarding the 10 sentences that end in apple question: Maybe add „each“. It gave 10 sentences and the last one ended in apple(s) - so it could be seen as pretty close.

@efausett 3 ай бұрын

I just tried GPT4 and it passed this test with either version of the wording.

@apache937 3 ай бұрын

i dont like this suggestion, the models should be able to do what you want regardless of errors in prompt

@jessiejanson1528 3 ай бұрын

They should do what you tell them or ask for clarification.

@kevinwells768 3 ай бұрын

Nice. Would have been useful to see a leader chart of which models pass which test.

@lun321 3 ай бұрын

Is it possible that Grok counted the period and 'twelve' (not 12 since it was already accounted for) as words, equalling 12?

@Josh-bq6rm 3 ай бұрын

So I can use it to finally do my Calculus problems?

@Eagleizer 3 ай бұрын

Ball in cup: You did not specify the the cup had no lid.

@kliersheed 3 ай бұрын

yup. it also considered it as a "container" which are most often smth thats closed to contain whatever is inside. i dont think its reasoning was wrong, it just didnt know what type of cup we talk about / that it had an open top side. he would have had to specify that to make sure its the reasoning thats the problem and not the knowledge/ prompt. many others of his questions as well. if he simply asks without specifiying that this is about how a human would do it in a real world with human limitations, if the AI treats the question as a sole logical problem and gives statements true to that context, you cant really blame it on bad "reasoning".

@timeless3d858 3 ай бұрын

that's like saying you didn't specify the man wasn't wearing a tophat.

@bigglyguy8429 3 ай бұрын

I find a more useful metric is to tell it that it is wrong, and ask it to go back and figure out why? That kind of conversation gives you more of an idea if the thing is actually thinking about the real world.

@mk1st 3 ай бұрын

In future versions the AI will probably be seeing the potential conflict in meaning and can ask back for clarity. I mean, it’s what we would do if we didn’t understand the question. I think it would be much more accurate if it did this.

@OurSpaceshipEarth 3 ай бұрын

No cup has ever ever had a lid that's not a cup. he should be saying teacup he's not using the correct wording like glass teacup or such.

@realDeor 3 ай бұрын

I look forward to the next few months, i am really curious to see what people can achieve with now its open source.

@johnrule1607 3 ай бұрын

There would be *FOUR* killers in the room. Three of them live, one of them dead.

@emolasher 3 ай бұрын

@6:25 change it to "lifts the upside down cup" maybe it needs it more descriptive.

@thakurlokesh 3 ай бұрын

How to run it on Google Collab?

@christianross2567 3 ай бұрын

Hey Matt! I noticed you were about half way through Gadaffi maxing and I was wondering if you were going to give us a video detailing your journey. Kind regards, your biggest fan.

@Shnazzleboxxin 3 ай бұрын

It was technically right in the 12 sentences that end with apple response. It gave you twelve sentences and the final sentence ended with apple.

@jermfu3402 3 ай бұрын

Nice catch! That's totally valid! Matthew, You're gonna have to give it a "pass" on that one!

@Cambo866 3 ай бұрын

I think if it uses assumptions to give a response like it did in the digging the hole problem then you should follow up with a question getting it to explain why its assumption may fail.

@akanbikhalid6928 3 ай бұрын

you should change the questions but keep them similar, if i was working at an AI company, i will be using your videos and others to fix the result

@matthew_berman 3 ай бұрын

I don't think they care about me ;)

@MrVnelis 3 ай бұрын

@@matthew_berman We do. I work at one of them, and most of your videos trigger tens of messages in our slack because nobody outside of the research dept has the time to test all these models.So we totally rely on channels like yours.

@MrVnelis 3 ай бұрын

Please do IBM :-)

@SlyNine 3 ай бұрын

@@MrVnelisyou don't care very much if you are giving it the answers. That's like studying for a specific I.Q. Test. The results are invalid and unuseful.

@snooks5607 3 ай бұрын

@@SlyNine ranking in benchmarks -> exposure -> VC funding

@synchro-dentally1965 3 ай бұрын

for the word count in response related question. is it considering the '.' and '\0" characters as words? I mean it considers '12' as a word

@apache937 3 ай бұрын

maybe, i think someone said the grok tokenizer sees it as 13 tokens (gpt-4 tokenizer had that sentence as 12 tokens). regardless its a fail

@doonk6004 3 ай бұрын

Groq needs to provide Grok-1 on their API now that it's open source!!

@solifugus 3 ай бұрын

What kind of GPU do I need to run this model? (at a minimum of price) You can still see the long-standing weakness of neural net models since conception in that they learn patterns and don't actually work logic. They only work logic in as much as the patterns they learn are compliant.

@thearchitect5405 3 ай бұрын

There is no singular GPU that could run this model, it takes a large rack of GPU's, which is why nobody has set up the open source model for chat yet. If you're interested in running models on your own device, you're better off trying a smaller model like Qwen1.5, which outperforms Grok by a decent margin and can be run on a singular GPU. If you do want to run the largest(reasonably) models on your own device, you'd need one or two A100's depending on what you're hoping to run. This won't work for Grok, but these's no incentive to use Grok since it's a lot worse than the current top open source models.

@solifugus 3 ай бұрын

@@thearchitect5405 There is the uncensored aspect. I don't want it to be a neo-Nazi or build bombs or anything like that but GPT and especially Claude's levels of censorship and extreme left-wing censorship is far beyond reason, in my view. Most people consider me to be centrist, btw.

@thearchitect5405 3 ай бұрын

@@solifugus Anthropic and OpenAI don't have up to date open source models. I will mention though, Grok isn't just worse than most of the top open source models, but also more censored than a lot of them. Grok has right wing ideology ingrained in it, with these LLM's it's pretty hard to avoid having one or the other because they primarily learn bias. This counts as a censorship because it's trained to exclude responses. Also, GPT4 and Claude 3 censorship isn't exactly extreme left-wing, they're actually generally less politically censored than Grok. Though prior to Claude 3, the Claude models were by far the most censored on the market. You may either be thinking of Gemini, or of very niche cherry picked examples.

@darkphase7799 3 ай бұрын

Do you keep spreadsheet showing what models passed what tests?

@keithnance4209 2 ай бұрын

The cup logic is actually correct. The faulty logic is that the person putting the cup into the microwave can slide the cup to the edge of the table and using both hands to force the ball to remain inside the cup. The test should explicitly state the person turns the cup right side up.

@realityvanguard2052 3 ай бұрын

So how much GPU power does one need to run Grok?

@Sven_Dongle 3 ай бұрын

Minimum 8 H100s.

@skasrafkamal7081 3 ай бұрын

Which gpu you are using

@Sven_Dongle 3 ай бұрын

The 4096 ones in the twit/X megaplex

@matthewbond375 3 ай бұрын

In your cup/marble challenge, could you be confusing the LLM by asking it where the BALL is, rather than the MARBLE? Or is that your intention?

@matthew_berman 3 ай бұрын

oh wow....i dont know why i never caught this! although i dont think it should matter, i changed it in my tests for the future. thanks for pointing it out!

@ctwolf 3 ай бұрын

@@matthew_bermanyou're one cool guy, just utilizing the feedback without ego my guy. Praise be onto you Matthew.

@matthewbond375 3 ай бұрын

Thanks for clarifying! This one is stumping all my local models in interesting ways. None of them seem to be able to reason that the marble doesn't stay in the cup as it moves, at least without giving them additional input. Great work! Love the channel!

@pokerchannel6991 3 ай бұрын

i found out quantizing is like compression where certain datapoints are lumped togehter as one undiffferencitated blob. It runs quicker, but some nuance is lost.

@apache937 3 ай бұрын

yes thats accurate. you can do some quant without any real loss in accuracy then its get bad fast

@oo__ee 3 ай бұрын

The model on X is RLHF'd whereas the open sourced model is the base model

@william5931 3 ай бұрын

I noticed it gave the answer before the reasoning a few times, maybe ask it to start with a step by step reasoning will improve the results, because it has more time/tokens to "think". I feel like it is just trying to come up with some reasoning to cover up its mistake.

@apache937 3 ай бұрын

yes this is better

@dasistdiewahrheit9585 3 ай бұрын

The hoodie 😻

@RobC1999 3 ай бұрын

For the cup problem, I wonder if the models would get it right if it was specified the cup does not have a lid

@apache937 3 ай бұрын

the point is to not make it too easy

@RobC1999 3 ай бұрын

@@apache937 I understand. But ambiguity in a question relies on the model anticipating the potential meanings and either picking one or trying to spell out all the possibilities. The model would be correct if the cup has a lid.

@olafsigursons 3 ай бұрын

Does not the requirement for Grok-1 being very light, like an i5-8250U CPU @ 1.60GHz with 8Gb?

@apache937 3 ай бұрын

fun joke

@dasistdiewahrheit9585 3 ай бұрын

Grok FULLY TESTED the whole industry!

@sophiophile 3 ай бұрын

The error with delay was that it didnt make it global where it was declared, from what I could see (you scrolled pretty fast). It looks like there were other issues as well.

@picksalot1 3 ай бұрын

The question "How many words are in your response to this prompt?" is a tricky one, depending on how the "12" is interpreted. It has single digits "1" "2", and a two digit number of "12". If each of theses is defined as a "word," then the answer of "There are 12 words in my response to this prompt." is correct, as 9 words + 3 number/digit words = 12 words. It will be interesting to see when/how it defines digits/numbers as words. In a way, "12" is a compound word, not unlike the word "Database," which can be seen as 3 words/meanings: Data + base + Database = 3. Enjoying your Channel. Thanks, subscribed.

@calysagora3615 3 ай бұрын

The test is exactly to see if the AI thinks about it in the same way we do, where numbers are not words, prompt names are not words, punctuation are not words, etc. Here Grok failed.

@picksalot1 3 ай бұрын

@@calysagora3615 Interesting. I didn't know how the AI thinks about numbers. Seems like an easy thing to fix.

@flashmedia8953 3 ай бұрын

How do we play with grok?

@Sven_Dongle 3 ай бұрын

Pay the Elon.

@brainstormsurge154 3 ай бұрын

The prompt, "How many words are in your response to this prompt," seems so simple but it's just how the LLM works that makes it difficult. A regular person could say, "One," or "There are three," so this question just shows how much differently we can think.

@landonoffmars9598 3 ай бұрын

I've noticed I'm not subscribed, I really thought I was. So I'm subscribing. Have a good day.

@christiandarkin 3 ай бұрын

a bit of a worry - if it's searching twitter for every answer, all your benchmarks are up there and regularly discussed aren't they?

@Hunter_Bidens_Crackpipe_ 3 ай бұрын

Thats the same for every other model Matt uses. They all got his info and questions.

@NicVandEmZ 3 ай бұрын

In Twitter is full of different types of people and people not do wrong things also and post wrong codes so I won’t be surprised if that’s why it’s not good with coding

@mickelodiansurname9578 3 ай бұрын

@@Hunter_Bidens_Crackpipe_ the other models weren't pulling live data from the internet, this one was... yes some have these tests in training data or fine tuning by now though. In fact thats one of the big problems in measuring their effect, the team over at huggingface are basically throwing their hands up at the moment cos their leaderboard is fast becoming pointless.

@PeterResponsible 3 ай бұрын

you should really consider Matt's "tests" as entertainment only. There is nothing scientific about this test and he's been doing the same test for so long that of course the solutions must have leaked into training data sets.

@thearchitect5405 3 ай бұрын

@@Hunter_Bidens_Crackpipe_ That's blatantly false. Matt almost exclusively tests offline models without search capabilities.

@GuidedBreathing 3 ай бұрын

Guess this was done a bit fast ☺️ Shirt in parallel takes the same time; and in sequential well.. 5:05 Four Killers; one dead three alive.

@reinerheiner1148 3 ай бұрын

Its not a mistake. He keeps those questions in there because you can see it either way. Which results in people discussing this in the comments, which results in more engagement for the youtube metrics, which results in better ranking of the video on youtube...

@english2success 3 ай бұрын

I think the ball was not 'in' the cup. It was 'under' the cup, right?

@jcorpac 3 ай бұрын

Nice overview. I feel a little disappointed that it didn't respond to the word count problem with "One".

@PCSJEFF67 3 ай бұрын

Thank you. I always thought that my clothes are shit and I have no taste in choosing them. Now I know I was wrong.

@NickFallon88 3 ай бұрын

It responds really fast

@epochgames3049 3 ай бұрын

12 words in the response was a PASS! 10 words in the sentence, and grok twice above that. The AI is literal! If you ask it how many words in the reponse, and only count the words in the response, and not the names.

@danielkorosec9944 3 ай бұрын

Would be interested to know what grok would reply hitting the apple in 0.

@mrtim6479 3 ай бұрын

Curious if you correct the AI when it got the logic wrong, and then later ask the same question, does the answer quality improve ?

@HobbiesHobo 3 ай бұрын

There are 4 killers in the room. 3 are alive and 1 is dead. There are still 4 killers in the room.

@elck3 3 ай бұрын

What is a “quantized” version?

@yoloswaginator 3 ай бұрын

The model parameters are compressed to a lower precision format, kind of butchering the optimal large model to make it much smaller

@elck3 3 ай бұрын

@@yoloswaginator thanks for that. is that like taking the weights matrix and compressing the precision of the numbers?

@yoloswaginator 3 ай бұрын

Yes, in the extreme version you get so-called binary neural nets, whose weights can only take boolean values. This still works surprisingly well and saves you lots of computation.@@elck3

@costa2150 3 ай бұрын

What does "quantized version" mean or do?

@Sven_Dongle 3 ай бұрын

The weights which make up the model are composed of floating point values, these values can be compressed via an algorithm (quantized) which reduces them to a set number of bits (typically 8 or even 4) thus reducing the overall size of the model allowing it to be run on smaller computing platforms. The raging dispute is the degree to which this reduces the overall accuracy of responses.

@costa2150 3 ай бұрын

@@Sven_Dongle ah ok. So it's a form of compression. Kind of like when an image is compressed and the resolution is reduced at different scales. Thank you Sven.

@Sven_Dongle 3 ай бұрын

@@costa2150 Sort of, but image compression can actually lose data by dropping out chunks that it determines wont be 'noticed', while quantization just reduces the size of each piece of data, though it can be argued information is also 'lost' in this manner.

@costa2150 3 ай бұрын

@@Sven_Dongle thank you for your explanation.