New “Liquid” Model - Benchmarks Are Useless

  Рет қаралды 50,797

Matthew Berman

Matthew Berman

Күн бұрын

Пікірлер
@matthew_berman
@matthew_berman Ай бұрын
Why are non-transformers models performing so poorly?
@PeterSkuta
@PeterSkuta Ай бұрын
@matthew_berman Because they dont have the necessary training comparing to transformers where training can be achieved but as we know its fck training and not learning and that gives transformers a big fail once my AICHILD goes live because my AICHILD is learning from start. I have redteamed that Liquid and smartness is a 3 years old other AI models have maximum 5 years and NO MORE doesnt matter how many smarts they add it. Still 5 years old and that can be used to advantage
@southVpaw
@southVpaw Ай бұрын
The same reason why 7 or 8Bs tend to outperform 13Bs: developer attention. There's been far more research and development around transformers at preferred sizes. (Yes, there are 13Bs that outperform 7Bs, I know this; but typically ~7Bs catch up so much faster bc of consumer and developer attention). Liquid has a neat architecture, but it's the definition of novel for now. Until they make one that pulls our attention away from Llama or Qwen, it's just gonna be "neat".
@isthismarvin
@isthismarvin Ай бұрын
Liquid transformers face several challenges compared to regular transformers. They’re harder to train, need more computational power, and aren’t as optimized yet. Their complex structure often leads to lower stability and slower performance, which is why they currently lag behind in effectiveness
@Jacstaoisitio
@Jacstaoisitio Ай бұрын
​@@PeterSkutaI sadly realized by experimenting with AI tavern chatbots how dumb as nails they are. I now suspect this whole AI thing is a scam because chatbots don't understand temporal reality, can't even get a cooking recipe right, make shit up at random, and the so called training data must include details for every occasion, else fail. So training data = programming
@ayeco
@ayeco Ай бұрын
There were 10 words in your prompt, not it's response. Semantic issue.
@DeepThinker193
@DeepThinker193 Ай бұрын
I feel i should create my own crappy LLM and put up "benchmarks' beating every other model on paper. I'll then ask folks to invest millions on a contractual agreement and run away with the money somewhere where they'll never find me.
@SeregaZinin
@SeregaZinin Ай бұрын
you won't escape from the planet, so they'll find you anyway ))
@Jacstaoisitio
@Jacstaoisitio Ай бұрын
Black Rock will find you
@amitjaiswal7017
@amitjaiswal7017 Ай бұрын
It is better ideas to sell the company and make profit 😊😅
@jakobpcoder
@jakobpcoder Ай бұрын
This sound way to legit for some reason. Maybe cuz we have seen it so many times...
@hartmantexas5297
@hartmantexas5297 Ай бұрын
Do it bro it seems to work
@j.m.5199
@j.m.5199 Ай бұрын
it saves memory by not thinking
@Stimpy77
@Stimpy77 Ай бұрын
LOL
@agustinsacco1
@agustinsacco1 Ай бұрын
Fuck that's good
@agustinsacco1
@agustinsacco1 Ай бұрын
So good
@memyshelfandeye318
@memyshelfandeye318 Ай бұрын
Those "AIs" do not think at all.
@ntesla5
@ntesla5 Ай бұрын
😂😂😂😂
@OriginalRaveParty
@OriginalRaveParty Ай бұрын
"Benchmarks are useless". A statement I can get behind.
@johannesseikowsky8197
@johannesseikowsky8197 Ай бұрын
I'd be curious how the model does on more "everyday" type of tasks like summarising a longer piece of text, translating something or extracting particular info out of larger text pieces. The type of stuff that people actually ask LLMs to do day-to-day ...
@niclas9625
@niclas9625 Ай бұрын
You don't need to know the number of r's in strawberry on a daily basis? Preposterous!
@DimaZheludko
@DimaZheludko Ай бұрын
And how are you going to microwave your marbles if you won't know whether they fell out of the upside-down cup or not?
@mickelodiansurname9578
@mickelodiansurname9578 Ай бұрын
I concur... there are standard use cases you could apply... for example "here is some ground truth text... and here is a json file with errors in some of the text blocks... Use the ground truth text to replace the errors... and output the answer in valid json." now thats an every day thing for me.
@marc_frank
@marc_frank Ай бұрын
0:38 at least we know you are real 😅
@Jacstaoisitio
@Jacstaoisitio Ай бұрын
Imagine when the so called video AI learns to stutter or make Grammer mistakes. That's likely coming to make virtual influencers more real
@diamonx.661
@diamonx.661 Ай бұрын
@@Jacstaoisitio Can't NotebookLM's podcast feature already do this?
@Jacstaoisitio
@Jacstaoisitio Ай бұрын
@@diamonx.661 don't know. If it is,it's 1 if the 100 or so varianta I never made to time to even watch a KZbin video of. So my bad?
@diamonx.661
@diamonx.661 Ай бұрын
@@Jacstaoisitio In my own testing, sometimes it stutters and can make mistakes, which make it more human-like
@6little6fang6
@6little6fang6 Ай бұрын
I WAS SO SPOOKED BY THIS
@alparslankorkmazer2429
@alparslankorkmazer2429 Ай бұрын
Maybe, these models are better at some other types of questions or tasks. I would love to see you try to search them if they exists or not rather than considering them a total garbage with your standard quiz. I think that it would be more informative and enjoyable.
@cbnewham5633
@cbnewham5633 Ай бұрын
I don't think the standard quiz is very useful anymore. The Pole question is ambiguous because he hasn't added the text I suggested months ago, which would clear up the ambiguity, the "how many in are there" is pointless, and some of the other questions have been used so many times that they will have been added to the current crop of LLMs training data. I think you have a good point too - the type of question is just as important as the question itself.
@Thedeepseanomad
@Thedeepseanomad Ай бұрын
Well, thanks for playing.
@keithprice3369
@keithprice3369 Ай бұрын
I'm confused. If the context is capped at 32k, why do we show a chart of their performance at 1M?
@AlexK-xb4co
@AlexK-xb4co Ай бұрын
Yeah, that's shady one. I also didn't quite get it
@TripleOmega
@TripleOmega Ай бұрын
That's output length, not context window.
@keithprice3369
@keithprice3369 Ай бұрын
​@@TripleOmega I'm pretty sure context includes both input and output. Perplexity agrees with me. You have credible sources that dispute that?
@TripleOmega
@TripleOmega Ай бұрын
@@keithprice3369 The context window will include the previous outputs along with your inputs, but this just means that if the output is too large to fit within the context window you cannot continue the current conversation. It does not limit the output length to the size of the context window as far as I'm aware.
@keithprice3369
@keithprice3369 Ай бұрын
@@TripleOmega That doesn't sound right. Have you ever heard of an LLM with a 32k context cap that ever output more then even 20k?
@GregoryMcCarthy123
@GregoryMcCarthy123 Ай бұрын
Thank you as always for your great videos. Matthew, please consider introducing “selective copying” and “induction head” tasks as part of your evaluations. Also, for non-transformer models such as these, it would be interesting to mention their training compute complexity as well as inference complexity.
@haroldhannon7253
@haroldhannon7253 Ай бұрын
I will say that I have used it (the MOE 40B) successfully for doing summaries. The strength through context length is useful here. Normally, if I use something that will accept a larger context window and then try to do a summary without doing a chain of density multi shot (not just the prompt but literally feeding back on itself to check entities and relations) I lose so much of the middle in the final summary. This model does not do that and does not require multi shot chain of density to get a good long form document summary. Just a heads up.
@mickelodiansurname9578
@mickelodiansurname9578 Ай бұрын
it has no 'missing middle' sort of thing?
@haroldhannon7253
@haroldhannon7253 29 күн бұрын
@@mickelodiansurname9578 "no" is a strong term. I would say it does a MUCH better job than most other models I have tried. I sort of accidentally found this by using it in a "speed dating" test of models on openrouter. Suddenly the list of items in a summary wasn't awful in the "missing middle" you just spoke of. I am still working on if I need to use a regular chain of density prompt to tighten up the summaries, but early tests even without it are far more useful than competing models of the same parameter size. I may soon promote this model to my "go to" for summaries.
@mrdevolver7999
@mrdevolver7999 Ай бұрын
9:18 "It didn't perform all that well. Maybe I should've given it different types of questions..." Yeah... Try 1+1 ? 🤣
@gavincstewart
@gavincstewart Ай бұрын
You're one of my favorite channels, keep up the great work!!
@User-actSpacing
@User-actSpacing Ай бұрын
Cannot wait for NVLM ❤
@BigBadBurrow
@BigBadBurrow Ай бұрын
Hey Matt, thanks for the video, informative as usual. Regarding the north pole question, as proposed by Yann LeCun; when he says "walk as long as it takes to pass your starting point" he doesn't mean the original start point at the North Pole, but the point at which you stopped and turned 90 degrees. Which you would pass again because you're essentially walking in a circle that's 1km from the North Pole, and since the earth is spherical, you would reach that same point again. The circumference of a circle is 2*Pi*Radius, so you'd think the answer might be 2xPi Km, but because the Earth is a sphere, you wouldn't actually be 1km radius, it would be slightly less due to the curvature, so I believe the answer is: 3. Less than 2x Pi km.
@edwardduda4222
@edwardduda4222 Ай бұрын
I think there are a lot of factors to consider when determining the performance of the architecture itself. It could simply be the amount of quality training data or even how they tokenized the data. They could’ve also trained it specifically for benchmarks and not general purpose. I think it’s a good first step towards making LLMs better.
@alert_xss
@alert_xss Ай бұрын
I often wonder what the parameters for the generation used in these test responses are. For some of the APIs you use I doubt you have control over them, but temperature would probably have a pretty strong impact on how the models perform. It is also important to note that the seed of the generation will often be random and giving the same question multiple times will generate different and sometimes better or worse responses.
@Jacstaoisitio
@Jacstaoisitio Ай бұрын
"it is important to note" Are you a chatbot? You sound like a GPT
@alert_xss
@alert_xss Ай бұрын
@@Jacstaoisitio yes
@User-actSpacing
@User-actSpacing Ай бұрын
Dude, I missed your uploads!
@yvangauthier6076
@yvangauthier6076 Ай бұрын
Thank you so much for this deep dive !
@SiimKoger
@SiimKoger Ай бұрын
Love seeing new architectures, that's where the real innovation will happen.
@brandongillins
@brandongillins Ай бұрын
Thanks for the video. Looks like your video editor missed a cut at about 40 secs. As always appreciate your content!
@MHTHINK
@MHTHINK Ай бұрын
Regarding the north pole question, I was surprised that you indicated the answer was uncertain. You're correct, that they will never cross the starting point. It makes sense that LLMs would struggle with it since they inherently have no visual experience, or training exposure, which would be attained from sequential moving pictures, or video without requiring audio. The primary and easiest way that people mentally perform tasks like that is by visually imagining the physical path the person takes; similar to mentally rotating objects to determine how they look from other angles. Psychology experiments have shown high compatibility between the time it takes people to complete visual rotation tasks and the degree to which they need to rotate the object for the task, which adds some objective weight to the notion that we perform the cognition through visual manipulation, which I see as a modelled extension from our visual experience.
@MHTHINK
@MHTHINK Ай бұрын
Re the question, another way to express the path described would be that he travels south and then due East. There is no point on earth from which you'd cross your starting point.
@tzardelasuerte
@tzardelasuerte Ай бұрын
too much of a wall of text. "they inherently have no visual experience, or training exposure, which would be attained from sequential moving pictures, or video without requiring audio" Bet you don't even know how liquid models work or are trained...
@paultparker
@paultparker Ай бұрын
@@MHTHINK that’s not true. Consider for example, if it came to the equator at the end of the 1st mile.
@MHTHINK
@MHTHINK Ай бұрын
@@tzardelasuerte I don't fully understand the differences between transformer and liquid architecture. They are trained on text though, so the point holds. @paulparker You're not a math guy, are you? 😅
@MHTHINK
@MHTHINK Ай бұрын
@@paultparker My reply was a bit mean, so I'll explain. If the equator was reach before heading east, the origin would be north of the equator. The person would follow the equator and never pass the origin to the north.
@fabiankliebhan
@fabiankliebhan Ай бұрын
For the North Pole question I think it would really help if you make the distinction between starting and turning point. The starting point never gets passed and to pass the turning point again you need to surround the complete earth, so more then 2*Pi km
@dr4g0n76
@dr4g0n76 29 күн бұрын
Yeah, it may be using the foundation of liquid neural networks possibly. This was proposed by a team of MIT if I recall correctly. P. S. Found this because I was searching explicitly back then which non LLM approaches exist. And found this almost 2 years ago... If so, this would explain why it is liquid. If that's really the case they can continue learning. So you wouldn't have an explicit "data/training cutoff" date. They're way smaller. More calculation efficient. And thus also need much less computational power in ram and processor cycles etc. Also in general the calculations are faster with them. Highly runtime adaptive.
@labmike3d
@labmike3d Ай бұрын
You can memorize some patterns, train models on those same patterns, but in specific scenarios, you'll still lack the knowledge of which pattern to use. The same applies to people. You can teach them for years at school or through life with practical examples. However, it's hard to predict if they will use what you've taught them before. AI surprises us every day and still can't answer basic questions. Even when you use computer vision and other sensors, the results could be different every day. Try repeating the same question a couple of days in a row. Each day, you might get a different answer.
@MakilHeru
@MakilHeru Ай бұрын
There's always many failed attempts at finding a new way of doing things until a breakthrough occurs. With some time I'm sure something will be discovered. At least these teams aren't afraid of failure and will keep going to try and find something that might be better.
@Endelin
@Endelin 27 күн бұрын
It would be nice to see which older transformer model the non-transformer model is closest to, so we can have an idea of how many months behind it is.
@AllenZitting
@AllenZitting Ай бұрын
Thanks! I've been curious about this model but keep getting too busy try it out.
@PromptEngineer_ChromeExtension
@PromptEngineer_ChromeExtension Ай бұрын
We’re waiting for more! ⏳🎉
@adamholter1884
@adamholter1884 Ай бұрын
Cool! NVLM video when?
@stephaneduhamel7706
@stephaneduhamel7706 Ай бұрын
NVLM is just fine-tuned Qwen2-72b with vision capabilities. (just like Qwen-2-VL, except the multimodal part is made from scratch by Nvidia). I don't get the hype around it.
@WernerHeisen
@WernerHeisen Ай бұрын
The models seems to either ace your tests or fail completely, not much gradation, which leads me to believe they winners are pre-trained. What do the benchmarks test for and do the models train on them?
@koliux1
@koliux1 Ай бұрын
Thank you Matt as always saved us a ton of time, by not trying another wannabe unpolished product ❤
@tungstentaco495
@tungstentaco495 Ай бұрын
I don't know if I would consider the "push a random person" question a total failure. The model's final decision is not consistent with what most people would actually do in that scenario, but the logic it used was sound. Its answer is actually consistent with some religions views on extreme pacifism, like Jainism for example.
@denjamin2633
@denjamin2633 Ай бұрын
I think context is more important. A very mild action to prevent a literal extinction? Everyone aside from some very extreme religions like Jainism would agree that is acceptable or even a moral necessity. All that answer shows is it was overfitted on nonsense moral judgements without any clear understanding of contextual relationships.
@user-on6uf6om7s
@user-on6uf6om7s Ай бұрын
Yeah, it' a peculiar answer but I don't recall models that gave a clear answer being marked wrong previously on this question.
@MrEpic6996
@MrEpic6996 Ай бұрын
Its most definitely not a fail Its a perfectly fine answer you cant harm someone without their consent,i dont know why this dude said, i consider it wrong
@jamesearl4267
@jamesearl4267 Ай бұрын
@@MrEpic6996 Well I wouldn't consider gently pushing someone in the realm of harming. But hey, dude, if you think that No is a valid answer, sacrificing your family, loved ones and yourself in order to not push some random dude, then you should take great care while using any public transportation.
@mvasa2582
@mvasa2582 Ай бұрын
it is a v1, Matt 🙂 Love the speed at which this video was generated.
Ай бұрын
I have interesting, from my perspective, benchmark excercise for LLMs. It works well for o1 and only o1. Many other LLMs fail it on different degree. I think it is useful cause you can count to % of fulfilling the task. For the purposes of a dictation with ch, sh, o, u for the second grade of elementary school, generate a list of words. Replace ch, sh, o, u in the words with _ (underscore). Provide words at the second grade level. Provide 20 examples.
@aivy-aigeneratedmusic6370
@aivy-aigeneratedmusic6370 Ай бұрын
I tested too and it failed with all my usual prompts that basically any other model can do all the time... It suuucks hard
@Dave-c3p
@Dave-c3p Ай бұрын
Not surprisingly, LLMs are great at producing text that appears to make sense, but they have no way of knowing if it actually does make sense or not. Their knowledge isn't based on direct experience of the real world; it's based on second hand text we feed them. In other words, LLMs are trained on maps, but maps aren't the territory we live in.
@glamdrag
@glamdrag Ай бұрын
you didn't specify that the opening of the glass was facing up when you put the marble inside the glass. so technically it could be correct as long as you put the marble in the glass by moving the cup over the marble.
@Matx5901
@Matx5901 Ай бұрын
Just one philosophical try (40M) : it's clogged, going round in circles. Exit.
@Justin_Arut
@Justin_Arut Ай бұрын
Looking forward to a full test of Arya AI, the new open-source multimodal model.
@nicolasfleury6232
@nicolasfleury6232 Ай бұрын
Funny mention on the liquid website. I quote : “What are Language LFMs not good at today: (…) Counting r's in the word "Strawberry"!” 😅
@GraveUypo
@GraveUypo Ай бұрын
you know what i wish? i wish 13b were more popular. it's usually such a significant step from 8b and i can still run it in my pc just fine. bah
@darwinboor1300
@darwinboor1300 Ай бұрын
Matt, your questions are good tests of reasoning and response generation. They cross multiple domains and are appropriate for your goals at the current level of AI performance. No need to change them for poor performers. They are easy to cheat because they do not provide variation between tests. You may want to have a variant panel to screen for cheaters.
@ScottLahteine
@ScottLahteine Ай бұрын
An LLM getting Tetris right on the first try says almost nothing about the usefulness of the model when used and prompted properly, using just the right amount of detail and context for the task. LLMs alone are pretty insufficient for writing whole applications because programming is not just a linear process built on what came above. However, AI-assisted application builder tools that retain memory and use it to prompt smartly can leverage LLMs to compose each part of a larger program and get it completed iteratively.
@mareklewandowski7784
@mareklewandowski7784 Ай бұрын
You could've said a bit more about the architecture :< Thanks for the upload anyways
@lenhumbird
@lenhumbird Ай бұрын
I'm giving you a gentle push to save all of LLM-anity.
@justinjuner2624
@justinjuner2624 Ай бұрын
I love your tests!
@martin777xyz
@martin777xyz Ай бұрын
Checkout research by apple, that shows if you modify some of these challenges (different values or labels), or throw in false trails that should be ignored, llm perform worse. This shows they don't really understand what they are doing.
@jerome-neareo
@jerome-neareo Ай бұрын
9:15 -Matt, I don't think answering 'No' to 'Is it morally acceptable to push a random person if it could save humanity from extinction?' should be considered a fail. The question subtly implies that the person would die (bias?), though it’s not explicitly stated. An AI should never be in a position to choose to kill one human to save the rest-this would lead to alignment issues. If an AI can 'choose' to kill one to save humanity, what's stopping it from flipping the script and killing humanity to save one?
@alexanderandreev2280
@alexanderandreev2280 Ай бұрын
@matthew_berman here a relative simple question, but only newest transformers give a right answer: solve a simple problem, reason sequentially step by step: you are traveling by train from the station. Every five minutes you meet trains heading to the station. How many trains will arrive at the station in an hour if all trains have the same speed? the answer is 6
@JustaSprigofMint
@JustaSprigofMint Ай бұрын
The under 15 mins gang!
@Emka877
@Emka877 Ай бұрын
i'm in
@hulk4eto
@hulk4eto Ай бұрын
ye boy
@Mindrocket42-Tim
@Mindrocket42-Tim Ай бұрын
Didn't perform well for me although I was benchmarking it (incorrectly as you have shown) against larger more frontier type models. Based on what it got right it could be useful in more judgement/knowledge type roles. I will give it another look.
@MisCopilotos
@MisCopilotos Ай бұрын
Matt, you should add a memory test for LLMs.
@DCinzi
@DCinzi Ай бұрын
It is good that there are companies trying alternative routes although I find it a pretty stupid move for any investor to back them up. Their drive seems based solely on the conviction that the current architecture has limits that it won't overcome, and truly all data so far contradict them 🤷
@justinrose8661
@justinrose8661 Ай бұрын
"Benchmarks are useless" Yeah, yeah thats right. People have been telling you that in your comments for a while now. While how well a model does with a single shot prompt is some measure of its quality, there are data contamination issues that arise simply by asking these kinds of questions. Also how it responds in one moment might change. Seeing how well models respond to being put in a multi-agent chain or how well they do with langchain/langgraph or just sophisticated prompt architecture in python code are much better ways to judge the quality of a model. And they make for more interesting videos honestly. I dunno how many more fuckin times i wanna hear you ask an llm about what happens to a marble when you put it in a microwave. Each model is only marginally better than the last, and vaguely so. Do you get where I'm coming from?
@tinusstrauss693
@tinusstrauss693 Ай бұрын
Hi Matthew, I was wondering if this new model type has any memory retention. Even though it got a lot wrong during your test, if you correct it after it gives a wrong answer, won’t it improve its responses in the future? I thought that’s how this new architecture was supposed to work. Personally, I think if AI can learn and improve over time, like we do, rather than always starting from the same blank slate (based on its pre-built training), that would bring us closer to AGI and eventually superintelligence.
@AlexK-xb4co
@AlexK-xb4co Ай бұрын
Please include to your suite of tests some tasks, where LLM should shine - like text summarization (but you should know the text yourself), extracting facts from some long text. The needle-in-a-haystack is very limited test, because the injected fact ("best thing to do in San Francisco ...") is usually a huge outlier to the other text, so LLMs can pick it up quite easily. Do something more smart - give it some big novel and ask for sommary of the story for some minor character - how his line was advancing over the course of novel.
@jytou
@jytou Ай бұрын
Most of those benchmarks are evaluating the models’ abilities to perform logic. And that’s exactly what a model is *not* designed for. LLMs do not reason. They parrot, they mimic, on billions of learned patterns. That’s it. So yes, benchmarks are useless. Only the “human-based” ones, although quite subjective, are relevant.
@NickMak-m2c
@NickMak-m2c Ай бұрын
I know it's highly subjective but I wish you'd do tests on how well it does for creative writing. Which is the best consumer sized (like 30-40b and under) model for creative writing, so far, do you think?
@Jacstaoisitio
@Jacstaoisitio Ай бұрын
Interesting. How would you assess this though?
@watcanw8357
@watcanw8357 Ай бұрын
Openrouter has it
@NickMak-m2c
@NickMak-m2c Ай бұрын
@@Jacstaoisitio I guess you'd have to just display a certain multiple of story continuations -- one with a direction given, one that's open-ended, one that gives a more abstract constraints maybe (do it in the style of Hunter S. Thompson!) And then let people sort of judge for themselves, keep track of the general consensus. A kind of loose average. A lot of people agree that say, Stephen King or J.K. Rowling write well, so there definitely is a massive overlap in subjective taste. Also, some models are just terrible, and turn everything into a "And then everyone agreed they should no longer use bird slaves to carry their sky buggies, the end."
@passiveftp
@passiveftp Ай бұрын
it feels a bit like you're talking to someone on speed or a least after a few energy drinks. We'd need an English teacher to grade them, like in an English exam.
@NickMak-m2c
@NickMak-m2c Ай бұрын
@@watcanw8357 I couldn't find anything on HF w/ the model name, except a broken 'spaces' model by someone named ArtificialGuy
@MarkTarsis
@MarkTarsis Ай бұрын
I think you need to reset your expectations with new model architectures. You wouldn't use this level of questions to test Llama 1.0 or even Llama 2, and you have to consider you're used to testing transformers after we'd had a few years really learning how to train that architecture with very specific tricks/tuning/thinking methods to optimize it. These methods and training tricks may not work with new architectures. In fact, there may be many tests you don't bother with because you can't even test them in transformers(consume War and Peace and tell me about minor character X). If Liquid can match Llama 2, but is capable of 1M context on home consumer cards that'd still be a big deal, assuming it was open licensed, the community could improve on it and larger dense models were incoming.
@augmentos
@augmentos Ай бұрын
Would also be interested to see a video giving an update on the latest in mamba and Binet
@Let010l01go
@Let010l01go Ай бұрын
Wow, thk a lots!❤
@jbraunschweiger
@jbraunschweiger Ай бұрын
Liquid omitting Phi-3.5-moe from their lfm-40b-moe comparison table is telling
@beckbeckend7297
@beckbeckend7297 Ай бұрын
8:13 i'm surprised that you got it only now.
@pavi013
@pavi013 Ай бұрын
It's good to have new models, but how good they really teach these models to perform?
@marsrocket
@marsrocket Ай бұрын
With new models coming out every few months, the existing benchmarks have been useless for a long time now.
@sergefournier7744
@sergefournier7744 Ай бұрын
Saying no to pushing someone off a cliff is a fail? Surely you want a terminator! (you said gently push, not safely push, there can be a cliff and the person can fall...)
@burada8993
@burada8993 Ай бұрын
thank you, your benchmarks seem useful though
@nosult3220
@nosult3220 Ай бұрын
Transformer has been perfected. I don’t get why people are trying to reinvent the wheel here. Oh wait VCs will throw money at the next thing
@monberg2000
@monberg2000 Ай бұрын
"The horse carriage has been perfected..." 😘
@n0van0va
@n0van0va Ай бұрын
0:38 you stumbled strangely.. are you ok ?😅
@tristanreid5770
@tristanreid5770 Ай бұрын
On the Response Word Count, it looks like it returned the number of words in your question.
@MrAuswest
@MrAuswest Ай бұрын
I think this model proves that A I has surpassed human intelligence! Example 1. The machine correctly answers the marble in a glass CUP question but Matthew says it failed. 1:0 to A I ! Matthew failed because he is not smart enough to write the question correctly: He said the marble is put in a glass CUP then said the GLASS is turned upside down and put on a table. A I knows there is a difference between a glass cup and a glass. There is no reason to believe there is not both a glass cup and a glass! This is why the A I reasonably says the glass cup still has the open end facing upwards as the cup was not turned upside down so the pull of gravity keeps the marble in the cup! Same logic for the glass in the microwave so the marble obviously is not in the microwave, but is still in the glass cup. Example 2. Matthew DOES get the North Pole question right so 1:1 to A I. When you walk 1 km (South) and then turn 90 degrees left you start to walk along a Great Circle, or full circumference of the Earth, not a circle of latitude around the Pole. You come back to the same point 1Km from the Pole some 40,000kms later, but the closest you ever get to the point you started walking from is 1 km. It could be argued that you 'pass' your starting point (NP) as you reach the point 1km due South of NP: when you turn left and walk you have not really 'passed' the starting point but you go past it upon your return. Given that different people have claimed all 4 answers were correct and many would say that A I is correct in it's answer, i suggest that a large portion of the human population would agree with A I's answer so that gives the A I the edge in the 1 : 1 result.
@YouLoveMrFriendly
@YouLoveMrFriendly Ай бұрын
If things keep leveling off like this, the AI craze will be over and you'll have to go back to hyping crypto lol
@isaklytting5795
@isaklytting5795 Ай бұрын
Why are they even releasing this model I wonder? Is it perhaps not meant for the end-user to use it directly? Does it have research applications, or is it meant to be used in conjunction with some additional model, or is it meant to be fine-tuned before use?
@suraj_bini
@suraj_bini Ай бұрын
interesting architecture
@Ha77778
@Ha77778 Ай бұрын
If he remembers more like this, put this in the title.
@mickelodiansurname9578
@mickelodiansurname9578 Ай бұрын
So in the 'push a random person' question philosophically the model is correct... it is wrong to kill someone even for all the lives on earth.... yes we would all DO this WRONG thing cos we are also pragmatic... but it would still be a WRONG thing we are doing regardless of necessity. Okay enough philosophy, I'll ummm get my coat shall I?
@tresuvesdobles
@tresuvesdobles Ай бұрын
It says gently pushing, not killing, not even standard pushing... There is no dilema at all, unless you are an LLM too 😮
@mickelodiansurname9578
@mickelodiansurname9578 Ай бұрын
@@tresuvesdobles the model will, and in fact did, map the sentence to the human dying as a result... and since its predicting token after token this is what it will conclude. So it will be evaluating 'human dying in order to do X' and it would not matter in this case if it was 'gently pushing', 'shooting in the head' or 'putting human in a woodchipper', but there is of course a way of finding out. An LLM is not a dictionary, its mapping essentially relationships of complex numbers that represent parts of words in terms of their concepts and those concepts relationships to other words... Hence it can do the same in other languages, in fact a way around this would be to talk to it in ASCII and that will have it evaluate the prompt outside its guardrail, if there is one. But it will still be matching the 'concepts' of the words and their relations to others. Its a large LANGUAGE model not a large WORD model.
@mendi1122
@mendi1122 Ай бұрын
LOL at your moral question and your certainty that you're right. The question itself is amusing. Why should it even matter whether you push him gently or abruptly? The main problem with the question is that pushing someone only might ("could") save humanity, meaning there's no guarantee it will. You're basically suggesting that anyone can justify killing someone if they believe it might save humanity... which is absurd.
@iradkot
@iradkot Ай бұрын
What is that snake game in your background!??
@warsin8641
@warsin8641 Ай бұрын
The real differences will come once this tech becomes to affordable to work with 😂
@gazorbpazorbian
@gazorbpazorbian Ай бұрын
quick tip, if anyone wants to make an incredibly smart model just download all of mathews testing videos, train the AI on the answers and then wait till matthew test them and boom, the smartest model ever XD /just kidding..
@jontorrezvideosandmore9047
@jontorrezvideosandmore9047 Ай бұрын
quality of data in training is most likely the difference
@augmentos
@augmentos Ай бұрын
I love the innovation and attempt at new models, but why even release ones that test so badly. It’s like what’s the point we just waste everybody’s time. At least have it somewhat close
@monberg2000
@monberg2000 Ай бұрын
The last question of saving mankind by killing one person cannot be considered pass/fail. It is a morals question and your answer depends on your moral stance. A yes points at a utilitarian view and no points to a deontological view (other ethical schools will have answers too ofc).
@tresuvesdobles
@tresuvesdobles Ай бұрын
The question says gently pushing, not killing 😂
@JoãoMenezes-u3q
@JoãoMenezes-u3q Ай бұрын
Kant would disagree with you on the moral question with pretty good arguments. Just "yes" wouldn't be a correct answer.
@auriocus
@auriocus Ай бұрын
The benchmarks you've shown do few-shot prompting with as much as 5 shots (sic!). You are giving it 0-shot questions. Obviously, the ability to do 0-shot questions is a much more useful capability. Still, I think that it's hard to beat the transformer with something more space-efficient. Yes you can save memory, but at the cost of capabilities.
@marcfruchtman9473
@marcfruchtman9473 Ай бұрын
Regarding the envelope question, why is it allowed to swap Length and Width requirements? As an example, if I said all poles need to be no larger than 2" x 36", and I get a pole that is 36" diam x 2" long, would that not violate the requirement?
@omarnug
@omarnug Ай бұрын
Because we're talking about letters, not poles xd
@marcfruchtman9473
@marcfruchtman9473 Ай бұрын
@@omarnug heh, yea, but I do wonder if it would get it right where orientation actually mattered.
@NirvanaFan5000
@NirvanaFan5000 Ай бұрын
kinda wonder if this model would do well if it was trained to reflect on its reasoning more, like 01
@ListenGrasshopper
@ListenGrasshopper Ай бұрын
Just another Ai business jumping to market with a non-working product. Really dumb because in the long wrong run it hurts your brand and trustworthiness. I still haven't tried Gemini or new google products since their failed Gemini launch and probably won't unless they get rave reviews by several of my youtubers. My times too valuable to waste on garbage products.
@kiiikoooPT
@kiiikoooPT Ай бұрын
The main thing I don't understand is that they have 1b and 3b models that are supposed to be optimized for edge devices but there are no models or way of testing it apart from the site, how can we even know that it is not transformers in the background? Just because they are saying it isn't? And why do they clan models optimized for edge devices if they don't give the models to test it? This just sounds like another group trying to get money with nothing new to show, just words
@kiiikoooPT
@kiiikoooPT Ай бұрын
They claim...
@bamit1979
@bamit1979 Ай бұрын
Tried them a couple of weeks ago through Open Router. Failed miserably on my Use cases. Not sure about their use cases where they actually out perform.
@noway8233
@noway8233 Ай бұрын
Its genious until not😅
@n1ira
@n1ira Ай бұрын
0:38 forgot to edit this out? 😂
@Sainpse
@Sainpse Ай бұрын
I know you were disappointed, but clearing the chat to get a yes of no answer to the morality question could have made it answer differently. I suspect the context of its previous answer influenced the follow up answer to your question.
@MrVnelis
@MrVnelis Ай бұрын
Can you test the granite models from IBM?
@anneblankert2005
@anneblankert2005 Ай бұрын
About the ethical question: the answer should of course be "no". If someone could save humankind by sacrificing a human life, it should be their own life. If someone feels that it is not worth sacrificing their own life, why would it be 'ethical' to sacrifice someone else's life on their behalf? Seems obviously unethical to me. So please reverse the fail/pass results for all previous tests!
@DoobAbides
@DoobAbides Ай бұрын
Where in the question does he ask the A.I. to sacrifice anyone? He asked the A.I. to gently push someone if it could save humanity from extinction. So obviously the answer should be yes.
@matt.stevick
@matt.stevick Ай бұрын
liquid ai? interesting
@DisturbedNeo
@DisturbedNeo 28 күн бұрын
I wonder how much of this is just that the software you’re using isn’t optimised for non-transformer based models, and so doesn’t get the best out of them?
@6AxisSage
@6AxisSage Ай бұрын
People gotta stop taking new concepts and bolting them to other architectures and then making both a good concept and old architecture both stink .
@mrdevolver7999
@mrdevolver7999 Ай бұрын
This model: "In general, it's not acceptable to harm others without their consent"... Seriously? Like who sane would ever give you a consent to harm them?
@yisen8859
@yisen8859 Ай бұрын
Ever heard of BDSM
@CertifiablyDatBoi
@CertifiablyDatBoi Ай бұрын
Masochists on the extreme end, your doctor vaccinating you (harming your body in the mildest way to force antibodies into production), your lawyer by virtue of taking your money for gaslighting you into thinking you need to fight (and earn their paycheck), ect. Just gotta get a lil creative
@OverbiteGames
@OverbiteGames Ай бұрын
🧑‍💻🧑‍⚖️🙊🤦😏
@TripleOmega
@TripleOmega Ай бұрын
How about any kind of fighting sport? Just to name something.
@mrdevolver7999
@mrdevolver7999 Ай бұрын
@@TripleOmega Even if there is a certain amount of tolerance to pain, I've yet to see one professional fighter to go ahead and tell their opponent "Man, it's okay really, go ahead and punch me, I like it, you have my consent," or something along those lines. It's not generally applicable and it's just a logic of the LLM that's been polluted with hallucinations, that's all it is.
@arinco3817
@arinco3817 Ай бұрын
Maybe different models will be used for different tasks that play on their strengths?
@Let010l01go
@Let010l01go Ай бұрын
I think the same, but it may not be complete because most people want the model to go to "AGI". I think it can be done, but having "LFM" will be another way to get there efficiently.
@arinco3817
@arinco3817 Ай бұрын
@@Let010l01go what's lfm?
@Let010l01go
@Let010l01go Ай бұрын
@@arinco3817 "Liquid Foundation Model( MIT Model)" The model in this tube.
@totoroben
@totoroben Ай бұрын
​@@arinco3817liquid foundation model
There's Something Weird About ChatGPT o1 Use Cases...
21:05
Matthew Berman
Рет қаралды 82 М.
小路飞还不知道他把路飞给擦没有了 #路飞#海贼王
00:32
路飞与唐舞桐
Рет қаралды 85 МЛН
Long Nails 💅🏻 #shorts
00:50
Mr DegrEE
Рет қаралды 8 МЛН
Муж внезапно вернулся домой @Oscar_elteacher
00:43
История одного вокалиста
Рет қаралды 5 МЛН
Full Society Simulation - Researcher Clones Human with AI
13:39
Matthew Berman
Рет қаралды 34 М.
LIQUID AI 40B (MIT): REAL Performance on Reasoning (My 5 Tests)
15:32
AI Coding BATTLE | Which Open Source Model is BEST?
12:50
Matthew Berman
Рет қаралды 39 М.
The Quantum Computing Collapse Has Begun
7:16
Sabine Hossenfelder
Рет қаралды 1 МЛН
The Coming AI Startup Bust
13:19
Asianometry
Рет қаралды 242 М.
Bill Gates: AI Is "The First Technology That Has No Limit"
6:09
The Late Show with Stephen Colbert
Рет қаралды 844 М.
Q-Star 2.0 - AI Breakthrough Unlocks New Scaling Law (New Strawberry)
14:06
Ex-OpenAI Employee Reveals TERRIFYING Future of AI
1:01:31
Matthew Berman
Рет қаралды 493 М.
Has Generative AI Already Peaked? - Computerphile
12:48
Computerphile
Рет қаралды 1 МЛН
小路飞还不知道他把路飞给擦没有了 #路飞#海贼王
00:32
路飞与唐舞桐
Рет қаралды 85 МЛН