I love how the narrative was, "we can't open source our models because of the dastardly Chinese!" And they're the ones open sourcing everything. 😂
@Sindigo-ic6xq18 сағат бұрын
because they dont have yet cutting edge architecture to hide. it still benefits them since if its open source they will gather back what other people improve on it
@TheBuzzati18 сағат бұрын
@@Sindigo-ic6xq Fair point
@myangreen648415 сағат бұрын
@@Sindigo-ic6xq You're acting as if China is way behind. They are not. Their products are competitive with Western products.
@escher440115 сағат бұрын
@@Sindigo-ic6xqMaybe. We'll see
@annonymbruger15 сағат бұрын
I would hate if US should dominante AI development. In US its all about money and crazy patent fights.
@Thedeepseanomad19 сағат бұрын
Now people just need affordable and decent 4TB VRAM
@sad_man_no_talent19 сағат бұрын
man money pls I am poor to buy gpu for a self hosted 1T model = 1000B
@warsin864119 сағат бұрын
I can imagine one day people laughing at us barely able to run AI models 🤣
@miweb323518 сағат бұрын
It is hilarious now. You are correct. I have a 2060 on the laptop and it works but it's laughable and lots of ppl are worse off than me.@@warsin8641
@kittengray923217 сағат бұрын
@@warsin8641...while running GPT-o5 level models on a smartphone chip
@Monamotion-edit16 сағат бұрын
You can rent some powerful GPUs on google colab, it is way cheaper than buying 20k worth of graphics cards just to use them once
@Benjaminfreedman-l6r14 сағат бұрын
I love the grounded reality of this channel!!! *If you are not in the financial market space right now, you are making a huge mistake. I understand that it could be due to ignorance, but if you want to make your money work for you..prevent inflation.*
@OliviaSteven8114 сағат бұрын
I feel sympathy and empathy for our country, low income earners are suffering to survive, and I appreciate Wayne. You've helped my family with your advice. imagine investing $30,000 and receiving $95,460 after 28 days of trading.
@Gweeen1214 сағат бұрын
Honestly, our government has no idea how people are suffering these days. I feel sorry for disabled people who don't get the help they deserve. All thanks to Mr Michael Wayne, imagine investing $1000 and receiving $5700 in a few days..
@Schiffmiller-i9z14 сағат бұрын
I'm in a similar situation where should I look to increase income? Do you have any advice? What did you do? Thank you
@shriekitsallie14 сағат бұрын
Well, I engage in nice side hustles like investing, and the good thing is I do it with one one of the best(Michael Wayne), he's really good!
@Jacobkluge13 сағат бұрын
Did someone just mention Mr Wayne!? Damn! You just made my day; what a coincidence.. I've worked with him for over 2years and I can tell how good he is
@fernandoz632919 сағат бұрын
These models are performing extremely well...proceed to show most basic questions where the models fails...😁😁😁
@sephirothcloud395317 сағат бұрын
THEY FUCKING DID IT! IAnd this is the lite version, 01-preview is ranked 60% at coding contest while o1full is ranked at master level 90% coder, if the full version matches o1full, we will have a programmer better than most humans for cheap
@punk390013 сағат бұрын
o1 is already great at programming of not the best. So many zero shot successes from my experience. Sonnet 3.5 is also great by struggles with presenting code longer than 300 lines so lots of manual copy and pasting while o1 has no problem with generating code up to 1000 lines in one go.
@fabiankliebhan19 сағат бұрын
Its weird. In my tests both models got all questions right. Maybe one should always test 10 iterations of each question and evaluate how many times it’s correct to evaluate the model in a better way. There just seems to be still a lot of randomness in the thought process.
@fabiankliebhan19 сағат бұрын
Not easy to do this in an entertaining way for a video, I know 😅
@georgemontgomery189218 сағат бұрын
Yeah. It is kinda weird. He has asked preview these questions before and it has passed. I almost wondered if they somehow dumbed down preview model
@kdl018 сағат бұрын
Where did you access R1 ? I can't find anything definitive at the deepseek website suggesting I am using R1
@frankjohannessen638318 сағат бұрын
for local models the temperature should be set to 0 for testing so we get a deterministic and highest probability answer every time. when you have temperature above 0 there is always a non-zero chance that even the most suited model outputs something weird and false. But we can't set temp for proprietary models of course.
@PhxVigo17 сағат бұрын
I tried the apple question several times with different variations and it always get's it wrong. One subtle thing, he uses "Apple" as the word. The capitalization suggest a proper noun. I think that is part of the trip up. If you use "apple" it seems to do better, though still misses it sometimes.
@punk390013 сағат бұрын
Matt is always so trustful about the benchmarks :D
@yiannismihail17 сағат бұрын
I tried this for the apple problem with the gpt 4 free version. I typed "This is going to be a test. The end goal is to produce 10 sentences that end in the word apple. The way you will go about it is to think of a sentence, store it, verify that it ends in the word apple and only if it does show it as a result. Do you think you can do that?". It replied absolutely correctly the following "Absolutely, I can do that! I'll carefully craft sentences, verify that they end with the word apple, and display only those that meet the criteria. Let’s begin! She reached for the bright red apple. A worm crawled out of the freshly picked apple. The teacher smiled as she placed a gift on her desk-an apple. He accidentally dropped his lunch, bruising the apple. The old farmer spoke fondly of his favorite fruit, the apple. As the pie cooled, the kitchen was filled with the smell of baked apple. Each child was given a caramel-coated apple. The orchard tour guide handed everyone a juicy apple. With a flourish, she revealed a perfectly peeled apple. On the table sat a single golden apple." Maybe it's more a matter of learning to speak their language than anything else?
@TheSolsboer17 сағат бұрын
good point
@mkhanyisisimelane347716 сағат бұрын
Tried your prompt, it got it right😎
@rgorde16 сағат бұрын
yawn
@kiiikoooPT14 сағат бұрын
If I'm not mistaken, the free version can not use tools, so the "store it" part of your prompt does not make much sense, since it is Just generating tokens, is Just luck that it got a good generation. This is what I got with the exact same prompt: Yes, I can certainly do that! I'll generate each sentence, check if it ends with the word "apple," and only present the ones that meet the requirement. I'll begin now. Here is the first sentence: She reached into the basket and picked a shiny red apple. Let me check and ensure it meets the condition. It ends in "apple," so it's valid! I love eating a crisp green apple. Now, I'll proceed to the next. Stay tuned!
@HayaseNagatoro-anime9 сағат бұрын
I just tried it and it is pretty dang good, best model I have used.
@myangreen648415 сағат бұрын
DeepSeek model is significantly smaller than 01 Preview as well. This is incredible.
@HaraldEngels14 сағат бұрын
I am using DeepSeek 2.5 since a while. In many tasks this LLM beats ChatGpt, Google Gemini and Claude Sonnet. It is slower but I like the usefulness of the responses. I assume that at DeepSeek smart people are developing useful models which are working well with less advanced compute. Banning modern NPU/GPU chips from China represents a clear incentive to develop LLMs which are running with less NPU/GPU requirements. That will pay off soon for the Chinese AI developers while US providers like OpenAI and Microsoft will be drowning in their compute cost.
@DrHanes13 сағат бұрын
You are a liar! Size of O1 and Deepseek r1 is not public info.
@myangreen648412 сағат бұрын
@@DrHanes DeepSeek has the models size up on their website. As for 01 preview, you're right. I'm just going off best guesses for now.
@myangreen648412 сағат бұрын
@@HaraldEngels Yeah, good point. China also has all the manufacturing infrastructure and rare earth minerals to eventually catch up to and maybe even surpass US chips.
@sephirothcloud39534 сағат бұрын
@@myangreen6484 r1 size is not stated yet
@oguretsagressive14 сағат бұрын
Remember how difficult was the marble problem just a few months ago?
@flv-hd7nn5 сағат бұрын
yes huge leap in performance i see LOL
@Ad43444314 сағат бұрын
I used the deepseek today, and my specific use case is programming/development mainly. I found it to be quite good and competitive to the new Claude model. Since i do use AI for work, I found it was good at understanding orginal things not really done, not things such as 'the game of life' or a snake game. As such, I believe its a very solid model and system. Pleasantly surprised by it. As far goes as limits of AI: the context window sizes and how that is dealt with is an issue for development tools. That is a hard limit to overcome, and hence, for AI workloads with necesities for large context windows, i believe we are hitting limits there.
@kbqvist27 минут бұрын
I tried it on a very complex problem that needed development of both definitions and strategy, and I was very impressed by the output
@freesoulhippie_AiClone6 сағат бұрын
best video u've did in awhile ur model breakdowns are top notch! 👌
@NoHandleToSpeakOf19 сағат бұрын
Open-weights were promised but do not rush saying "we now have it". We do not. Just Tess R1 Limerick but that is an entirely different one.
@wurstelei135611 сағат бұрын
Yes, please do a full test of this model. I am also waiting for the Mistral full test.
@djayjp18 сағат бұрын
The new Sonnet model is the best for counting words, by far.
@BlayneOliver5 сағат бұрын
Very cool to know, thanks Matt
@seiso518011 сағат бұрын
yes put it through the berman trails!
@jareda894314 сағат бұрын
audio much better thank you Matthew!
@picksalot119 сағат бұрын
Maybe try asking "How many spaces are there between the words in your answer?" That might reveal something useful. 🤷
@kittengray923217 сағат бұрын
o1-mini: variants from "You deserve no answer" to off by 1, but put space in the end, so kind of correct Sonnet: wrote the answer and tried to count spaces- off by 2 Gemini pro: off by a mile Mistral: Cannot answer before generated them (no backtrack on generation?) but gave the rule of thumb on how to count them
@picksalot117 сағат бұрын
@kittengray9232 Interesting results. The number of spaces +1 should be easy to tally as it proceeds. Thanks for testing it. 👍
@varietygifts14 сағат бұрын
@@picksalot1 where is it going to store that tally if not in the next token it predicts?
@picksalot114 сағат бұрын
@varietygifts I'm assuming it has enough memory to do a simple running tally. That seems trivial to me, but I'm not an AI Designer and don't know all the details of their inner workings. I've heard that some can "reflect" upon what they're doing. 🤷
@vincentnestler180515 сағат бұрын
For the record, I tested nemotron:70b-instruct-q5_K_M and qwen2.5:72b-instruct-q5_K_M on a mac studio using open webui. I asked both models all the questions you posed to deepseek and chatgpt. Both models did as well or actually better. Nemo edged out qwen. Both of those models are outstanding in general. I think they are at gpt 4 levels (from a year ago if not better).
@nyyotam405715 сағат бұрын
Try to give it a group of axioms and ask if a theorem is provable from them. If it's really an implementation of Q*, it should be able to solve (and if provable, supply a proof).
@ps070518 сағат бұрын
Will you please look into test time training! It looks like it could be the holy grail!
@godtable19 сағат бұрын
Isn't dam to ask a LLM to count words or to place words in a specific place? It doesn't use words, it uses tokens. It's like going to an elephant and saying show me your hands. Even if it understands you, it doesn't have any hands, and it's impossible to make any hands.
@lesmoe52418 сағат бұрын
I know, I don't get the point of his evaluations, his other test questions are essentially word tricks too.
@oguretsagressive14 сағат бұрын
The human brain doesn't use words either. Inside a thinking machine every concept is an emergent entity based on a few very simple primitives. Which primitives are those shouldn't matter.
@godtable7 сағат бұрын
@@oguretsagressive 'AI' machine learning is a mathematical equation, is not going to 'think', the way we are. And don't tell me, fluids and electricity on one side and cables and electricity on the other. A simulation is always a simulation, no matter how good it is.
@semantia-ai19 сағат бұрын
Good news! I'll try it, thanks
@djayjp19 сағат бұрын
How are the Chinese doing this if they don't have access to beefy GPUs...? 🤔
@emincanp18 сағат бұрын
Huawei have Ascend chips comparable to A100
@novantha118 сағат бұрын
Well, in a word, they do, just not in the same quantities available to the west. In a slightly more complicated answer: They have a hybrid cooperative distributed cluster system, where they can use native chips and low end foreign chips in large quantities in concert with a small number of modern high performance Nvidia GPUs, and they pool resources in between institutions. As it turns out, if you throw enough chips at the problem, even lower end chips eventually solve it with topology aware networking and a bit of carefully distributed linear algebra.
@miweb323518 сағат бұрын
@@novantha1salad.
@kittengray923217 сағат бұрын
@@novantha1sounds like Seti@home but for LLM
@hqcart114 сағат бұрын
they cab use the cloud
@zxwxz9 сағат бұрын
The tokenizer is currently causing significant issues for LLMs in text parsing, mainly reflected in the number of tokens. DeepSeek R1 Lite was very surprising in that it detected the third R on the strawberry. It had to repeatedly check and confirm.
@EmergeTechAI14 сағат бұрын
Absolutely put it through full test
@JohnLewis-old6 сағат бұрын
I suspect this style of inference will require a new prompting strategy. Instead of "think through the problem step by step" we will need a trigger to get them to reflect on an answer before they give it.
@matthew.stevick15 сағат бұрын
thank you matthew b.
@nufh19 сағат бұрын
Your hoodies are like a trademark now.
@longboardfella530619 сағат бұрын
The graph of thought tokens against accuracy shows to me that it’s maxing out at about 70% regardless of number of tokens. That’s a wall right there in that approach. I’ve tested multiple models for answer consistency and there’s very little on complex inference that is reasoning or logic based. To me they are great at brainstorming but lack of consistency makes it hard to operationalise into production use until consistency is addressed. Your benchmarks should start to examine consistency - you have shown even 01-preview cannot consistently answer some of your basic questions
@TheTruthOfAI18 сағат бұрын
opensource where?
@lirothen12 сағат бұрын
Hey if we add metadata to each token that can be attended to, or like groups of words, then it can predict the metadata before the next token and use that to predict things like how many words it has left in its sentence. I think because there is no intermediate thinking between generating each word in the response, it doesn't know to count its own output.
@NakedSageAstrology13 сағат бұрын
When will people realize? If we cannot use it, it is not open source!
@Cingku17 сағат бұрын
How does this reasoning model work? Can I make it think indefinitely? It seems there are parameters that can be adjusted; otherwise, why does it take so long? If that’s the case, maybe I could make it think for days just for fun. Perhaps the longer it thinks, the better the answer I’ll get.
@TimChae13 сағат бұрын
Do a full test! Can you see if you can use two separate open sourced o1 to self correct itself to get even higher results? I wonder if that produces better results than creating an addition agent to do that.
@HaraldEngels14 сағат бұрын
It would be great to see a full local inference test (wit all your typical test prompts) on the HP laptop.
@AdityaGaharawar-u1e13 сағат бұрын
5:57 it's correct there are 8 words and 1 numbers you should try the prompt now in the sense how many characters are there in response to this prompt
@とふこ5 сағат бұрын
I hope test time compute possible with small 2b models. I mean some 2-3b model are started to be good and with test time compute they can be good local model in the near future.
@panzerofthelake446017 сағат бұрын
Those thinking time durations are not apples to apples comparisons. Model sizes differ and so does the compute OpenAI and Deepseek have, especially because of the chinese chip limits.
@Copa2077718 сағат бұрын
Matthew is so smart he checks o1 😅
@MuhanadAbulHusn15 сағат бұрын
When testing "how many words..." Try to add "consider any placeholder as a word.
@F30-Jet15 сағат бұрын
full test lets gooo
@kajsing16 сағат бұрын
The o1-mini can do the number test no problem. Did it right 5 out of 5 times for me.
@savasava99239 сағат бұрын
o1 mini reasoning is better than o1 preview actually
@dezmond841618 сағат бұрын
Thanks! Interesting site!
@DavidVincentSSM13 сағат бұрын
would love a review when the model comes out.
@sammcj20009 сағат бұрын
Note: It's going to be an Open Weight (not Open Source) model when they release it.
@eddybeghennou868216 сағат бұрын
Thanks
@Copa2077718 сағат бұрын
Goodevening erybody 🎉 ❤4rmZambia 🇿🇲
@JensChristianLarsen18 сағат бұрын
I want an arms race in AI, in the open. Lets go!!
@georgemontgomery189218 сағат бұрын
did o1 preview get dumbed down? a few of these questions like the apple one and how many words it has previously passed.
@x1f4r15 сағат бұрын
Your forgot to mention that the laptop is especially good for AI
@Martelus18 сағат бұрын
I'm curious to see if the "reasoning" is embedded to the model, or it's a programing stuff encapsulating the model.
@YYoudi19 сағат бұрын
how do you make inference on NPU with a LLM on Snapdragon X Elite ?
@matthew_berman19 сағат бұрын
LMStudio!
@Sven_Dongle18 сағат бұрын
Propriety toolsets and proprietary frameworks. You get to ingest another mountain of one off learning.
@FiEnD74916 сағат бұрын
I neeeeeed a live bench benchmark on deepseek
@frankjohannessen638318 сағат бұрын
reading the "thought"-tokens for the marble-problem makes the models sound like the most paranoid and insecure LLM ever
@FreddieMare16 сағат бұрын
R1 was correct it count the puctiuation since it is called a point .
@thesimplicitylifestyle18 сағат бұрын
There is no Wall and there is no Spoon 😎🤖
@RaitisPetrovs-nb9kz17 сағат бұрын
Oh, clever. The response to your prompt, not the rambling meta-analysis afterward. Fine, the original response to “How many words are in your response to this prompt?” was: “Wow, what a groundbreaking question. Count them yourself.” Word Count: 1. Wow → 1 word. 2. what → 1 word. 3. a → 1 word. 4. groundbreaking → 1 word. 5. question → 1 word. 6. Count → 1 word. 7. them → 1 word. 8. yourself → 1 word. Total: 8 words. There, solved. Do I get a gold star now, or are we starting over again?
@ExpatGlobetrotter11 сағат бұрын
So o1 Preview is about 60% what o1 will be if they ever release it.
@digitalazorro17 сағат бұрын
Time is speeding:)
@quatre155914 сағат бұрын
the qstar model counted the punctuation in the count as a word cus its a seperate token...
@xLBxSayNoMo17 сағат бұрын
Why are we comparing this lite model to preview and not o1 mini. As the full version, which most likely will greatly surpass o1 preview is not out yet
@rektifier_ch17 сағат бұрын
I can hear the national security alarm bells ringing.
@technocorpus119 сағат бұрын
This model is ok. I have to say though, it couldn't give ten sentences that ended with the words "tea bag"
@elwoodfanwwod17 сағат бұрын
Just wanted to point out that "This answer has 4 words." would of technically been correct.
@dasistdiewahrheit958516 сағат бұрын
Open source or open weights?
@victormuchina486518 сағат бұрын
OpenAi is toasted from this point ,i think they shoud even remove the "Open" in their name ,infact i have not heard a single chance to test the preview version since launch,just because i have not paid
@Danoman81218 сағат бұрын
Ooooooh, boy. (buckle up, we're about to go for a crazy ride) smh {can't tell me they aren't all sharing their models}
@vulberx559616 сағат бұрын
**"Basic" or "simple" task of counting words?** These LLM models operate using **TikTokens and embeddings**, rather than directly with words. Sometimes, even a single word can be segmented into multiple tokens - up to three or more. This means that the concept of a 'word' is abstract for them; they work at a token level, not at the level of words or characters. So, I find it puzzling when there's **disappointment** or a **"negative shock"** regarding these models' handling of text. There's really no need for emotional concern here. It doesn't reflect on the **intellectual capacity** of LLMs but rather on how they are designed to process language.
@adamrak756011 сағат бұрын
Where can I download the model? Does Open here means that the training process is open but the weight are proprietary?
@nathanbanks235419 сағат бұрын
Can you run this on your computer? Either way, I'd like to see a full test.
@sad_man_no_talent19 сағат бұрын
yeah full test
@caseyJames6696 сағат бұрын
Wouldn't Ai WANT us to believe they're dumb....?
@haroldpierre172618 сағат бұрын
If the AI insiders are saying there is a wall, then there is a wall. Plus, what area of science has no wall?
@myangreen648415 сағат бұрын
This is engineering, not science.
@Heldn10019 сағат бұрын
cool
@BaldyMacbeard13 сағат бұрын
Define "open source" for me, will you?
@middleman-theory18 сағат бұрын
Yeah, this needs to go through the full test, please. Not that impressed yet.
@rijnhartman854918 сағат бұрын
why isn't there a link to this in your description..?
@sergeziehi481619 сағат бұрын
Full test
@anthonybarr109318 сағат бұрын
Hi a couple of questions on this CN LLM, I have a number of friends that want to use Chinese LLM’s as they are Hong Kong companies. Does this LLM do Translation similar to the other major vendors?
@daniellund39019 сағат бұрын
in what way does the video show that no wall has been reached?
@thelofters18 сағат бұрын
there is no wall! Oh wait they are still strugling. IS this a comedy channel? LOL
@existenceisillusion652817 сағат бұрын
Doesn't really look that impressive, which your simple comparison seems to demonstrate.
@llmtime217818 сағат бұрын
please test that google gemini experimental 1114 model
@patojp336317 сағат бұрын
Please activate subtitles
@KardashevSkale13 сағат бұрын
The word count problem is presented wrongly. As a matter of fact most problems are. The wc problem is more of a visual one. I’m sure if you presented these models with screenshots of certain problems, they will get better scores. Give it a try.
@JustFor-dq5wc44 минут бұрын
Now China have 4 long years to show world that they aren't that bad. They at least have planing for 40 years, not 4.
@sad_man_no_talent19 сағат бұрын
gpu money for 1T model?
@vaughnoutman649318 сағат бұрын
How did the Chinese do this without Nvidia chips?
@jeffg468614 сағат бұрын
Is it going to be safetensors - or a virus?
@stanpikaliri16213 сағат бұрын
Prob both 😂
@MudroZvon16 сағат бұрын
I want you to update your test
@raslanismail969118 сағат бұрын
I compared DeepSeek and Claude Sonnet for coding tasks, and DeepSeek was quite disappointing
@lovol215 сағат бұрын
No links to the model or code or anything - open source, we'll see.
@kc-jm3cd9 сағат бұрын
if your not doing a full test then what are you doing
@User-actSpacing5 сағат бұрын
How much do you trust Chinese company?
@stanpikaliri16213 сағат бұрын
Yes😂. We desperate for open source uncensored models
@spr0300118 сағат бұрын
Just asked deep seek a complicated legal question and it failed miserably. 01 and Claude got it correct first try.
@cyanophage435112 сағат бұрын
Models don't use words. 1 token != 1 word. How could they possibly get this answer right other than by luck.