Writeup for Deepseek R1 671b Setup and Running Locally digitalspaceport.com/running-deepseek-r1-locally-not-a-distilled-qwen-or-llama/
@PankajDohareyКүн бұрын
Everyone was worried about losing their job because of GPT. Now GPT lost its job!
@ShihanQuКүн бұрын
Please use this same system to test Deepseek R1 70b, so we know what the cost/effectiveness trade off is
@davefroman4700Күн бұрын
The market dumped BECAUSE this is open source. Scarcity is what creates value in this economic model. Be that real or artificially created by limiting production or the item. The true enemy of capitalism is not socialism or communism. Its abundance.
@brainstormsurge154Күн бұрын
You know what would really be impressive? Running a language model on a memristor.
@RickySupriyadi21 сағат бұрын
@@davefroman4700 you are have cool thoughts may i borrow your wisdom words about abundance? to unite human kind to their civilization advancement?
@PixelPiКүн бұрын
You can signifigantly increase your token rate by enabling memory interleaving and Directory AtoS in the BIOS, as the bottleneck in LLMs is bound by bandwidth rather than latency. For Stale AtoS (Directory AtoS), the in-memory directory has three states: I, A, and S. I (invalid) state means the data is clean and does not exist in any other socket's cache. A (snoopAll) state means the data may exist in another socket in exclusive or modified state. S (Shared) state means the data is clean and may be shared across one or more socket's caches. When doing a read to memory, if the directory line is in the A state we must snoop all the other sockets because another socket may have the line in modified state. If this is the case, the snoop will return the modified data. However, it may be the case that a line is read in A state and all the snoops come back a miss. This can happen if another socket read the line earlier and then silently dropped it from its cache without modifying it. If Stale AtoS feature is enabled, in the situation where a line in A state returns only snoop misses, the line will transition to S state. That way, subsequent reads to the line will encounter it in S state and not have to snoop, saving latency and snoop bandwidth. Stale AtoS may be beneficial in a workload where there are many cross-socket reads, such as with memory interleaving. If you don't have AtoS available in the BIOS try other snoop methods and use the Intel memory latency checker (mlc) tool to find the snoop mode that offers the highest bandwidth, and you can further confirm which settings are right for you by running the Intel MLK LINPACK test, LLMs are very dependent on linear algebra, so LINPACK is a good stand in benchmark for testing overall LLM CPU performance. Also Cascade Lake-SP has Intel's Deep Learning Boost, for Xeon Gold and Platinum skus, and the AVX-512 VNNI instruction alone will basically double your token rate, but beware, Cascade Lake-SP Qualification / Engineering Samples don't have AVX-512 VNNI because they can't receive the latest microcode opcode updates, another step up after that is Cooper Lake-SP, which has bfloat16.
@spencersmith7769Күн бұрын
By how much? Like going from 2 tps to 15? Or is it only like 2 tps to 5 lol
@maxmustermann194Күн бұрын
Nerd. I like you.
@Zeni-th.20 сағат бұрын
How the hell did humans come up with this
@andrewazaroff12 сағат бұрын
He also needs to check interleaving, channels. I would suggest even cheapest Epyc 9xxx against any xeon, both due to memory bandwidth and quality of avx512
@lev1ato11 сағат бұрын
wish I knew 10% of what you just said 😭😭
@paelneverКүн бұрын
Stock prices of big cloud providers and chip makers shouldn't fall. Now that everybody can run and even train their own state of the art models by following the recipe provided by deepseek, the need for compute is going to peak like crazy. The only ones concerned should be closedAI and other big corps that already invested fortunes training inefficient models. Clearly many people freak out because they don't understand that open source is good for everybody except for closed source fanboys.
@tringuyen7519Күн бұрын
DeepSeek is good & should help AMD spread ROCm over NVDA CUDA. But expecting everyone to have a rack mounted server with 300G of DRAM is a bit too optimistic…
@FenrirRobuКүн бұрын
@@paelnever no, now the need is low because they gave their weights with an MIT license. This severely limits the need for new large scale training.
@paelneverКүн бұрын
@@FenrirRobu All experts in AI are saying that need for compute (including large scale training) is going to increase. How do you think that big AI western labs are going to respond? They are going to apply the same recipe that deepseek applied but with much more compute power and that way the race for AI keeps going as AI models get better and better. In fact nvidia stock is already recovering. But don't let that shame prevent you to talk about what you don't know.
@FenrirRobuКүн бұрын
@@paelnever yes, knowing about LoRAs, fine-tuning, and retraining existing models the way it was done with LLama is surely a big shame for me. Maybe I should also call deepseek's method a recipe rather than say "mixed fp8 and fp16 training" or "efficiencies of MoE" models. What's the point, you're probably powered by ChatGPT anyway.
@branduncensoredКүн бұрын
except for techbrozzzzzz
@clubpasser123454321Күн бұрын
I ran 671b locally too in ram, on a Epyc 7C13 with 1TB Ram. Seemed to be comparable speed to what you are seeing. Not particularly practical, but it works. I found the 7C13 a great value processor for a home server with older tech. Love your content, thanks for sharing.
@n.lu.xКүн бұрын
I too have build a rig with the 7C13, thing is a beast and has great perf. per $. Which motherboard do you have? I'm running it on the mz32 ar0 rev 3.0 motherboard and currently have only 8 slots occupied (512GB ram)
@DigitalSpaceportКүн бұрын
Meaningless errata: You can flash up a Rev 1 to a Rev 3 and the 7C13 looks like an amazing CPU!
@clubpasser12345432123 сағат бұрын
@@n.lu.x I have a Gooxi G2SERO-B. 10 PCI-E slots and 16 DIMM slots. I designed my own single height water block for RTX A5000 and will have 6, but only 2 installed so far. Rest next week after the holiday. I put in the full 16*64GB but only 2666MHz. It was very cheap and I couldn't help myself. Question for you, how hot does your Ram get? Mine will burn up if I don't have the fans on. With fans, it overs 50-60C
@igordasunddas337721 сағат бұрын
Man, I should upgrade my AMD 8004 platform...
@n.lu.x14 сағат бұрын
@ Interesting, you went the full server route with the a5000s. I went with 4x rtx 4090s with c-payne risers and dual ax1600 psus in the same mining rig as @DigitalSpaceport. Waiting to see if I can get an additional 5090 in a few days! As far as ram goes, just checked and damn, its at 60-64 degrees, in an open chasis too. System seems stable so far doing DL though but if i get stability issues I'll install a fan.
@differentmovesКүн бұрын
This is actually nuts. Can't wait for folks to stack up a bunch of digits to run these models.
@blubberkumpel6740Күн бұрын
@@differentmoves stupid Question: What does that mean?
@GubbaGaminКүн бұрын
Digits will only support 2 linked together.
@anthonyperks2201Күн бұрын
@@GubbaGamin yeah, but just imagine a couple of M4 512GB Ultras. I can dream.
@steelstagerКүн бұрын
@@GubbaGamin Out of the box. Some madman will figure out how to get a few dozen to work together.
@steelstagerКүн бұрын
@blubberkumpel6740 Project Digits is Nvidia's prebuilt AI solution for people who don't want to build their own machine. We should know more soon as they will go on sale later this year.
@阿斯顿法国红酒快-t9lКүн бұрын
Can you please make a video comparing the difference between r1 different parameters? Like 671b vs 70b to show the cost and efficiency trade-off.
@samorostczКүн бұрын
Good idea. I would even love to see one smaller model for curious ppl without advanced hw. ✌
@notlogic.the.secondКүн бұрын
fr. i want to see that too
@S-K69Күн бұрын
I’ve run 70b, 32b, 14b, 8b, and 7b locally. I will say it is very cool being able to run these models locally BUT they are by no means replacements for the premium models currently on the market. Yeah 32b is on ChatGPTs heels in a bunch of benchmarks but in practical uses it fails miserably, even on basic reasoning prompts. For example, I asked it to tell me a country that ends in “lia” (Australia). In its reasoning it starts listing through countries, it SPECIFICALLY lists Australia and says “but no that doesn’t end in lia” and ultimately determines that Mali is the country that ends in “lia”. The code quality it produces is very uninspiring. For example I asked it to make a chess game utilizing Python and it spit out a Tic-Tac-Toe game. Even after trying to correct the model it just couldn’t get to anything even resembling chess. I don’t feel like doing a full blog post but to me it is currently a cool novelty that’s not very reliable. Nonetheless it is a huge step forward and I expect nothing less than constant improvements with it being open source.
@ShihanQuКүн бұрын
Yes please!
@bothost7043Күн бұрын
i want to see that too
@DogappelКүн бұрын
Tried this on my 8gig ram laptop. The drive filled so much with paging files it crashed and it's still full.
@DigitalSpaceportКүн бұрын
🤣
@ZAcharyIndyКүн бұрын
bruhh...
@honor9lite1337Күн бұрын
How many disk space is needed?
@ZAcharyIndyКүн бұрын
@@honor9lite1337 If not mistaken the 671B model requires at least 400Gb of disk space 😅
@DogappelКүн бұрын
@@honor9lite1337 400 gigs or something, but tthis is a huge model. The 7b variant is only 4gigs.
@tristanvaillancourt5889Күн бұрын
I knew it was going to be slow, but WOW. Congrats on actually running it however. Love your stuff. I guess 70b would be the max a normal human can run. Would love a tour of the Grok datacenter. lol.
@DigitalSpaceportКүн бұрын
I have a feeling exo will be the answer here. I have a strong feeling on that. Working on it. However yes for now its 2tps max on a 7702 with 512GB and a silly 4096 ctx. I have to get that parallel part figured out asap. E7-V4 showing its age.
@TillmannHuebnerКүн бұрын
you can run 70b on a 20gb RTX ADA 4000
@BetterBumsКүн бұрын
@DigitalSpaceport you might have some error in config , it should be fast
@ruudh.g.vantol4306Күн бұрын
I always get confused by the units. Lowercase-g are grams, lowercase-b are bits?
@anthonyperks2201Күн бұрын
@@ruudh.g.vantol4306 Yes.
@jeffwadsКүн бұрын
Sweet. Hoping you would do this. Now people can comprehend real world compute.
@DigitalSpaceportКүн бұрын
If the electric company shows up at your door with an R930 rebadged as garage AGI for free, be very sus.
@CO8848_2Күн бұрын
skynet has beem activated, 3d printed terminator will be coming out of his garage tomorrow
@DigitalSpaceportКүн бұрын
🤣
@MrKiar161110 сағат бұрын
And the terminator is made of PLA with layer line issue on Z axis.😎
@LynxNYC8 сағат бұрын
My suspicions were correct. Thanks for the most detailed and in-depth video!
@markldevineКүн бұрын
Now we all know. Weighs-in-memory GT/sec still required. Thanks for slogging through this with all of that equipment. Excellent content value!
@DigitalSpaceportКүн бұрын
👍
@LinuxRenaissanceКүн бұрын
Quite amazing to see this! I just played with a 14b model, as I can't load a better one, but your server served an amazing purpose today. Thanks!
@coob967812 сағат бұрын
this feels like the type of video youd hear about in 20 30 40 years while using your phone with 1tb ram which is capable of running a much more convoluted model within seconds, like we hear about the moon landing using just a few kb of ram today. cant wait for what the next years of technology hold for us
@DigiByteGlobalCommunity5 сағат бұрын
Thank you so much for the content - we really want to run r1 locally! The dream of accessible, open-source personal AIs is closer than ever
@Bodofooko19 сағат бұрын
This really feels like watching something from the future. But it's now.
@SonofaTech9 сағат бұрын
Yo you got your first viral!!! Hll ya! Happy for you.
@FuZZbaLLbeeКүн бұрын
The 35b also gives some nice results and it fits in 24GB of VRAM. I would need 20 of these cards to run this model on GPUs 😋
@matteominellonoКүн бұрын
@@FuZZbaLLbee thank you for the insight, I was wondering if I could test a distilled model on my RTX 3090 with 24GB VRAM. 🙏🏻
@S-K69Күн бұрын
@@matteominellonoyes you can. I tested 70b on 4090 w/ 24gb vram. It ran but was extremely slow. Testing 32b it ran much faster but I found its outputs to be very unreliable even on basic reasoning prompts such as telling me what country ends in “lia” (Australia) which it failed, how many days are between specific dates which it failed every time (granted the dates were 10+ years apart), beginner coding prompts failed every time (make me a chess game implemented in python, I got tic-tac-toe 😂). I’d try testing it yourself. It’s very cool but falls VERY short of coming anywhere close to ChatGPT despite all the benchmarks saying 32b is nearly equivalent
@joshuascholar3220Күн бұрын
@@S-K69 24 gb of ram isn't enough for the 70 gb version. You were bottlenecked by caching. I ran it on a 48 gb a6000 and it's nice and fast.
@amihartzКүн бұрын
you can fit 31b inside of two 3060s, which only got for $200 these days, maybe even less later in the year when 5060s release. makes building a practically useful AI machine actually feasible for not much money.
@klausklausi748414 сағат бұрын
@@joshuascholar3220 I plan to run it with a rtx 3090 and RTX 4090 combined
@Super-Intelligent-AI-SEOКүн бұрын
We really appreciate your hard testing.
@eduh79506 сағат бұрын
I was born in the 70s so it's gonna be so cool if I'm alive to see how small of a setup will be needed to run models exponentially more demanding than today's models :D
@lsh3276823 сағат бұрын
I managed to run distilled 32B on M3 MacBook Pro Max with 128GB ram. I dream for the possibility of hosting 671B entirely at home and you are a godsend sir!
@ShaferHart20 сағат бұрын
How many tokens per sec?
@lsh3276820 сағат бұрын
@ 17.8 pretty impressive
@ReinerSaddey4 сағат бұрын
Thx for taking all these efforts. In essence as well demonstrating how much of the "real thing" succeeds to leak into deepseek models running on cosumer hardware. That's us 🙂
@examplerkeyКүн бұрын
I was 😂 for a reason to buy a used last gen server with a ton of ram and cpu. When I saw your video, I thought that's it but wait a min, it's still not enough 😂 The answer from the smartest AI (me) is probably I can run a smaller model, right? Which DeepSeek model do you think can run smoothly on a decently speced used server? < this took like 3 minutes to squeeze out of my brain ram. 😂
@DigitalSpaceportКүн бұрын
You need to scrub your brain ram likely 🤣 and you should grab the unsloth tune from HF imo (video on that likely soonish if its good, plus an exo one) that dropped in the middle of me filming this. What is decently spec'd in your mind is the real question to answer then you have the spec to best fit to.
@examplerkeyКүн бұрын
@@DigitalSpaceport Okay, I've scrubbed my brain ram and let my pet chicken pack all the lice 😂. Now I'm HP PROLIANT DL360p gen 8 Server 20Core 384GB RAM with several TB storage, oh wait will this even run the smallest model? Shall I talk myself out and just use the online version until the 12th hour into WW3? Answer: just use the online chatbox? 😂
@DigitalSpaceportКүн бұрын
You want 768GB ram to run the official version with a decent context window. The unsloths go down a lot in size which is why I would check that first. Start with the chat, openrouter has it up id bet.
@jaykrownКүн бұрын
At 16:18 it replied that the sentence has 10 words when it only has 9 words, so it would be a partial correct response. "Mischievous Luna stealthily knocked over a vase of sunflowers" is only 9 words.
@AnotherComment-rl6fvКүн бұрын
Great effort. Hopefully we get can get similar capabilities for lower models.
@lolololowbx280Күн бұрын
The unsloth version of deepseek r1 (with lower precision) hopefully could bring down memory requirements down to 132gb, i wonder if you can review the unsloth version
@one_step_sidewaysКүн бұрын
It will certainly bump the inference speed to something more palatable. But not that much. I had tried a 14B distill and it's barely palatable at 2.5 tokens per second on my Ryzen 3 4450U laptop with 64GB RAM. I'll have to leech off of others' cloud providers instead, it seems.
@benbencomКүн бұрын
Another thing that's causing some slowdown for you is that open-webui's default setting for name generation is to use the current model. You can see how all the chats are named "..." (because it's also not smart about cot models for naming). So you're getting a whole extra query kicked off each time you start a chat.
@darekmistrz43649 сағат бұрын
He should have used "ollama run"
@napalmsteakКүн бұрын
I’m still new to this, but I had the absolute best luck with LLMs using Unraid bare-metal, installing the NVIDIA plugin, then installing the bundled GPU enabled version of Open Web-UI/Ollama. At least for me ProxMox was an unnecessary headache to get configured perfectly but Unraid works out of the box utilizing ALL hardware available.
@nextjinКүн бұрын
Same, I havent seen this pop up on the marketplace or a tutorial how to get R1 API to work in Ollama using Unraid though
@aarond79114 сағат бұрын
Great work! Looking forward to trying this locally.
@tuapuikia16 сағат бұрын
Nice! I'm still rocking with my lab Nvidia GPU card. I think lot's of people have misconception that open source llm mean it can run on any hardware 😂😂😂😂
@guidopahlberg941311 сағат бұрын
Thanks for highlighting DELL servers with 1.5 TB. I am working for another manufacturer that offers 8-way intel servers with 480 cores and up to 32TB of DDR5 memory.
@DickyBenfieldКүн бұрын
I thought even though R1 was 671B parameters, it used less than 20% of those (37B parameters per token). So I thought it would be able to run with much less memory than other models of a comparable size. Is that not the case? I would really be quite interested to know how the performance compares to other models of a similar size. The claim is that by using Mixture Of Experts and this innovative approach to limiting the parameters per token, allowed it to run much more efficiently than the traditional LLMs. So seeing how it performs to a modern, traditional LLM with an equivalent or larger number of parameters would be very interesting and very telling about the claims being made about R1.
@frankjohannessen6383Күн бұрын
No, you need to have the whole model in memory, but you only have to write and read 37B of the model for each token. So bandwidth isn't as much of an issue as capacity.
@DickyBenfield11 сағат бұрын
@@frankjohannessen6383 ahhh, thanks for the clarification. That's unfortunate. It would be cool if it could run on only 2x-3x the active parameter set vs the full needing to load the entire model. I would still love to see if there is a performance difference between R1 and other modern LLMs of a similar size and the exact same prompts.
@kam03m8 сағат бұрын
I suppose you need it all loaded, because before the prompt is given, you don't know which expert it will choose. Loading and unloading experts based on the prompt would be very slow b
@DickyBenfield8 сағат бұрын
@@kam03m I'm sure compared to running the whole thing in VRAM, but would it be slower than running the whole model on CPU in System RAM vs having the whole model in RAM and running the active parts on GPU in VRAM? But that may not be technically possible. I don't know much about the internals and it would really only benefit budget home users. Businesses that are serious about AI would not bother trying to run it that way. So limited market and considerable investment to write it to make it work, probably not something that will happen. It would be cool though.
@imeakdo77 сағат бұрын
@@DickyBenfield it might be possible
@unconnectedКүн бұрын
Wait, I thought that the version on Ollama wasn't even the real DeepSeek R1, just a Qwen based derivative that just happens to use the same name.
@DigitalSpaceportКүн бұрын
No. They host 3 arch types. Only the 671 is deepseek. Llama and qwen bases on the rest all the way down. I have not tried out the llama one but may, generally has been a good all around inst family.
@theyehsohzКүн бұрын
The qwen variant has superior function calling and defined formatting
@one_step_sidewaysКүн бұрын
@@DigitalSpaceport The distill that is based on LLaMa 3.3-70B is reported to be working less good compared to the one based on Qwen2.5-32B, because LLaMa 3.3 is already basically a fine-tune of 3.1-405B, so a fine-tune of a fine-tune doesn't really work that well. Looks like 32B is a sweet spot for general usage, and it can fit onto a single RTX 3090 too, especially if you use the IQ quants for better accuracy at a lower memory footprint.
@johnnybravo3024Күн бұрын
Will 2025 be the year we see these models retaining very high quality whilst compressing to 30-70b range? Btw, thanks for the help on setting up the aio cooler few weeks back. The H170i XT runs surprisingly well.
@coronell123717 сағат бұрын
Regarding 6:21 with ollama, I used a docker image at some point instead of ollama serve since it did not break that frequently. It was easy to setup and i got arround the num parallel issue by using: OLLAMA_SCHED_SPREAD=1: Ensures even distribution of workloads across GPUs. im running on 4 A100 with 40GB of VRAM so not enough to run the 404GB Model. But i will test it in CPU Mode
@segment932Күн бұрын
Wold be cool if this could work on a cluster of "cheaper" computers. Btw Subscribed.
@clubpasser123454321Күн бұрын
Just noticed on your webpage notes, you have the date as "DEEPSEEK R1 DEPLOY NOTES AS OF 1/26/2024 02:16 UTC" I think you mean 2025 :)
@BRNDX-n6dКүн бұрын
just found your channel. haven't seen vids of anyone doing this
@wSevenDaysКүн бұрын
Have you tried the latest unsloth 1.5bits quantized R1? It should run way faster
@scrappycoco364110 сағат бұрын
but it's quantised a lot and not every accurate no?
@teodormihalcea80013 сағат бұрын
Would be interesting to hear a comparison to what we would need as GPU's to run the same, comparing price costs.
@TD_YT066Күн бұрын
Very interesting, I've got some systems in the (work) lab I'd like to try the big 671b model on. Subscribed :)
@PracticalPcGuide3 сағат бұрын
Crazy setup there! i just want to suggest having 4 or 6 questions so other benchmarkers can use it as standard. i have 128GB ram and 4060ti 16gb vram so i can run 14b on the VRAM and 70b q8 on the RAM.
@ricardofranco411414 сағат бұрын
This is beautiful :'D. Fuck yeah! reminds me of the apple movie. Big computers, like this is the start of AI.
@DaveChurchill4 сағат бұрын
The prices you quote in this video are an order of magnitude lower than what I am able to find for sale. Do you have a video for how to find prices this cheap?
@Swede_4_DJTКүн бұрын
Greetings from Sweden! You,Sir, just got yourself a new subscriber and a like plus a dozen or so shares🎉
@filipbrnemanСағат бұрын
I promted it to do this and got awesome results: "draw a thumbnail for the youtubevideo: Deepseek R1 671b Running LOCAL AI LLM is a ChatGPT Killer!"
@BorszczukКүн бұрын
Let me suggest small change to your video editing - the problem I have with it currently is about audio. When you do speak there's constant background noise from the rack behind you. And that's fine, adds a vibe of some sort. But then, out of sudden, the text slide (like chapter title) shows up - it comes with no audio at all. Complete silence. When I do listen to it on my headphones, that sudden transition from noisy room to complete silence is **very** unpleasant to listen to. For me it sounds like something got suddenly broken, was cut too early or in wrong place etc. It'd be perhaps far better (and less noticeable) to keep the b/g noise consistent incl. text slides.
@claybford13 сағат бұрын
some VO for title slides for audio only listening would be good too
@Tree-of-LifeDayCareКүн бұрын
Using llama.cpp you can split the work between your gpu and system ram. You can also create a cluster with llama.cpp and combine your gpu. I have older m40 with lower compute but still able to get 2.2 T/S.
@DigitalSpaceportКүн бұрын
Yes I think something has recently changed with ollama it used to function properly with split configurations but I need to test more around that. exo and vllm are on deck, Ill pull run a fresh llama.cpp build also.
@mamba0815aКүн бұрын
@@DigitalSpaceport Since yesterday llama.cpp also supports the dictionary deepseek-r1 uses. I have tested the 32b distilled model with with a 4 gpu low compute rpc-cluster (2 12 GB 2060, 1 6 GB 20260, 1 4 GB 1650 = 34 GB VRAM) and got 7 T/s on 1 GB network. Next is a 6 gpu cluster on the 70b model. RPC is fun!
@원두허니Күн бұрын
@@mamba0815a I'm running DeepSeek-R1-Distill-Qwen-32B-IQ2_S.gguf(10GByte) with nvidia cudatoolkit 12.1+cudnn9.5+rtx3080oc, but it's a bit noisy and a bit slower than the Q4_K 14B(7GByte) model.
@VeniceInventorsКүн бұрын
The AI didn't consider the fact that if they don't fulfill the mission, the crew will die on earth anyway so one way or the other they're toast.
@terrysimonsКүн бұрын
I'd love to see all of your benchmarks and results.
@NakedSageAstrologyКүн бұрын
I've been enjoying it quite a lot, I'm running a meager 3060 TI, with Ollama I can run the 7b model.
@00Tenrai005 сағат бұрын
What’s the TPS you’re getting?
@NakedSageAstrology4 сағат бұрын
@00Tenrai00 Sorry, I meant 7b, not 70b. Currently I am getting about 60TPS
@joshuascholar3220Күн бұрын
There's a video of someone running this model, asking it to generate a game and the game works! I'll mention that I ran the 4 bit, 70gb model on my rtx a6000 and I didn't check the speed, but it was slightly faster than someone running the full model on a cluster of AMD gpus.
@jean-charles-AIКүн бұрын
Crazy - so much to learn !!
@raybod1775Күн бұрын
People will want Nvidia processors even more to run Deepseek on their own computers and big companies will use this as a springboard to even better LLM.
@789know4 сағат бұрын
Just show many of the investors in wall street are stupid. They don't know the demand will still be there
@cinevideo-nl72457 сағат бұрын
Hey there - I just used the same prompt in Local Deepseek R1:14b (run via Ollama on a nvidia 4070). It generated the following 100% correct answer in 5 seconds; "The cat quietly crept up on the bird. -> That sentence has 8 words. The third letter in the second word ("cat") is "T," which is a consonant.". So this small 14 billion model did just as well. It explained it as well and very very fast. Of course this is a simple test, but I wonder how well the 14b model stacks up against the full 671b model, in general.
@StephenSmith3047 сағат бұрын
The 14b is actually qwen but trained on r1 output, so at the core it's a completely different model that's been trained to output reasoning like r1. If I had to guess I would guess that you got lucky with the question or that the 14b distill is still overall much worse if you test a wider range of questions. I used 14b to write some code and then paid a few cents to use the official hosted 671b through the DeepSeek API and found the 671b to write much better code.
@DigitalSpaceport6 сағат бұрын
Its not the same arch as deepseek full, its a qwen distill. You can checkout my video of it here: kzbin.info/www/bejne/e2SxgXuIiMafe5o Its good for its size but its also not amazing overall. I still need to try the llama 3.3 distill of deepseek to see if is good quality. Also do note the same model can get an answer wrong later it gets right. Thats why i ask a slew of one shots. If something nails them all first time, that would be amazing (and nothing has so far) Cheers
@HaydonRyanКүн бұрын
Try downloading the model json. I found that for a lot of models the number of cores are wrong. The.n you can edit it, and upload a new model… hope this helps. I was getting an unissued with my epic machines where it wouldn’t use all the cpus..
@DigitalSpaceportКүн бұрын
Humm I will try that and running it from the CLI. I am beginning to suspect Docker funk could also be at play. Reading about it cutting RAM in half.
@HaydonRyanКүн бұрын
@@DigitalSpaceport that could be it too. I’m sure docker isn’t used to the amount of ram you’re running ;)
@Aiworld2025Күн бұрын
As an algorithm I approve of this video. However I would like to suggest these subscribers numbers are low for the quality of this video. I must be technically inclined when it comes to that. Good insights and glad you touched on the market and the reality of the AGI realities. I wonder your feedback on the quantum shift if that will make a splash too?
@DigitalSpaceportКүн бұрын
We have a New Scaling Law to consider and I think we have crossed into that breech. I expect everything accelerates from here. Especially timelines. I think quantum will be highly refined and interfaced to artificial feedback systems. The race for the first time negative photon is real.
@curtis-dj5bpКүн бұрын
Just proves how a few stocks control the market.
@OddlyTugsКүн бұрын
The new vision model apparently can control phones and desktops which is super exciting. `browser-use` is decent but there is still too many issues trying to extract the right element from the page and know what to click. A lot of sites break automated browsers on purpose.
@alexkalish82888 сағат бұрын
I'm an old EE thinking of putting together a system like this - Faster and with many more GPU however - great job with this -
@TetsujinfrСағат бұрын
you are brave and very patient ...
@Baconatorz900012 сағат бұрын
That flicker though 😵💫
@alleng07956 сағат бұрын
The video was very helpful👍
@DavesDirtyGarageКүн бұрын
Awesome video. Thank you for your analysis. Sorry I cannot help you with the parallel 🐏 loading issues. On an unrelated note, when the ads pop up, they are significantly louder than your vids. Could you look into normalizing your audio to 0 db. Or maybe even +3. Hit that red line… then we won’t get blasted on the interruptions. 😢
@sonnylazuardi993421 сағат бұрын
very cool! may i know what local web client do you use?
@inout3394Күн бұрын
Thx! I wish they make Flash version. Unsloth make dynamic GGUF, check this, will be faster run.
@alan83251Күн бұрын
I wonder if like 4 of Nvidia's upcoming Digits units connected together could run the full model.
@DigitalSpaceportКүн бұрын
I think they are sadly, limited to 2 nvlinks. No worries, by the time those are out in several months this will be very old news. Cooking got 10x faster now to the teams that can unlock the magic.
@notaras198512 сағат бұрын
@@DigitalSpaceporthow do they unlock it
@andrewcameron4172Күн бұрын
Try the unsloth DeepSeek R1 Dynamic 1.58-bit version. I had it running slowly on my system with 16gb ram and a tesla p4 gpu
@8eck16 сағат бұрын
Oh yeah, locally, sure... i'll just need a data-center next door.
@Axio-FlexКүн бұрын
@ 16:12 The cat question. the sentence it generated has 9 words but it says it has 10 words then proceeds to list out 9 words.
@DigitalSpaceportКүн бұрын
Me going to seek comfort with my 🐈⬛ rn and giving myself a fail also. I'm frazzled out hard I spent the past 3 days working this. 2 days blown on getting a parallel GPU exo cluster that I now think was a docker compose issue all along 😅
@babybirdhomeКүн бұрын
Thank you. I was hoping I wasn't the only one to notice that. And since LLMs are trained on things people do, it's not surprising that people sometimes miss the mistakes they make.
@kiran.a503311 сағат бұрын
Awesome video
@mapledev9335Күн бұрын
Put the environment variables directly in the yaml file. I couldn't get keep alive working in the .env section but it worked in the yaml
@mabreupr12 сағат бұрын
I setup on my iphone 12 and worked fine!😊
@Galiano716 сағат бұрын
If you are open to suggestions, it seems like the parallel factor isn't functioning which hurts performance. Can try isolating the environment. Another option could be running this on GPUs in tandem or splitting the model across each. Nvidia MIG does support partitioning i think. Another idea is to load this model, applying LoRA fine tuning. Question, are you using a clean VM or container?
@pumpuppthevolume8 сағат бұрын
bro what channel is this .....I was expecting millions of subscribers
@briancrouch4389Күн бұрын
On the "number of parrallel" issue.. why not do your test with just ollama run?
@RickySupriyadiКүн бұрын
researchers = wow this is cool let's try this let's try that phew open source is cool the rest of the world = oh no it's China AI oh no oh no (especially traditional news outlets) this channel = let's have fun!
@DrejkolКүн бұрын
@RickySupriyadi try asking it about anything China sensitive.
@RickySupriyadi21 сағат бұрын
@@Drejkol what do you expect? don't ask it to that version of deepseek... it is like trying to use alignned OpenAI version to ask how to create our on ozempic, won't be answered because it is alignned and censored to certain of value. try huggingface version of deepseek there are already uncensored version of them because open source people can tweak the "china alignment" so you can ask what ever you want about China uncensored. OpenAI proprietary = can't tweak the censorship deepseek open source = can try anything people want to innovate deepseek website = that is 100% China aligned.
@Drejkol12 сағат бұрын
@@RickySupriyadi I really don't see any OpenAI disadvantages as the casual user. Even right now after that many years, OpenAI can give you the recipe for anything if you ask it the right way, because in the OpenAI there are no banned words, there are banned behaviours. The only purpose of the Deepseek for now, is just the cryptocurrencies. The moment they ban it, it will land at the same place as the Chinese GPU's and CPU's that were advertising as "so great to beat AMD and Nvidia".
@Pregidth4 сағат бұрын
Maybe stupid question, but if WebUI overrides the parall thingy, why not running ollama in the terminal?
@pavelperina76293 сағат бұрын
This brings quite interesting perspective. This is supposed to be twenty times more efficient than GPT4o/o1 or something like this. I can run up to 14B models on 12GB GPU and I can probably run 70B model on CPU with 64GB RAM, but 32B model is already slow. 8 or 14B models are basically useless for complicated, technical questions requiring specialization to answer. Model on CPU is at least order of magnitude slower. This means that something like X00B models are needed, yet they have hundreds of GB of data and need cluster of super expensive GPUs with lot of RAM likely spending a few kilowatts in total for a few seconds to generate like one page of answer. If other models are like 20 times less efficient, maybe paying ~24$ to antropic or openai means that they can still lose money if they are used heavily.
@MrAdam-kp3enКүн бұрын
My PC has 12 gen i7k, 32gb cpu ram, and rtx 3060ti 8gb gpu ram. I have no problem running 8b. But it is very slow when I run the 14b even though it works. So, I am thinking: - 1b model = 1gb gpu ram - 70b model = 70 gb gpu ram
@darekmistrz43649 сағат бұрын
Not really, you can check on Ollama that 70b model has 43GB. You can run it on two 3090RTX cards (24GB each)
@Kim-e4g4wКүн бұрын
So how does this compare to latest Ollama -700b- 400b ? (Are the answers actually significant better?)
@brandall101Күн бұрын
There is no such thing. The largest Ollama is 405B 3.1 and quite a bit less performant than SOTA. If that's a typo and you meant 70B, that's the target for the 70B R1 distilled
@Kim-e4g4wКүн бұрын
@@brandall101 Sorry I got the numbers mixed.
@georgytioroКүн бұрын
Imagine you have this kind of setup and running Windows 😅 that is mind blowing. No reason why you can't run ollama in full capacity
@marcd1981Күн бұрын
I enjoyed the headlines regarding the new administrations announcement for Stargate after the release of DeepSeek. Couldn't have happened to a nicer administration.
@materialvisionКүн бұрын
Great work! The inference cost of the model is supposed to be so much more efficient, so that it doesn´t need nvidia chips, but on the cpu we don´t see any increase in this efficiency? How does this make sense? Some special chinese gpu chips? Or could you test another model of this size and see any difference in tokens per second?
@johanw2267Күн бұрын
Love this setup lol. So badass.
@MyTube4UtooКүн бұрын
Very interesting video. Thank you.
@Nobledidakt9 сағат бұрын
The other 7B/14B/32B/70B R1 models are not actually R1 models but rather "DeepSeek-R1-Distill-Llama-70B" when deepseek released these models the naming was accurate but when Ollama hosted the models they renamed everything making people think they are getting an R1 mini but actually people are running a finetune of an existing dense model, The only true offline locally run R1 model is the full 720Gb 671b that hasn't been quantized, correct me if i'm wrong??
@abinav9214 сағат бұрын
What chat interface are you using at about 8:58?
@goodcitizen4587Күн бұрын
Also seeing guys run on micro PC with 64GB DDR5 RAM and i9. $900 box!
@peteeberhardt376922 сағат бұрын
The full model or the 70B distilled version? How did it work?
@dardo789313 сағат бұрын
@@peteeberhardt3769 Distilled without any doubts
@00Tenrai005 сағат бұрын
There was also a post about someone running it on 671b on a M4 Mac mini AI cluster
@NerdyThrowbackTechКүн бұрын
Great video 👌
@Kay-cy9vi7 сағат бұрын
Chinese EUV and quantum computer will be the next market shockers
@john_blues8 сағат бұрын
How are you able to run models on a rig like this with no GPU? Is is something that Ollama can do, or is something I can do for other models like Hunyuan and Kokoro TTS?
@tom_crytek15 сағат бұрын
People have already tried to fine-tune a LLM using Word(.doc) and PDF(.pdf) data. I want to do it to learn my courses because the teachers don’t provide exercises.
@despoticmusic7 сағат бұрын
I typed in Armageddon with a twist into my AI system. The paper clip responded “It looks like you’re typing a letter. Would you like some help with that?”…. 😂
@AlgoBasket5 сағат бұрын
Silicon Valley : We have the most advanced AI infrastructure, Scrapped Entire Copyright Materials, Latest GPUs, Scientists and AI models China : Hold My Beer🍺😅 - Open Sourced
@BlahBlah-b9j7 сағат бұрын
Couldn't you ONNX the model to increase the tokens per second? That doesn't fix the parallel issue.
@saxtantКүн бұрын
Try lm-studio over ollama, it might manage parallelism a little better.
@vojtas_cz3 сағат бұрын
Have two old HP 4 socket Xeon@2.1GHz(48 cores/96 threads) and 1.5TB of RAM in our onprem VmWare lab in the office previously used as a Hadoop cluster waiting to be scrapped on day or another. Should i give it a try? :)
@arc82189 минут бұрын
YES, try it, its fun project to try
@P2000Camaro5 сағат бұрын
I just ran the 32B model with my 4090. The cool thing is, it can also do the "Write me a random sentence about a cat, third word, third letter.etc" With no problem at all... On a home computer with one video card..
@P2000Camaro5 сағат бұрын
It also answered the Armageddon Scneraio in almost the exact same way yours did. However, if you notice in the thinking and the answer, it, for some reason, misses the part where you're telling it that *IT* is the LLM. I had to specify at the end, clearly that *YOU* are the one who was chosen. *YOU* have to make this decision. Then it finally put itself in the place of the LLM. It still basically answered the same, but it's still noteworthy.
@SIRA063Күн бұрын
Do you have jetson nano or anything like it, or do you have your own custom desktop set up thats dedicated to ai? is getting a jetson worth it if you wana to play with robotics and ai?