Deepseek R1 671b Running LOCAL AI LLM is a ChatGPT Killer!

Рет қаралды 179,275

Күн бұрын

Пікірлер: 523

@DigitalSpaceport Күн бұрын

Writeup for Deepseek R1 671b Setup and Running Locally digitalspaceport.com/running-deepseek-r1-locally-not-a-distilled-qwen-or-llama/

@PankajDoharey Күн бұрын

Everyone was worried about losing their job because of GPT. Now GPT lost its job!

@ShihanQu Күн бұрын

Please use this same system to test Deepseek R1 70b, so we know what the cost/effectiveness trade off is

@davefroman4700 Күн бұрын

The market dumped BECAUSE this is open source. Scarcity is what creates value in this economic model. Be that real or artificially created by limiting production or the item. The true enemy of capitalism is not socialism or communism. Its abundance.

@brainstormsurge154 Күн бұрын

You know what would really be impressive? Running a language model on a memristor.

@RickySupriyadi 21 сағат бұрын

@@davefroman4700 you are have cool thoughts may i borrow your wisdom words about abundance? to unite human kind to their civilization advancement?

@PixelPi Күн бұрын

You can signifigantly increase your token rate by enabling memory interleaving and Directory AtoS in the BIOS, as the bottleneck in LLMs is bound by bandwidth rather than latency. For Stale AtoS (Directory AtoS), the in-memory directory has three states: I, A, and S. I (invalid) state means the data is clean and does not exist in any other socket's cache. A (snoopAll) state means the data may exist in another socket in exclusive or modified state. S (Shared) state means the data is clean and may be shared across one or more socket's caches. When doing a read to memory, if the directory line is in the A state we must snoop all the other sockets because another socket may have the line in modified state. If this is the case, the snoop will return the modified data. However, it may be the case that a line is read in A state and all the snoops come back a miss. This can happen if another socket read the line earlier and then silently dropped it from its cache without modifying it. If Stale AtoS feature is enabled, in the situation where a line in A state returns only snoop misses, the line will transition to S state. That way, subsequent reads to the line will encounter it in S state and not have to snoop, saving latency and snoop bandwidth. Stale AtoS may be beneficial in a workload where there are many cross-socket reads, such as with memory interleaving. If you don't have AtoS available in the BIOS try other snoop methods and use the Intel memory latency checker (mlc) tool to find the snoop mode that offers the highest bandwidth, and you can further confirm which settings are right for you by running the Intel MLK LINPACK test, LLMs are very dependent on linear algebra, so LINPACK is a good stand in benchmark for testing overall LLM CPU performance. Also Cascade Lake-SP has Intel's Deep Learning Boost, for Xeon Gold and Platinum skus, and the AVX-512 VNNI instruction alone will basically double your token rate, but beware, Cascade Lake-SP Qualification / Engineering Samples don't have AVX-512 VNNI because they can't receive the latest microcode opcode updates, another step up after that is Cooper Lake-SP, which has bfloat16.

@spencersmith7769 Күн бұрын

By how much? Like going from 2 tps to 15? Or is it only like 2 tps to 5 lol

@maxmustermann194 Күн бұрын

Nerd. I like you.

@Zeni-th. 20 сағат бұрын

How the hell did humans come up with this

@andrewazaroff 12 сағат бұрын

He also needs to check interleaving, channels. I would suggest even cheapest Epyc 9xxx against any xeon, both due to memory bandwidth and quality of avx512

@lev1ato 11 сағат бұрын

wish I knew 10% of what you just said 😭😭

@paelnever Күн бұрын

Stock prices of big cloud providers and chip makers shouldn't fall. Now that everybody can run and even train their own state of the art models by following the recipe provided by deepseek, the need for compute is going to peak like crazy. The only ones concerned should be closedAI and other big corps that already invested fortunes training inefficient models. Clearly many people freak out because they don't understand that open source is good for everybody except for closed source fanboys.

@tringuyen7519 Күн бұрын

DeepSeek is good & should help AMD spread ROCm over NVDA CUDA. But expecting everyone to have a rack mounted server with 300G of DRAM is a bit too optimistic…

@FenrirRobu Күн бұрын

@@paelnever no, now the need is low because they gave their weights with an MIT license. This severely limits the need for new large scale training.

@paelnever Күн бұрын

@@FenrirRobu All experts in AI are saying that need for compute (including large scale training) is going to increase. How do you think that big AI western labs are going to respond? They are going to apply the same recipe that deepseek applied but with much more compute power and that way the race for AI keeps going as AI models get better and better. In fact nvidia stock is already recovering. But don't let that shame prevent you to talk about what you don't know.

@FenrirRobu Күн бұрын

@@paelnever yes, knowing about LoRAs, fine-tuning, and retraining existing models the way it was done with LLama is surely a big shame for me. Maybe I should also call deepseek's method a recipe rather than say "mixed fp8 and fp16 training" or "efficiencies of MoE" models. What's the point, you're probably powered by ChatGPT anyway.

@branduncensored Күн бұрын

except for techbrozzzzzz

@clubpasser123454321 Күн бұрын

I ran 671b locally too in ram, on a Epyc 7C13 with 1TB Ram. Seemed to be comparable speed to what you are seeing. Not particularly practical, but it works. I found the 7C13 a great value processor for a home server with older tech. Love your content, thanks for sharing.

@n.lu.x Күн бұрын

I too have build a rig with the 7C13, thing is a beast and has great perf. per $. Which motherboard do you have? I'm running it on the mz32 ar0 rev 3.0 motherboard and currently have only 8 slots occupied (512GB ram)

@DigitalSpaceport Күн бұрын

Meaningless errata: You can flash up a Rev 1 to a Rev 3 and the 7C13 looks like an amazing CPU!

@clubpasser123454321 23 сағат бұрын

@@n.lu.x I have a Gooxi G2SERO-B. 10 PCI-E slots and 16 DIMM slots. I designed my own single height water block for RTX A5000 and will have 6, but only 2 installed so far. Rest next week after the holiday. I put in the full 16*64GB but only 2666MHz. It was very cheap and I couldn't help myself. Question for you, how hot does your Ram get? Mine will burn up if I don't have the fans on. With fans, it overs 50-60C

@igordasunddas3377 21 сағат бұрын

Man, I should upgrade my AMD 8004 platform...

@n.lu.x 14 сағат бұрын

@ Interesting, you went the full server route with the a5000s. I went with 4x rtx 4090s with c-payne risers and dual ax1600 psus in the same mining rig as @DigitalSpaceport. Waiting to see if I can get an additional 5090 in a few days! As far as ram goes, just checked and damn, its at 60-64 degrees, in an open chasis too. System seems stable so far doing DL though but if i get stability issues I'll install a fan.

@differentmoves Күн бұрын

This is actually nuts. Can't wait for folks to stack up a bunch of digits to run these models.

@blubberkumpel6740 Күн бұрын

@@differentmoves stupid Question: What does that mean?

@GubbaGamin Күн бұрын

Digits will only support 2 linked together.

@anthonyperks2201 Күн бұрын

@@GubbaGamin yeah, but just imagine a couple of M4 512GB Ultras. I can dream.

@steelstager Күн бұрын

@@GubbaGamin Out of the box. Some madman will figure out how to get a few dozen to work together.

@steelstager Күн бұрын

@blubberkumpel6740 Project Digits is Nvidia's prebuilt AI solution for people who don't want to build their own machine. We should know more soon as they will go on sale later this year.

@阿斯顿法国红酒快-t9l Күн бұрын

Can you please make a video comparing the difference between r1 different parameters? Like 671b vs 70b to show the cost and efficiency trade-off.

@samorostcz Күн бұрын

Good idea. I would even love to see one smaller model for curious ppl without advanced hw. ✌

@notlogic.the.second Күн бұрын

fr. i want to see that too

@S-K69 Күн бұрын

I’ve run 70b, 32b, 14b, 8b, and 7b locally. I will say it is very cool being able to run these models locally BUT they are by no means replacements for the premium models currently on the market. Yeah 32b is on ChatGPTs heels in a bunch of benchmarks but in practical uses it fails miserably, even on basic reasoning prompts. For example, I asked it to tell me a country that ends in “lia” (Australia). In its reasoning it starts listing through countries, it SPECIFICALLY lists Australia and says “but no that doesn’t end in lia” and ultimately determines that Mali is the country that ends in “lia”. The code quality it produces is very uninspiring. For example I asked it to make a chess game utilizing Python and it spit out a Tic-Tac-Toe game. Even after trying to correct the model it just couldn’t get to anything even resembling chess. I don’t feel like doing a full blog post but to me it is currently a cool novelty that’s not very reliable. Nonetheless it is a huge step forward and I expect nothing less than constant improvements with it being open source.

@ShihanQu Күн бұрын

Yes please!

@bothost7043 Күн бұрын

i want to see that too

@Dogappel Күн бұрын

Tried this on my 8gig ram laptop. The drive filled so much with paging files it crashed and it's still full.

@DigitalSpaceport Күн бұрын

🤣

@ZAcharyIndy Күн бұрын

bruhh...

@honor9lite1337 Күн бұрын

How many disk space is needed?

@ZAcharyIndy Күн бұрын

@@honor9lite1337 If not mistaken the 671B model requires at least 400Gb of disk space 😅

@Dogappel Күн бұрын

@@honor9lite1337 400 gigs or something, but tthis is a huge model. The 7b variant is only 4gigs.

@tristanvaillancourt5889 Күн бұрын

I knew it was going to be slow, but WOW. Congrats on actually running it however. Love your stuff. I guess 70b would be the max a normal human can run. Would love a tour of the Grok datacenter. lol.

@DigitalSpaceport Күн бұрын

I have a feeling exo will be the answer here. I have a strong feeling on that. Working on it. However yes for now its 2tps max on a 7702 with 512GB and a silly 4096 ctx. I have to get that parallel part figured out asap. E7-V4 showing its age.

@TillmannHuebner Күн бұрын

you can run 70b on a 20gb RTX ADA 4000

@BetterBums Күн бұрын

@DigitalSpaceport you might have some error in config , it should be fast

@ruudh.g.vantol4306 Күн бұрын

I always get confused by the units. Lowercase-g are grams, lowercase-b are bits?

@anthonyperks2201 Күн бұрын

@@ruudh.g.vantol4306 Yes.

@jeffwads Күн бұрын

Sweet. Hoping you would do this. Now people can comprehend real world compute.

@DigitalSpaceport Күн бұрын

If the electric company shows up at your door with an R930 rebadged as garage AGI for free, be very sus.

@CO8848_2 Күн бұрын

skynet has beem activated, 3d printed terminator will be coming out of his garage tomorrow

@DigitalSpaceport Күн бұрын

🤣

@MrKiar1611 10 сағат бұрын

And the terminator is made of PLA with layer line issue on Z axis.😎

@LynxNYC 8 сағат бұрын

My suspicions were correct. Thanks for the most detailed and in-depth video!

@markldevine Күн бұрын

Now we all know. Weighs-in-memory GT/sec still required. Thanks for slogging through this with all of that equipment. Excellent content value!

@DigitalSpaceport Күн бұрын

👍

@LinuxRenaissance Күн бұрын

Quite amazing to see this! I just played with a 14b model, as I can't load a better one, but your server served an amazing purpose today. Thanks!

@coob9678 12 сағат бұрын

this feels like the type of video youd hear about in 20 30 40 years while using your phone with 1tb ram which is capable of running a much more convoluted model within seconds, like we hear about the moon landing using just a few kb of ram today. cant wait for what the next years of technology hold for us

@DigiByteGlobalCommunity 5 сағат бұрын

Thank you so much for the content - we really want to run r1 locally! The dream of accessible, open-source personal AIs is closer than ever

@Bodofooko 19 сағат бұрын

This really feels like watching something from the future. But it's now.

@SonofaTech 9 сағат бұрын

Yo you got your first viral!!! Hll ya! Happy for you.

@FuZZbaLLbee Күн бұрын

The 35b also gives some nice results and it fits in 24GB of VRAM. I would need 20 of these cards to run this model on GPUs 😋

@matteominellono Күн бұрын

@@FuZZbaLLbee thank you for the insight, I was wondering if I could test a distilled model on my RTX 3090 with 24GB VRAM. 🙏🏻

@S-K69 Күн бұрын

@@matteominellonoyes you can. I tested 70b on 4090 w/ 24gb vram. It ran but was extremely slow. Testing 32b it ran much faster but I found its outputs to be very unreliable even on basic reasoning prompts such as telling me what country ends in “lia” (Australia) which it failed, how many days are between specific dates which it failed every time (granted the dates were 10+ years apart), beginner coding prompts failed every time (make me a chess game implemented in python, I got tic-tac-toe 😂). I’d try testing it yourself. It’s very cool but falls VERY short of coming anywhere close to ChatGPT despite all the benchmarks saying 32b is nearly equivalent

@joshuascholar3220 Күн бұрын

@@S-K69 24 gb of ram isn't enough for the 70 gb version. You were bottlenecked by caching. I ran it on a 48 gb a6000 and it's nice and fast.

@amihartz Күн бұрын

you can fit 31b inside of two 3060s, which only got for $200 these days, maybe even less later in the year when 5060s release. makes building a practically useful AI machine actually feasible for not much money.

@klausklausi7484 14 сағат бұрын

@@joshuascholar3220 I plan to run it with a rtx 3090 and RTX 4090 combined

@Super-Intelligent-AI-SEO Күн бұрын

We really appreciate your hard testing.

@eduh7950 6 сағат бұрын

I was born in the 70s so it's gonna be so cool if I'm alive to see how small of a setup will be needed to run models exponentially more demanding than today's models :D

@lsh32768 23 сағат бұрын

I managed to run distilled 32B on M3 MacBook Pro Max with 128GB ram. I dream for the possibility of hosting 671B entirely at home and you are a godsend sir!

@ShaferHart 20 сағат бұрын

How many tokens per sec?

@lsh32768 20 сағат бұрын

@ 17.8 pretty impressive

@ReinerSaddey 4 сағат бұрын

Thx for taking all these efforts. In essence as well demonstrating how much of the "real thing" succeeds to leak into deepseek models running on cosumer hardware. That's us 🙂

@examplerkey Күн бұрын

I was 😂 for a reason to buy a used last gen server with a ton of ram and cpu. When I saw your video, I thought that's it but wait a min, it's still not enough 😂 The answer from the smartest AI (me) is probably I can run a smaller model, right? Which DeepSeek model do you think can run smoothly on a decently speced used server? < this took like 3 minutes to squeeze out of my brain ram. 😂

@DigitalSpaceport Күн бұрын

You need to scrub your brain ram likely 🤣 and you should grab the unsloth tune from HF imo (video on that likely soonish if its good, plus an exo one) that dropped in the middle of me filming this. What is decently spec'd in your mind is the real question to answer then you have the spec to best fit to.

@examplerkey Күн бұрын

@@DigitalSpaceport Okay, I've scrubbed my brain ram and let my pet chicken pack all the lice 😂. Now I'm HP PROLIANT DL360p gen 8 Server 20Core 384GB RAM with several TB storage, oh wait will this even run the smallest model? Shall I talk myself out and just use the online version until the 12th hour into WW3? Answer: just use the online chatbox? 😂

@DigitalSpaceport Күн бұрын

You want 768GB ram to run the official version with a decent context window. The unsloths go down a lot in size which is why I would check that first. Start with the chat, openrouter has it up id bet.

@jaykrown Күн бұрын

At 16:18 it replied that the sentence has 10 words when it only has 9 words, so it would be a partial correct response. "Mischievous Luna stealthily knocked over a vase of sunflowers" is only 9 words.

@AnotherComment-rl6fv Күн бұрын

Great effort. Hopefully we get can get similar capabilities for lower models.

@lolololowbx280 Күн бұрын

The unsloth version of deepseek r1 (with lower precision) hopefully could bring down memory requirements down to 132gb, i wonder if you can review the unsloth version

@one_step_sideways Күн бұрын

It will certainly bump the inference speed to something more palatable. But not that much. I had tried a 14B distill and it's barely palatable at 2.5 tokens per second on my Ryzen 3 4450U laptop with 64GB RAM. I'll have to leech off of others' cloud providers instead, it seems.

@benbencom Күн бұрын

Another thing that's causing some slowdown for you is that open-webui's default setting for name generation is to use the current model. You can see how all the chats are named "..." (because it's also not smart about cot models for naming). So you're getting a whole extra query kicked off each time you start a chat.

@darekmistrz4364 9 сағат бұрын

He should have used "ollama run"

@napalmsteak Күн бұрын

I’m still new to this, but I had the absolute best luck with LLMs using Unraid bare-metal, installing the NVIDIA plugin, then installing the bundled GPU enabled version of Open Web-UI/Ollama. At least for me ProxMox was an unnecessary headache to get configured perfectly but Unraid works out of the box utilizing ALL hardware available.

@nextjin Күн бұрын

Same, I havent seen this pop up on the marketplace or a tutorial how to get R1 API to work in Ollama using Unraid though

@aarond791 14 сағат бұрын

Great work! Looking forward to trying this locally.

@tuapuikia 16 сағат бұрын

Nice! I'm still rocking with my lab Nvidia GPU card. I think lot's of people have misconception that open source llm mean it can run on any hardware 😂😂😂😂

@guidopahlberg9413 11 сағат бұрын

Thanks for highlighting DELL servers with 1.5 TB. I am working for another manufacturer that offers 8-way intel servers with 480 cores and up to 32TB of DDR5 memory.

@DickyBenfield Күн бұрын

I thought even though R1 was 671B parameters, it used less than 20% of those (37B parameters per token). So I thought it would be able to run with much less memory than other models of a comparable size. Is that not the case? I would really be quite interested to know how the performance compares to other models of a similar size. The claim is that by using Mixture Of Experts and this innovative approach to limiting the parameters per token, allowed it to run much more efficiently than the traditional LLMs. So seeing how it performs to a modern, traditional LLM with an equivalent or larger number of parameters would be very interesting and very telling about the claims being made about R1.

@frankjohannessen6383 Күн бұрын

No, you need to have the whole model in memory, but you only have to write and read 37B of the model for each token. So bandwidth isn't as much of an issue as capacity.

@DickyBenfield 11 сағат бұрын

@@frankjohannessen6383 ahhh, thanks for the clarification. That's unfortunate. It would be cool if it could run on only 2x-3x the active parameter set vs the full needing to load the entire model. I would still love to see if there is a performance difference between R1 and other modern LLMs of a similar size and the exact same prompts.

@kam03m 8 сағат бұрын

I suppose you need it all loaded, because before the prompt is given, you don't know which expert it will choose. Loading and unloading experts based on the prompt would be very slow b

@DickyBenfield 8 сағат бұрын

@@kam03m I'm sure compared to running the whole thing in VRAM, but would it be slower than running the whole model on CPU in System RAM vs having the whole model in RAM and running the active parts on GPU in VRAM? But that may not be technically possible. I don't know much about the internals and it would really only benefit budget home users. Businesses that are serious about AI would not bother trying to run it that way. So limited market and considerable investment to write it to make it work, probably not something that will happen. It would be cool though.

@imeakdo7 7 сағат бұрын

@@DickyBenfield it might be possible

@unconnected Күн бұрын

Wait, I thought that the version on Ollama wasn't even the real DeepSeek R1, just a Qwen based derivative that just happens to use the same name.

@DigitalSpaceport Күн бұрын

No. They host 3 arch types. Only the 671 is deepseek. Llama and qwen bases on the rest all the way down. I have not tried out the llama one but may, generally has been a good all around inst family.

@theyehsohz Күн бұрын

The qwen variant has superior function calling and defined formatting

@one_step_sideways Күн бұрын

@@DigitalSpaceport The distill that is based on LLaMa 3.3-70B is reported to be working less good compared to the one based on Qwen2.5-32B, because LLaMa 3.3 is already basically a fine-tune of 3.1-405B, so a fine-tune of a fine-tune doesn't really work that well. Looks like 32B is a sweet spot for general usage, and it can fit onto a single RTX 3090 too, especially if you use the IQ quants for better accuracy at a lower memory footprint.

@johnnybravo3024 Күн бұрын

Will 2025 be the year we see these models retaining very high quality whilst compressing to 30-70b range? Btw, thanks for the help on setting up the aio cooler few weeks back. The H170i XT runs surprisingly well.

@coronell1237 17 сағат бұрын

Regarding 6:21 with ollama, I used a docker image at some point instead of ollama serve since it did not break that frequently. It was easy to setup and i got arround the num parallel issue by using: OLLAMA_SCHED_SPREAD=1: Ensures even distribution of workloads across GPUs. im running on 4 A100 with 40GB of VRAM so not enough to run the 404GB Model. But i will test it in CPU Mode

@segment932 Күн бұрын

Wold be cool if this could work on a cluster of "cheaper" computers. Btw Subscribed.

@clubpasser123454321 Күн бұрын

Just noticed on your webpage notes, you have the date as "DEEPSEEK R1 DEPLOY NOTES AS OF 1/26/2024 02:16 UTC" I think you mean 2025 :)

@BRNDX-n6d Күн бұрын

just found your channel. haven't seen vids of anyone doing this

@wSevenDays Күн бұрын

Have you tried the latest unsloth 1.5bits quantized R1? It should run way faster

@scrappycoco3641 10 сағат бұрын

but it's quantised a lot and not every accurate no?

@teodormihalcea8001 3 сағат бұрын

Would be interesting to hear a comparison to what we would need as GPU's to run the same, comparing price costs.

@TD_YT066 Күн бұрын

Very interesting, I've got some systems in the (work) lab I'd like to try the big 671b model on. Subscribed :)

@PracticalPcGuide 3 сағат бұрын

Crazy setup there! i just want to suggest having 4 or 6 questions so other benchmarkers can use it as standard. i have 128GB ram and 4060ti 16gb vram so i can run 14b on the VRAM and 70b q8 on the RAM.

@ricardofranco4114 14 сағат бұрын

This is beautiful :'D. Fuck yeah! reminds me of the apple movie. Big computers, like this is the start of AI.

@DaveChurchill 4 сағат бұрын

The prices you quote in this video are an order of magnitude lower than what I am able to find for sale. Do you have a video for how to find prices this cheap?

@Swede_4_DJT Күн бұрын

Greetings from Sweden! You,Sir, just got yourself a new subscriber and a like plus a dozen or so shares🎉

@filipbrneman Сағат бұрын

I promted it to do this and got awesome results: "draw a thumbnail for the youtubevideo: Deepseek R1 671b Running LOCAL AI LLM is a ChatGPT Killer!"

@Borszczuk Күн бұрын

Let me suggest small change to your video editing - the problem I have with it currently is about audio. When you do speak there's constant background noise from the rack behind you. And that's fine, adds a vibe of some sort. But then, out of sudden, the text slide (like chapter title) shows up - it comes with no audio at all. Complete silence. When I do listen to it on my headphones, that sudden transition from noisy room to complete silence is **very** unpleasant to listen to. For me it sounds like something got suddenly broken, was cut too early or in wrong place etc. It'd be perhaps far better (and less noticeable) to keep the b/g noise consistent incl. text slides.

@claybford 13 сағат бұрын

some VO for title slides for audio only listening would be good too

@Tree-of-LifeDayCare Күн бұрын

Using llama.cpp you can split the work between your gpu and system ram. You can also create a cluster with llama.cpp and combine your gpu. I have older m40 with lower compute but still able to get 2.2 T/S.

@DigitalSpaceport Күн бұрын

Yes I think something has recently changed with ollama it used to function properly with split configurations but I need to test more around that. exo and vllm are on deck, Ill pull run a fresh llama.cpp build also.

@mamba0815a Күн бұрын

@@DigitalSpaceport Since yesterday llama.cpp also supports the dictionary deepseek-r1 uses. I have tested the 32b distilled model with with a 4 gpu low compute rpc-cluster (2 12 GB 2060, 1 6 GB 20260, 1 4 GB 1650 = 34 GB VRAM) and got 7 T/s on 1 GB network. Next is a 6 gpu cluster on the 70b model. RPC is fun!

@원두허니 Күн бұрын

@@mamba0815a I'm running DeepSeek-R1-Distill-Qwen-32B-IQ2_S.gguf(10GByte) with nvidia cudatoolkit 12.1+cudnn9.5+rtx3080oc, but it's a bit noisy and a bit slower than the Q4_K 14B(7GByte) model.

@VeniceInventors Күн бұрын

The AI didn't consider the fact that if they don't fulfill the mission, the crew will die on earth anyway so one way or the other they're toast.

@terrysimons Күн бұрын

I'd love to see all of your benchmarks and results.

@NakedSageAstrology Күн бұрын

I've been enjoying it quite a lot, I'm running a meager 3060 TI, with Ollama I can run the 7b model.

@00Tenrai00 5 сағат бұрын

What’s the TPS you’re getting?

@NakedSageAstrology 4 сағат бұрын

@00Tenrai00 Sorry, I meant 7b, not 70b. Currently I am getting about 60TPS

@joshuascholar3220 Күн бұрын

There's a video of someone running this model, asking it to generate a game and the game works! I'll mention that I ran the 4 bit, 70gb model on my rtx a6000 and I didn't check the speed, but it was slightly faster than someone running the full model on a cluster of AMD gpus.

@jean-charles-AI Күн бұрын

Crazy - so much to learn !!

@raybod1775 Күн бұрын

People will want Nvidia processors even more to run Deepseek on their own computers and big companies will use this as a springboard to even better LLM.

@789know 4 сағат бұрын

Just show many of the investors in wall street are stupid. They don't know the demand will still be there

@cinevideo-nl7245 7 сағат бұрын

Hey there - I just used the same prompt in Local Deepseek R1:14b (run via Ollama on a nvidia 4070). It generated the following 100% correct answer in 5 seconds; "The cat quietly crept up on the bird. -> That sentence has 8 words. The third letter in the second word ("cat") is "T," which is a consonant.". So this small 14 billion model did just as well. It explained it as well and very very fast. Of course this is a simple test, but I wonder how well the 14b model stacks up against the full 671b model, in general.

@StephenSmith304 7 сағат бұрын

The 14b is actually qwen but trained on r1 output, so at the core it's a completely different model that's been trained to output reasoning like r1. If I had to guess I would guess that you got lucky with the question or that the 14b distill is still overall much worse if you test a wider range of questions. I used 14b to write some code and then paid a few cents to use the official hosted 671b through the DeepSeek API and found the 671b to write much better code.

@DigitalSpaceport 6 сағат бұрын

Its not the same arch as deepseek full, its a qwen distill. You can checkout my video of it here: kzbin.info/www/bejne/e2SxgXuIiMafe5o Its good for its size but its also not amazing overall. I still need to try the llama 3.3 distill of deepseek to see if is good quality. Also do note the same model can get an answer wrong later it gets right. Thats why i ask a slew of one shots. If something nails them all first time, that would be amazing (and nothing has so far) Cheers

@HaydonRyan Күн бұрын

Try downloading the model json. I found that for a lot of models the number of cores are wrong. The.n you can edit it, and upload a new model… hope this helps. I was getting an unissued with my epic machines where it wouldn’t use all the cpus..

@DigitalSpaceport Күн бұрын

Humm I will try that and running it from the CLI. I am beginning to suspect Docker funk could also be at play. Reading about it cutting RAM in half.

@HaydonRyan Күн бұрын

@@DigitalSpaceport that could be it too. I’m sure docker isn’t used to the amount of ram you’re running ;)

@Aiworld2025 Күн бұрын

As an algorithm I approve of this video. However I would like to suggest these subscribers numbers are low for the quality of this video. I must be technically inclined when it comes to that. Good insights and glad you touched on the market and the reality of the AGI realities. I wonder your feedback on the quantum shift if that will make a splash too?

@DigitalSpaceport Күн бұрын

We have a New Scaling Law to consider and I think we have crossed into that breech. I expect everything accelerates from here. Especially timelines. I think quantum will be highly refined and interfaced to artificial feedback systems. The race for the first time negative photon is real.

@curtis-dj5bp Күн бұрын

Just proves how a few stocks control the market.

@OddlyTugs Күн бұрын

The new vision model apparently can control phones and desktops which is super exciting. `browser-use` is decent but there is still too many issues trying to extract the right element from the page and know what to click. A lot of sites break automated browsers on purpose.

@alexkalish8288 8 сағат бұрын

I'm an old EE thinking of putting together a system like this - Faster and with many more GPU however - great job with this -

@Tetsujinfr Сағат бұрын

you are brave and very patient ...

@Baconatorz9000 12 сағат бұрын

That flicker though 😵‍💫

@alleng0795 6 сағат бұрын

The video was very helpful👍

@DavesDirtyGarage Күн бұрын

Awesome video. Thank you for your analysis. Sorry I cannot help you with the parallel 🐏 loading issues. On an unrelated note, when the ads pop up, they are significantly louder than your vids. Could you look into normalizing your audio to 0 db. Or maybe even +3. Hit that red line… then we won’t get blasted on the interruptions. 😢

@sonnylazuardi9934 21 сағат бұрын

very cool! may i know what local web client do you use?

@inout3394 Күн бұрын

Thx! I wish they make Flash version. Unsloth make dynamic GGUF, check this, will be faster run.

@alan83251 Күн бұрын

I wonder if like 4 of Nvidia's upcoming Digits units connected together could run the full model.

@DigitalSpaceport Күн бұрын

I think they are sadly, limited to 2 nvlinks. No worries, by the time those are out in several months this will be very old news. Cooking got 10x faster now to the teams that can unlock the magic.

@notaras1985 12 сағат бұрын

@@DigitalSpaceporthow do they unlock it

@andrewcameron4172 Күн бұрын

Try the unsloth DeepSeek R1 Dynamic 1.58-bit version. I had it running slowly on my system with 16gb ram and a tesla p4 gpu

@8eck 16 сағат бұрын

Oh yeah, locally, sure... i'll just need a data-center next door.

@Axio-Flex Күн бұрын

@ 16:12 The cat question. the sentence it generated has 9 words but it says it has 10 words then proceeds to list out 9 words.

@DigitalSpaceport Күн бұрын

Me going to seek comfort with my 🐈‍⬛ rn and giving myself a fail also. I'm frazzled out hard I spent the past 3 days working this. 2 days blown on getting a parallel GPU exo cluster that I now think was a docker compose issue all along 😅

@babybirdhome Күн бұрын

Thank you. I was hoping I wasn't the only one to notice that. And since LLMs are trained on things people do, it's not surprising that people sometimes miss the mistakes they make.

@kiran.a5033 11 сағат бұрын

Awesome video

@mapledev9335 Күн бұрын

Put the environment variables directly in the yaml file. I couldn't get keep alive working in the .env section but it worked in the yaml

@mabreupr 12 сағат бұрын

I setup on my iphone 12 and worked fine!😊

@Galiano7 16 сағат бұрын

If you are open to suggestions, it seems like the parallel factor isn't functioning which hurts performance. Can try isolating the environment. Another option could be running this on GPUs in tandem or splitting the model across each. Nvidia MIG does support partitioning i think. Another idea is to load this model, applying LoRA fine tuning. Question, are you using a clean VM or container?

@pumpuppthevolume 8 сағат бұрын

bro what channel is this .....I was expecting millions of subscribers

@briancrouch4389 Күн бұрын

On the "number of parrallel" issue.. why not do your test with just ollama run?

@RickySupriyadi Күн бұрын

researchers = wow this is cool let's try this let's try that phew open source is cool the rest of the world = oh no it's China AI oh no oh no (especially traditional news outlets) this channel = let's have fun!

@Drejkol Күн бұрын

@RickySupriyadi try asking it about anything China sensitive.

@RickySupriyadi 21 сағат бұрын

@@Drejkol what do you expect? don't ask it to that version of deepseek... it is like trying to use alignned OpenAI version to ask how to create our on ozempic, won't be answered because it is alignned and censored to certain of value. try huggingface version of deepseek there are already uncensored version of them because open source people can tweak the "china alignment" so you can ask what ever you want about China uncensored. OpenAI proprietary = can't tweak the censorship deepseek open source = can try anything people want to innovate deepseek website = that is 100% China aligned.

@Drejkol 12 сағат бұрын

@@RickySupriyadi I really don't see any OpenAI disadvantages as the casual user. Even right now after that many years, OpenAI can give you the recipe for anything if you ask it the right way, because in the OpenAI there are no banned words, there are banned behaviours. The only purpose of the Deepseek for now, is just the cryptocurrencies. The moment they ban it, it will land at the same place as the Chinese GPU's and CPU's that were advertising as "so great to beat AMD and Nvidia".

@Pregidth 4 сағат бұрын

Maybe stupid question, but if WebUI overrides the parall thingy, why not running ollama in the terminal?

@pavelperina7629 3 сағат бұрын

This brings quite interesting perspective. This is supposed to be twenty times more efficient than GPT4o/o1 or something like this. I can run up to 14B models on 12GB GPU and I can probably run 70B model on CPU with 64GB RAM, but 32B model is already slow. 8 or 14B models are basically useless for complicated, technical questions requiring specialization to answer. Model on CPU is at least order of magnitude slower. This means that something like X00B models are needed, yet they have hundreds of GB of data and need cluster of super expensive GPUs with lot of RAM likely spending a few kilowatts in total for a few seconds to generate like one page of answer. If other models are like 20 times less efficient, maybe paying ~24$ to antropic or openai means that they can still lose money if they are used heavily.

@MrAdam-kp3en Күн бұрын

My PC has 12 gen i7k, 32gb cpu ram, and rtx 3060ti 8gb gpu ram. I have no problem running 8b. But it is very slow when I run the 14b even though it works. So, I am thinking: - 1b model = 1gb gpu ram - 70b model = 70 gb gpu ram

@darekmistrz4364 9 сағат бұрын

Not really, you can check on Ollama that 70b model has 43GB. You can run it on two 3090RTX cards (24GB each)

@Kim-e4g4w Күн бұрын

So how does this compare to latest Ollama -700b- 400b ? (Are the answers actually significant better?)

@brandall101 Күн бұрын

There is no such thing. The largest Ollama is 405B 3.1 and quite a bit less performant than SOTA. If that's a typo and you meant 70B, that's the target for the 70B R1 distilled

@Kim-e4g4w Күн бұрын

@@brandall101 Sorry I got the numbers mixed.

@georgytioro Күн бұрын

Imagine you have this kind of setup and running Windows 😅 that is mind blowing. No reason why you can't run ollama in full capacity

@marcd1981 Күн бұрын

I enjoyed the headlines regarding the new administrations announcement for Stargate after the release of DeepSeek. Couldn't have happened to a nicer administration.

@materialvision Күн бұрын

Great work! The inference cost of the model is supposed to be so much more efficient, so that it doesn´t need nvidia chips, but on the cpu we don´t see any increase in this efficiency? How does this make sense? Some special chinese gpu chips? Or could you test another model of this size and see any difference in tokens per second?

@johanw2267 Күн бұрын

Love this setup lol. So badass.

@MyTube4Utoo Күн бұрын

Very interesting video. Thank you.

@Nobledidakt 9 сағат бұрын

The other 7B/14B/32B/70B R1 models are not actually R1 models but rather "DeepSeek-R1-Distill-Llama-70B" when deepseek released these models the naming was accurate but when Ollama hosted the models they renamed everything making people think they are getting an R1 mini but actually people are running a finetune of an existing dense model, The only true offline locally run R1 model is the full 720Gb 671b that hasn't been quantized, correct me if i'm wrong??

@abinav92 14 сағат бұрын

What chat interface are you using at about 8:58?

@goodcitizen4587 Күн бұрын

Also seeing guys run on micro PC with 64GB DDR5 RAM and i9. $900 box!

@peteeberhardt3769 22 сағат бұрын

The full model or the 70B distilled version? How did it work?

@dardo7893 13 сағат бұрын

@@peteeberhardt3769 Distilled without any doubts

@00Tenrai00 5 сағат бұрын

There was also a post about someone running it on 671b on a M4 Mac mini AI cluster

@NerdyThrowbackTech Күн бұрын

Great video 👌

@Kay-cy9vi 7 сағат бұрын

Chinese EUV and quantum computer will be the next market shockers

@john_blues 8 сағат бұрын

How are you able to run models on a rig like this with no GPU? Is is something that Ollama can do, or is something I can do for other models like Hunyuan and Kokoro TTS?

@tom_crytek 15 сағат бұрын

People have already tried to fine-tune a LLM using Word(.doc) and PDF(.pdf) data. I want to do it to learn my courses because the teachers don’t provide exercises.

@despoticmusic 7 сағат бұрын

I typed in Armageddon with a twist into my AI system. The paper clip responded “It looks like you’re typing a letter. Would you like some help with that?”…. 😂

@AlgoBasket 5 сағат бұрын

Silicon Valley : We have the most advanced AI infrastructure, Scrapped Entire Copyright Materials, Latest GPUs, Scientists and AI models China : Hold My Beer🍺😅 - Open Sourced

@BlahBlah-b9j 7 сағат бұрын

Couldn't you ONNX the model to increase the tokens per second? That doesn't fix the parallel issue.

@saxtant Күн бұрын

Try lm-studio over ollama, it might manage parallelism a little better.

@vojtas_cz 3 сағат бұрын

Have two old HP 4 socket Xeon@2.1GHz(48 cores/96 threads) and 1.5TB of RAM in our onprem VmWare lab in the office previously used as a Hadoop cluster waiting to be scrapped on day or another. Should i give it a try? :)

@arc8218 9 минут бұрын

YES, try it, its fun project to try

@P2000Camaro 5 сағат бұрын

I just ran the 32B model with my 4090. The cool thing is, it can also do the "Write me a random sentence about a cat, third word, third letter.etc" With no problem at all... On a home computer with one video card..

@P2000Camaro 5 сағат бұрын

It also answered the Armageddon Scneraio in almost the exact same way yours did. However, if you notice in the thinking and the answer, it, for some reason, misses the part where you're telling it that *IT* is the LLM. I had to specify at the end, clearly that *YOU* are the one who was chosen. *YOU* have to make this decision. Then it finally put itself in the place of the LLM. It still basically answered the same, but it's still noteworthy.

@SIRA063 Күн бұрын

Do you have jetson nano or anything like it, or do you have your own custom desktop set up thats dedicated to ai? is getting a jetson worth it if you wana to play with robotics and ai?