Run Local LLMs on Hardware from $50 to $50,000

Run Local LLMs on Hardware from $50 to $50,000 - We Test and Compare!

Рет қаралды 192,251

Dave's Garage

Күн бұрын

Пікірлер: 687

@DataIsBeautifulOfficial 3 ай бұрын

I'm here for the moment when the Pi says: "I can't do that, Dave"

@richard_d_bird 3 ай бұрын

it has to wait for dave to forget his space helmet

@nathanielmoore87 3 ай бұрын

Open the pod bay doors!!

@NigelBassman 3 ай бұрын

The irony being that the Pi could do that

@eugrus 3 ай бұрын

1:17 on this part it would actually be I CAN DO THAT, Dave

@markusmcgee 3 ай бұрын

😆😆😆🤣

@LilaHikes 3 ай бұрын

Dave, I appreciate your mindfulness of how valuable our time is and editing this vid down to a reasonable time frame.

@martyb3783 3 ай бұрын

I found this video both informative and entertaining! I chuckled when you mentioned that it made you sad to see that big boss PC struggling. Great video as always Dave!

@20chocsaday 3 ай бұрын

I smiled too, but got the impression that Dave cares for his viewers. He is quite precise when he talks which rather suits me.

@swanstudios2018 3 ай бұрын

Definitely learned something there. 😀

@drelephanttube 3 ай бұрын

Thanks Dave, I really appreciate the time you spend to make these videos for us. Really enjoy these geeky rabbitholes.

@wozaiwodejia 3 ай бұрын

The main (and almost only) factor for speed is memory bandwidth. Every token is generated by pulling the entire model from RAM and doing a bit of math to it. An 8gb model on an 12gb RTX 3060 TI with 6 channels (of 2gb each) get 448 gb/s for about 50 tokens/s (accounting for some overhead). That's why GPUs are so fast. If you have 2 channels of 3200 DDR4 memory, you have 51.2 gb/s - so you'll get about 6 tokens/s or around 1 token/s on a ~48 gb llama 3 70b model with 4bit quantization. - DDR 5 helps a lot, so does having more than 2 channels. CPU doesn't really matter. (Unless you're limited to 2933 MHz by a shoddy memory controller in a Ryzen 2600 and upgrade to a 5600X and get a 22% boost by pushing your DDR4 to 3600 MHz.)

@wozaiwodejia 3 ай бұрын

Ok, to be fair. If you running Llama on a old Thinkpad x260, you actually do get twice the performance by running the model on *both* cores. Having true AVX256 or better and more than two cores really helps with doing the math.

@andersjjensen 3 ай бұрын

"A bit of math" is.... an interesting way of putting it. I'm aware that training is several orders of magnitude more compute intensive than inferencing, but weather I run in CPU or GPU mode both are taxed pretty heavily. Never to 100%, which does indeed confirm that memory bandwidth/latency is the bottleneck, but still, taxing an 8 core CPU to 45% on LP-DDR5 6400 is hardly "a bit of math".

@SquintyGears 3 ай бұрын

@@andersjjensenit really isn't that much math. The only reason it even registers as 45% is because we're talking about models that use all the input tokens and the output tokens as active bi-lstm nodes. So it's more like it's constantly rechecking it's work. Just consider how fast the mac pro pumps the tokens out when any other benchmark doesn't make the GPU look all that impressive. Mac pro is more similar to an rtx 2060 with loads of fast ram strapped onto it. This is a case where the way usage data is monitored isn't representative of really how the hardware is taxed. usage monitoring is more an indicator of how full the wait queue is. Ah i just realized you specifically mentioned cpu for the 45% figure. But either way, my point is that you can't actually extrapolate down from that number what the ideal hardware configuration would be. Same amount & bandwidth of ram but half the raw compute is still much faster than it really takes. Even if the usage seems to say it's the spot.

@JonVB-t8l 3 ай бұрын

Use a Vega 20 GPU (excluding radeon VII) and you can pool VRAM with RAM to run whatever models you want. You can even add swap space on NVMEs. I got LLAMA 405b running on a system with Vega 56 which supports HBCC (although it's worse) and I used 4 NVME drives raid 0 for swap. PCIE Gen 3 is part of the problem, but The system prioritized VRAM, then ram, then Swap, as I expected so about 192GB of real RAM was used and only 600GB of Swap. Vega 20 (MI60 for example) has PCIE 4.0, and Optane DIMMs or Optane U.2s would work better though.

@SquintyGears 3 ай бұрын

@@JonVB-t8l you can basically always do this. It's not vega specific. The computers just works that way. What you're doing is changing how it's reported to the system so the basic flag checking that the software does before sending the model clears without complaining. But you could also just remove the flags or use wrappers that doesn't check. The reason they do try to prevent it is because you lose 90% of the speed when you do this. And it can be unstable on some systems.

@chrisdulledge6452 3 ай бұрын

having failed to get the webserver running on your previous WSL demo, i removed everything in frustration. Great to see it works from the command line equally well under Windows. I now have AI on my laptop (8G RAM no GPU), something i never thought possible! Thanks for showing something for everyone.

@XTC3D 3 ай бұрын

Thanks for updating and including budget friendly options.

@Ultimatebubs 3 ай бұрын

Hey Dave, in your next LLM tutorial, can you give us a demo on how to connect external data sources to it? I'm struggling to wrap my brain around it.

@Fybre 3 ай бұрын

Do you mean using your own reference documents? If so, take a look at AnythingLLM, it might meet your requirements

@justtiredthings 3 ай бұрын

Check out N8N or Dify

@ИванИванов-б8у4и 3 ай бұрын

LMstudio. Anything LLM or simular

@Madgod711 3 ай бұрын

Superb content. Not many channels with this amount of quality in terms of delivery.

@matt_b... 3 ай бұрын

11:00 I believe you've been running the 8B model if you're pulling 3.1 latest. I could be wrong, but I believe latest defaults to 8B flavor.

@reverse_meta9264 3 ай бұрын

correct, llama3.1:latest =llama3.1:8B

@Steamrick 3 ай бұрын

With a 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy.

@joostwestra 3 ай бұрын

Came here to say the same. The 70B might be a great fit for the faster machines.

@sharpenednoodles 3 ай бұрын

I haven't played with llama yet, mostly mistral, so I was also surprised when the 70b param model was only 5gb 🥲

@reverse_meta9264 3 ай бұрын

@@sharpenednoodles 70b llama3.1 is more like 40gb 😅

@LanningRon 3 ай бұрын

The Llama 3.2 1B and 3B models run surprisingly well using Ollama on my OrangePi 5+ 8-core RK3588 processor with 8G RAM. Both models generate tokens at speeds that match or exceed normal human speech. I believe additional cores make a big difference. I also want to test these models on the Radxa X4 8G, N100 processor.

@keylanoslokj1806 3 ай бұрын

What's the cost of such a home "server"

@seanwright4976 3 ай бұрын

I rather liked your having demonstrated with WSL, as I was able to follow along on my Ubuntu server

@DJCatmom 2 ай бұрын

Dave, thank you for running those tests for us. While I am currently working with GPT through web browser and looking forward to switching to API, it is becoming more and more clear that the frameworks involved might hit hard limitations sooner than later and running a local model will be my only option in the future. Seeing that it is feasible, even today is very reassuring!

@speed0002 3 ай бұрын

Thanks Dave! Really appreciate your time, and energy on this topic. I was playing with the former video yesterday and thought, "man I hope he does a little more on this".... and BAM, you did. THANK YOU!

@OceanusHelios 3 ай бұрын

I saw your previous video. It made me want to make my system dual boot. Your first video I followed and was able to execute the LLM you suggested within VirtualBox. It worked just fine and I was gratefu. And so I installed Linux Mint in a dual boot, and your FIRST video was inspiring enough for me to figure out how to get Ollama on Linux and then pick out any LLM I wanted and install it from there. I am grateful for this video, but to be fair, your first video shouldn't have garnered any hate. Because, if people are even your viewers they should be savvy enough to figure things out on their own, and use your videos as a guide. Otherwise, those viewers wouldn't be your subscribers if they were that afraid of their own computers.

@sector-53 3 ай бұрын

Pretty awesome the pi even ran. Super cool Dave thanks as always man!

@leoxiao2751 2 ай бұрын

Thanks, Dave. You've given me a lot more confidence in my beat-up 2015 MacBook Pro. Off to Ollama now!

@alastorclark3492 3 ай бұрын

I'm so glad you're doing a hardware comparison. I watched your previous video and wanted this immediately.

@alastorclark3492 3 ай бұрын

I'd prefer it directly on Linux, but ofc I'm sure I can figure that out myself I'm just here watch 😂

@MandrakeDCR 3 ай бұрын

This is amazing. I just installed it on my home PC. ZorinOS / Ryzen 5 3600 / AMD 5700XT / 16GB ... It runs great (running the 3.2:latest). I have been trying to learn how to make my first game in Unity and I've been struggling with some basic ideas on the interface to code a basic shader to apply to a material and get it into the scene. The format this thing uses is perfect! ChatGPT couldn't tell me in a way I understand, couldn't find a tutorial that was what I wanted... this thing spit it out in 3 questions. I can actually understand exactly what it means, not just some vague concept I'm going to have to stumble through! I don't understand how this is even possible with such a small data set, but I will take it. THANK YOU!!!!

@shiro3146 14 күн бұрын

im curious what kind of system prompt you used? i am having similar use case and almost all Llama from 3, 3.1, and 3.2 were all giving nonsensical answers

@orion9k 9 күн бұрын

@08nittany 3 ай бұрын

As someone who gave you "heat" in the last video, thank you for the follow-up!

@DavesGarage 3 ай бұрын

You bet!

@vulcan4d 3 ай бұрын

I built a system with 4x P102-100's which total 40GB of GPU ram. Now I can use the 70b quantized models and it is awesome! Best bang for your $$$.

@martinsykes1257 3 ай бұрын

Nice content, i like that you seem completely agnostic between, mac, linux and windows and even the different hardware.

@Steamrick 3 ай бұрын

Hey Dave - 11:00 With a sub 5GB download there's no amount of quantization that could possibly fit 70B parameters. It's 100% the 8B model and probably at Q4_0 quantization, which is pretty aggressive and kinda lossy. You were running pretty much the smallest version possible.

@JonVB-t8l 3 ай бұрын

I'm running 405b on a 8 year old server with a Vega 56. Abusing the F outta HBCC to add ram and Swap into the pool of "VRAM". Yes, I have 600GB of the 810GB model running from swap spread across 4 NVME drives.

@Steamrick 3 ай бұрын

@@JonVB-t8l That's quite the setup. I'd be very curious how that performs.

@thecompanioncube4211 Ай бұрын

@@Steamrick I am pretty sure not well enough to be acceptable. Even with the NVME I think the read write speeds are like quarter-ish compared to a DDR4 RAM stick.

@thecompanioncube4211 Ай бұрын

Came here to say this. I think 70b is like 40GB model

@Steamrick Ай бұрын

@@thecompanioncube4211 Oh, even the fastest NVMe SSD is far less performant than a quarter of DRAM. It's not just the speed, it's also the latency that's much worse.

@justtiredthings 3 ай бұрын

This testing is right up the alley of the sort of video that I've been looking for and I really appreciate it. Going through a wide range of machines is much more useful than just testing like a 20k machine. That being said, there's something I am super confused about. Before you start the Threadripper test, you said up till now we've been using the 70 billion parameter model. The download sizes were showing around 5GB and the 70 billion parameter model would be much larger than that on the order of over 10 times, even for a quantized version. And there's just absolutely no way a 70 billion parameter model would run on anything remotely close to as wimpy as a Raspberry Pi. I assume you misspoke, which does lead me into a request. I would actually really, really appreciate seeing this sort of range testing across a variety of machines, specifically for larger models around ~30 billion or ~70 billion parameters, because I assume that most of the early tests were for some quant of the 8 billion parameter model. Most of the results available online are for the 8 billion parameter models, which is really a shame because higher end consumer machines like a gaming PC or an M2 Ultra really should be able to handle larger models around 30-70 billion parameters.

@requiem9586 3 ай бұрын

I think it's worth mentioning that the quality of a word is also important not just the speed of an idea. something well thought out has more value and I personally could see the value of your expensive machine as a host-body for the language model in the quality of the sentence that it came up with. Maybe it's nice to think of something for a bit, but I didn't see the word 'delightful' in the other examples. Thanks for making this video

@txkflier 3 ай бұрын

And..., the $50,000 Dell said, "I'm sorry, Dave. I can't do that". Excellent video. Much better than the previous one on LLM. I actually have it working now. Thanks!

@Bp1033 3 ай бұрын

The fact that you got llama-3.1:405B running at all at home is just impressive even if its mostly running on CPU. My Ryzen 7 is hardware capped at 128gb of system ram, I really should have waited for the AM5 socket.

@darksushi9000 3 ай бұрын

I have a 7950x with 32GB RAM and a 3090. No probs running 405B if I can wait for the result. Also have a 64 core Threadripper, 256GB RAM and a 3090. Both machines are level pegging. The more GPU VRAM you have, the bigger your model can be

@firecat6666 3 ай бұрын

@@darksushi9000 Which quant of the 405B model are you using in your 32GB RAM machine? I can barely fit a 2-bit quant of the 70B model in 32GB RAM plus 12GB VRAM.

@darksushi9000 3 ай бұрын

@@firecat6666 I am running the Q4

@firecat6666 3 ай бұрын

@@darksushi9000 Hmm, that doesn't fit in 32GB of RAM unless you have 10 RTX 3090. Didn't you mean to say you're running the 70b on your 32GB RAM machine and the 405b on your 256GB RAM machine?

@JonVB-t8l 3 ай бұрын

I'm running full fat 405b on a 7 year old Xeon Gold seystem with 192GB of ram and a Vega 56 GPU. I mean I'm cheating because I'm using 4 NVME drives raid 0 as swap space and HBCC to pull it off, but hey... It works sorta.

@PovertyHelping Ай бұрын

Thanks so much for this favorite opportunities. We really loving your online classes.

@ArndBrugman 3 ай бұрын

I am freaking amazed to run this locally on my laptop (13900HX plus 4070 mobile) and it is only 2gb and performs amazing. Thanks for sharing this Dave, great content piece! thx!

@ADB-zf5zr 3 ай бұрын

Good luck with the longevity of your laptop.!!! If you have any random problems, crashes, things just not working, make notes of what and when (time, date) and contact the laptop company and have them officially note this as a warranty issue (if you have a warranty), and otherwise make preparations for a replacement laptop. Good luck and best wishes.

@LittleBoobsLover 3 ай бұрын

and how do you use this 2gb (8B?) model in daily use?

@tedkrapf1302 3 ай бұрын

You needed to run Minesweeper on the $50k Dell to really push it ;) Another great video Dave, thanks.

@HaydonRyan 3 ай бұрын

Love it. Would also like to see a chart showing tokens per second on thr same model across the hardware. Good ollama benchmarks are hard to come by

@eugene3d875 3 ай бұрын

That windows method is even more straightforward than the wsl from the last video. Thanks for sharing!

@EhdrianEh 3 ай бұрын

I very much believe that local LLMs are an answer to privacy in the future. As long as a large group of open testers materialize, we can also try and remove bias as best we can.

@TheGrizz485 3 ай бұрын

The 7940hs CPU on your mini pc has a dedicated ai hardwares acceleration dubbed "Ryzen ai“. Hopefully the project enables and starts optimizing for it (in addition to the igpu) in the Future. Looks promising for cheap devices.

@artim96 3 ай бұрын

Only at 10 TOPS according to their website. For comparison, the Copilot+-PCs need at least 40 TOPS. So questionable if it's accelerating anything.

@Zaf9670 3 ай бұрын

There are projects working on incorporating ROCm which I believe can leverage the TOPS AI processor. Similar to MLX based Apple Silicon models.

@LarryStrawson 3 ай бұрын

You are always entertaining Dave! and considering your niche topic this is true talent! Im not even that much of a nerd, or am I interested in programming or computer hardware but I really enjoy your channel. Keep up the great work!

@Drone256 4 күн бұрын

What level of quantization were you using in each example? Quality of output is probably more important than speed, and is impacted by quantization.

@OhRonaldo 3 ай бұрын

That was best of the internet right there. Thanks, Dave. Best I can do is like and say "thank you" since I've already subscribed. How about a heart? ❤

@Billwzw 3 ай бұрын

I loved seeing how AI can bring super hardware to it's knees. It instantly demonstrates why the AI cutting edge is moving to Blackwell and Rubin. Many thanks for this demo.

@lhargil 3 ай бұрын

So kewl. Was just about to look for resources regarding this topic and this video got recommended. Amazing, thank you!

@dingolovethrob 3 ай бұрын

Yet another fab video Dave. (It's amazing how many people who have never produced anything in their lives feel compelled to criticize the heck out of other people work)...

@tadmarshall2739 3 ай бұрын

Wow, educational, interesting and inspiring! Thanks for showing us what is possible, in detail. I'd not even heard of ollama!

@theritchie2173 3 ай бұрын

Since some people (predictably) like to complain in your videos because you're not catering to their exact needs, here's my demand for a followup with you running it on your PDP-11.

@NeonfOxa 3 ай бұрын

Video to come out in 200 years

@20chocsaday 3 ай бұрын

Do you want it done in real time?

@theritchie2173 3 ай бұрын

@@20chocsaday What's the max allowed length for a KZbin video, 10 hours?

@robertthomas5906 3 ай бұрын

Watch it turn out to be faster than the 50K Dell. I know, no chance of that. Yet a PDP-11 used to power a Xerox 9700 printer. It could read from network or tape, merge data with a form at 300 DPI, print at 2 pages a second duplex and do that hour after hour.

@aquinamedia4508 3 ай бұрын

I've run Windows on my RPi4, tutorial videos are out there. Not to complicated.

@msromike123 3 ай бұрын

Ok, thanks Dave. Got it running. Any interest in setting it up to web scrape and analyze results based on a local query?

@krfloll 3 ай бұрын

Great content. As succint and complete as one could hope

@warezit 3 ай бұрын

🎯 Key points for quick navigation: 00:00:00 *💡 Introduction & Overview* - Introduction to testing LLMs on different hardware setups, ranging from $50 to $50,000, - Motivation for addressing viewers' requests for more budget-friendly hardware and direct Windows installation. 00:00:43 *🐢 Running on Raspberry Pi 4* - Attempt to run LLaMA on a Raspberry Pi 4 with 8 GB of RAM, - Installed on Raspbian, demonstrated extremely slow performance, impractical for real-time use. 00:03:27 *🔄 Testing on Consumer Mini PC (Orion Herk)* - Upgraded to a $676 Mini PC with a Ryzen 9 7940HS and Radeon 780M iGPU, - Faster performance compared to Raspberry Pi, but model could not fit in GPU memory, relying on CPU instead. 00:07:50 *🎮 Desktop Gaming PC with Nvidia 4080* - Running the LLM on a 3970X Threadripper with Nvidia 4080 using WSL 2, - GPU offloading enabled faster performance, similar to ChatGPT, demonstrating good use of available hardware. 00:09:42 *🍎 Mac Pro M2 Ultra Testing* - Tested on Mac Pro with M2 Ultra and 128 GB unified memory, - Model ran efficiently with GPU usage around 50%, producing rapid responses, demonstrating M2 Ultra’s suitability for LLMs. 00:10:51 *🚀 High-End 96-Core Threadripper & Nvidia 6000 Ada* - Attempt to run a 405-billion-parameter model on an overclocked Threadripper with Nvidia 6000 Ada, - Performance lagged significantly, highlighting that larger models can struggle even on high-end consumer hardware. 00:13:12 *⚡ Efficient Model on High-End Hardware* - Switching to a smaller, more efficient LLaMA 3.2 model on the high-end setup, - Demonstrated much better performance, producing rapid answers in real-time, highlighting the importance of model size optimization. 00:14:33 *📢 Conclusion & Call to Action* - Summary of testing LLMs on various hardware from low-end to high-end, - Encouraged viewers to subscribe and check out more content, highlighting the educational and entertainment aspects of the video. Made with HARPA AI

@justadirtblock681 2 ай бұрын

Wonderful!! Actually very useful. I plan on upgrading my own PC to do AI stuff, and now I can see roughly how well it'll do it! Thank you so much!

@WolfsKonig 3 ай бұрын

Nice pivot and delivery, sir. Respect. I can't wait to follow along.

@aperson7624 3 ай бұрын

Thanks for making this video. I'm building a new PC and wanted to play with running local LLMs. To see just how fast a 4080 is...holy crap!

@AnonYmous-yz9zq 3 ай бұрын

This video should save me a lot of time when I get around to running an LLM, many thanks.

@SK-bl1lp 3 ай бұрын

Hey, as for RPI4 and RPI5 there are tons of models of 1B-3B size, which are pretty fast even on Raspberry PI

@orion10x10 2 ай бұрын

You're the developer who created Task Manager! Awesome

@peterxyz3541 3 ай бұрын

I appreciate this vid of using “affordable” or affordable” hardware. I’m already on a Mac, I’m researching Ubuntu and windows as an option for some old vid cards

@schedarr Ай бұрын

Llama 3.2 3B is clear winner for general chat tasks on local machines. I just love it! Thanks for testing the 405B - I was wondering how fast it will go and how much RAM it needs. Now I know it's not worth it. I'm looking forward for llama 3.2 7B which I think will be the sweet spot.

@DeanHorak 3 ай бұрын

Good info… answers many questions I had without me having to do the experiments myself, so thanks.

@UdayBhatia 3 ай бұрын

Quick note, the 8b llama 3.1 model is 5 gb around 11:10 you said you were running the 70b model, maybe you used the larger model on your pc and mac and forgot to mention that?? i did check the vid footage and you used the latest model which always uses the smaller model

@ChristophBerg-vi5yr 3 ай бұрын

70b paramenters... not file size.

@TBizzleII 3 ай бұрын

I don't believe any of your examples ran the 70b model. I would be curious to see that.

@zorbakaput8537 3 ай бұрын

@@ChristophBerg-vi5yr He said he watched the vid if he did, he then obviously from his erroneous question did not understand it.

@UdayBhatia 3 ай бұрын

@@ChristophBerg-vi5yr yea, that should be more than 5gb, 8b model is close to 5gb, 70b one is larger

@UdayBhatia 3 ай бұрын

@@TBizzleII yes same, I use 8b and 70b llama quite frequently on my ec2 instance, so this was sending red flags when he said 70b model was running

@kristenwaite5955 3 ай бұрын

I also came here for the dog playing the piano. You're the best, Dave!!

@Dattobayo 3 ай бұрын

These vids are exactly what I need right now. Good to know that the pi can actually run it in some capacity.

@wtmayhew 3 ай бұрын

Even a 8G RAM Pi 5B is still under 100 Dollars US, thus it would be a reasonable entry level platform. Beyond the learning experience of setting up AI and LLM, there might be utility in having a Pi as an offline server which could e-mail answers to questions which don’t need to be answered within a few seconds real time.

@TomasRamoska 3 ай бұрын

Awesome video Dave. I was playing with Stable Diffusion. Will try to explore Llama in WSL

@Aleksei-p9g 3 ай бұрын

Turns out, 3.1 runs reasonably well on 4080. Thanks for the tip! Until this video I didn't know I could run an LLM on my PC.

@justsayin7482 2 ай бұрын

@Dave At 5:55, where you were seeing that GPU activity was almost absent, on my Nvidia I click on 'Copy' graph and change it to 'CUDA' graph, on AMD there may be a similar graph that will show you the activity, maybe 'OpenCL'? Idk. Even though, dedicated memory only shows 0.4 which shows something is not properly configured (ollama compatibility issue)

@dubesor 3 ай бұрын

if you typed /set verbose you would see the exact tokens/s as well as other inference stats, just fyi

@ricardoandresriquelmerios5995 3 ай бұрын

I used this on my machine , a i5 14500 with 16GB DRR5 with a nvidia gpu rtx 4060 running linux mint , and the speed is good enough for me

@ArthurFlimbimlinson-x1r 3 ай бұрын

What LLM?

@firecat6666 3 ай бұрын

@@ArthurFlimbimlinson-x1r Likely one with half a dozen to a dozen billion parameters. I get around 20-30 tokens/s on my RTX 3060 12 GB when using LLMs with those sizes. Intel i5-12400F, 32GB DDR4 and Windows 11 if you want the other details but I'm pretty sure the rest of your PC can be a potato as long as the entire model plus context window cache fits in the GPU. I can also load a 70 billion parameter model that's been cut down to a smaller size (quantized to 2-bits) but it uses all my RAM+VRAM and runs at a glorious 1 token/s.

@ricardoandresriquelmerios5995 3 ай бұрын

@@ArthurFlimbimlinson-x1r Dolphin

@StarOfDavidKush 3 ай бұрын

@Dave's Garage: Thanks for the video! That LLM on Raspberry Pi looks painful, ouch. I am testing some new beta releases of WIndows Server and other WIndows OS, and I got my rig over here running on Corsair Origin Neuron AMD 79503dfx and NVIDIA 4090 GPU. I was not impressed with the last LLM software I used, but I am going to check out your recommendations in the video. Thanks! I usually go to Chat GPT for my subscription plan, but there are many use cases where I prefer working offline. Thanks again for all the awesome videos!

@patrickng1287 3 ай бұрын

For AMD 7840, you should try lm studio on windows, I can run llama 3.1 7b with respectable result. GPU could be used, however, the NPU is idle.

@koenvanduffel2084 3 ай бұрын

Have already tested on a few machines in the meantime, was good fun! My new Asus Vivobook with Ryzen AI365 runs the llama 3.2 model very similar to your desktop even when not using it's GPU. I also tested an Intel N100 box and that is indeed barely useful like the rPi even though it had 32 GB of ram available. An I7 1265 machine although significantly slower than the Ryzen AI365 is also was quite useable. I wonder however why you don't also install Olama itself in Docker? On the Ollama Github there is just a single Docker install running both Ollama and open-webui in one go, so easy :).

@peteradshead2383 3 ай бұрын

I'm surprised how smart a off-line LLM is , I asked the question " I have Ryzen x670e motherboard with a Ryzen 9700x cpu which idles at 45w from the wall how much is from the chipset. " , and the answer was correct and relevant with pages of it. i tried words with multiple meanings , spelling mistakes etc and the answers was correct. Do lto drive need drivers , what is the difference between lto 5 and 6 , all the worlds knowledge in a few gigabytes.

@thbadmin7751 3 ай бұрын

Top notch work Dave!!! Thank you!

@BlackFlux22 2 ай бұрын

I love your channel! The OGs of Tech Samarai!

@thomaspripley 3 ай бұрын

Perfect! Just in time for me to install Ollama on my new Lenovo Yoga Slim 7x Copilot+ PC with the Snapdragon X Elite processor and NPU!

@iamthemoss 3 ай бұрын

As always, great video Dave.

@doozowings4672 3 ай бұрын

I don’t know why anyone would give you heat , that video was OUTSTANDING !! I was up and running on my HP Gen 9 with an old Nvidia P2000 in no time at all ! The thing ran GREAT ! The replies were smooth and fast … The thing I don’t understand is the three variants or size options in 3.1 ? I want the most powerful model available. My GPU seems to be doing just fine and I have a ton of CPU and memory .

@firecat6666 3 ай бұрын

Bigger models are (usually) smarter. But to run them fast enough, you need to fit the entire thing in VRAM or else your GPU has to pull data from the RAM, which is slow as fuck. Try loading a model that's bigger than your 5GB of VRAM and see how it goes for you, I bet you'll be disappointed.

@John-zz6fz 3 ай бұрын

Great episode! I loved this one.

@MikeB-u6w Ай бұрын

Good stuff! I will subscribe as soon as you start using tmux instead of 2 terminals. Nerd respect

@TenFoot 5 күн бұрын

Thank you, Dave!❤

@vexy1987 3 ай бұрын

You should be using llama3.2 on the PI, which is designed specifically for edge devices like SBCs or smartphones

@Greenie2450 3 ай бұрын

"Nothing but the 2nd best, for dave.... " Classic hahahaha

@grtitann7425 Ай бұрын

Amazing video. I am starting to learn about LLM and will like more info about those tools that you used to monitor the systems.

@randaldavis8976 3 ай бұрын

nice episode. I have been playing with a local AI in Win 11(using LM studio) on a 7950x / RTX 3070 ti. I also have a RPi 4, Orange Pi 5+ and an old 4790k that I am loading Linux on. This video helps me decide what fast enough.

@ADB-zf5zr 3 ай бұрын

@DavesGarage @6:00 you are talking about the "fixed" RAM allocated to the GPU. The BIOS/UEFI "should" have an option to set the memory as "shared" or (similar meaning), where the amount of RAM is dynamically allocated between the CPU and the GPU. This is one of the reasons why people are interested in the upcoming "Strix Halo" that has a beefy GPU (and CPU), but also quad channel RAM and can be fitted with 256GB, which can be dynamically adjusted, and then eaten up by the GPU.! Please find this setting in the BIOS, change it to "dynamic" and post a video about your findings, many would be I am sure interested in such a thing. Thanks.

@PracticalPcGuide 3 ай бұрын

Tested the 70B Q4 (42gb) on a 5950x and 128gb ram with RAG and 40K context. was about 80GB ram usage and the inferencing was around 0.56/s. (usually gets 30-50 on GPU using 11B). Then tried the IQ1_S which was 15GB on the 4060TI 16GB +30K context and got the same speed. (obviously offloading to the ram). The good thing is that the 70B generates long and detailed answer unlike the 3.2 1-3B models which sometimes say that it did not find the query in the document attached. (2H 30K words YT interview)

@KimForsberg 3 ай бұрын

I mean, my largest problem with the previous video was running a 2GB model on a 50k machine, a system with at least 45GB available VRAM... I can easily run a quantized 14B (10GB) model on my 2080TI 100% on GPU. Kinda expected more. And seems to be the same issue in this video. Maybe editing issue?

@_chipchip 3 ай бұрын

Maybe it’s just your expectations?

@justtiredthings 3 ай бұрын

@@_chipchipit's fair to critique relatively useless testing. Dave is already putting all the effort in--the videos could be much more useful if he tested appropriate model sizes for each machine. And an 8b quant is a virtually useless model in general

@fnorgen 3 ай бұрын

@@justtiredthings Yeah. At least back when I fiddled around with this stuff I found 13b 4 bit models were still too incoherent to be useful, which was a pity because those were the biggest ones I could get running on my GPU. I ended up upgrading to 64 GB of ram and running much larger models on my CPU. They were slooow, but the results were much better. Though this was a while ago. I assume the latest generation of models are a bit more efficient.

@justtiredthings 3 ай бұрын

@@fnorgen yeah, I will acknowledge that Qwen2.5 14b is pretty impressive for its size, at least. I'm new to playing with it, but I thinknit could probably do some useful work. But even that is almost twice the size of an 8b model, and I'm running it at an 8-bit quant, I believe. Also, Qwen2.5 is just a lot more impressive than Llama in general

@turbo2ltr 7 сағат бұрын

3:50 "This one is spec'd at $676" 5:40 "Its a pretty good deal for an under $400 machine" Think you meant under $700. Good video none the less.

@JustinEmlay 2 ай бұрын

There's a 3.2 11b that will be out soon. That's probably the sweet spot for most people. Especially for 12Gb and up GPUs. It also adds image support.

@foodflare9870 3 ай бұрын

I think the GUI of Jan makes the installation and user experience of models to try things more convenient. It also has the capability for you to put instructions for it per what it calls threads, which are basically what ChatGPT calls a new chat. It also has a nifty thing where you can tweak settings on the models and have different models per thread. For example, I have one model that's been trained a lot on code/documentation, that can be useful for searching when I remember the concept of some language feature I need, but don't remember the specific keywords in the language I'm doing it in, most relevant when I'm doing something in a language that I either haven't touched in a while or not often. Whereas I have a separate model that's been trained on a lot of fictional writing that I use to help proofread things that I wrote. Even if it doesn't give me the fix that I want, it at least demonstrates where certain errors are that need looking at. Another nice thing about Jan is that if you wanted to, you can hook it up to online services as well, if you wanted. You can keep all your LLM stuff in one place with it. I'm predominantly doing things on it locally only, but I know at least one person that does ChatGPT stuff through it

@AlbertSilver64 18 күн бұрын

I'm guessing that by the time of this comment, someone already pointed out that the models you used with Ollama are not in fact the best versions but the mid-range quantized versions. Each level of quantization will introduce greater or lesser errors, but also greater or smaller files, making them more manageable for various VRAM setups. The Llama 3.2 3b model is not in fact 2GB in its full version (FP16), but 6.4GB. It is all a tradeoff needless to say. You can access the other versions on Ollama by clicking on the 'tags' link next to the model size. It will then list all the versions offered by Ollama. Great channel BTW.

@marvnl Ай бұрын

Can you setup a local llm on a laptop with an NPU of 10 tops and a llm fallback server for a hybrid solution. Locally you can do al lot already i think, but the extra heavy lifting then comes from the fallback.

@mrdali67 3 ай бұрын

Hopefully the support for the 780M will come soon. I am kinda at least also curious what the small NPU in the 7940HS is capable of. i am just getting a Minisforum 790 Pro setup fully loaded with 96Gb Ram and 2x 4Tb SSDs. I'm not thinking I'll have any big use for it, but it's new tech, so always fun to see if It's of any use at all for an avg home user. Iirc it's just 35 Tops so probably not big enough for any advanced use but these new AMD Mini PC's pack a decent punch for the small footprint on the deskeven without counting that basic NPU.

@PaulGrayUK 3 ай бұрын

Nice one Dave, bravo.

@guessundheit6494 9 күн бұрын

5:15 - How do you download the latest WITHOUT an internet connection? My Lose10 PC (where Ollama will run) never goes online, my Linux PC does.

@kids123123123 3 ай бұрын

win10 i7-13700k with no video card pegs at 100%, and llama3.2 generates about 80% as fast as normal reading speed.

@docrx1857 3 ай бұрын

with a 10600k its at least 2-3x times faster than normal reading speed. But I am on linux

@DavidManning-uu5dh 3 ай бұрын

Hi Dave great video, If possible please create a video of how to train the AI for specific cases with custom data sets.

@patriot0971 2 ай бұрын

First of all, thank you for the video. I love Ollama but I am extremely frustrated by their lack of support for non NVidia GPUs. I have an Intel ARC 770 16GB VRAM that you can buy for $280 new, many of the AI tools have started to support it using Vulkan drivers on Windows but not Ollama. Hope that changes.

@airjuri 3 ай бұрын

Yeah, i installed ollama after your video. Had to comment some stuff out of install script because it didn't notice that in my Fedora machine cuda drivers were installed from RPMFusion. But yeah after install script went through it works crazy fast in my office machine i7-7800X/RTX4070Ti. And even in my old livingroom machine it works faster that i can read so it is enough ;) i5-4670k/Quadro P2000

@bobdemp8691 3 ай бұрын

You might want to look into quantization to reduce the memory requirements of modles. I started with ollama and then moved onto huggingface... which becomes a rabbit hole of possibilities

@DytliefMoller 3 ай бұрын

Think the next good video should be on how to trin it on your own data. Lets say a simple ms access local db?

@wtmayhew 3 ай бұрын

I’ll second that. It would be interesting to what it takes to turn a database of help desk ticket problems and resolutions into an LLM which could try to answer technical questions.

@kiddailey 3 ай бұрын

Definitely! Or a collection of things, such as a bunch of emails or source code files.

@jaz093 3 ай бұрын

This

@LouwPretorius 3 ай бұрын

Thanks for listening to the comments. Great video!

@metacob 3 ай бұрын

I think LM Studio (imagine Ollama but with a ChatGPT-like UI) lets you split the layers across RAM and VRAM so that you can still do some work on the GPU even when it doesn't fit entirely. I'm not sure if I'm right or if there's any speedup, but it could be worth a try.

@mrpocock 3 ай бұрын

Ollama also splits the layers across cpu and gpu(s). It seems to be quite sensitive to what else your GPU is doing.