Ai Server Hardware Tips, Tricks and Takeaways

Рет қаралды 12,727

Digital Spaceport

Күн бұрын

Пікірлер: 123

@DigitalSpaceport 13 күн бұрын

Writeup - digitalspaceport.com/homelab-ai-server-rig-tips-tricks-gotchas-and-takeaways/

@d3vr4ndom 9 күн бұрын

@DigitalSpaceport I’ve watched your videos and looked at your website. I appreciate your content, but i find it very incomplete. I would appreciate it more if you could answer my questions here. You can also update your website too with the info. It would really help me and the community understand how to invest in hardware instead of making expensive mistakes. I understand it may be too much to ask of you since your not an employee of mine but your kindness in pointing me to the right resource or help since you have the hardware would be greatly appreciated and needed by the community. Background: I’m willing to spend 25k on a A.I setup. This setup will be used to run LLMs ex. Ollama, in addition to Flux 1 Image generator, and a video generation model yet to be decided. My plan is to run CPT (Continuous Pre-Training) which just means training a model that already exists to remove basis and empathize what matters for my use cases. This training will be done on both LLM and image generation models with video models in the future. Inference will also need to be done on all these models and I will be aiming to ensure I get the most T/s (tokens per second), but also the best training times as this is what will be consuming the greatest amount of time as stated in the Ollama white-paper the 70b model took 2048 H100s 21 days of 24/7 training to finish. With cloud pricing that’s a 2 million dollar model. In your video you never test CPT, generating LoRAs (Low Rank Adaptations), nor imaging / video inference. The hardware requirement difference between doing CPT, LoRAs, inference of each model type will be different than just text Ollama inference. This leaves a large gap in the different types of hardware utilization that may occur under different workloads and conditions especially when dealing with other type of models other than text models. My Questions: How does CPU affect training CPT, and LoRA generation in comparison to inference? Again not just text models but image and video models as well.. (Cores, GHz, Cache) How does RAM affect training CPT, and LoRA generation in comparison to inference? Again Again not just text models but image and video models as well.. I know you said speed of RAM doesn’t matter but only in text inference did I see that tested. I would like to see the following (# of Channels, capacity, and speed) How does GPUs affect training CPT, and LoRA generation in comparison to inference? Again not just text models but image and video models as well.. I really want to understand in what work loads do you need a 16x connection on each GPU, what’s the performance difference? How does scaling perform across multiple GPUs? For example does 2 x 3090 perform twice as good in training as 1 x 3090? If not what’s the performance decrease as you scale from 1 - 8? Can you get away with only using a 8x or 4x bandwidth PCIe, if not with gen4 maybe gen5, what’s the performance decrease? Perhaps it’s just a performance hit in loading the model? As for storage did you see any off loading of the SafeTensor file or checkpoints to be overwhelming for your drive? Whats the GB/s needed during training to not bottleneck your GPUs dumping data into your SSD? I have a few more questions but maybe I’m asking too much already eh? If you see this then it would be much appreciated by myself and hopefully helpful for others to see your responses. Cheers。

@MeidanYona 13 күн бұрын

This is very helpful! I buy most of my hardeare from facebook marketplace and i often have to wait long spans between getting components so knowing what to watch out is very important. Thanks a lot for this!

@minedustry 4 күн бұрын

I also play a long game, acquiring hardware as well as premeditated upgrade paths. My old daily pc became the home theater pc. My current daily pc will become the game server, and I guess my next pc becomes an AI llm machine.

@LucasAlves-bs7pf 13 күн бұрын

Great video! The most eye-opening takeaway: having two GPUs doesn’t mean double the speed.

@DigitalSpaceport 13 күн бұрын

Hands down #1 question in videos. Not with llama.cpp yet but hopefully soon. Bigger models and running models on seperate gpus at the same time are the current reasons and running bigger models like nemotron is a big quality step. Or use vLLM which isnt as end user friendly as ollama/owui

@gaiustacitus4242 3 күн бұрын

Why would this be eye-opening? Of course, having multiple GPUs does not result in linear scaling. You can't get close to linear scaling on any system where multiple chips share the processing, even on the same die. When it comes to GPUs like the nVidia RTX series there is the latency of the computer's bus that will slow data transfer.

@gaiustacitus4242 3 күн бұрын

Even if llama.cpp was highly optimized for parallel processing, it is impossible to achieve linear scaling across multiple GPUs. Also, when evaluating the Mac vs RTX comparison, remember the old adage of "Statistics don' t lie, but liars do statistics." The nVidia benchmarks only run very small models which fit entirely within the GPU's V-RAM. The performance of an RTX based rig falls on its face when the model is pushed out into system RAM. Running larger models which yield better results will run faster on the M4 MAX hardware because the memory is part of the system on a chip (SoC). FWIW, the only benchmark results I've found compare an nVidia RTX 4090 build against a baseline MacBook Pro M3 MAX. The M4 MAX neural processing unit (NPU) offers 38 TOPS performance, which is significantly better than the 18 TOPS of the M3 MAX NPU. Granted, this is far below nVidia's claims of 320 TOPS for the RTX 3090 or the 1,321 TOPS for the RTX 4090, but again, those numbers are only relevant for small LLMs (such as the 2.5B model used in the benchmarks) which fit entirely within the GPU V-RAM.

@danielstrzelczyk4177 13 күн бұрын

You inspired me to experiment with own AI server based on 3090/4090. I did little different choices like: ASRock WRX80D8-2T + Threadripper Pro 3945wx. As you mentioned CPU clock speed matters and I got a brand new motherboard + CPU for around 900 USD. I also want to try OCulink ports (ASRock has 2 of them) instead of risers There are 2 advantages: OCulink offers flexible cabling and works on separate power supply so you are no longer dependent on a single expensive PSU. So far I see 2 problems: Intel X710 10gbe ports cause some errors under ubuntu 24.04 and Noctua NH-U14S is too big to close a Lian Lin 011 XL so I have to turn to an open air case. Can't wait to see your future projects.

@DigitalSpaceport 13 күн бұрын

On the intel, if thats fiber x710, do you have approved optics?

@MetaTaco317 12 күн бұрын

@@danielstrzelczyk4177 I've been wondering if OCuLink would find it's way into these types of builds. Wasn't aware ASRock mobo had 2 ports like that. Have to check that out.

@danielstrzelczyk4177 11 күн бұрын

@@DigitalSpaceport I use x710 copper connection but I just figured out that I shouldn't have blamed intel NICs for repeating "Activation of network connection failed". The source of my issue was virtual USB Ethernet (American Megatrends Virtual Ethernet) created for IPMI. This is strange as I use separate ethernet cable connection for IPMI but after disabling "Connect automatically" in USB Ethernet profile it all returned to normal.

@danielstrzelczyk4177 11 күн бұрын

@@MetaTaco317 Yes there are 2 ports out of the box but I have seen also PCIe cards with OCuLinks and converter from M.2 slot to OCuLink so there are couple of options. ADT Link F9G just arrived so let's see how it works :)

@UnkyjoesPlayhouse 13 күн бұрын

dude, what is up with your camera, feels like I am drunk or on a boat :) another great video :)

@SaveTheBiosphere 12 күн бұрын

The 4060 Ti w 16GB can be had New for 449. Built for x8 on PCIe, so perfect for bifurcation off an x16. Memory bus is slower/128 but for AI use seems like hands down the best bang for the buck card? 165 watts Max draw. (PNY brand on Amazon in stock.)

@DigitalSpaceport 12 күн бұрын

Great comment! You got me thinking this morning and I wrote out a detailed response but for a dual to quad setup, these are a compelling route. kzbin.infoUgkx60ENLNTIkOA49lFHXJIhr4yNdYHH2gib?si=BcSKVmujjfdPigly

@dorinxtg 13 күн бұрын

I didn't understand why you didn't mention any of the Radeon 7xxx cards, nor ROCm

@ringpolitiet 13 күн бұрын

You want CUDA for this.

@christender3614 13 күн бұрын

It’s preferable . AFAIK, Olllama isn’t yet optimized to work with ROCm. Would’ve been interesting though like “how far do you get with AMD”. AMD is so much more affordable per GB. Especially when you look at used stuff. Maybe that’s something for a future video, @DigitalSpaceport ?

@christender3614 13 күн бұрын

My comment vanished. Could you make a video on AMD GPU? Some people say they aren’t that bad for AI.

@DigitalSpaceport 13 күн бұрын

I see two comments here and do plan to test AMD and Intel soon.

@slowskis 13 күн бұрын

@@DigitalSpaceport I have a bunch of A770 16tgb cards along with Asrock H510 BTC Pro+ motherboards sitting around. Was thinking of trying to make a 12 card cluster connect by 10gb network cards and 10900k for the cpu with the 3 linked to each other. Any problems you can think of that I am missing? 4 gpu per motherboard with 2 10 gb cards. The biggest problem I can think of would be the single 32gb ram stick that the cpu is using.

@coffeewmike 13 күн бұрын

I am doing a build that is about 60% aligned with yours. Total investment to date is $7200. My suggestion if you have a commercial use goal is to invest in the server grade parts.

@christender3614 13 күн бұрын

Been waiting for that one and happy to write the first comment!

@DigitalSpaceport 13 күн бұрын

Legend!

@FahtihiGhazzali 10 күн бұрын

i love this video. so much i've learned in such a little time. question 1: blah blah blah answer: no, vram is more important question 2: blah blah blah answer: no, vram! other questions: no, vram! 😊

@DigitalSpaceport 9 күн бұрын

VRAM all day long

@StefRush 12 күн бұрын

My AI LAB I build to test and was shocked how fast it was. 4 x Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz (1 Socket) RAM usage 71.93% (11.22 GiB of 15.60 GiB) DDR3 proxmox-ve: 8.3.0 (running kernel: 6.8.12-4-pve) NVIDIA geForce GTX 960 PCIe GEN 1@16x 4Gi write python code to access this LLM response_token/s:24.43 create the snake game to run in python response_token/s:21.38 You did a great job with your tutorials Thanks I'm going to get some 3060s now.

@DigitalSpaceport 12 күн бұрын

Thanks those 12GB 3060s are costwise probably top 3 imo for VRAM/$

@squoblat 9 күн бұрын

At what point does an A100 80gb become viable? They are starting to drop in price now. Mulling over either an RTX 6000 Ada or waiting a bit longer and going for an A100 80gb. Currently running 2x RTX A5000, the lack of single card VRAM is much more of a limit than many of the online world seem to point out.

@DigitalSpaceport 9 күн бұрын

Honey we are getting an A100 cluster doesnt pass my gut check on things I can casually drop at dinner. Not yet. Im on ebay looking these up however now and your right, prices on them are down.

@squoblat 9 күн бұрын

@DigitalSpaceport looking forward to the video if you ever do get one. Having a compute card makes a lot more sense as models get bigger. The self-hosted scene is going to get pushed out if things like the a100 stay very expensive.

@gaiustacitus4242 3 күн бұрын

You've hit the nail on the head. V-RAM is the most important factor. Unless you are looking to run at most a 7B parameter LLM (which IMO is pointless) then an nVidia RTX based system will yield disappointing performance. Even an M3 MAX with 128GB RAM will perform better on models requiring more than 24GB RAM, and the M4 MAX offers still better performance.

@TheColonelJJ Күн бұрын

With the end of SLI|NVLINK, is there hope that the home PC build will run two GPUs with Windows? I want to add a second 3060 12G to my Z790 i-7 14900k, and use it for Stable Diffusion/Flux while also running a LLM along side for prompting. Am I forced to move to LINUX?

@tringuyen0992 16 сағат бұрын

Hi, great video.How about the Nvidia A6000?

@hassanullah1997 12 күн бұрын

Any advise on potential local server for a small startup looking to support 50-100 concurrent users doing basic inference/embeddings with small-medium sized models - 13B for eg. Would a single RTX 3090 suffice for this?

@DigitalSpaceport 12 күн бұрын

This is my guess, so don't hold me to it. I would start with figuring out exactly which model or models you want to run concurrently. You would want to set the timeout to those to be pretty long to avoid something like people coming back and all warming it up at the same time, so greater than 1 hour. I think you would be better off with 3 3060 12GBs if it would support the models that you intend to use. If you are looking for any flexibility, then start with a good base system and add 3090s as needed is the safest advice. If there is a big impact from undersizing, just go 3090s. Make sure to get a CPU that has a good fast single-thread speed. Adjust your batch size around as needed but the frequency of your users' interactions need to be observed in NVTOP of other more LLM specific specialized performance monitoring tools.

@marvinthielk 8 күн бұрын

If I want to get into hosting some 30B models. Would 2x 3060s work or would you recommend a 3090?

@thanadeehong921 13 күн бұрын

I setup motherboard and epyc cpu just like you. May i ask if you can do it all over again, will you change any setup?

@DigitalSpaceport 13 күн бұрын

Im wanting to get a 7f72 but they are expensive and I would need a pair. If i was scratch building I would likely have used an air cooler for cpu also. Maybe the h12ssl-i would be the board id go with since the mz32-ar0 has gone up in price a good bit.

@keoghanwhimsically2268 11 күн бұрын

How does a 4x3090 compare to 2xA6000 for training/inferencing for different model sizes? (A6000 is more like 3090 Ti in terms of CUDA but with 48GB VRAM, though with a lower power draw for a single card than even the 3090 since it’s optimized for pro workloads and not gaming. Downside: it’s 2x the cost per GB of VRAM compared to 3090.)

@TheYoutubes-f1s 10 күн бұрын

Have you seen any inference benefit to using CPUs with a larger L3 cache? Some of the EPYC Milan CPUs have 768 MB of L3 cache. I wonder if it has an effect when the model can't fully fit in VRAM.

@DigitalSpaceport 9 күн бұрын

I didnt in testing on the 7995wx which may be a video your interested kzbin.info/www/bejne/p5fUeXR3n8mHsM0

@tomoprime217 11 күн бұрын

Is there a reason you left out the RX 7900 XTX a bad gpu pick for 24GB?

@lucianoruiz2057 10 күн бұрын

I loved this video !. Very helpful information !! I have a question: What is the difference in performance between using a pcie gen 3 x16 vs a pcie gen 4 x16 . I have a few 3090s and also some dell t7810 with dual 16x but pcie gen 3 :(

@DigitalSpaceport 10 күн бұрын

Thanks. for inference speed, slot gen and width dont have impact that is meaningful at all. Thise t7810s are G2G!

@lucianoruiz2057 9 күн бұрын

@@DigitalSpaceport Nice! What about training with x16 pcie 3.0 vs x16 pcie 4.0 ? Did you try ?

@claybford 13 күн бұрын

Any tips/experience using NVLink with dual 3090s?

@DigitalSpaceport 13 күн бұрын

Its not needed unless you are training but i need to test on my a5000's that have nvlink to not just be a parrot on that. I did try it out but messed up something iirc and got frustrated. Will give it another shot soonish

@claybford 13 күн бұрын

@DigitalSpaceport cool thanks! I'm putting together my new 2x3090 desktop/workstation and I grabbed the bridge so I'll be trying it out soon as well

@hotsauce246 13 күн бұрын

Hello there. Regarding RAM speed, were you partially offloading the models in GGUF format? I am currently loading the EXL2 model completely into VRAM.

@DigitalSpaceport 13 күн бұрын

No the model was fully loaded to vram this video tested multiple facets of cpu impact fairly decently. kzbin.info/www/bejne/p5fUeXR3n8mHsM0

@TheYoutubes-f1s 13 күн бұрын

Nice video! What do you think of the Asrock Romed82t motherboard?

@DigitalSpaceport 12 күн бұрын

I'd go with h12ssl-i.

@SaveTheBiosphere 12 күн бұрын

What are your thoughts on the AMD Strix Halo releasing in January at CES? It's an APU with 16 Zen 5 cores and an NPU and a GPU all on chip and on chip ddr6 64GB. Also called the Ryzen AI Max+ 395. Targeting AI workstations, the January release version to have 64 GB on board ddr6 that can be allocated to AI models (OS would need some of the 64). Q3 2025 128 GB version.

@DigitalSpaceport 12 күн бұрын

AMD doesnt lack for the hardware. Its always the software/drivers/kern lacking. Its been better but nvidia is the sure thing. Hopefully they have some good work going into the sw side for this! I hope to test one out eventually

@Boyracer73 13 күн бұрын

This is relevant to my interests 🤔

@HotloadsTTV 9 күн бұрын

I have a mining rig with 1x PCI-E Lanes. I thought the models required 4x PCI-E lanes? You mentioned in the video that it is possible to run GPU on x1 PCI-E for inference. Are there other caveats?

@DigitalSpaceport 9 күн бұрын

Not around the 1x point really. However I'll drop a note I didnt test that with the usb risers yet. It *should* not make a difference vs ribbons as its just loading the model into vram, much like a dag workload. However it would impact training horribly. It may impact some rag workloads depending on how much you batch into a document store. Let me know how it goes!

@НеОбычныйПользователь Күн бұрын

@@DigitalSpaceport You can actually test this on your quad-3090 system. Just set PCIe version to 1x in BIOS and check the generation speed in large models and especially the speed of processing large context.

@DanBYoungOldMiner 9 күн бұрын

I had a Gigabyte x399 Designare Ex 1.0 with Threadripper 1900x and 128gb 3200mhz memory just laying around and not doing much. Was able to get 6 - Asus TUF RX 6800xt's into a MODDED RSV-L4500U 4U SERVER CASE. ALL GPU's at PCIE 3.0 X8. I will next install ROCm from AMD as they have a very large catalog. Would be nice to have some content on AMD RX6000 and RX7000 GPU's as AMD GPU's are very capable, but there is not much content out there.

@DigitalSpaceport 9 күн бұрын

Let me know how it goes Im interested in getting AMD GPU content on the channel.

@Keeeeeeeeeeev 13 күн бұрын

more than ddr4/ddr5 & MTs probably the interesting takeawy would be single vs dual vs quad channels vs 8 channels performances

@Keeeeeeeeeeev 13 күн бұрын

...maybe even more cache speeds and quantity.... what are your thoughts?

@DigitalSpaceport 13 күн бұрын

Forvsurevyou want tobwatch this video! Its the most in depth test on cpu impacts around and ive got a pretty crazy 7995wx in it 8 channels filled. kzbin.info/www/bejne/p5fUeXR3n8mHsM0

@Keeeeeeeeeeev 13 күн бұрын

@@DigitalSpaceport I missed that. tnx. whatching rn

@Keeeeeeeeeeev 13 күн бұрын

same thoughts...faster cache and higher amounts would be my bet both on cpu and gpu. If I'm not getting something wrong the fastest gpus running llm ( both older and newers models) seems to be those with higher cache, higher memory bandwidht and bigger Memory Bus sizes. of course TFlops do count but to lower extent

@MartinStephenson1 12 күн бұрын

Taking a dual 4090 system as a benchmark build /running cost. What would be the cost per hour to use a cloud provider with 48gb RYX A6000. When would it be cheaper to use a cloud service.

@DigitalSpaceport 12 күн бұрын

@@MartinStephenson1 its multivariant, as individuals electric rates factor in heavily. Also how much utilization of the system is in play factors in heavy. This might be a good video topic, its pretty complex. Id also not go wth 4090s unless your doing imgen/videgen.

@gaiustacitus4242 3 күн бұрын

@@DigitalSpaceport Why not? nVidia's benchmarks show the 4090 to yield more than 4x the TOPS of the 3090. If the model will fit entirely in the V-RAM of a single 4090, then that will perform far better. Even the 4080 SUPER offers more than 2.5x the performance of the 3090, though you sacrifice 4GB of V-RAM.

@DigitalSpaceport 3 күн бұрын

This has not played out like their benchmarks when I did head to head tested of my dual 4090s vs dual 3090s. Maybe a good subject to revisit in a future video.

@TheYoutubes-f1s 12 күн бұрын

Are AM5 boards an option if you just want to do inference on three 3090s?

@DigitalSpaceport 12 күн бұрын

Yes an AM5 will work. Full lane support is only needed for training/tuning models and image/video gen.

@GeneEkimen 4 күн бұрын

Hello. Can you tell me, if I have rig with 8 p106-100 what models can I use on it? I think it very interesting graphic cards because now you can buy it for 10-15 dollars, maybe you can make a video with this cards. Thank you

@DigitalSpaceport 4 күн бұрын

These 3GB vram pascal 1060 headless iirc?

@GeneEkimen Күн бұрын

@@DigitalSpaceportIt is 6gb vram pascal 1060 (Like gtx 1060 6gb)

@AIbutterFlay 10 күн бұрын

How much would you charge to buy all the components of a 3090 quad?

@DigitalSpaceport 10 күн бұрын

I think your looking for the cost analysis part of the video where I put that together, not sure if you saw that yet. Its here - kzbin.info/www/bejne/gH-XdpuXgpypr9k I would also suggest the H12ssl-i is about the same cost as the MZ32-AR0 also right now and a smaller board footprint and no sas connector bridge to get the top PCIE slot running.

@KonstantinsQ 12 күн бұрын

So i did not get it, more cores is better or worse for AI? For example Ryzen 9 5950x with 16 cores vs Ryzen 5 7600X with 6 cores?

@DigitalSpaceport 12 күн бұрын

Its not super black and white. More cores are useful when you run more models. Ive seen a single model run 12 cores up at the same time briefly, others just 8. However 1-2 cores always stay ran up the entire output. Id go 8-16 cores minimum myself. If you have a lot of additional services, factor those in additionally. High all core turbo and single thread factor second behind GPUs.

@adjilbeh 12 күн бұрын

hello what do you think about 7 rtx a4000 or rtx 4000ada the slim on with 20gb of vram and on slot. they have a lower tdp than a 3090rtx?

@DigitalSpaceport 12 күн бұрын

The rtx's cost a bit more for the amount of VRAM but are great overall. My a5000s idle around 8w just a bit under the 3090s 10w.

@Nick-tv5pu 9 күн бұрын

Faster more modern architecture than the 3090s too. I have four 3090s and one A4000 ADA SFF and was surprised by it's performance

@FabianOlesen 12 күн бұрын

i want to suggest a slight lower tier, 2080 Tis that have been modified with 22G memory, running 2x system

@canoozie 13 күн бұрын

My RTX A6000s idle at 23W so yeah, always on is expensive depending on your GPU config. I have 3x in each system, 2 systems in my lab.

@DigitalSpaceport 13 күн бұрын

Mmmmmm 48GB vram each. So nice!!!

@canoozie 13 күн бұрын

@@DigitalSpaceport Yes, they're nice. I was looking for a trio of A100s over a year ago and couldn't find them, so instead, I bought 6 A6000s because at least I could find them.

@DigitalSpaceport 13 күн бұрын

If you think about it... I avg 10-12w per 3090 24gb the 23w per a6000 48gb seems to scale. Maybe idle is tied to vram amount also?

@canoozie 13 күн бұрын

@@DigitalSpaceport That could be, but usually power scales with # of modules, not size. But then again, maybe you're right, because I looked at an 8x A100-SXM rig a while back, it idled each GPU between 48-50W and had 80GB per GPU.

@DigitalSpaceport 13 күн бұрын

@canoozie my 3060 12GB idles 5-6 hum. Interesting. Also now im browsing ebay for A100s. SMX over pcie right? Im prolly not this crazy.

@christender3614 13 күн бұрын

The most difficult decision is how much money to spend for a first buy. I’m kinda reluctant to get a 3090 config not knowing if I’ll be totally into local AI.

@DigitalSpaceport 13 күн бұрын

3060 12GB is a good starter then. If you want img video gen heavy, 24GB is desirable. Local AI is best left running 24/7 in a setup however to really get the benefits with integrations abound in so many homeserve apps now.

@VastCNC 12 күн бұрын

Maybe rent a vm with your target config for a little before you start building?

@mrrift1 13 күн бұрын

What are your thoughts on getting 4 to 8 4060 ti with 16g vram?

@DigitalSpaceport 13 күн бұрын

64GB VRAM is a very solid amount that will run vision and nemotron easy at q4 and not a bad card at all for inference.

@mrrift1 12 күн бұрын

@@DigitalSpaceport thanks i think i might have 8 4060 to 16g and will build a rig with them i would love to hear ant thoughts you have. i will have a budget of 3500 us for the rest and i have a Thermaltake The Tower 900 Full Tower E-ATX Gaming Case, Black as my case to start with

@Keeeeeeeeeeev 13 күн бұрын

can you mix together amd and nvidia gpus on inference?

@DigitalSpaceport 13 күн бұрын

Great question. Will test when I get an amd card.

@Keeeeeeeeeeev 8 күн бұрын

@@DigitalSpaceport within 2 weeks if I'll have enough time I'll probably let you now. Just ordered the cheapest 12 gb 3060 I 've found to add alongside a RX 6800😁

@minedustry 5 күн бұрын

If I added a ai translater to my game server, how many threads and how much video card would i need?

@DigitalSpaceport 5 күн бұрын

Is there software that does this specifically? Batching in parallel is fairly decently performant but if its just a message here or there id start with finding a model that does translationa the best (sry Im not sure which one that is) and then checking what params it supports. Then size your card to a q8 for that model. Dont forget to add like 20% for ctx

@minedustry 5 күн бұрын

It is just a few messages here and there. The slang, typos, and in-game jargon make all the translators that I've tried spit out garbage that just seems to create more confusion than help. I was hoping that you would know which software to use, but anyway, I should probably try something simple to learn on.

@marianofernandez3600 13 күн бұрын

what about cpu cache?

@DigitalSpaceport 13 күн бұрын

Doesnt seem to impact inference speed interestingly but would need engineering flamegraph to profile it really. Not a top factor for sure.

12 күн бұрын

4gpus vs 1 m4 max. And its getting half the tokens/sec. Powering 4 gpus, the extra equipment needed to run them.. Apple seems like a no brainer.

@DigitalSpaceport 12 күн бұрын

Here is a timestamp comparing numbers on the same quants M4 Max vs quad 3090 ollama/llama.cpp Im not sure where your getting its half as slow? kzbin.info/www/bejne/ZqTSpatpat6MjK8&si=TypOesYh-ujt2_AQ

@christender3614 11 күн бұрын

I guess this is about M4 Max being half as slow. I think there’s a point. Getting an M4Max MacBook(14 Core Version though which might be slower) doesn’t cost a lot more that 4 used 3090 where I live. So it’s way cheaper than a full system built around the 3090s. And it uses way less energy. And it’s way more versatile. So depending on what you want, half the speed maybe isn’t that big of a tradeoff. Though as I said, I’m not sure if the 14 Core version is comparable to the 16 Core version, which is more expensive than a system built around 4 3090s. Edit: It also depends on what you need AI for. If you’re looking to totally replace ChatGPT and ask questions all the time, speed matters more than if you’re happy with ChatGPT and only need local AI for special tasks and/of some more private stuff.

@joelv4495 6 күн бұрын

@@christender3614 With M4M MBP, you've gotta get the 16 core variant to get more than 36 GB memory.