LLAMA 3.1 70b GPU Requirements (FP32, FP16, INT8 and INT4)

Рет қаралды 37,261

Күн бұрын

This Tool allows you to choose an LLM and see which GPUs could run it... : aifusion.compa...
Welcome to this deep dive into the world of Llama 3.1, the latest and most advanced large language model from Meta. If you've been amazed by Llama 3, you're going to love what Llama 3.1 70B brings to the table. With 70 billion parameters, this model has set new benchmarks in performance, outshining its predecessor and raising the bar for large language models.
In this video, we'll break down the GPU requirements needed to run Llama 3.1 70B efficiently, focusing on different quantization methods such as FP32, FP16, INT8, and INT4. Each method offers a unique balance between performance and memory usage, and we’ll guide you through which GPUs are best suited for each scenario-whether you’re running inference, full Adam training, or low-rank fine-tuning.
To make your life easier, I’ve developed a free tool that allows you to select any large language model and instantly see which GPUs can run it at different quantization levels. You’ll find the link to this tool in the description .
If you’re serious about optimizing your AI workloads and want to stay ahead of the curve, make sure to watch until the end. Don’t forget to like, subscribe, and hit the notification bell to stay updated with all things AI!
Patreon : / aifusion
Disclaimer: Some of the links in this video/description are affiliate links, which means if you click on one of the product links, I may receive a small commission at no additional cost to you. This helps support the channel and allows me to continue making content like this. Thank you for your support!
Tags:
#Llama3
#MetaAI
#GPUrequirements
#QuantizationMethods
#AIModels
#LargeLanguageModels
#FP32
#FP16
#INT8
#INT4
#AITraining
#AIInference
#AITools
#llmgpu
#llm
#gpu
#AIOptimization
#ArtificialIntelligence

Пікірлер: 75

@einstien2409 3 ай бұрын

Lets all take a moment to appreciate how much nvidia knee caps their GPUs with low Vram to screw the customers into buying more 50K usd GPUs. For 25K the H100 should not come for less than 256GB vram. At all.

@alyia9618 2 ай бұрын

shame to the competitors that cannot/don't want to offer better solutions! and there are many startups out there with innovative solutions ( no GPU but real "neural" processors )...these startups need money, but the likes of AMD, Intel, etc... instead continue with their bollocks and get out "CPUs with NPUs" that are clearly not enough to run "real" LLMs...and this is because they are playing at the same game as Nvidia, trying to squeeze as much money as possible from the gullible...sooner or later we will have machines with Graphcore IPUs or Groq LPUs, but not before the usual culprits will get rich squeezing everyone

@kineticraft6977 2 ай бұрын

I just want someone to slap a nvme slot on an affordable Tesla gpu

@einstien2409 2 ай бұрын

@@alyia9618 Nvidia will be the first to get those out and charge an insane price for each while Intel and AMD suckle on their thumbs and make the next gen AI HZ XX SX AI max pro Ultra Intelligence Hyper Max XX AI CPU with an NPU that will suck 2 milli wars to produce a total of 60 TOPs of performance so you can have blurry background on your zoom calls and generate naked women images.

@bjarne431 Ай бұрын

M1/2/3/4 with high memory configurations looks like a steal if you want to run larger models….

@coleisman Ай бұрын

basic marketing, every company does with every product, from iphones to game consoles to cars they remove basic features that cost very little in order to incentivize you to step up to a higher model even if you dont need the other features

@sleepyelk5955 2 күн бұрын

Very cool overview, thanks a lot ... (always in the search for more ram^^)

@lrrr 2 ай бұрын

Thanks man , I was trying to find video like this on for a long time You save my day!

@AIFusion-official 2 ай бұрын

Glad I could help

@AaronBlox-h2t 2 ай бұрын

Same here...although I only need this info yesterday so lucky to have found it now. haha. New sub

@ngroy8636 5 күн бұрын

What about using cpu offloading for inference

@serikazero128 2 ай бұрын

I think your video is pretty solid and also, its missing something. I can currently run llama 3.1 with 0 Video RAM. Yes, you heard that right, 0 GB of VRAM. How is this possible, well with low quaternization types, similar to int4 and int8; In my case more exactly: llama3.1:70b-instruct-q3_K_L I can run with with around 50-64 gb of RAM. And run it on my CPU. Its takes however roughly 2 minutes to answer: hey, My name is Jack, what's yours? What's the deal? AI needs RAM, not really specifically VRAM. VRAM is much faster of course, but I'm using a laptop CPU (weaker than a desktop one) and one that is from 3-4 years ago. After I load the model my RAM usage jumps to around 48gb, while normally without using the model it sits at around 10gb. My point is: you don't need insane resources to run AI as long as speed ain't the issue you can even run it on the CPU. It just is going to take longer. The GPU isn't the one that makes AI go, the GPU only makes the AI go much faster. I have no doubt that with a 40gb VRAM, my llama 3.1 would answer in 20-30 seconds instead of 2 minutes or 2.5 minutes. However, you can still run it on a outdated LAPTOP CPU. As long as you have enough memory. And that's the key thing here, Memory. And it doesn't have to be VRAM!!

@AIFusion-official 2 ай бұрын

Thank you for sharing your experience! You're absolutely right that running LLaMA 3.1 70B on a CPU with low quantization like Q3_K_L is possible with enough RAM, but it comes with trade-offs. While CPUs can handle the load, they tend to overheat more than GPUs when running large language models, which can slow down the generation even further due to throttling. So, while it's feasible, the long response times (e.g., 2 minutes for a simple query) and the potential for overheating make it impractical for real-life usage. For faster and more stable performance, GPUs with sufficient VRAM are much better suited for these tasks. Thanks again for bringing up this important discussion!

@alyia9618 2 ай бұрын

yeah ok, but the loss of precision from using 3 bit quantization is colossal!!! there is a reason why fp16 ( or bf16 ) is the sweet spot for quantization, with int8 as a "good enough" stopgap....

@serikazero128 2 ай бұрын

@@alyia9618 I could run fp16, if I add more RAM, that's my point. And if a laptop processor can do this, A LAPTOP CPU, You can run even fp16 on a CPU with a computer with 256 RAM. And getting 256 RAM is a looooot cheaper than getting 256 VRAM

@alyia9618 2 ай бұрын

@@serikazero128 yes you can run fp16 no problem, especially with avx512 equipped cpus! the problem is that by going up on the number of parameters, memory bandwidth becomes a huge bottleneck...this is the real problem, because the cpus can cope with the load, especially the latest ones with integrated npus and it is a no brainer if we run the computation on the igpus too! feeding all those computational units is the problem, because 2 memory channels and a theoretical max of 100GB/s for the bandwidth isn't enough...the solution the likes of Nvidia and AMD have found for now is to add hbm memory to their chips...and it is an empirically verified solution too, because we have Apple M3 chips going strong exactly because they have high bandwidth memory on the socs

@ДмитрийКарпич 2 ай бұрын

"I have no doubt that with a 40gb VRAM, my llama 3.1 would answer in 20-30 seconds instead of 2 minutes or 2.5 minutes." - no, its would answer in 2-3 seconds, maybe 5 in worst scenario. Its little bit tricky, but base idea not just place model in some memory. You need dozen cores to have deal with it. And with desktop CPU you get 6-8-10-20 cores, instead of 5,888 CUDA cores in 4070.

@nithinbhandari3075 2 ай бұрын

Nice video. Thanks for the info. We are sooo gpu poor.

@harivenkat1021 23 сағат бұрын

is there any tool to find the time taken to run on the GPUs, for x number of tokens, with y bit quantization

@gazzalifahim 2 ай бұрын

Man, this is the tool I was wishing for the last 3 months! Thanks Thanks Thanks! Just got a question. I was planning to buy a RTX 4060Ti for my new build to run some Thesis work. My work is mostly on the Open Source Small LLMs like Llama 3.1 8B, Phi-3-Medium 128K etc. Will I be able run those with almost a great inference speed?

@AIFusion-official 2 ай бұрын

Thank you, I’m glad it’s useful! As for the RTX 4060 Ti, if you’re looking at the 8GB version, I’d actually recommend considering the RTX 3060 with 12GB instead. It’s usually cheaper and gives you more room to run models at higher quantization levels. For example, with LLAMA 3.1 8B, the RTX 3060 can run it in INT8, whereas the 4060 Ti with 8GB would only handle INT4. Just to give you some perspective, I personally use an RTX 4060 with 8GB of VRAM, and I can run LLaMA 3.1 8B in INT4 with around 41 tokens per second at the start of a conversation. So while the 4060 Ti will work, the 3060 might give you more flexibility for your thesis work with LLMs

@AaronBlox-h2t 2 ай бұрын

I recommend Intel ARC A770 16GB and IPEX-LLM and Intel Python, both optimized for the ARC and beats 4060 by 70%, according to Intel.

@guytech7310 Ай бұрын

@@AaronBlox-h2t Does Llama 3.1 support Intel ARC?

@fuzzydunlop7154 Ай бұрын

That's the question you'll be asking for every novel application if you buy an Intel GPU

@chuanjiang6931 Ай бұрын

When you say 'full Adam training', is it full parameter fine-tuning or training an LLM from scratch?

@____________________________.x 2 ай бұрын

The GPU tool would be easier if it listed the tools you could run with a specific GPU? Still, it’s nice to have something so thanks 👍

@AIFusion-official 2 ай бұрын

Thank you for your comment! I’m glad you find the tool helpful. Could you clarify what you mean by 'tools'? Our tool is specifically designed to show GPU requirements for large language models. I’d love to hear more about your thoughts!

@Sl15555 Ай бұрын

neat site, i like this kind of information. i have been using Exl2 models 8bpw Llama 3.1 70B and Llama 3.1 Nemotron. I can load them on 2 rtx a 6000's. i would love to see this information for different quant types. the hardest part for me is figuring out how much vram i need for the models i try and its more of a brute force just try it and see.

@AhmadQ.81 2 ай бұрын

Is it sufficient to use AMD MI325X 288GB vRAM for Lama 3.1 70b using FP32 for inference.

@AIFusion-official 2 ай бұрын

Thank you for your question! The AMD MI325X with 288GB of VRAM is not sufficient for running LLaMA 3.1 70B in FP32, as that would require more memory. However, FP16 is recommended for inference and would fit well within the 288GB limit, allowing for efficient performance.

@SahlEbrahim Ай бұрын

Is there a way to fine-tune this model via cloud

@seryoga6308 Ай бұрын

Thank you for your video. Tell me please, what models can you advise for i9 9900, 32gb ram, rtx3090.

@monkeybuttadventure2 Ай бұрын

Llama 3.1 8b will run great

@io9021 2 ай бұрын

When running Llama3.1 70b with ollama, by default it selects a version using 40GB memory. That's 70b-instruct-q4_0 (c0df3564cfe8). So that has to be int4. I guess in this case all parameters (key / value / query and feedforward weights) are int4? Then there are intermediate sizes where probably different parameters are quantized differently? 70b-instruct-q8_0 (5dd991fa92a4) needs 75GB, presumably that's all int8?

@AIFusion-official 2 ай бұрын

Thank you for your insightful comment! Yes, when running LLaMA 3.1 70B with Ollama, the 70b-instruct-q4_0 version likely uses INT4 quantization, which would apply to all parameters, including key, value, query, and feedforward weights. As for intermediate quantization levels, you're correct different parameters may be quantized to varying degrees, depending on the model version. The 70b-instruct-q8_0, needing 75GB, would indeed suggest that it’s fully quantized to INT8. Each quantization level strikes a balance between memory usage and model performance.

@akierum 25 күн бұрын

Just use air llm to put everything into ram not ssd then use 2x 3090 for speed.

@chuanjiang6931 Ай бұрын

According to Meta Blog, "We performed training runs on two custom-built 24K GPU clusters." How come only 13 H100 GPUs are required for a 70B model for full training on your webpage? Do you mean "at least"?

@vojtechkment2956 25 күн бұрын

It is the minimal count of GPUs which allows you to keep all 70B parameters of the model, plus all Adam parameters (necessary during the training), in the VRAM memory at the same time. Ie. the minimal assumption of any computing efficiency. The untold information is that this way you would train it for 61 years. :/ Provided that you train it on the same corpus and with the same approach which META used. Good luck.

@robertoguerra5375 22 күн бұрын

Nobody knows how much precision they need

@K.F.L Ай бұрын

After getting into Linux , i really want to learn more about coding in general . Struggling for help via forums, i installed llama3.1 8B but it gets alot wrong . I was going to install the 70B version . It seems with my current memory specs, the ideal RAM usage, would be up my own arse considering its a GTX 1060

@dinoscheidt 2 ай бұрын

Why does the tool not list Apple M3 Chips? Inference and LORA run better than on most GPUs you listed

@AIFusion-official 2 ай бұрын

Thanks for your question! The M3 chips are definitely included in the tool, but if you selected a model and quantization level that require more memory than the M3 chips can provide, they won’t appear in the results. Try choosing a different model or quantization level, and you should see the M3 listed. Let me know if you have any other questions!

@dinoscheidt 2 ай бұрын

@@AIFusion-official mh, I selected LLAMA 3.1 70B. I have a M3 with 128GB ram… no problem. But did not see it in the list. Just smaller vram GPUs EDIT: I tried it again. Now it appears to show. From a UX perspective it would make sense to always show all GPUs so the list doesn’t jump around and grey out what is supported.

@AIFusion-official 2 ай бұрын

You mean LLAMA 3.1 70B? You won't see it because FP32 is selected. If you select INT8, you will see it listed.

@Felix-st2ue 2 ай бұрын

How does the 70b q4 version vompare to lets say the 8b version at fp32? Basically whats more inportant, the number of parameters or the quantization?

@AIFusion-official 2 ай бұрын

Thank you for your question! The 70B model at Q4 has many more parameters, allowing it to capture more complex patterns, but the lower precision from quantization can reduce its accuracy. On the other hand, the 8B model at FP32 has fewer parameters but higher precision, making it more accurate in certain tasks. Essentially, it’s a trade-off: the 70B Q4 model is better for tasks requiring more knowledge, while the 8B FP32 model may perform better in tasks needing precision.

@alyia9618 2 ай бұрын

if you must do "serious" things always prefer a bigger number of parameters ( with 33b and 70b being the sweet spots ), but try to not go under int8 if you want your LLM to not spit out "bullshit"...loss of precision can drive accuracy down very fast and make the network hallucinate a lot, loses cognitive power ( a big problem if you are reasoning on math problems, logic problems, etc... ), becomes incapable of understanding and producing nuanced text, spells disaster for non latin languages ( yes the effects are magnified for non latin scripts ), dequantization ( during inference you must go back to fp and back again to the desider quant level ) increases the overhead

@guytech7310 Ай бұрын

@@AIFusion-official Can you do a video showing the differences between the precision levels? I am curious about error rates between the different precision. - Thanks!

@mohamadbazmara601 3 ай бұрын

Great, what if we want to run it not just for one since request. What if we have 1000 request per second?

@AIFusion-official 3 ай бұрын

Handling 1,000 requests per second is a massive task that would require much more than just a few GPUs. You'd be looking at a full-scale data center with racks of GPUs working together, along with the necessary infrastructure for cooling, power, and security. It’s a significant investment, and you’d need to carefully optimize the setup to ensure everything runs smoothly at that scale. In most cases, relying on cloud services or specialized AI infrastructure providers might be more practical for such heavy workloads.

@maxxflyer 2 ай бұрын

great tool

@loktevra Ай бұрын

is it possible to use AMD EPYC 9965 (192 cores, 576 GB/s memory bandwidth) for inference and training? maybe it is not as fast as GPUs but I can use much cheaper RAM modules and only one processor and it will be cheap enough

@guytech7310 Ай бұрын

No, Consider that NVidia GPUS have between 1024 cores to over 16,384 cores.

@loktevra Ай бұрын

@@guytech7310 but for LLM's as I know bottle neck is memory bandwidth not an amount of cores. And my question is how much cores is enough to reach memory bandwidth's bottle neck

@loktevra Ай бұрын

@@guytech7310 and do not forget that avx512 instructions allow compute numbers in parallel in just one core

@guytech7310 Ай бұрын

@@loktevra If that was true than LLMs would not be heavily dependent on GPUs for processing. Its that the larger LLM models require more VRAM to load. Otherwise with low VRAM, the LLM has to swap out parts with the DRAM on the motherboard. PCIe Supports DMA (Direct Memory Access) & thus the GPU already has full access to the Memory on the motherboard.

@loktevra Ай бұрын

@@guytech7310 yes, but on GPUs memory bandwidth is bigger then on CPU. AMD EPYC 9965 is the latest CPU from AMD has just 576 GB/s. So for commercial usage using GPUs without doubt will be better chose with higher speed of VRAM. But for home lab maybe EPYC CPU is just enough?

@treniotajuodvarnis 2 ай бұрын

70b runs on two or even one 1080ti and old xeon v2 with 128gb ram, yes it takes a bit to generate like from 10 seconds up to a minute but not that bad! needs only RAM, I guess 64 would be enough.

@chiminhtran7534 2 ай бұрын

Which motherboard are you using if i may ask

@treniotajuodvarnis 2 ай бұрын

@@chiminhtran7534 huananzhi x79 delluxe, has 4 ram slots that supports LRDIMM 64gb modules and 2 pcie x16, and I ran on another system with 18core xeon v3 cpu and 128gb 1866ddr3 and two 3090 runs flawlesly without hickups and acceptable speeds (mb: huananzhi x99-T8, 8 ddr3 slots, max 512gb ram)

@px43 2 ай бұрын

This app you made is awesome, but I've also heard people got 405b running on a MacBook, which your app says should only be possible with $100k of GPUs, even at the lowest quantization. I'd love to use your site to be my go-to for ML builds but it seems to be overestimating the requirements. Maybe there should be a field for speed benchmarks, and you could give people tokens per second when using various swap and external ram options?

@AIFusion-official 2 ай бұрын

Thank you for your feedback! Running a 405 billion parameter model on a MacBook is highly unrealistic due to hardware constraints, even with extreme quantization, which can severely degrade performance. In practice, very low quantization levels like Q2 would significantly reduce precision, making the model's output much poorer compared to a smaller model running at full or half precision. Additionally, tokens per second can vary based on the length of the input and output, as well as the context window size, so providing a fixed benchmark isn't feasible. We’re considering ways to better address performance metrics and appreciate your suggestions to help improve the app!

@sinayagubi8805 2 ай бұрын

wow. can you add tokens per second on that tool?

@AIFusion-official 2 ай бұрын

Thank you for your comment! Regarding the tokens per second metric, it’s tricky because the speed varies greatly based on the input length, the number of tokens in the context window, and how far along you are in a conversation (since more tokens slow things down). Giving a fixed tokens-per-second value would be unrealistic, as it depends on these factors. I’ll consider ways to offer more detailed performance metrics in the future to make the tool even more helpful. Your feedback is greatly appreciated!

@shadowhacker27 2 ай бұрын

Imagine being the one paying, in some cases, over 60,000USD for a card with 80gb [EIGHTY] of VRAM... twice.

@dipereira0123 3 ай бұрын

Nice =D

@Xavileiro 2 ай бұрын

And lets be honest. Llama 8b sucks really bad.

@AIFusion-official 2 ай бұрын

@Xavileiro i respect your opinion, but i dont agree. Maybe you've been using a heavily quantized version. Some quantization levels reduce the model accuracy and the quality of the output significantly. You should try the fp16 version. It is really good for a lot of use cases.

@mr.gk5 2 ай бұрын

Llama 8b instruct fp16 is great, much better actually