I appreciate you for holding the communities hand 🤝
@martin-thissen Жыл бұрын
Wow, really appreciate it, means a lot! :-)
@andikunar7183 Жыл бұрын
Great content, thanks! llama.cpp also supports running Mixtral, and a 3Bit quantized model (huggingface/TheBloke) requires below 20GB RAM, so it should run on a 4090. I run it (Q5k_m quantized) with 25 token/s on a M2 Max Mac Studio, requiring in total 40GB RAM. And to me, it is providing an amazing amount of content-quality in German. Still waiting for a bit more llama.cpp optimization, but hope to use it as my standard model for local RAG.
@martin-thissen Жыл бұрын
Thanks, appreciate it! :-) Wow, didn’t realise the folks from llama.cpp already support the Mixtral model. Honestly, thought implementing the MoE layer in C++ would take a little longer, but as always super impressive how fast things are moving! Awesome, 25 token/s sounds really nice! Also helpful to get feedback for languages other than English! I‘m curious to see some benchmarks on how the quantization impacts the quality of the Mixtral model, but glad to head you are getting good results with it!
@teleprint-me Жыл бұрын
@martin-thissen Look at llama.cpp pull 4406. I think you'll find it interesting. Glad you're back by the way.
@henrischomacker6097 Жыл бұрын
How long does it take until you get an answer to such questions like "Write a short story about lamas" on your M2?
@andikunar7183 Жыл бұрын
@@henrischomacker6097 it depends on the answer-length. On my M2 Max with Mixtral 8x7B Q5k_m (32GB file, requiring 35GB VRAM), and yesterday's llama.cpp build (a moving target, because they always enhance/optimize) - it answered with 432 tokens at 25.5 token/s = approx. 17s for the pure answer-generation, and used up to 55GB of RAM on this 96GB machine (Safari with this youtube was also open). Technical details: LLM answer-generation for single users is largely memory-bandwidth bound (OK, with a little bit of GPU-impact if you use more quantization). The M2 Max has 400GB/s (due to a 512Bit wide memory bus, NVIDIA has 384 but much faster VRAM). Martin's NVIDIA RTX 6000 (40 GB VRAM) has 2.5x the memory bandwidth and speed (his PC also costs more than 2.5x as much). But it's GPU-performance helps it only with prompt-processing and learning. So for local LLM answer-generation to just one user, a M2 Mac Ultra is currently the most cost-effective solution with probably almost the performance of Martin's system. If you also want to do fine-tuning or serve multiple users, you need a PC with an NVIDIA card for performance - their newest GPUs are approx. 13x faster than a M2/M3 Max, and approx. 7x faster than an M2 Ultra.
@jonmichaelgalindo Жыл бұрын
I've seen people experimenting with Mixtral, and it looks *extremely* capable. Looking forward to testing it myself!
@martin-thissen Жыл бұрын
Yes, got the same impression from my first experiments!
@bakrianoo Жыл бұрын
Welcome back. I am really happy for watching a new video from you again ❤
@martin-thissen Жыл бұрын
Thanks for the support, really appreciate it! ❤️
@trsd8640 Жыл бұрын
Great video! And thank you for bringing clarity into this benchmark thing!
@martin-thissen Жыл бұрын
Glad it was helpful!
@robliv7 Жыл бұрын
Great video , thanks for the explanation and overview. But wait - NVIDIA gave you a $7000 GPU just like that? Time to become a youtuber lol
@ernestuz Жыл бұрын
This model is massive compared to the ones I normally use, I'll try using the CPU, 16 cores, lets see. Thanks for the video!
@JohnR-fc7vr Жыл бұрын
I wish you had shown a running model and how it feels, what kind of responses it gives.
@TheSiddhaartha Жыл бұрын
Long time, no see!
@martin-thissen Жыл бұрын
I know! It has been some time :/ Planning on uploading more regularly again though 💪
@geoffreygordonashbrook1683 Жыл бұрын
Many thanks for the great video and links to papers! Looking forward to videos on fine tuning.
@martin-thissen Жыл бұрын
You‘re welcome, glad you enjoyed it! Happy to hear! :-)
@93cutty Жыл бұрын
I was going to try mixtral on my 4090 later. I don't use that PC much so BigMama is gonna have to come out of hibernation to give this a go!
@martin-thissen Жыл бұрын
Let me know how it went :-)
@Stephen5311 Жыл бұрын
@@martin-thissen Hey, successfully got Dolphin-2.5-Mixtral-8x7b running on my 3090 and 128GB of RAM using the disk option. Takes 89GB/128GB. Not the fastest using disk (around 0.5 tokens/sec), but still at least 2x as fast as WizardLM33b. Uncensored too.
@TheDaveau Жыл бұрын
Great video - thanks. Do you know if the MOE architecture would lend itself to model file sharding as used in AirLLM?
@redfield126 Жыл бұрын
That is not fair. Now you have this monster desktop 😅 Seriously, I think you getting this brand new computer and this new MoE released is kind of destiny !
@martin-thissen Жыл бұрын
Yeah, it’s definitely nice to have the opportunity to work with such a model on my desktop 🙌🏻
@fb3rasp11 ай бұрын
Great video. I wish it would work on my RTX 3060. You reckon if Mixtral is publishing a smaller model?
@AbidonX Жыл бұрын
Broi long time no see nice to meet u again
@martin-thissen Жыл бұрын
I know, it has been some time. But it’s good to be back :-)
@javiergimenezmoya86 Жыл бұрын
I would like to know how MoE could be cut in multiples experts models. P.e: select only layers that are expert in translate among 2 languages.
@martin-thissen Жыл бұрын
That’s a great question, I saw a paper investigating exactly this and they found that the experts don’t diversify to become experts in a specific language. But obviously it would interesting if this holds true for the Mixtral model too. Otherwise would be an elegant way to reduce the memory needed to load the model. Might give it a try and check out how crucial it is to use all eight experts.
@prabhakaranutube Жыл бұрын
is it possible to run it with Ollama for inference?
@reinerheiner1148 Жыл бұрын
Is it possible to split the model up and load it onto two gpus instead of pne? That way, with 2x 3090 48gb of vram could be used locally. Thanks for the video!
@martin-thissen Жыл бұрын
Yes, loading the model onto two or multiple GPUs is definitively possible. MistralAI must also have used multiple/many GPUs to pre-train the Mixtral model.
@TheReferrer72 Жыл бұрын
You can run it in any configuration of System Ram and or VRAM since this Wednesday. using LLM Studio and other programs locally.
@daan3298 Жыл бұрын
Try having an actual chat with the model. If you get it to be stable, consistent and without repetition, please do share your settings.
@parmesanzero7678 Жыл бұрын
Yes but how did you get such a close shave?
@gangs0846 Жыл бұрын
Not possible to run any mixtral model without gpu?
@PerFeldvoss Жыл бұрын
Thanks, I tried to lod your code i VS Code, and try to run it on a RTX3070 GPU but I get "No GPU found. A GPU is needed for quantization." - I am aware that I will probably not be able to run this model on my GPU but I would like to see it run with some of the other LLM so it tried this model_id = "ehartford/dolphin-2.5-mixtral-8x7b" - but still get the No GPU error. Using VS Code I am prompted to make a .venv and I wounder if you recommend either that or conda? ( mannaged to install protobuf .. that apparently is needed for the LLM to work.... )
@Stephen5311 Жыл бұрын
I successfully got dolphin-2.5-mixtral-8x7b running on my 3090 in text-generation-webui. I remember I had to turn on trust-remote-code and also use the disk option as it was taking too much RAM/VRAM (takes 89GB/128GB RAM total). So even if you fix all the errors, probably can't run it on your computer.
@utvecklarakademin Жыл бұрын
Running it on RTX 4070 ti but obviously it's quite slow :), still, it works!
@martin-thissen Жыл бұрын
Glad to hear it's running though! :-)
@anokimusic Жыл бұрын
Thank you! ❤
@martin-thissen Жыл бұрын
You're welcome 😊
@michaelberg7201 Жыл бұрын
Please show how to use a local LLM with GPT pilot. Otherwise I see how this can quickly get very expensive.
@altruistx Жыл бұрын
It's a GPT3.5 alternative. Not GPT4.
@rthardy Жыл бұрын
3.5 is usually more accurate than 4
@mattahmann Жыл бұрын
@@rthardyno it’s not
@ronalddhs3726 Жыл бұрын
So, I can run it on an RTX3090 then. :O Thanks for the video.
@martin-thissen Жыл бұрын
Yes, that should work with 4-bit precision. Let me know if you can make fully use of the 32k context size. I could imagine that it will lead to OOM errors
@ronalddhs3726 Жыл бұрын
Thanks for answering; I'm definitely gonna give it a try. I was under the impression that you needed 40GB of VRAM to run it. locally :) days@@martin-thissen
@publicsectordirect982 Жыл бұрын
Does 4bit precision not require 24gb vram? So a 4090? Or have i got that completely wrong:)
@henrischomacker6097 Жыл бұрын
Got mixtral-8x7b-v0.1.Q3_K_M.gguf with 3-bit quantization working on a 4090 with Llama-cpp. Unfortunately it takes about half a minute to get an answer. - Maybe not acceptable for chatting but eventually acceptable for other usage.
@Stephen5311 Жыл бұрын
I got Dolphin-2.5-mixtral-8x7b running on my 3090. Does around 0.5 tokens/sec. And yeah, the slower speed can negate some use-cases.
@deltawhiplash1614 Жыл бұрын
Hello where can I found a cheap high vram gpu ?
@andikunar7183 Жыл бұрын
Mixtral runs on a M2/M3 Max with >=64GB with 25 token/s inference via llama.cpp and a Q5k_m quantized model from huggingface/TheBloke. Apple silicon shines on single-user inference (comparatively cheap with large RAM, only 2x slower than NVidia), but NVIDIA GPUs blow them away on training (13x faster)
@dholzric1 Жыл бұрын
@@andikunar7183 I'm getting 7 tokens/sec on my dual 3090 setup with 64gb of ddr4 and a 5950x. with Mixtral Q5_K_M. I'm wondering if i need to remove the SLI connector.
@h2o11h2o Жыл бұрын
Do you have a Discord?
@SFgamer Жыл бұрын
90 gb takes up a quarter of hd space.
@petec737 Жыл бұрын
You can buy 1TB hdd for literally $20 lol
@ROKKor-hs8tg Жыл бұрын
You could have said that its size is 100 GB
@EntangleIT Жыл бұрын
🙏👀
@8eck Жыл бұрын
Mixtral On Your Computer, just need a card worth 7000$ 😁
@andikunar7183 Жыл бұрын
A bit more than $3000 is all you need for inference only - use the latest llama.cpp and the quantized models from huggingface/theBloke. I'm very happy with 5Bit quantization quality (low perplexity) on my M2 Max Mac Studio, get 25 token/s, and use approx. 40GB RAM in total. With reduced precision, it could run on a 32G M2 Max. But Martin's machine really blows away the Macs on learning/finetuning.
@Stephen5311 Жыл бұрын
I'm running Dolphin-2.5-mixtral-8x7b on my 3090 (worth ~$800). Though you also need 128GB of RAM it seems.
@andikunar7183 Жыл бұрын
@@Stephen5311 yes, you can run all with llama.cpp on a good CPU, if you have enough CPU-RAM. However the token/s you get is lower, because of copying RAMVRAM, or executing just on the low mem-bandwidth / low-FLOPs CPU. You can also accelerate with stronger quantization, but this reduces accuracy (for me 5Bit seems the sweet-spot). My M2 Max gets 2-3x the # of token/s when using the GPU. And the 3090/4090,... have a mich higher GPU-horsepower (+ the Intel/AMD CPUs typically have a much lower memory-bandwidth) - so their acceleration vs. the CPU should be higher.