LLaMA2 Local install on MacBook

Рет қаралды 104,672

Жыл бұрын

Install LLaMA2 on an Apple Silicon MacBook Pro, and run some code generation.
get TG Pro for yourself: www.tunabellysoftware.com/tgp...
👀 More gear I use: (including course creation and youtube stuff): www.amazon.com/shop/alexziskind
▶️ Apple Silicon and Developers Playlist - • Apple Silicon and Deve...
Repos and models
1. Request access: ai.meta.com/resources/models-...
2. Clone: github.com/facebookresearch/l...
3. Clone: github.com/ggerganov/llama.cpp
Commands
python3 -m pip install -r requirements.txt
python3 convert.py --outfile models/7B/ggml-model-f16.bin --outtype f16 ../../llama2/meta_models/llama-2-7b-chat
make
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0
./main -m ./models/7B/ggml-model-q4_0.bin -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt
#m2pro #m2max #programming
💻NativeScript training courses - nativescripting.com
(Take 15% off any premium NativeScript course by using the coupon code YT2020)
👕👚iScriptNative Gear - nuvio.us/isn
- - - - - - - - -
❤️ SUBSCRIBE TO MY KZbin CHANNEL 📺
Click here to subscribe: / alexanderziskind
- - - - - - - - -
🏫 FREE COURSES
NativeScript Core Getting Started Guide (Free Course) - nativescripting.com/course/na...
NativeScript with Angular Getting Started Guide (Free Course) - nativescripting.com/course/na...
Upgrading Cordova Applications to NativeScript (Free Course) - nativescripting.com/course/up...
- - - - - - - - -
📱LET'S CONNECT ON SOCIAL MEDIA
ALEX ON TWITTER: / digitalix
NATIVESCRIPTING ON TWITTER: / nativescripting

Пікірлер: 160

@andikunar7183 Жыл бұрын

Great video. If running llama.cpp on Apple Silicon Macs, I would recommend to build with "LLAMA_METAL=1 make" and invoke main including the "-ngl 1" option. This enables processing on the M1/2's GPU. Not a big difference on M1/M2s, but probably on Pro/Max/Ultra Macs, but it also frees up your CPU for other taskes. If running on a M1/M2 (not pro,...) and not on the GPU you also could invoke it to only use the P-cores with including the flag "-t 4", for me this was faster most of the time.

@AZisk Жыл бұрын

yes, pinning this. currently the 7B and 13B models work on metal, but the 70B model only on CPU. watch this issue for updates and workarounds: github.com/ggerganov/llama.cpp/pull/2276

@EntangleIT Жыл бұрын

Make sure to run 'make clean' if you ran make without the LLAMA_METAL=1 parameter beforehand. Also looks like 13b-chat is too big for my VRAM, as it fails. Attempting to use the 7b-chat model next.

@EntangleIT Жыл бұрын

Unfortunatly my M1 Macbook Air only has 8GB and cannot run the 7b-chat model either.

@FabioEloi Жыл бұрын

@@EntangleIT Maybe (maybe) you could modify the quantization value, using half of the memory.

@EntangleIT Жыл бұрын

@@FabioEloi Thanks, will try that.

@pablotus1067 9 ай бұрын

Been struggling through several online tutorials and finally got LLama2 running locale on my laptop after this video!! Thanks so much !!

@AaoCodeKare 8 ай бұрын

For people getting error: vocab size mismatch (model has -1 but tokenizer.model has 32000) There should be a .json file (probably params.json) inside the llama-2-7b-chat folder. Open the json file and set the "vocab_size": to 32000 from -1. Here is my params.json file: {"dim": 4096, "multiple_of": 256, "n_heads": 32, "n_layers": 32, "norm_eps": 1e-06, "vocab_size": 32000}

@braindates 8 ай бұрын

thank you!!

@adewaleShotobi 8 ай бұрын

Thank you

@danielli9224 8 ай бұрын

OH MY GOD U JUST SAVED MY NIGHT（I decided not going to bed until I solve this problem then I see your comment ）

@thiruppathis264 7 ай бұрын

You are the saviour

@rosszhu1660 7 ай бұрын

Got the same problem. Thank you so much!

@swiftpy Жыл бұрын

You created a smile on my face. Ready happy to see you doing this kind of videos.

@MeinDeutschkurs Жыл бұрын

Mysteries busted! Helped a lot! Thx, Alex! 🤗

@shivamroy1775 Жыл бұрын

You are a Godsend. I downloaded these weights on my Mac and have been trying to figure out how to run them locally since the last 3 days. God bless the KZbin algorithm that I finally stumbled across your video. I absolutely love how detailed the video is and how you took the time to explain everything including the folder structure in great detail. Thanks

@synaestesia-bg3ew Жыл бұрын

What minimum laptop spec

@theoldknowledge6778 Жыл бұрын

What an amazing and useful video! Thank you!!! Can’t wait for Part II

@LukeBarousse Жыл бұрын

Just got approved by Meta, thanks for this demo Alex!

@AZisk Жыл бұрын

zuck blessed thee :)

@nirsarkar 11 ай бұрын

This was a great video! Thank you so much!. I used it to deploy, 7b-chat model on M2 (not pro). I have 24G RAM and 1TB disk. I used the LLAMA_METAL=1 make as suggested by andiKunar, thanks for that. I was able to see that main() was actually using the GPUs during inference. Responses were pretty fast, with no lag as such, though I need to test more. Thanks once again, it is a great starting point to run the models, locally.

@peterthornton4462 5 ай бұрын

bro you are helping me out right now, thank you for your clear explanation

@aaishikdutta290 Жыл бұрын

Hey Alex, would love to see benchmarks on different types of machines for different type of models on local

@Brocollipy 10 ай бұрын

Thank you so much for this. Really helpful.

@not_milk 11 ай бұрын

I was able to get a 13B version running quite performantly on a M1 Pro with 16GB RAM! I had to use a heavily quantized version of it. Q3 runs very fast, and a Q4 is mostly performant, although at certain points it hangs for a second.

@williamyang3721 11 ай бұрын

Thanks for your video! It helps a lot!

@mitchec100 11 ай бұрын

Fantastic video. managed to get both 7 and 13B working on a MacBook Pro M1 16" with 16GB ram. Ive tried chat GPT before and was expecting similar result from llama2. Im afraid there is no comparison. although it works fine with quick response it has a mind of its own. With regards to accuracy, its way of the mark on most things and likes to make stuff up to fill the gaps. I suppose it's more accurate on the 70B model, which on my Mac won't run. Anyway it's good fun just reading the stuff it comes up with. Keep up the good work

@SaurabhAgrawal-px5tv 9 ай бұрын

For me its crash with 16GB Mac

@YugalJindle 4 ай бұрын

Super clear, and super well explained. Best explainer.

@AZisk 4 ай бұрын

Wow, thanks!

@twoodcc Жыл бұрын

great video! thanks!

@magnusmuller5223 4 ай бұрын

Thank you very much

@asamirid Жыл бұрын

good guide, thank you ✅✅

@jamshidmamatov7859 8 ай бұрын

huge thanks bro

@echos01 10 ай бұрын

hello, Alex You did a really good job! I still have one question: how to stop the prompt? means terminate the main shell

@yusufkemaldemir9393 10 ай бұрын

will be great to see its performance on QA for local pdfs. since you already have it installed in your comp, would you please make a short video to see its performance on QA? thanks

@som4971 9 ай бұрын

thanks for this video

@indylawi5021 11 ай бұрын

Thanks for putting together this great video. given the huge GB size and limited size of storage on my M1 Mac, I would like to know if anybody managed to use external SSD for the storage of llama?

@ninjatogo 11 ай бұрын

The CPU performance of the 7B model looks pretty snappy. I know we probably shouldn't rely on it for writing code, but do you think it's reliable enough for reading and translating code into pseudo-code or explaining code in human-like sentences? I want to try building an automated documentation system with it.

@bunnyboy4192 11 ай бұрын

thanks god you exist!!

@woolfel Жыл бұрын

I've been meaning to try it, but been busy with other stuff. It will be nice to download the model without having to torrent multiple files or figure out which one to download from huggingface. Last month I tried to get LLama 1 working on my macbook pro, but downloading the model files from huggingface is less than pleasant. Then there's all the extra steps to get it running. Having to run cmake also felt a little icky. Too many bad memories running cmake on linux over the last 2 decades :) There's a reason why M2Ultra can go up to 192G of unified memory :) Even quantized versions take a ton of memory to load the entire model.

@BenOgorek 11 ай бұрын

Refreshing to see a tutorial where you don’t have to install UI components to use it. I’ve got only 8gb on my Mac so I’m going to have to switch to my Windows laptop w 32gb. Curious if you still need to quantize and all that stuff

@mubasharwarriach7216 Жыл бұрын

Can you make a tutorial on how to make a local running api using this modal which we can use for our websites?

@vladioffe 11 ай бұрын

Thanks for the VIdeo it was bery useful. It would be amazing if you could also create a video on how to make the chat creating summarization of texst.

@bdarla Жыл бұрын

In my opinion, this is your best video! Thank you!

@milangalusic2525 Жыл бұрын

Danke!

@user-ko2xe2kt7t 7 ай бұрын

Fantastic!!!

@maxlgemeinderat9202 8 ай бұрын

Great video! How can I now access the model/tokenizer in a python script?

Жыл бұрын

Damn!!! Now I have a perfect excuse for trading my Air base model for a Pro model with 2 TB!!!!

@Michael-Martell 10 ай бұрын

Is it better to do this process with the code Llama? would that one work the same?

@victorbarros1130 11 ай бұрын

wow. a lot of steps. ty waiting for someone upload it all into a docker image. Oo

@Kristin-666 5 ай бұрын

great and detailed video! but can i download it on macbook air m1 with 8 ram and 256 gb storage? thanks.

@ummnine6938 11 ай бұрын

how did you run the 70B, how much ram did it take, and btw its very disappointing to hear that even 70b doesnt do good code, really.

@jaans3712 Жыл бұрын

I am no Python expert, but why to use Conda and then install packages with Pip? I have used Conda (and Python 😄) few times and I installed all the environments with ”Conda install”

@AbishaiSingh 11 ай бұрын

At the end, you said it’s not good for code, can you share your experience which llm is best for code so far?

@muthukumarannm398 11 ай бұрын

I had no idea that covert and quatize we so fast. What machine was used in this video? cpu and ram?

@JSiuDev Жыл бұрын

Thank you!! Going to try this on my M2 air. Let see how far this can go, lol.

@abbasali6588 11 ай бұрын

let me know your results please, sad Air user here too :p First time feeling bad for purchasing air :(

@JSiuDev 11 ай бұрын

@@abbasali6588 You can run up to 13B if you have 16G of ram. I didn't try is 70B because llama.cpp had bugs with 70B during that time. It is not fast, but works.

@linglebrun7119 11 ай бұрын

@@abbasali6588 I run Llama2 7B on my MacBook Air M2 - no problem. I have 16GB Ram.

@Space. Жыл бұрын

great video i applied a while ago and still didn't get mine :(

@benjaminalonso4630 10 ай бұрын

gracias he podido realizarlo en mi mac m1 del 2021... va como avión!! (ironia) pero por la santa vaca funciona! jajaj gracias...

@philipbutler Жыл бұрын

7:44 I would assume that it would automatically create the 7B directory, unless you already tried

@karthigeyan88 11 ай бұрын

Hi Alex, Great video and thanks for that... Could you mention how to make an API out LLAMA 2 so that we can use it for synthetic data generation like how we used to do it with OpenAI API. Thanks in advance.

@SaurabhAgrawal-px5tv 9 ай бұрын

Did you get answers to this now? I am also looking for the same😄

@karthigeyan88 9 ай бұрын

@@SaurabhAgrawal-px5tv no, I havent...

@DNote314 9 ай бұрын

Do you still have the same opinion about the code generation with the code-llama model?

@michaeldausmann6736 11 ай бұрын

did anybody have trouble with md5sum? I don't have it on my machine. brew install? alias md5 -r ?

@Tony_Hughes 11 ай бұрын

I got a Mac Pro pro Vega II, 32gb video memory and the 12core cpu with 96gb ram.. will I be able to follow these instructions on a intel Mac Pro 2019?

@DevanSabaratnam Жыл бұрын

Will we need to have latest XCode CLI tools to compile?

@yusufkemaldemir9393 10 ай бұрын

was that useful for information retrieval, QA of local pdfs?

@Shivam-bi5uo 6 ай бұрын

doubt, if i use ollama, arent the models already quantized because i see that llama2-uncensored and other llama models are 3,8gb only

@LouRao 11 ай бұрын

Great video. Although I wished you had gone slower in step by step methods. A better explanation from the start to the end would had been a better understanding.

@jeanluchaurais1506 11 ай бұрын

Great video. Can you show how to train the model with our own data ??

@user-ob7fd8hv4t 7 ай бұрын

I also want to know, I want to train the model with some of my own data, because the data is sensitive, I want to deploy the model locally

@garethgaston2809 8 ай бұрын

My compile fails unless you add "LDLIBS=-lstd++" to your environment. Not sure why it's not in the Makefile

@RobertAlexanderRM 11 ай бұрын

Great video. I have converted all of my python programming to Poetry. Bye Conda, bye pipenv :) Also I prefer using pyenv to manage my different python levels.

@BenOgorek 11 ай бұрын

Did you get it to work? Seemed like the last advantage of conda was dealing with GPU stuff, and I was wondering if LLMs would breath new life into it

@seanbrown9900 9 ай бұрын

How do I exit out from talking to an llm in my terminal once I have this up and running?

@DataLogicSolution 10 ай бұрын

How to increase promt size limit. I am getting this error main: error: prompt is too long (5717 tokens, max 508)

@sanajitdasadhikary_048_cse6 6 ай бұрын

can you please make a video how to install llama model on Windows 11?

@mahaasiri5949 Жыл бұрын

Great

@avinrique Жыл бұрын

look i am a linux user and it worked in mine and your instruuction were damn clear , but i have a graphics card rtx 3060 with 6gb vram and i want to run this thingy on my gpu , how can we do that? and also would it can be fine tuned later?

@SYEDNURULHasan1789 6 ай бұрын

What if anyone wants to use it through LLM studio instead of doing manually all the quantization? Will there be nay problem in doing so?

@Videodecumple 11 ай бұрын

Is the terminal the only way to interact with the model? Can I use python instead?

@user-fm5yy2sx2p 10 ай бұрын

What are you Mac parameres (ram, gpu ect)?

@shiyammosies5975 6 күн бұрын

I want to use AIDE locally when I do development on my VSCODE - I see that the Claude 3.5/ Gemini API integration works well. Do I need to run any LLM locally that might help me with code? In future there might be some - I'm asking because I want to know which M2 mini to Buy M2 mini 8GB/16GB or M2 Pro Mini with 16GB or even more? Please suggest @alex

@am0x01 9 ай бұрын

Why not using ollama to run the LLaMa2 on MacBook?

@RendonJr 2 ай бұрын

Did u make a video with llama 3

@BinuJasim 11 ай бұрын

Great video. How much of the low quality code generation capability can be attributed to quantization?

@BenOgorek 11 ай бұрын

Good question. I was also curious as to what happened there

@stargator4945 5 ай бұрын

LLaMA2 there is an app using those models directly when downloading from HF. It is called freechat. But I have a question, I tested about 30 different models on an M2U machine with enough RAM. I made the best results with mixtral 8x7B model but only in English. I tried some German models and they were bad (better in English than in German) yes they were worse than the mixtral model used in German. Does anybody have found a model that works properly in the German language?

@swiftpy Жыл бұрын

👍🏽 Great 👍🏽👍🏽

@ggopi767 10 ай бұрын

can we run open source LLM like Llama, stable diffusion in MAC with Graphics AMD Radeon R9 M380 2 GB, Processor- 3.2 GHz Quad-Core Intel Core i5

@mahdi-hasan 10 ай бұрын

Can you please review Code Llama?

@smuskal 11 ай бұрын

Any ideas how to join langchain on the local model(s)?

@you__shef 5 ай бұрын

will this work with an older i7 mac or just the new gen m chips?

@alifakhraee 8 ай бұрын

Got this error on the convert step, any idea how to resolve it? Vocab size mismatch (model has -1, but models/tokenizer.model has 32000).

@pasikeranen 8 ай бұрын

Just tried this and get the same error... No idea what that is about, perhaps something is off with the latest llama.cpp? Hopefully someone can point us to right direction.

@bekagelashvili2904 4 ай бұрын

do you know if i can use NVIDIA Chat with RTX whis ai agent

@lhxperimental 7 ай бұрын

Why do they have a special link for each person? Do they mod the AI model for each user, if not the hash of the file should match for all users on the same version of the file. Not sure how but can it call home? I know its just a file that gets consumed by your software but...somehow this looks sus.

@KevinKentor 4 ай бұрын

Is it possible to customize the model?

@jinluwang5671 2 ай бұрын

Can you please do one for llama3?

@mckengineer5727 Жыл бұрын

How about Bard ?

@user-hz7qj7iu4u Жыл бұрын

Fantastic video Alex, thank you! I managed to run the 7B and 13B Model. With the 70B Model I was not that lucky. The main error looks like this: error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected 8192 x 8192, got 8192 x 1024. Could you give me a hint how to solve this?

@whistl034 11 ай бұрын

I have the exact same problem. I searched the web and only found someone had submitted a bug report about this issue.

@jackmarius5423 11 ай бұрын

Same problem

@daves.software 11 ай бұрын

Add -gqa 8

@HACKTIONREPLAY 11 ай бұрын

Try this command instead: ./main -m ./models/70B/ggml-model-q4_0.bin -n 4096 -gqa 8 @@jackmarius5423 @whistl034

@HACKTIONREPLAY 11 ай бұрын

Try this command instead: ./main -m ./models/70B/ggml-model-q4_0.bin -n 4096 -gqa 8@@whistl034

@mynameisZhenyaArt_ 7 ай бұрын

Can you run a 70B version of llama2 ?

@RunForPeace-hk1cu Жыл бұрын

I applied and never gotten an email ...

@Uglybob2012 10 ай бұрын

I have a laptop with 6gb ram and i couldnt run any of these on local.

@massimodileo7169 Жыл бұрын

on what kind of machine (cpu and memory) did you tried it? I would love to try it on my mac mini M1 with 8g but I have also a macbook pro 16 M1 with 16gb… just in case

@Chidorin Жыл бұрын

afaik it needs 16gb minimum

@Monawwar 5 ай бұрын

i am getting the following error after running convert.py raise FileNotFoundError(f"{vocabtype} vocab not found.") FileNotFoundError: spm vocab not found.

@d3vilscry666 5 ай бұрын

Are you missing a file?

@zihuatanejo7741 11 ай бұрын

managed to run the 7B-chat model, but in the interactive mode(using the script in examples folder), it sometimes gets stuck I'm using 2021 M1 Pro MBP, 16GB RAM

@SaurabhAgrawal-px5tv 9 ай бұрын

Same for me, did you get any solution?

@eclipselu 11 ай бұрын

Thanks for the video! I'm curious of how the 70B model answers the prime number question, or can it even be loaded into the memory of a MBP.

@aniruddhlaharia1275 10 ай бұрын

I did all of that but it is so bad at writing code 😭

@itrcz Жыл бұрын

Can anyone explain why context increase affect performance not linear? when I start context 1024 it run smooth but when I increase it upto 4096 app just freeze

@Chidorin Жыл бұрын

what’s your ram? afaik is needs 16gb minimum, so can be more for 4096

@JC-td6ud 7 ай бұрын

llama-2-7b-chat does not appear to have a file named config.json I am seeing this error during convert.py. Does anyone know how to fix it?

@shawnwang7032 6 ай бұрын

Same here. Looking for the solution.

@zelopes99 6 ай бұрын

@JC-td6ud@@shawnwang7032 i found somewhere that you needed to download the llama-2-7b-chat-hf model. You can find it in the hugging face repository. I downloaded each file individually bc I couldn't manage to clone it with git

@davehenokhliong 6 ай бұрын

@@zelopes99 did it work for you after changing it to the hf model?

@MudroZvon Жыл бұрын

How much memory do you need?

@Chidorin Жыл бұрын

afaik 16gb at least

@MrDragos360 Жыл бұрын

I did a course at my master degree about ML...hated it soo much. I couldn't wrap my head around it and I didn't understand a thing. I will stick to fullstack web and mobile dev and keep away from this crazy ML/AI thing. Just running those models seems sooo complicated, imagine writing the code for them. Also it seems evyerhing about ML in in Python or C/C++, languages that I never understood. C#, Java and JavaScript(especially with TypeScript) are sooo easy and pleasant to write in.

@boraoku 11 ай бұрын

After some testing, I can confidently say that even compared to free GPT 3.5, LLAMA2 is a waste of time.

@yagoa Жыл бұрын

can you convert it to coreML?

@AZisk Жыл бұрын

one would need to be way smarter than me to do that :)

@SashaBaych 10 ай бұрын

You attitude to criticism is exemplary.

@AZisk 10 ай бұрын

i keep seeing your comments show up, and when i go to reply, they are gone. i think youtube is deleting them. perhaps try tempering the wording to be more constructive and not inflammatory

@de-ar 3 ай бұрын

9:29 That's what He said 🤐

@El.Desarrollador 11 ай бұрын

Thanks to you and ChatGPT I was able to install it, there was some steps I need to do for the tokenizer. but Not Im not able to run the 70b model I did the same steps but with the 70b paths but is not running I got an error. when I run the main command. I was able to run the 7b parameter model with no issue but not the 70b. By the way the 70b is not much useful.. definitely not following well instructions and it start writing like in some other languages sometimes. Is really not good following instructions. I'll keep my ChatGPT Plus subscription for now...