How to Run LLaMA-2-70B on the Together AI

  Рет қаралды 12,704

Sam Witteveen

Sam Witteveen

Күн бұрын

Пікірлер: 47
@StephenPasco
@StephenPasco Жыл бұрын
Sam, you have some of the best LLM videos on YT. Thanks so much for making this content.
@MrFaaaaaaaaaaaaaaaaa
@MrFaaaaaaaaaaaaaaaaa Жыл бұрын
I wonder what the fined tuned API costs will be. That's where the value is at for me
@Drone256
@Drone256 Жыл бұрын
Wonder why you have to “spin up” the model. They could keep some running because with any customer volume at all Llama-2-70b will be in constant demand and predictable demand.
@noobicorn_gamer
@noobicorn_gamer 8 ай бұрын
They lowered the cost even further now. Exciting times.
@tiagomarante7720
@tiagomarante7720 11 ай бұрын
Hi Sam, Could you make a video about togetherAi with langchain agents? I already try it all but the TogetherAi doesnt think it support agents. With gpt 3.5 it works fine, but when i change the model it never works :( . I feel like langchain agent where only made for openAi models and not other LLM in mind. Btw keep the good work
@fabianaltendorfer11
@fabianaltendorfer11 Жыл бұрын
Can't run your class: PydanticUserError: If you use `@root_validator` with pre=False (the default) you MUST specify `skip_on_failure=True`. Note that `@root_validator` is deprecated and should be replaced with `@model_validator`. Any ideas? Thanks!
@dare2dream148
@dare2dream148 Жыл бұрын
Thank you Sam for another great post! Super helpful for me to et up to speed with Llama 2 large models. Keen to learn how it supports parallel data processing. For example, if I have several prompts in hand, is there any way not to process them sequentially.
@samwitteveenai
@samwitteveenai Жыл бұрын
Good question, I think it should work in batches but I need to check it out .
@souvickdas5564
@souvickdas5564 Жыл бұрын
How to embed documents using together ai models so that I can store those embedding to vectordb. Instead of using instructor model for embedding, how can we do it using together ai's models? Thanks for your awesome video.
@samwitteveenai
@samwitteveenai Жыл бұрын
I am not sure if Together is hosting any Embedding models, I will keep an eye out about this. I approached them about some other stuff and the never bothered to get back to me.
@mahmudyassin6422
@mahmudyassin6422 Жыл бұрын
amazing sam. you are great
@geekyprogrammer4831
@geekyprogrammer4831 Жыл бұрын
I know Chris Re. He is teaching Machine Learning at Stanford!
@JesFinkJensen
@JesFinkJensen Жыл бұрын
Very useful. Thanks!
@fontenbleau
@fontenbleau Жыл бұрын
It's really strange, the 70B of Llama 2 did not impressed me so much than 13B Llama 2 and even original 65B, maybe because they cleaned it so much from hallucinations it becomes very empty and kinda dumb. The most interesting in 70B was hallucinations in big size. Such hallucinations I see now only in Bloomz 176 billions, which right from the start started discussion from our "relationships". Anyway I see a great replace of 70B Llama 2 in a new Platypus 2 70B which is very scientific but quite short, the golden ratio really is to construct a great scientific model which is very long in tokens limit and not so required in hardware. I'm using ggml 8-bit versions, which is indistinguishable from original float16, it's great on 14 cores CPUs. Kinda unique that I'm never used GPU with them still.
@samwitteveenai
@samwitteveenai Жыл бұрын
Interesting. What were the conditions and types of prompt causing the hallucinations?
@fontenbleau
@fontenbleau Жыл бұрын
@@samwitteveenai in Bloomz it's right away with standard trick of requesting topic of previous discussion (which never take place and model will made it up), but I haven't managed to probe it further. Bloomz is very big and I accessed it only through Petals service, where it's very slow and short in response.
@SteveSamuels-m1y
@SteveSamuels-m1y Жыл бұрын
Great video. I've been using LLaMa 70b chat on Together and it seems very bad. I get a lot of duplication of sentences and lots of malformed words. I ddon't get the same issue on Replicate. Any idea why the performance is so bad? It's a real shame as I really like the look of Together AI
@samwitteveenai
@samwitteveenai Жыл бұрын
check the prompt to see if they are doing the prompt wrapping for you. They should be the same model, so my guess it is something to do with prompts or the sampling
@SteveSamuels-m1y
@SteveSamuels-m1y Жыл бұрын
@@samwitteveenai Thanks for the quick response, I don't think there is prompt wrapping, it is just the default prompt, which I have tried changing. I've no also tested hosted versions of LLaMa 70B Chat on a couple of other hosted services and I've seen this behaviour a few times, as well as having read reports of it being seen by other, especially as it approached its max context length.
@samwitteveenai
@samwitteveenai Жыл бұрын
Interesting I haven't seen this personally. It could make sense if it is at the end of the context window.
@arockdurai5110
@arockdurai5110 Жыл бұрын
Awesome video.
@paulocoronado2376
@paulocoronado2376 Жыл бұрын
Amazing! 👏👏👏
@guanjwcn
@guanjwcn Жыл бұрын
Thanks for the video, Sam. Does it mean that if we need to run the 70B model on premise, we would meed 4 x H100 GPUs?
@samwitteveenai
@samwitteveenai Жыл бұрын
If you want to run it at full resolution you probably need at least 4 A100s to get a speedy service
@guanjwcn
@guanjwcn Жыл бұрын
Thanks for the prompt reply, Sam. Pardon for my ignorance here. If my chatbot needs to be hosted on premise, and there could be hundreds of customers using the chatbot simultaneously at a particular point in time, what would be the hardware requirements? Would 4 x H100 GPUs be sufficient? Or I would need like hundreds of H100 GPUs?@@samwitteveenai
@s.patrickmarino7289
@s.patrickmarino7289 Жыл бұрын
You could run all the surrounding software such as the context memory and UI logic locally, but run the actual model on the cloud. This would give you some advantage in terms of privacy and such. If you need serious privacy, you might be better off running the back end of the model on AWS. That can be far more expensive if you have light usage, but far less if you are talking about heavy use.
@guanjwcn
@guanjwcn Жыл бұрын
Thank you for sharing the insights.@@s.patrickmarino7289
@tonyli7014
@tonyli7014 Жыл бұрын
I tried the MPT 30B and Falcon 40B. Both have timeout errors.
@samwitteveenai
@samwitteveenai Жыл бұрын
I haven't tried those, I think they should work, just make sure you turn them on first etc.
@s.patrickmarino7289
@s.patrickmarino7289 Жыл бұрын
Is Together better than Replicate? What is the major difference?
@samwitteveenai
@samwitteveenai Жыл бұрын
This is a lot cheaper than Replicate in my testing.
@ngoduyvu
@ngoduyvu Жыл бұрын
Interesting, I hope they keep the 0$ hosting
@samwitteveenai
@samwitteveenai Жыл бұрын
It seems they have adopted that now
@leloulouduski
@leloulouduski Жыл бұрын
Sam, thank you so much for your amazing videos. I have one question for you. We are all still looking at functionality but what about scalability ? Do you think that we will be able to use those incredibly powerful model in a real context at high frequency like 100’s requests/prompts per second ? Even with fine tuning, pruning, quantization , …. ? Thanks in advance
@IvarDaigon
@IvarDaigon Жыл бұрын
Nice video, looks like the pricing has changed again and its now based on tiers where each tier is a range of parameter sizes. their highest tier 40-70B parameters is now $0.003 per 1K tokens which is twice the cost of ChatGPT 3.5 (4K context) which is only $0.0015 per 1K tokens and GPT 3.5 has 175B parameters. Looks like they will have to reduce the price significantly to be competitive with OpenAI and Anthropic even though they are offering open source models that they themselves did not have to create. My expectation is that if a model has half as many parameters and uses half as much RAM as well as half as many GPU cycles as the competitor's then the price I should be paying is about half as much (or less) also. Still though it's good for developers to easily test out the different models and more competition is ALWAYS welcome, but as a business they will need to do better to remain viable.
@solomonaryeetey7370
@solomonaryeetey7370 Жыл бұрын
Exxxxxxxactly what I have been looking for!!! Thank you Sam!
@paraconscious790
@paraconscious790 Жыл бұрын
Amazing Sam, this is so very good to try out and do some serious stuff with. Thanks a lot! bless
@micbab-vg2mu
@micbab-vg2mu Жыл бұрын
Sam thank you for the video - I will try it.
@rajivraghu9857
@rajivraghu9857 Жыл бұрын
Good one .. Its APIs are also pretty fast!
@jacobgoldenart
@jacobgoldenart Жыл бұрын
Hi Sam, Thanks for all the great videos! I have a question about the prompt template. There seems to be all sorts of ideas about how to properly prompt LLama2. Your Prompt format looks like this: [INST] You are a Neuroscientist with a talent for explaining very complex subjects to lay people Chat History: {chat_history} Human: {user_input} Assistant:[/INST] Where as on the Huggingface guide to Lllama 2 prompting it has a slightly different format using an wrapped around the whole turn and closing the [/inst] tag right after the {user_input}, where on yours you include the Assistant: within the [/inst] Here's huggingface's version" [INST] {{ system_prompt }} {{ user_msg_1 }} [/INST] {{ model_answer_1 }} [INST] {{ user_msg_2 }} [/INST].... One more thing. There is a recent Replicate blog on (you guessed it!) A guide to prompting Llama 2 : ) where they say you don't want to use the Human: (to denote the human is speaking) and you only want to wrap the (humans) input in the [inst] not the ai's Here's their examp'e: correct_prompt_long = """\ [INST] Hi! [/INST] Hello! How are you? [INST] I'm great, thanks for asking. Could you help me with a task? [/INST] Of course, I'd be happy to help! . [INST] How much wood could a wood chuck chuck or something like that? [/INST] """ So I'm really confused as the correct way! : )
@samwitteveenai
@samwitteveenai Жыл бұрын
Mine wasn't exactly like the Chat Models does it in the paper. Generally I explore for a while to I find what seems to work best. Original I called it AI rather than Assistant but out of the blew it kept replying with "assistant: reply" etc. If you take a look at the Meta code you can see how they did it. I deliberately added things like "Chat History" to make it more compliant with the way LangChain does it. Usually I try to do what they paper or model card says and then in cases like this I tweaked. I hadn't seen the info about 'Human' I think I tried 'User:" at one point and it didn't seem to work well. hope this helps a bit.
@bleo4485
@bleo4485 Жыл бұрын
Sam, thanks for this. i think togetherai made another positive change since your video. I just signed up and received 25k credits 😍
@samwitteveenai
@samwitteveenai Жыл бұрын
Yes I saw this the day after I recorded the vid . Even better for people signing up now
@K.F-R
@K.F-R Жыл бұрын
This video did not start with "Okay....". I feel cheated. 😉
@samwitteveenai
@samwitteveenai Жыл бұрын
Lol
@dare2dream148
@dare2dream148 Жыл бұрын
Haha same here!
RetrievalQA with LLaMA 2 70b & Chroma DB
10:52
Sam Witteveen
Рет қаралды 17 М.
LLaMA2 Tokenizer and Prompt Tricks
13:42
Sam Witteveen
Рет қаралды 16 М.
My daughter is creative when it comes to eating food #funny #comedy #cute #baby#smart girl
00:17
The joker favorite#joker  #shorts
00:15
Untitled Joker
Рет қаралды 30 МЛН
This Llama 3 is powerful and uncensored, let’s run it
14:58
David Ondrej
Рет қаралды 146 М.
META's New Code LLaMA 70b BEATS GPT4 At Coding (Open Source)
9:25
Matthew Berman
Рет қаралды 82 М.
Unlimited AI Agents running locally with Ollama & AnythingLLM
15:21
Tim Carambat
Рет қаралды 135 М.
SMALL BUT MIGHTY - 13B Model Beats Llama-65B NEW BEST LLM!!!
12:17
Prompt Engineering
Рет қаралды 22 М.
I used LLaMA 2 70B to rebuild GPT Banker...and its AMAZING (LLM RAG)
11:08
Open Source AI Inference API w/ Together
25:25
sentdex
Рет қаралды 31 М.
Intro to RAG for AI (Retrieval Augmented Generation)
14:31
Matthew Berman
Рет қаралды 59 М.
Ollama: The Easiest Way to RUN LLMs Locally
6:02
Prompt Engineering
Рет қаралды 38 М.