How to Run LLaMA-2-70B on the Together AI

Рет қаралды 13,132

Sam Witteveen

Күн бұрын

Пікірлер: 48

@StephenPasco Жыл бұрын

Sam, you have some of the best LLM videos on YT. Thanks so much for making this content.

@MrFaaaaaaaaaaaaaaaaa Жыл бұрын

I wonder what the fined tuned API costs will be. That's where the value is at for me

@geekyprogrammer4831 Жыл бұрын

I know Chris Re. He is teaching Machine Learning at Stanford!

@Drone256 Жыл бұрын

Wonder why you have to “spin up” the model. They could keep some running because with any customer volume at all Llama-2-70b will be in constant demand and predictable demand.

@noobicorn_gamer 11 ай бұрын

They lowered the cost even further now. Exciting times.

@fabianaltendorfer11 Жыл бұрын

Can't run your class: PydanticUserError: If you use `@root_validator` with pre=False (the default) you MUST specify `skip_on_failure=True`. Note that `@root_validator` is deprecated and should be replaced with `@model_validator`. Any ideas? Thanks!

@souvickdas5564 Жыл бұрын

How to embed documents using together ai models so that I can store those embedding to vectordb. Instead of using instructor model for embedding, how can we do it using together ai's models? Thanks for your awesome video.

@samwitteveenai Жыл бұрын

I am not sure if Together is hosting any Embedding models, I will keep an eye out about this. I approached them about some other stuff and the never bothered to get back to me.

@solomonaryeetey7370 Жыл бұрын

Exxxxxxxactly what I have been looking for!!! Thank you Sam!

@leloulouduski Жыл бұрын

Sam, thank you so much for your amazing videos. I have one question for you. We are all still looking at functionality but what about scalability ? Do you think that we will be able to use those incredibly powerful model in a real context at high frequency like 100’s requests/prompts per second ? Even with fine tuning, pruning, quantization , …. ? Thanks in advance

@shadyar26 2 ай бұрын

Thanks for the content! I'm planning to build a pdf chatbot and chat with the pdf and user of my product can chat with it anytime but the different thing, it's for my country because our language is different Everyday I may have like 200,000 word so I wonder how much it costs, is it a lot or not? and is it good to use that or not?

@SteveSamuels-m1y Жыл бұрын

Great video. I've been using LLaMa 70b chat on Together and it seems very bad. I get a lot of duplication of sentences and lots of malformed words. I ddon't get the same issue on Replicate. Any idea why the performance is so bad? It's a real shame as I really like the look of Together AI

@samwitteveenai Жыл бұрын

check the prompt to see if they are doing the prompt wrapping for you. They should be the same model, so my guess it is something to do with prompts or the sampling

@SteveSamuels-m1y Жыл бұрын

@@samwitteveenai Thanks for the quick response, I don't think there is prompt wrapping, it is just the default prompt, which I have tried changing. I've no also tested hosted versions of LLaMa 70B Chat on a couple of other hosted services and I've seen this behaviour a few times, as well as having read reports of it being seen by other, especially as it approached its max context length.

@samwitteveenai Жыл бұрын

Interesting I haven't seen this personally. It could make sense if it is at the end of the context window.

@tiagomarante7720 Жыл бұрын

Hi Sam, Could you make a video about togetherAi with langchain agents? I already try it all but the TogetherAi doesnt think it support agents. With gpt 3.5 it works fine, but when i change the model it never works :( . I feel like langchain agent where only made for openAi models and not other LLM in mind. Btw keep the good work

@dare2dream148 Жыл бұрын

Thank you Sam for another great post! Super helpful for me to et up to speed with Llama 2 large models. Keen to learn how it supports parallel data processing. For example, if I have several prompts in hand, is there any way not to process them sequentially.

@samwitteveenai Жыл бұрын

Good question, I think it should work in batches but I need to check it out .

@fontenbleau Жыл бұрын

It's really strange, the 70B of Llama 2 did not impressed me so much than 13B Llama 2 and even original 65B, maybe because they cleaned it so much from hallucinations it becomes very empty and kinda dumb. The most interesting in 70B was hallucinations in big size. Such hallucinations I see now only in Bloomz 176 billions, which right from the start started discussion from our "relationships". Anyway I see a great replace of 70B Llama 2 in a new Platypus 2 70B which is very scientific but quite short, the golden ratio really is to construct a great scientific model which is very long in tokens limit and not so required in hardware. I'm using ggml 8-bit versions, which is indistinguishable from original float16, it's great on 14 cores CPUs. Kinda unique that I'm never used GPU with them still.

@samwitteveenai Жыл бұрын

Interesting. What were the conditions and types of prompt causing the hallucinations?

@fontenbleau Жыл бұрын

@@samwitteveenai in Bloomz it's right away with standard trick of requesting topic of previous discussion (which never take place and model will made it up), but I haven't managed to probe it further. Bloomz is very big and I accessed it only through Petals service, where it's very slow and short in response.

@rajivraghu9857 Жыл бұрын

Good one .. Its APIs are also pretty fast!

@tonyli7014 Жыл бұрын

I tried the MPT 30B and Falcon 40B. Both have timeout errors.

@samwitteveenai Жыл бұрын

I haven't tried those, I think they should work, just make sure you turn them on first etc.

@guanjwcn Жыл бұрын

Thanks for the video, Sam. Does it mean that if we need to run the 70B model on premise, we would meed 4 x H100 GPUs?

@samwitteveenai Жыл бұрын

If you want to run it at full resolution you probably need at least 4 A100s to get a speedy service

@guanjwcn Жыл бұрын

Thanks for the prompt reply, Sam. Pardon for my ignorance here. If my chatbot needs to be hosted on premise, and there could be hundreds of customers using the chatbot simultaneously at a particular point in time, what would be the hardware requirements? Would 4 x H100 GPUs be sufficient? Or I would need like hundreds of H100 GPUs?@@samwitteveenai

@s.patrickmarino7289 Жыл бұрын

You could run all the surrounding software such as the context memory and UI logic locally, but run the actual model on the cloud. This would give you some advantage in terms of privacy and such. If you need serious privacy, you might be better off running the back end of the model on AWS. That can be far more expensive if you have light usage, but far less if you are talking about heavy use.

@guanjwcn Жыл бұрын

Thank you for sharing the insights.@@s.patrickmarino7289

@bleo4485 Жыл бұрын

Sam, thanks for this. i think togetherai made another positive change since your video. I just signed up and received 25k credits 😍

@samwitteveenai Жыл бұрын

Yes I saw this the day after I recorded the vid . Even better for people signing up now

@micbab-vg2mu Жыл бұрын

Sam thank you for the video - I will try it.

@IvarDaigon Жыл бұрын

Nice video, looks like the pricing has changed again and its now based on tiers where each tier is a range of parameter sizes. their highest tier 40-70B parameters is now $0.003 per 1K tokens which is twice the cost of ChatGPT 3.5 (4K context) which is only $0.0015 per 1K tokens and GPT 3.5 has 175B parameters. Looks like they will have to reduce the price significantly to be competitive with OpenAI and Anthropic even though they are offering open source models that they themselves did not have to create. My expectation is that if a model has half as many parameters and uses half as much RAM as well as half as many GPU cycles as the competitor's then the price I should be paying is about half as much (or less) also. Still though it's good for developers to easily test out the different models and more competition is ALWAYS welcome, but as a business they will need to do better to remain viable.

@s.patrickmarino7289 Жыл бұрын

Is Together better than Replicate? What is the major difference?

@samwitteveenai Жыл бұрын

This is a lot cheaper than Replicate in my testing.

@mahmudyassin6422 Жыл бұрын

amazing sam. you are great

@paraconscious790 Жыл бұрын

Amazing Sam, this is so very good to try out and do some serious stuff with. Thanks a lot! bless

@jacobgoldenart Жыл бұрын

Hi Sam, Thanks for all the great videos! I have a question about the prompt template. There seems to be all sorts of ideas about how to properly prompt LLama2. Your Prompt format looks like this: [INST] You are a Neuroscientist with a talent for explaining very complex subjects to lay people Chat History: {chat_history} Human: {user_input} Assistant:[/INST] Where as on the Huggingface guide to Lllama 2 prompting it has a slightly different format using an wrapped around the whole turn and closing the [/inst] tag right after the {user_input}, where on yours you include the Assistant: within the [/inst] Here's huggingface's version" [INST] {{ system_prompt }} {{ user_msg_1 }} [/INST] {{ model_answer_1 }} [INST] {{ user_msg_2 }} [/INST].... One more thing. There is a recent Replicate blog on (you guessed it!) A guide to prompting Llama 2 : ) where they say you don't want to use the Human: (to denote the human is speaking) and you only want to wrap the (humans) input in the [inst] not the ai's Here's their examp'e: correct_prompt_long = """\ [INST] Hi! [/INST] Hello! How are you? [INST] I'm great, thanks for asking. Could you help me with a task? [/INST] Of course, I'd be happy to help! . [INST] How much wood could a wood chuck chuck or something like that? [/INST] """ So I'm really confused as the correct way! : )

@samwitteveenai Жыл бұрын

Mine wasn't exactly like the Chat Models does it in the paper. Generally I explore for a while to I find what seems to work best. Original I called it AI rather than Assistant but out of the blew it kept replying with "assistant: reply" etc. If you take a look at the Meta code you can see how they did it. I deliberately added things like "Chat History" to make it more compliant with the way LangChain does it. Usually I try to do what they paper or model card says and then in cases like this I tweaked. I hadn't seen the info about 'Human' I think I tried 'User:" at one point and it didn't seem to work well. hope this helps a bit.