Extending Llama-3 to 1M+ Tokens - Does it Impact the Performance?

Рет қаралды 11,527

Күн бұрын

In this video we will look at the 1M+ context version of the best open llm, llama-3 built by gradientai.
🦾 Discord: / discord
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Patreon: / promptengineering
💼Consulting: calendly.com/engineerprompt/c...
📧 Business Contact: engineerprompt@gmail.com
Become Member: tinyurl.com/y5h28s6h
💻 Pre-configured localGPT VM: bit.ly/localGPT (use Code: PromptEngineering for 50% off).
Signup for Advanced RAG:
tally.so/r/3y9bb0
LINKS:
Model: ollama.com/library/llama3-gra...
Ollama tutorial: • Ollama: The Easiest Wa...
TIMESTAMPS:
[00:00] LLAMA-3 1M+
[00:57] Needle in Haystack test
[02:45] How its trained?
[03:32] Setting Up and Running Llama3 Locally
[05:45] Responsiveness and Censorship
[07:25] Advanced Reasoning and Information Retrieval
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu...

Пікірлер: 43

@engineerprompt Ай бұрын

CORRECTION: There is a mistake in testing long context in the video [haystack test towards the end of the video (Tim Cook and Apple related question)]. If you set the context length in a session in ollama and exist it, you will have to reset the context length in the new session again. Parameters set in one session do not persist across sessions. An oversight on my end and thanks to everyone for pointing it out.

@antonvinny Ай бұрын

So did it work correctly after setting the context length?

@user-cl7vn1eg3u Ай бұрын

I've been testing it. It has a hallucination issue when large text is put in. However the writing is good so even the hallucinations are interesting.

@maxieroo629 Ай бұрын

Have you tried lowering the temperature?

@john_blues Ай бұрын

The AI went from scholarly professor to unintelligible drunk quite quickly.

@abadiev Ай бұрын

need more information about this model. How it work with autogen or another agent system?

@sergeaudenaert Ай бұрын

thank you for the video. When you exited and reran the model, shouldnt you also reset the context window to 256K ?

@engineerprompt Ай бұрын

that's a valid point. I thought (mistakenly) it persists for the llm but seems like you actually have to do it for each session.

@supercurioTube Ай бұрын

Thanks a lot for this showcase! That test you've done is fantastic: "A glass door has 'push' on it in mirror writing. Should you push or pull it? Please think out loud step by step." I've tried with several llama3 8b, down to llama3:8b-instruct-q4_1 quantization which ends quickly with a spot-on: "So, to answer the question: You should pull the glass door" I'm able to reproduce the infinite output you get with llama3-gradient:8b-instruct-q5_K_M. So there was something broken in this fine tune for larger context indeed. I was hoping to leverage larger context in an application with llama3 but that won't be the model for that I guess.

@supercurioTube Ай бұрын

And I've tried with the full fp16 from Ollama too. It does seem to stop consistently but answers wrong: "Conclusion: Even though the word is written backwards when looking from within your reflection, I should still try opening the glass doors by pushing them like they would say."

@engineerprompt Ай бұрын

there are 16k and 64k versions which are finetuned. Might be interesting to look into those.

@supercurioTube Ай бұрын

@@engineerprompt thanks for the suggestion, I will 😌

@henkhbit5748 Ай бұрын

Thanks for the update. Anybody try to do "real" RAG using multiple documents? You cannot access it using GRoq?

@engineerprompt Ай бұрын

you can look into localgpt :)

@jeffwads Ай бұрын

Yes, without multiple needle runs, the test is pretty weak.

@engineerprompt Ай бұрын

agree.

@hoblon Ай бұрын

You need to set context size in each session. Not just once. That's why the needle test failed.

@JoeBrigAI Ай бұрын

The setting isn't persistent? Major oversight in the video if this the case.

@hoblon Ай бұрын

@@JoeBrigAI They are persistent per session. Once you enter /bye that's it.

@engineerprompt Ай бұрын

that is true. I thought it to be otherwise. Added a pinned comment to highlight this.

@user-gp6ix8iz9r Ай бұрын

Can you do a review on AirLLM it’s lets you run a 70b model on 4gb of vram

@engineerprompt Ай бұрын

havne't seen that before. Will explore what it is.

@HassanAllaham Ай бұрын

does it let us run such size without GPU.. i.e. on CPU only ??

@ikjb8561 Ай бұрын

Due to regressive nature of LLMs there is an exponential chance of producing errors for every passing token. Be careful what you wish for.

@smartduck904 Ай бұрын

So I guess this will not run on a GTX 1080 TI?

@Vadinaka Ай бұрын

May I ask which system you are using to run this?

@engineerprompt Ай бұрын

I am using M2 Max 96GB to run this.

@GetzAI Ай бұрын

you need to pick up an M4 Mac Studio when it comes out ;)

@engineerprompt Ай бұрын

indeed :D

@unclecode Ай бұрын

Interesting, this one didn't bring a ladder to the party for the joke haha. About the model not stopping, it's probably related to RoPE (Rotary Positional Encoding). If someone messed with that, things could go forever. Anyway the quantized version definitely affects the model's behavior.

@engineerprompt Ай бұрын

haha, that's true. Pleasantly surprised with the joke :) that's actually a good point with RoPE.

@vertigoz Ай бұрын

Phi3 128k got worse against 4k when trying to analyze a program I gave to him

@R0cky0 Ай бұрын

13:16 it appears the llm was suffering Schizophrenia that moment 😅

@8eck Ай бұрын

100+ GB of vram for 4-bit quantized model? 🙄Are you sure about quantized one?

@acekorneya1 Ай бұрын

The issue with all these "BENCHMARKS" is that they are all lies. We need a better, real benchmark for LLM because what we get from people who make these models are all lies. They don't perform well when it comes down to doing real work, or anything in real production. They all suck compared to closed models. It's like the people who benchmark them show very cherry-picked examples.

@ritpop Ай бұрын

Yes, some models are good to Daily use and in some case better than gpt 3.5 of chatgpt but I never used one that is close to gpt 4. And in some use cases the 3.5 still better than mistral in my own experience. So they really should put the real breachmarks