Making 1 MILLION Token Context LLaMA 3 (Interview)

Рет қаралды 23,212

Күн бұрын

An interview with Leo Pekelis, Chief Scientist at Gradient.
Be sure to check out Pinecone for all your Vector DB needs: www.pinecone.io/
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewber...
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.ne...
👉🏻 LinkedIn: / forward-future-ai
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V
Links:
/ leonid-pekelis
gradient.ai/

Пікірлер: 125

@paelnever 3 ай бұрын

Honestly i tested that 1M token context window model and doesn't perform very well but i hope in the future the finetunning to large context windows works better.

@toadlguy 3 ай бұрын

Are you talking about speed or quality of the response?

@Quitcool 3 ай бұрын

Of course it's gonna be different for each use case.

@ts757arse 3 ай бұрын

My experience was similar. I limited it to 300,000 tokens context window and the response quality was pretty poor. BUT I was using a Q8 model and I expect the issues with reducing precision escalate significantly with context. From my experience, if you could graph out the deterioration in quality, it could even be a logarithmic deterioration.

@tollington9414 3 ай бұрын

Are we talking about ‘lost in the middle’ ?

@daivionosaghae4807 3 ай бұрын

You can't expect a 1m token model to reason over 100 tokens the same way they're over corrected for longer windows

@MagnesRUS 3 ай бұрын

Thanks, where is MoA tests? I'm waiting it!

@matthew.m.stevick 3 ай бұрын

This channel is Great. Thanks Matthew.

@matthew_berman 3 ай бұрын

Wow thank you so much!!

@guillerf10 3 ай бұрын

@@matthew_berman Who will make the best Ragflow tutorial?

@CreativeEngineering_ 3 ай бұрын

Can also use a database of embedded chat interactions with a small local model to continuously search by Calculating Similarity and distance to find relevant past interactions that could apply to the current interaction. This will allow you to use a smaller context window without sacrificing the information available to the AI

@kkollsga 3 ай бұрын

I was hoping they would serve the 1 million token model on Groq. That would have been incredible, even a 100k context would be nice. Groq and Gradient should talk together :)

@justtiredthings 3 ай бұрын

Hell, I'd take 32k. 8k is virtually useless for a serious application that requires a lot of instructions

@arjangfarahzadeh 3 ай бұрын

I would be so happy if they give us 128k. I could do so much with that

@LeoPekelisGradient 3 ай бұрын

Thanks & good idea. We're working on it!

@arjangfarahzadeh 3 ай бұрын

@@LeoPekelisGradient thanks a lot. I'm a very big fan of groq 😉

@jefframpe5075 3 ай бұрын

Have you seen the AutoGroq project? It automatically generates agents for Autogen and CrewAI

@brianlink391 3 ай бұрын

00:02 Gradient unlocked a million token context window for Llama 3 Model 02:11 Importance of context window in language models 06:28 Enhancing coding capabilities with large language models 08:31 Advantages of the million token context window model 12:39 Extending context length in model training challenges and process 14:51 Challenges in training million token context models 18:49 Needle and HStack benchmarks for testing model performance 20:35 Examining the performance of large language models in cross-referencing information 24:22 Exploring new ways to serve long context models efficiently. 26:13 Algorithms extensions for memory compression and selective opening

@JustinsOffGridAdventures 3 ай бұрын

Love these interviews. Thanks Leo for taking the time to chat with Matt and giving us some context on what your working on. Love it. Keep up the great worrk guys.

@perschistence2651 3 ай бұрын

In my experience the needle in a haystack benchmark, unfortunately, is practically useless... A better approach is to fill the context partly with a codebase and partly with a story the model doesn't know. Then ask it to rewrite a chapter from the perspective of another character and to write a new function for a certain code class which uses other functions with known outputs. Something like that. Llama 1M can just do about 2k tokens, in that case many models can do about 4k and GPT4o 16k.

@tjchatgptgoat 3 ай бұрын

Proud of you Matthew! Much love from New Orleans brother❤💪

@elu1 3 ай бұрын

Matt, you have asked good questions and summarized responses in concise way! Thank you for the content!

@jackflash6377 3 ай бұрын

Outstanding interview. Very articulate and knowledgeable this Leo chap. When are will there be an open source 1 M context model?

@mickelodiansurname9578 3 ай бұрын

Well the problem is as Leo mentioned that say you download the 70b llama model and you have the hardware to run it... okay fine... now you start using it and as you use up more and more context the models next compute cycle is the square of the number of tokens so far.... fine at 8k squared in terms of compute.... not so good after 500k squared!

@jackflash6377 3 ай бұрын

@@mickelodiansurname9578 Good to know. Thank you.

@robxsiq7744 3 ай бұрын

I would be happy with a 30k token window tbh. 64k ftw, but I'll take scraps.

@jimrhea5484 3 ай бұрын

Love how you ask him questions in the same format you would ask an AI. He answers, but like a human, has to think about it so he can present it back as informative as possible for a human to understand. If there is one thing I love about AI, it's how it already knows and will concisely detail the topic at hand. If it was me, I'd have an AI right there to help me answer these questions in realtime. Then I would have AI do everything for me while I go riding.

@MeinDeutschkurs 3 ай бұрын

VRAM-need explodes the larger the context window is. Have you ever tried to fill in 120,000 tokens with a total window of 1M, asking a tiny question about a fragment of the 120,000 tokens?

@MeinDeutschkurs 3 ай бұрын

@@polarper8165 , you don’t have to. Everything is ok.

@stanpikaliri1621 3 ай бұрын

You don't need to use Vram you can use DDR ram to load huge AI models. It's slower than Vram but it's not too bad.

@nathanbanks2354 3 ай бұрын

Gradient has one of the coolest websites I've seen full of retro AI art from the '70's.

@toadlguy 3 ай бұрын

What a great Interview! And also a great advancement. So many questions. Leo said it is more like training a model than "pre-training" to create these larger context windows, however, as far as I know, there is no access to the training data from Meta and the compute necessary would seem prohibitive, so there must be some way that they are modifying or manipulating the weights to recognize the "distance encoding". Would be great to get more clarity on that. Also would be very interested to know the trade offs between using a pre-trained model (like with your companies code base) vs a cached large context model. Obviously changing the code base would be easier in the large context model, but what trade offs would there be. Also, is there any mechanism for caching multiple parts of the context window. Finally, although this might be interesting for things like video, it would seem the compute necessary would be somewhat prohibitive unless you can cache the "attention" which would make more sense for text. Matt, it would be great if you could do a follow up on this after you have tried it yourself (and even better if you could get Leo back again.) Great stuff!!!

@jefframpe5075 3 ай бұрын

Glad to see RULER being used. RULER: What's the Real Context Size of Your Long-Context Language Models?

@superfliping 3 ай бұрын

Simplified answer increasing practical memory to help users in various ways recall conversations codes and text understand them prioritize information more efficiently due to a large token contacts window.

@johnbollenbacher6715 Ай бұрын

Suggested programming test: Write a function in Ada that takes a pair of numbers and an array of such pairs and returns true if there are an odd number of occurrences of the pair in the array.

@FunwithBlender 3 ай бұрын

its the desk space to operate in thats the context window

@countofst.germain6417 3 ай бұрын

Great interview, you asked a lot of questions I was curious about

@AssWann 3 ай бұрын

Berman, I am still looking forward to the Matt3 (Matt to the third power) ai conference with you, Matt Wolfe and MattVidPro

@KCM25NJL 3 ай бұрын

A combination of Layer-wise attention caching and selective attention computation would definitely make large context workflows more efficient. I'd also like to see if such a thing as in-context token quantization / vector graphing can be achieved without the need to offload to external DB's....... I think this would be a worthwhile area of research.

@ITSupport-q1y 3 ай бұрын

Brilliant, good questions.

@tiagotiagot 3 ай бұрын

Has anyone figured out a way to make context be abstracted into a fractal non-euclidean "mind palace" the LLM can traverse to find any past information, and always find more places to fit new info?

@chrism3440 3 ай бұрын

How would you fetch from this fractal structure? How would it organize its categorizations for fetchability

@tiagotiagot 3 ай бұрын

@@chrism3440 I'm not sure. For a long time I've had the gut instinct that some sort convolution-style compression might be applicable to concepts instead of image pixels, sorta like how when we don't remember something but it's on the tip of the tongue we can remember things similar to it, what is around it, what category it is etc, seems for humans remembering is a lot like following a scent-trail, you catch a whiff and move to where it's stronger; maybe that could partially be how this could work, each edge of a node in the graph would have a different smell, representing a convolution of everything that can be reached going thru it, with things requiring less steps being more over-represented but things many nodes down still having a hint of it's smell coming thru that door. And perhaps there could be some sort video-game style culling of nodes too many nodes away, and nodes at the distant surface of the graph's volume gets streamed in and out of the GPU as needed while the LLMs itself is always at a moving center position where up to the Nth level neighbors are all already/still loaded in VRAM and by the time it reaches the old surface new stuff has already had time to load at full "LOD"? Another related intuition that's just at the egde of my knowledge is perhaps it could be something like a non-euclidean NeRF/gaussian splatting abstraction, with working mirrors, lenses, wormholes etc, and instead of visual data the "pixels" from a different perspective compose different concepts; I know there are already some projects about stuff like that (for 3d visual data) with more data than fits VRAM (or that can be rendered fast enough all at once), and instead streams data from disk as needed. I never looked into the finer details of how vector databases work, maybe it already is something similar to that , dunno; could perhaps have some elements of it.

@daniel_tenner 3 ай бұрын

What’s the point of a 1M token context if it forgets 90% of it or fails to follow instructions? Context length should not be quoted without also including those two metrics.

@pensiveintrovert4318 3 ай бұрын

Marketing is the point.

@aaronravak1407 3 ай бұрын

I regularly run GPT4o up to 300,000 to 500,000 tokens, its great hardly forgets anything. Just gotta become one with the model.

@pensiveintrovert4318 3 ай бұрын

@@aaronravak1407 What's your use case? Code?

@ollimacp 3 ай бұрын

Yes exacly. The gradient llama3 model is shit compared to the llama3 base model. Even asking it something that has very little context, that llama3 70B instruct can handle at ease the gradient 1M token context model does not handle well at any task. With the phi3 128k model its exacly the same. I want quality and if they raise the context window. I want to know that the quality stays the same. My prof always said: "If you want the one thing, you have to give up another thing" And as long as there aren't any improvements on the Base model, the context extensions for base model will always degrade quality.

@susmitdas 3 ай бұрын

@@aaronravak1407 It has 128k context tho...

@ManjaroBlack 3 ай бұрын

A larger context window is not always best. The larger the context window the higher the quality of the context matters. This means it’s even more important to ensure the information available to the model within the context is specific and verbose. The quality of the output that the LLM adds to the context is very important. A larger context window is not good for smaller variants of models. Also high compression/quantization really affects the quality of the output adding to the issue of context quality.

@dr.mikeybee 3 ай бұрын

Thanks for mentioning Cursor. I'm going to try it.

@arunsammitpandey86 3 ай бұрын

Thanks for this video!

@aa-xn5hc 3 ай бұрын

Good questions good answers

@mafaromapiye539 3 ай бұрын

Generative AI models are for wisdom mining, they feel like simple systems of Earth from perspectives draughtsman

@hqcart1 3 ай бұрын

large context only exists in the demos, never worked for me on the best models.

@FunwithBlender 3 ай бұрын

to reason on an entire codebase, we need to look at tokenization data prep and ideally dataflow and explicit graph data.... its a solvable but complex issue

@toadlguy 3 ай бұрын

Would be interested to know if this has really been successfully done for anything other than test cases.

@Goggleboxing 3 ай бұрын

Thanks for this. What's to stop bad faith actors from inputting someone else's IP and have the AI reword/recode it to pass it off as their own? Whether that's an author's book, a screenplay, a video, a piece of music, sicientific research, etc.

@robertheinrich2994 3 ай бұрын

I really want to see that in action. I currently play around with llama 3 abliterated, that thing is interesting. the only problem I see with 1m token, my computer is already quite slow with 8k token, even bumping the setting to 16k would be painfully slow. but I guess, there are tasks, where it totally makes sense to give the task to the LLM in the evening, and have the response ready by the next day.

@alexcoventry7580 3 ай бұрын

Is their million-token context implementation public? Or it's just that the base llama-3 is open-source?

@tollington9414 3 ай бұрын

The latest Gemini modal has a 2M context window

@AssWann 3 ай бұрын

now if only they wouldn't censor prompts like pansy snowflakes, them and everyone. I've never tried this 1 million token thing, I'm still looking for ais that do basic things without being blocked

@tiagotiagot 3 ай бұрын

What happened to that thing with "attention sinks" allowing infinite context sizes, that could be implemented in the interpreter app (dunno what they're called, the stuff that run the LLMs) without even needing to modify the models? (sorry, I don't remember which channel talked about it, just remember the term "attention sinks", was some months ago I think).

@ISK_VAGR 3 ай бұрын

Matt. It is a great video. However, I am skeptical that this is really as fast as the 8000 tokens version. It is more like a physical impossibility rather than a problem with training. Is the simple fact that it requires more time to find information If you have more information you will require more time unless you increase computational power. However, it is really remarkable that they can fit 1 million tokens context window with a relative high-performance. I would love to test it.

@leewilliams5828 3 ай бұрын

ChatGPT 4o context window is still only a little over 4000, no?

@arcamari1222 3 ай бұрын

I really like your channel and I find your videos on artificial intelligence extremely interesting. Thanks to the subtitles, I am able to follow and understand the topic better, which I am very passionate about, I am Italian. However, I have noticed that I often find myself unsubscribed from the channel without reason. It has happened 4 or 5 times already and I can't figure out if it's a technical issue or if I'm being removed by the channel owner. Could you help me understand why this happens and how I can solve the problem? Thank you."

@truepilgrimm 3 ай бұрын

Nice.

@TheReferrer72 3 ай бұрын

I think its best to wait for Meta to increase the context length.

@INTELLIGENCE_Revolution 3 ай бұрын

lol. He’s a research scientist. This is literally his remit 😅

@spectator59 2 ай бұрын

How much VRAM for 1m context, tho?

@pensiveintrovert4318 3 ай бұрын

Have they ACTUALLY used their 1 million window on any use case successfully? Or is it just a claim that it would be helpful? I am yet to hear that Gemini is creating any klller apps.

@4.0.4 3 ай бұрын

The problem is it costs _$7 per prompt._ You can't build a killer app out of something that expensive right now.

@pensiveintrovert4318 3 ай бұрын

@@4.0.4 I started using OpenAI with gpt-pilot and after $92 turned it off. Now I just use chatGPT to produce chunks of code and then slap it together by hand. The latter seems to work reasonably well.

@gileneusz 3 ай бұрын

I tried this model, and via ollama it just produces trash, completely unusable comparing to regular llama 70b. Maybe it's a bug or something.

@zyxwvutsrqponmlkh 3 ай бұрын

The needle needs to not stick out. Have it for instance change the name of a character that is only stated once. But also this should be done on a text the model was never trained on so war and peace should not be used. What was the name of the character that did x thing.

@NotU-eg1jf 3 ай бұрын

Who sees the cigarette?

@大支爺 3 ай бұрын

It's useless for me because they're censored and not multi Languages supported.

@brunodangelo1146 3 ай бұрын

Dude what happened to your thumbnails?

@BradleyKieser 3 ай бұрын

That was brilliant

@penshon7775 3 ай бұрын

Omg. This dude has 1 mil tokens window too. Talking and talking and talking.. something that can be said with 2 words

@clearmind3022 3 ай бұрын

It's kind of ironic. The creation he's making will soon outperform him, making him obsolete. Decreasion, well outgrow the creator ironic. Hopefully you build a good friendship with it.

@TheRealUsername 3 ай бұрын

Probably a next gen of Transformer model could but not the current models.

@pensiveintrovert4318 3 ай бұрын

Gemini has had 1 million context for months now. Not a single new klller app created nor any new Shakespeare plays written.

@MikeWoot65 3 ай бұрын

Wait.. the guy with a beard is a He/Him? So glad that he filled that out on LinkedIn. I would have never guessed.

@AI-under-Five 3 ай бұрын

You should consider educating yourself on this matter. What someone appears to be does not determine their gender identity. Additionally, this is how the world is now; by including pronouns, people are helping to normalize this practice and support a more inclusive environment.

@timer4times2 3 ай бұрын

@@AI-under-FiveLet's be honest. Nobody cares what your "gender" is, only what your sex is because that is the only thing that is relevant in most cases.

@cesarsantos854 3 ай бұрын

No, it is just snowflakes virtue signaling their illness.

@MikeWoot65 3 ай бұрын

@@AI-under-Five Oh right. Gender is a social construct right? Literally saying that man/women are defined by their behaviors within a society. You are literally defining women based on gender norms and gender roles. I thought we fought to get rid of gender roles? Now you're saying those are the EXACT things that we should use to define ourselves? Imagine a political ideology forcing you to change your definition what a woman is, and then saying. this is how the world is now. The hubris is staggering

@MikeWoot65 3 ай бұрын

@@AI-under-Five How can gender be a social construct, but when choosing your gender, it has nothing to do with gender norms which are socially constructed? And this is not how it is taught in schools. I'll refer to the def of gender via the WHO "Gender refers to the characteristics of women, men, girls and boys that are socially constructed. This includes norms, behaviours and roles"

@lunevka 3 ай бұрын

hmm

@Ms.Robot. 3 ай бұрын

I cannot repost what my Ai says (omg) 🔥, but I wish I could 😊. What are tokens 😁

@tex1297 3 ай бұрын

Any time i see a pro tech company with random tubes and inapropiate stuff in backgrund, i know they wont show up again🤣

@mafaromapiye539 3 ай бұрын

Yah Gemini Flash 1.5 1m context, it's web app utilize vram more as you approach high context

@skeptiklive 3 ай бұрын

The Ruler test is such a great insight. This behavior is why I've almost entirely switched to Claude 3 Opus. It performs incredibly well with this!

@executivelifehacks6747 3 ай бұрын

Can you elaborate pls... where is C3O tested on the RULER test?

@skeptiklive 3 ай бұрын

@@executivelifehacks6747 It's not, I was referring to the behavior they were testing for - not benchmark results. I've found just through testing that I'm able to upload several documents (usually around 20-150 pages worth of them) and can continuously ask complex questions about them that require comprehension of data across several locations to generate an effective answer. The best example I can give is that I created an AI call transcript QA agent with C3O where I upload a 5 page grading rubric with a 20-40 page call transcript and it gives human level responses to generalized questions in the rubric, it tallies the scores correctly, and draws overall conclusions for complex and interrelated questions... and it does it all in a zero shot response I tested this same workflow with all the other large models and they weren't even close to the quality of analysis C3O provides.

@jefframpe5075 3 ай бұрын

Glad to see RULER being used. RULER: What's the Real Context Size of Your Long-Context Language Models?

@GoysForGiza 3 ай бұрын

You should make two different channels. One with tutorials and alike, then one with shit like this.

@stanpikaliri1621 3 ай бұрын

Nice cant wait 1M tokens to improve my AI personality and to add some more stuff to its memory and also I should be able to chat longer without AI model to hallucinate. Just really wish to get unbiased uncensored model at one point.

@Xyzcba4 3 ай бұрын

Not convinced that hallucinating llm's will ever go away