Efficient Streaming Language Models with Attention Sinks (Paper Explained)

Рет қаралды 36,131

Күн бұрын

Пікірлер: 103

@guangxuanxiao247 Жыл бұрын

Hi, Yannic, first of all, thank you so much for the insightful explanation of our work! To address your question at 27:36 - in every experiment we conducted, we did shift the position encodings in the Window Attention baseline. The results you see for window attention have already incorporated your suggestion, but the perplexity did not improve. This implies that the model doesn't solely rely on position encoding to learn attention sinks. We appreciate your input and will clarify this point in our future revision!

@corgirun7892 Жыл бұрын

Intuitively, adjusting the position encoding seems to only alter the attention scores introduced by the position encoding and should not lead to effects like attention sink. Attention sink should be the result of the combined action of specific tokens, such as + position encoding + FFN.

@hanyanglee9018 Жыл бұрын

Hi, a short question, is it possible to do some diffusion like trick with language model. I know the performance may be the issue. Thanks.

@hanyanglee9018 Жыл бұрын

18:35 I really believe this is because the softmax. If you want (0.1, 0.1) after softmax, you have to add a 3rd number into the vector, and it's sqrt(1-0.01-0.01), it's big. Softmax helps protect the output of each sub layer but it works similar to bn, it breaks the length of the output vector. Just like the quaternion used in computer graphics, you can simply add some extra dimensions and let the model dump whatever there. If we get rid of the softmax, the problem is back to, how to protect the output of each sub layers, sine or sigmoid(then divided by sqrt(width of the vector))? Maybe a first order softmax works better? temp = sum(abs(x)) # x is a vector of ? dimension. temp is scalar if temp

@Annachrome Жыл бұрын

In the paper right after section 4 header “Experiments”: “We evaluate StreamingLLM using four prominent recent model families: Llama-2 (Touvron et al., 2023b), MPT (Team, 2023), PyThia (Biderman et al., 2023), and Falcon (Almazrouei et al., 2023). Notably, Llama-2, Falcon, and Pythia incorporate RoPE (Su et al., 2021), whereas MPT employs ALiBi (Press et al., 2022)”

@huytruonguic 2 ай бұрын

This makes sense, because I think if we make position 0 a new token, it will try to put a lot of attention on that new token which might not be desirable when the model was not trained to be familiarized with that random new token

@emrecagataykose Жыл бұрын

To me, this "attention sinks" idea seems very similar to the "registers" idea from the "Vision Transformers need registers" paper. Perhaps, after all these years, we find that the original attention was just missing a component to offload some computation, leading to unimportant pixels (in the CV field) or first few tokens (in the NLP field) being used as such.

@zandrrlife Жыл бұрын

Idk I would have to disagree. Fundamentally different from Registers, which are really just useless tokens to increase computational width, more related to the pause token paper. Both methods basically increase reasoning capacity by having more time to compute. StreamingLLM aims to solve a different problem. All are dope 😂. Actually this and pause tokens should be utilized together, highly synergistic.

@sagumekishin5748 Жыл бұрын

i wonder if we can add "register" token every few steps in the sequence serve as "summarization" and off load computation

@vassa-vn7xz Жыл бұрын

Maybe no, registers attend to other tokens, but attention sync do not.

@brll5733 Жыл бұрын

Latent memory (intuitively) seems like a useful component to have and this seems to confirm it

@itayatelis2898 Жыл бұрын

I do agree StreamingLLM attempts to solve a different problem but fundamentality it's the same core problem, the larger the model gets (size and data) the more it finds some tokens at irrelevant positions (position of patches that usually serve no propose or beginning of a text that usually less important -- prologue and etc.)@@zandrrlife

@roomo7time Жыл бұрын

I am really grateful that you upload these paper review videos more often these days. THANK YOU!

@oldmankatan7383 Жыл бұрын

Thank you for linking the paper you were reviewing! Convenient, professional, and appreciated 🙂

@johnboren32 Жыл бұрын

It works well. I got Mitral to produce 15_000 words on the history of machine learning since the Dartmouth Conference. It never repeated itself.

@tobiaswegener1234 Жыл бұрын

That is impressive. I guess you mean Mistral 7b, correct?

@johnboren32 Жыл бұрын

@@tobiaswegener1234correct

@yeonwoosung491 Жыл бұрын

Personally, this "attention sinks" ideas made me to remind the "How Does BERT Rerank Passages?" paper. One of the key idea of this paper was that "the first few tokens really matters to the BERT model's reranking performance. Looks also the first token is treated as the key for all NLP-related tasks?

@eoghanf Жыл бұрын

"I'm good at counting. I can count" LOLOLOL. Another great video.

@AcademyOmen Жыл бұрын

Thanks for breaking it down, I can confidently re-read that paper now

@PotatoKaboom Жыл бұрын

Best content on youtube! Please never stop making these!

@tweak3871 Жыл бұрын

This feels like there is some parallel thread with the context vector in LSTMs in that, there is an information budget that can be plausibly stored in a context vector (or in this case, the attention sink tokens) and that information budget will eventually run out as the model can only fit so much information onto the vector. But love anything that helps speed things up! It's interesting what works well with the constraints of hardware. But from an accuracy perspective, it feels like duct tape in the way that slapping a context vector onto an RNN is. But great work nonetheless, a little less familiar with the other authors but I've read most of Beidi Chen's work and it's dope to see what I consider a theme of her work, that is reengineering neural nets to be more compatible with the hardware for performance gains, continue.

@kevon217 Жыл бұрын

Love your walk-throughs. Always a great watch.

@lucidraisin Жыл бұрын

Thank you Yannic, great video (and personal analysis) as always

@ThomasTomiczek Жыл бұрын

Here is the main problem with that that no one talks about. It keeps the first X tokens around - ok, good. Except that unless X is the length of the complete system prompt, the system prompt, the original instructions "given by a godlike authority" are going to slowly decay. Also, in any non-trivial setup (i.e. not chat) you will have a lot of contexts you can throw away - stuff like embeddings, planning stages. These delude the on ongoing context - we end up with larger and larger needed context window anyway. One item of research would actually be not to keep the first X tokens, but some tokens in the middle - i.e. the last 12 tokens OF THE SYSTEM PROMPT.

@Rhannmah Жыл бұрын

Well, I could see a system that detects which token groups are highly relevant and keeps those around. The broader ideas of a dissertation, for example, so that the LLM stays on track throughout the whole text output with lesser attention requirements. Speaking of which, I've always wondered if it was possible to tokenize broader stuff like ideas or concepts that are comprised of multiple words. In this sentence, "the white cat jumps over the short fence" for example, tokenizing "the white cat" as a single token would wrap it as an idea, that gets associated with other ideas "jumps over" and "the short fence" which would also be tokenized. But I don't know enough about tokenization to know whether this is feasible or not.

@tweak3871 Жыл бұрын

There are special tokens for models like BERT, so stuff like the [CLS] and [SEP] tokens, I wonder if denoting the system prompt with special tokens, and always keeping the system prompt intact while pretraining (and maybe even do some pretraining w/sequences that are beyond the context window) would address some of those problems.. But you definitely have an information budget, and you can only go so far with it, it be cool if there some kind of way to learn a systematic compression previous tokens and packing that into the first N tokens as a way to help address some token window limitations.. but again, even if you could, it all has limits that I suspect aren't like, 10x innovations but more like 10-20% gains at best.

@SimonSlangen 10 ай бұрын

A simpler (untested) hypothesis would be that this is a case of residual values. Say the n-th token gives a key Kn. Any future query that's close to Kn will also pick up values from previous tokens to some degree. To deal with this, Vn should be interpreted as a residual over the values of preceding tokens with similar keys. If you just drop earlier tokens from the context window, you're effectively introducing random offsets to the values of the remaining tokens. EDIT: hard to argue with those experimental results though

@alikhodabakhsh2653 Жыл бұрын

thanks for posting the video. my question is: How can we change positional encoding for tokens and keep the old cache simultaneously?

@chsafouane Жыл бұрын

Why do you need to train explicitly for an attention sink token ? When a tokenizer adds a start token, couldn't it play the role (implictly) of an attention sink token ?

@davefar2964 10 ай бұрын

What's up with Layer 10 Head 0 with attention sink token in Figure 7? There's only medium attention to it, which doesn't speak for the importance of the attention sink.

@almoni127 Жыл бұрын

Why didn't transformer-XL training catch on? Seems to me like the most natural solution to sliding window attention

@stephanelter-b2c Жыл бұрын

Hello Yannic, good job. I never understood why the positional encoding starts with the oldest token and not the last. Do you know why? Did anyone have any information about this? So I would use the zero position for the last token and then move backwards to the older tokens. I assume that this problem does not occur then. (just a first guess).

@maxwellclarke1862 Жыл бұрын

You need to use relative positional encoding for that. Because otherwise we would need to train one token at a time rather than "in parallel", since the position information associated with a given token would change depending on the timestep.

@sampruden6684 Жыл бұрын

If I'm understanding correctly, this is a hack to be compatible with already trained models, and actually what this is suggesting is a modification to the attention mechanism, right? The lesson is that softmax isn't quite the right operation. If it works when the first token is fixed to a newline, then the contents of that first token is constant. That means its keys and values are constant. That suggests that you could remove that first token and make this neater by learning a constant sink key/value for each layer. Taking that idea slightly further, you could probably just directly predict the output of `dot(query, constant sink key)` and skip the learned constant and the dot product. These are small improvements (we're basically just saving one token) but they feel neater to me. The issue is in the attention mechanism, and therefore that's the "morally" correct place to put the fix. The proposed implementation has two obvious advantages: You can use it with existing models, and you can do it today without waiting for flash attention implementations to add the feature. It's late at night, does that seem right to others?

@tweak3871 Жыл бұрын

There is a really cool unpublished paper called "PairConnect: A Compute-Efficient MLP Alternative to Attention" that I think would interest you in the vein of "you could probably just directly predict the output of `dot(query, constant sink key)` and skip the learned constant and the dot product" But yeah it feels like the first token has some amount of constant information budget that you can pile stuff on top of, and that maybe it makes sense to have a "context" token or a series of them. But yeah, what you say feels correct to me.

@davefar2964 10 ай бұрын

About Yannics question at kzbin.info/www/bejne/amGcpYGilqesmtUsi=OvfvcduXSmwRbnjU&t=1610: maybe having the left-most window element always at position 0 is not enough because if the token value always changes, the model cannot learn to use it as attention sink.

@tiagotiagot Жыл бұрын

Newlines seems like not the best choice, since it could have meaning elsewhere in the context (and in the training data as well). Perhaps it would be better to use something like the Null character or whatever is the highest value possible if it's not assigned anything, or something like that?

@Kram1032 Жыл бұрын

I think the point was simply to try that and show that it doesn't affect things. It's not meant to be a "good" choice

@AbstractQbit Жыл бұрын

I wonder if the softmax with +1 in the denominator that was discussed on HN and reddit in July would also work for this. If the model learns to use the first token as an "attention sink", why not just allow it to not attend to anything at all instead?

@kimchi_taco Жыл бұрын

I'm not sure what's different from transformer xl?

@PaganPegasus Жыл бұрын

okay but like. How does Transformer-XL not suffer from this then? Because that uses a KV cache but no attention sink, however that works just fine.

@jondo7680 Жыл бұрын

Since I'm not an expert, all I see is that the training is more different than inference time? Doesn't the training process also have a memory limit and a fixed window size that needs to slide? How comes that the network doesn't learn and adopt to this? Or are these networks never trained on text that's longer than the window? In that case we don't need to be suprise that the models don't scale. 26:22 What he is asking here is exactly how I had imagined these neuronal networks to work. Why would you not start the numbering from 0 after cutting out the beginning?

@シマムラアキト Жыл бұрын

Could this mean that the 'ATTENTION SINKS' serves as a stable reference point, reflecting a specific feature or state of the model at a given layer, thereby influencing the contextual understanding of the rest of the tokens in the Transformer?

@Jandodev Жыл бұрын

Keep the initial prompt engineering and allow it to propagate through the massive context windows! :)

@thomasbrunelouellet4370 Жыл бұрын

too much quality streamer... ahhaha i dont got time to watch everything. great quality Yannic. thanks

@Rizhiy13 Жыл бұрын

What about just having one or a few tokens always present even without positional embeddings? Similar to "vision transformers need registers"?

@sophontec2822 Жыл бұрын

Great insight!

@JohnSmith-he5xg Жыл бұрын

Doesn't it seem undesirable that a substantial portion of the attention is being allocated seemingly without regard to the content, but only to position (specifically the first couple tokens)? This strikes me as a flaw that they observed and then used versus attempting to fix.

@AM-yk5yd Жыл бұрын

* It was a year 2023. People realized (Beginning of Sequence) token is important and shouldn't be discarded. I seethed, I thought people do it already. Like longformer. * log PPL Excuse me for seething again, but is "log (exp(loss))" is this necessary? Why not take loss directly? * Interesting that only llama benefits from >1 sinks. Though other models are 2k,not 4k. It would be interesting to see are btlm-8k.

@OperationDarkside Жыл бұрын

If you liken this to human behavior, putting new lines for the first few characters is like a human thinking before speaking. In a more abstract sense, the attention sink tokens are the place where the logic and reasoning happen. So if you keep the memory, containing all the reasoning, around the following effort is put into language generation, rather than reasoning. I see 2 potential problems with this, though. 1. Will the LLM be able to correct itself in the middle of generating the response, if the incorrect part is outside of its sliding window? 2. Wouldn't this paper imply, that most of the computation of current models is wasted, if the logic and the language part is combined into one? What if we separated the logic/reasoning and the language part? I think, GPT models are probably still the best at language generation, but could be made significantly smaller and not in quadratic complexity, if we provided the sliding window the output of a "reasoning" model.

@DeruwynArchmage Жыл бұрын

Why can’t the attention just be rescaled so that it uses up the full probability space for the heads that matter and just zero out everything else? Like if it’s suppose to be 5%, 10%, & 20%, instead of dumping 65% in position 0, why not make them 14.29%, 28.57%, and 57.14%? Wouldn’t that have the same effect without the attention sink issue? Or is the attention sink the model’s way of saying that there’s a 65% chance that everything that came before doesn’t really influence the next token through the attention heads? I’m also wondering, since so much weight is being placed on the first token, isn’t that really just saying that we don’t need so many free parameters and that any individual token can’t actually usefully attend to every previous token but that there’s some limit. Let’s say that it’s 10 as an example. So, instead of encoding attention as 1 connection for every previous token, instead we make the reference a tuple of the original value we would have calculated and the position of the token in question. Now we don’t need to calculate in a quadratically growing manner, but instead a constant one. Let it pay attention to up to 10 other tokens and zero out the extra ones (or maybe -1) if it only needs 3 of the 10 connections for this particular token. That should mean that our calculation complexity growth is just linear and we can keep all of the old tokens as well. I mean, the current limitations certainly because remembering 8192 or 32,768 tokens is too tough. It’s because 8k^2 is too big of a number (64M and 1G respectively). So we should be able to have context lengths of 1/10th of those two numbers for the same cost with a limited attention capacity. But then, if the token it needs to pay attention to is a million tokens ago, it’s not that big of a deal; it’s still in the context window. Now we don’t forget an instruction or detail you mentioned several hours ago. If you hit your limit, it could start summarizing the context that’s going out of scope and put that in a separate section that it uses as external reference for content that’s 6 years old and it can’t pay attention to every word said in that period simultaneously. If that summarization context was always there and used in the original calculations during training then it could already have references to that built in. When it needs to start adding things to that archive context because the current context is too large, it could do it in chunks so that it only has to recalculate that portion every once in a while, say, every 10k tokens it adds to the summary context and prunes irrelevant stuff from it, that way it always has room for new content in the archive section. Then just recalculate the current context based on the new contents of the archive and move on from there. At the beginning of a conversation it can load some basic context that should apply to nearly any conversation. Stuff like time, the user’s name, information about itself, that kind of thing. You can determine what goes into the archive based on the parts of the conversation being archived that have significant connections in the remaining window, and then some kind of summary of that, which preserves the most salient parts. The archived portion gets moved to slower/more permanent storage in case it ever needs to be referenced in its entirety at some point in the future. The currently relevant archive portion could also include information that was recently pulled from the internet as well to help keep the language model grounded in reality and used as reference. I think you also want to leave some scratch space where the network can add relevant tokens that it could reference immediately without doing a full recalculation. None of the existing tokens should refer to empty slots in the archive or scratch space, so adding new info should not force a recalculation. Only when you need to recompress them. Kind of like garbage collection for memory. If all of that was built in from the very beginning while training was occurring, that should allow the neural network to wire in the relevant connections so that it knows how to properly utilize those data points when generating tokens. You can also do the compression process in a separate thread so that the user never knows that it’s occurring, since I figure it’ll be expensive. Then you just swap the pointers of the current conversation with the recalculated one to make it seamless.

@DeruwynArchmage Жыл бұрын

Regardless, it seems like the attention sink is essentially a bug that they’re trying to patch.

@agentds1624 Жыл бұрын

But what happens when the window of lenght W shifted W+1 times? Then the first token in the window has no more information about the first token of the sequence, right?

@jondo7680 Жыл бұрын

It seems like the token isn't important at all. It's probably more about the numbering of the tokens as he says here 26:22 I didn't even know that the code just continues with the index numbers instead of starting by 0 once the cut is done. That really sounds like an oversight to me.

@jtjames79 Жыл бұрын

At about 10 minutes I was like "Just put position zero in the cash! At 20 minutes "Ha I told you so!" The important thing was that I made this about me. Also commenting for the engagements and I didn't have anything more to contribute.

@petermcarthur7450 Жыл бұрын

🤔 Is the attention on the zero token serving as a form of long-term memory?

@vassa-vn7xz Жыл бұрын

No, because first token sees nothing.

@tweak3871 Жыл бұрын

As in, is it analgous to the context vector in an LSTM? I had the same thought, it does feel like something akin to a null group in a clustering algorithm or something. Feels like it could be trained to act like a context vector, but I question how far you could stretch it.

@blanamaxima Жыл бұрын

I am too old , we used to do math and things in science and not add dummy tokens that we cannot explain :))

@cerealpeer Жыл бұрын

what i think... honestly... is im really confused... and im struggling to understand how you can just train preprocessing into a learning model structure where its inferencing on data its instructed to ignore... i feel really dumb rn, but im gonna try to get it in my head.

@cerealpeer Жыл бұрын

i love CMU and MIT! i see FB, and i smell hacks. thats not an accusation-- i love hacks, but like with any cheat code im down low trying to figure out how it works... fuck i wish i was smart.

@angloland4539 Жыл бұрын

@asatorftw Жыл бұрын

Seems great, but for simplicity's sake I think I will stick to SkeletonUI🤔

@hikaroto2791 Жыл бұрын

you re uploaded this one? i remember you explaining that one already

@jcorey333 Жыл бұрын

I know he has talked about similar ideas before, but I believe this paper is new.

@Bokbind Жыл бұрын

Really? The paper is from the 29th of September. It's barely 2 weeks old.

@jcorey333 Жыл бұрын

I believe he mentions his video about "Big Burt" or something, which has a similar concept except bidirectional.

@hanyanglee9018 Жыл бұрын

The title of this video should be the dark side of softmax.

@ashsilverwizard3275 Жыл бұрын

I will start by saying that I know nothing of the technical aspect of all of this. Here are my assumptions: If I understand correctly this is part of inference. ----So the cache stores the information of the current "conversation" i.e its context. The area under the triangle is all the possible information that can be encoded The cache is the amount of information needed to represent the area under the triangle after optimization. If all of my assumption are correct then I suggest the following: In a "conversation" that is only as long as the "window" then context can be stored anywhere in theory. If you move the "window" throwing away the start as needed without recomputing then the model forgets part of the context. ---This leads to the model starting to behave erratically, its attention "waned" and it makes up things to compensate. Recomputing fixes the problem of forgetting context but comes at a cost. While the most important bits of information is theoretically stored in any of the connections the first few tokens has the most theoretical long term storage so models learn to use that preferentially. The attention sinks anchors that part of the memory. If I am correct then I can make 2 predictions. 1. While a 1 token attention sink appears to work just as well as recomputing it will start to become worse after a "conversation" starts to be above a certain length,. 2. an Attention sink model with 2,3 or 4 attention tokens will perform closer to the recomputing model in such longer "conversations". My reasoning for this prediction is simple. If the model can dynamically configure its longer term memory allocation size as I suspect happens with recomputing models then it will perform best the longer the conversations. Meanwhile the longer term memory allocation of the attention sink model is much more constrained. I could be wrong about this assumption though, but if I am not then there should be an optimum number of attention tokens depending on the "window" size to approach the performance of the recomputing model without sacrificing computational efficiency.

@unclecode Жыл бұрын

What if we modify the softmax to assign the remaining probability space to the recent token(s) rather than the first token? Then perhaps just sliding the window would work without any need to keep the first token.

@maks029 11 ай бұрын

so this means we can just have inference on unlimited amount of tokens? Science was made...took us only 5 years from T^2 complexity

@heejuneAhn Жыл бұрын

One question! the silding window method is new and different from the original transformers? I think it has really a minor difference from the original one. The original one has the absoulte fixed location of hidden state but the sliding window is just a relative one. Can this be a kind of paper still? haha

@CoughSyrup Жыл бұрын

Yannic said "Token" so many times, it lost all meaning to me. Jebus, I don't think Ive heard the word token so many times in one sitting; token token token. So I searched the transcript and Yannic says 'token' 155 times. For a 32.5 minute video, that's an average of once every 12.5 seconds.

@fumi_ Жыл бұрын

The word “token” was your attention sink

@tweak3871 Жыл бұрын

for the record, I laughed, and thought your comment was clever@@fumi_

@Ohmriginal722 Жыл бұрын

Can you do a "paper explained" video for the gaussian splatting paper that is taking the NeRF world by storm? I'm working on a project with it for one of my classes and I need it explained.

@yesandnoofwhy987 Жыл бұрын

I only have experience in machine language and assembly but it appears to be the same problem when the CPU's were 16 bit registers. What was the answer then ? Move to 32 bit. When it occurred again at 32 bit; we moved to 64 bit. Was it easy NO ! We had to move from octal based languages to hexadecimal. How? By combining 4 octal registers. Sorry for my ignorance and apparent naivety but if it would help, I will risk being embarrassed and condemned .

@uberthought Жыл бұрын

thx

@heejuneAhn Жыл бұрын

I think finanlly Transfromer should use a hierarchical abstraction method as a human does. Not this style flat single level information handling. In fact, all the methods in the paper is the exactly I thought when I first read the Transformer paper. Am I a genius? But this approach the AI cannot remember the dialog in the long context. ChatGPT instead seems to summarize the previous dialog to reduce the context size or sometime they take each converssations in segments and selectively input to the GPT machine.

@yoloswaginator Жыл бұрын

No you‘re not. All these ideas about how to build agents have been around for decades. They only become testable one puzzle piece at a time.

@CharlesVanNoland Жыл бұрын

The real solution is going to be hierarchical. Instead of tokens looking back, they'll look up. Groups of tokens represented as a token unto themselves, and groups of groups of token, etc... means virtually infinite complexity - limited only by the complexity of the network itself. It might not be trainable with backprop automatic differentiation, but neither are brains. Predictive hierarchies are the name of the game.

@beecee793 Жыл бұрын

damnit I'm not first

@andrewsomerville5772 Жыл бұрын

Those sunglasses are irritating. Content is good though.

@oncedidactic Жыл бұрын

Yannic was born that way

@orthodoxNPC Жыл бұрын

could be medical

@BHBalast Жыл бұрын

Its a valid opinion but on the other hand, I like them. :) They add something to the persona that makes it more like down to earth or sth? Its hard to tell for me, in my opinion academic world has a big problem with ego and down to earth educators are always welcome.

@andrewsomerville5772 Жыл бұрын

@@BHBalast I think the thing that bothers me is that it feels the opposite of down to earth to me, like someone trying too hard to be cool. I imagine it's a "persona" to catch attention on youtube. Oh well. not a big deal. Just needed to say it "out loud".

@AvastarBin Жыл бұрын

@@andrewsomerville5772maybe the lights are tiring his eyes too much so he out those sunglasses? Because trust me those lights sometimes need to be very strong especially if you're in front of a green screen