Retentive Network: A Successor to Transformer for Large Language Models (Paper Explained)

Рет қаралды 101,118

Күн бұрын

Пікірлер: 248

@YannicKilcher Жыл бұрын

OUTLINE: 0:00 - Intro 2:40 - The impossible triangle 6:55 - Parallel vs sequential 15:35 - Retention mechanism 21:00 - Chunkwise and multi-scale retention 24:10 - Comparison to other architectures 26:30 - Experimental evaluation

@cerealpeer Жыл бұрын

i asked chatgpt to design a chatgpt like llm based on retnet, and this is what it came up with. i hope at least some part of it is useful. #chatgpt #openai Creating a complete ChatGPT-like model based on the RetNet architecture would require extensive resources and training data, which cannot be provided within this text-based environment. However, I can provide you with a simplified code structure that combines a RetNet-based architecture with a basic chat interaction. Please note that this code serves as a starting point for building such a model and is highly simplified. First, we'll define the RetNet components, and then integrate them into a chat interaction loop: ```python import torch import torch.nn as nn import torch.nn.functional as F # Define the components of the RetNet-based ChatGPT class MultiScaleRetention(nn.Module): def __init__(self, input_dim, num_heads): super(MultiScaleRetention, self).__init__() self.multihead_attention = nn.MultiheadAttention(input_dim, num_heads) def forward(self, x): # Apply multihead attention x, _ = self.multihead_attention(x, x, x) return x class FeedForwardNetwork(nn.Module): def __init__(self, input_dim, hidden_dim): super(FeedForwardNetwork, self).__init__() self.linear1 = nn.Linear(input_dim, hidden_dim) self.gelu = nn.GELU() self.linear2 = nn.Linear(hidden_dim, input_dim) def forward(self, x): # Apply the feedforward network x = self.linear1(x) x = self.gelu(x) x = self.linear2(x) return x class RetNetBlock(nn.Module): def __init__(self, input_dim, hidden_dim, num_heads): super(RetNetBlock, self).__init__() self.multi_scale_retention = MultiScaleRetention(input_dim, num_heads) self.feed_forward = FeedForwardNetwork(input_dim, hidden_dim) self.layer_norm1 = nn.LayerNorm(input_dim) self.layer_norm2 = nn.LayerNorm(input_dim) def forward(self, x): # Apply multi-scale retention residual = x x = self.layer_norm1(x + self.multi_scale_retention(x)) # Apply feedforward network x = self.layer_norm2(x + self.feed_forward(x)) return x class RetNet(nn.Module): def __init__(self, input_dim, hidden_dim, num_heads, num_layers): super(RetNet, self).__init__() self.blocks = nn.ModuleList([RetNetBlock(input_dim, hidden_dim, num_heads) for _ in range(num_layers)]) def forward(self, x): # Apply multiple RetNet blocks for block in self.blocks: x = block(x) return x # Define the ChatGPT architecture class ChatGPT(nn.Module): def __init__(self, input_dim, hidden_dim, num_heads, num_layers, vocab_size): super(ChatGPT, self).__init__() self.retnet = RetNet(input_dim, hidden_dim, num_heads, num_layers) self.embedding = nn.Embedding(vocab_size, input_dim) def forward(self, input_seq): embedded_input = self.embedding(input_seq) retnet_output = self.retnet(embedded_input) return retnet_output # Example usage: input_dim = 512 # Input dimension hidden_dim = 1024 # Hidden dimension for the feedforward network num_heads = 8 # Number of attention heads num_layers = 6 # Number of RetNet blocks vocab_size = 10000 # Vocabulary size chat_gpt = ChatGPT(input_dim, hidden_dim, num_heads, num_layers, vocab_size) input_seq = torch.randint(0, vocab_size, (32, 10)) # Batch size of 32, sequence length of 10 output = chat_gpt(input_seq) print(output.shape) # Print the output shape ``` This code defines a simplified ChatGPT architecture based on the RetNet components. In practice, you would need to fine-tune this model on a large dataset of conversational data and integrate a decoding mechanism for generating responses. Additionally, a more extensive vocabulary and tokenization scheme should be used to handle natural language text effectively.

@cerealpeer Жыл бұрын

ok i asked chatgpt for a chatgpt and this is lol... idk, man... uhh... 🤯 i have a headache from caffiene and nicotine: i hope this does a thing... i hate it when im stupid. #chatgpt #openai #gtforeally Incorporating the described system into a process that improves itself over iterations is a complex task. It requires a continuous feedback loop and the ability to modify the system based on user feedback and learning from interactions. Below is a simplified example of how the system can iterate to improve itself: ```python class ChatSystem: def __init__(self): self.responses = {} self.learning_iterations = 3 # Number of learning iterations def user_interaction(self): user_input = input("User: ") if user_input.lower() == "exit": return False # Exit the interaction else: response = self.generate_response(user_input) print("System:", response) feedback = input("Was this response helpful? (yes/no): ").lower() if feedback == "yes": self.store_feedback(user_input, response) return True # Continue interaction def generate_response(self, user_input): return "Thank you for your input." def store_feedback(self, user_input, response): if user_input in self.responses: self.responses[user_input].append(response) else: self.responses[user_input] = [response] def dynamic_learning(self): for _ in range(self.learning_iterations): for user_input, responses in self.responses.items(): if len(responses) > 1: improved_response = " ".join(responses) # Simplified improvement self.responses[user_input] = [improved_response] def run(self): print("Chat System - Type 'exit' to end the conversation.") while True: continue_interaction = self.user_interaction() if not continue_interaction: break self.dynamic_learning() print("Chat System - Learning and Feedback Collection Completed.") # Print peak utility print(" Peak Utility of the System:") for user_input, responses in self.responses.items(): print("User Input:", user_input) print("Improved Response:", responses[0]) # Assuming the first response is the most improved if __name__ == "__main__": chat_system = ChatSystem() chat_system.run() ``` In this code: - The system collects feedback and stores it. - After the conversation, it goes through a specified number of learning iterations to improve responses. - Finally, it prints the peak utility, assuming that the first improved response in the stored feedback represents the peak utility. Please note that this is a simplified example for demonstration purposes. In a real-world scenario, dynamic learning and improvement would involve more complex NLP techniques and data handling processes.

@cerealpeer Жыл бұрын

im trying to get chatgpt to make a self developing chatgpt: #chatgpt Integrating the simplified Selective Black Box System into a Retentive Large Language Model (LLM) internal environment, while linking it to both internal and external black boxes, along with supervised user interactions for live automatic security and development, is a complex endeavor. Here, I'll provide a conceptual overview of how such an integration might work: ```python import random class SelectiveBlackBoxSystem: def __init__(self): self.code = "" self.history = [] self.security_vulnerabilities = [] def generate_response(self, user_input): response = f"System response to user input: {user_input}" # Save user input and response to history self.history.append((user_input, response)) # Analyze user input and detect security vulnerabilities (simplified for demonstration) if "sql injection" in user_input.lower(): self.security_vulnerabilities.append("SQL Injection detected in user input.") # Check if it's time to self-develop or request user input for improvements if len(self.history) % 5 == 0: self.self_develop() response += " System has self-developed based on history." else: response += " System is waiting for user input or self-development." return response def self_develop(self): # Simplified self-development process (replace with actual logic) new_code = random.choice(["print('Hello, World!')", "for i in range(10): print(i)"]) self.code += new_code + " " def report_security_vulnerabilities(self): if self.security_vulnerabilities: return " ".join(self.security_vulnerabilities) else: return "No security vulnerabilities detected." class RetentiveLLMEnvironment: def __init__(self): self.llm_code = "" self.selective_black_box = SelectiveBlackBoxSystem() def generate_llm_response(self, user_input): # Process user input with the LLM and get a response llm_response = f"LLM response to user input: {user_input} " self.llm_code += llm_response # Store LLM-generated code in the environment # Pass user input to the Selective Black Box System for further processing black_box_response = self.selective_black_box.generate_response(user_input) # Return both LLM and Black Box responses return llm_response + black_box_response def report_security_vulnerabilities(self): # Report security vulnerabilities detected by the Selective Black Box System return self.selective_black_box.report_security_vulnerabilities() if __name__ == "__main__": environment = RetentiveLLMEnvironment() while True: user_input = input("User: ").strip() if user_input.lower() == 'exit': print("Goodbye!") break system_response = environment.generate_llm_response(user_input) print("System:", system_response) # Check for security vulnerabilities and report if any security_report = environment.report_security_vulnerabilities() if security_report: print("Security Report:") print(security_report) ``` In this conceptual integration: - The `SelectiveBlackBoxSystem` is encapsulated within the `RetentiveLLMEnvironment`. - User input is first processed by the LLM to generate a response and store the LLM-generated code in the environment. - Then, the same user input is passed to the `SelectiveBlackBoxSystem` for further processing. - Both the LLM response and the Black Box System response are provided to the user. - Security vulnerabilities detected by the Black Box System are reported. This integration combines the capabilities of the LLM, the Selective Black Box System, and user interactions for security monitoring and code development within the internal environment. It serves as a simplified demonstration, and in practice, more sophisticated logic, security measures, and integration points would be necessary.

@kristoferkrus Жыл бұрын

First they published the ResNet, now they publish the RetNet... There is only one conclusion to be drawn from this: In 2031, Microsoft Research will publish The ReuNet.

@NoNameAtAll2 Жыл бұрын

I'll implement only RewNet. RevNet was too weak

@justfoundit Жыл бұрын

Extrapolation beyond model size

@alles_moegliche73 Жыл бұрын

the next one is RestNet, trust me

@jantuitman Жыл бұрын

I think they try to slowly work up to RedmondNet

@anmolt3840051 Жыл бұрын

I'm waiting for ReyNet

@jjkjkjka Жыл бұрын

Training parallelism is referring to parallel computation along time axis, not to the ability to have loss on every token (both transformers and RNNs can do this). Basically in transformers you can use parallel_for(batch and time) but in RNNs you can't do parallel_fro along time, because along time dimension It's casually dependent on output of previous token not on input (as in transformers).

@lexer_ Жыл бұрын

I am always really curious with these paper explanations which parts of these concepts are actually novel and which are novel combinations of known principles and concepts. Maybe you could consider just very briefly mentioning this aspect while going through the explanation? Sometimes you implicitly touch on this aspect but I would love a more explicit pointing out of which of this you think is novel. I don't ask for some rigorous lookup through all of ML and math publications of course! Just a gut-based best guess is fine.

@kristoferkrus Жыл бұрын

This feels familiar. Isn't this almost precisely what they did in the paper "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" back in 2020, except for the discount factor to allows the retention to decay exponentially?

@kimchi_taco Жыл бұрын

I feel the same, and I think the discount factor is not a feature but tech debt.

@lexer_ Жыл бұрын

Throughout the entire explanation I had the impression the main contribution of this paper was to accumulate and actually try out existing ideas from other papers. This is of course also a very valuable contribution but I dislike the way this is publicized as this revolutionary invention from Microsoft research.

@Supreme_Lobster Жыл бұрын

@@kimchi_taco in the paper they mention that Linear Attention (which basically removes the softmax aswell) has problems with modelling positional encoding, so this might be the intuition they had for how to solve it.

@hanyanglee9018 Жыл бұрын

It's about protecting the output of each layer. since it doesn't have the softmax.

@donglixp Жыл бұрын

Any question is welcome! The modelling is different between them. For example, linear attention still tries to approximate softmax, while retention is motivated from a different angle. The comparisons between them are described in the paper. In terms of running the arch as RNNs, we indeed were inspired a lot by previous work, which was also clearly presented in the paper.

@IcyyDicy Жыл бұрын

I'm so glad I took a linear algebra course at uni, I can actually kinda understand what's going on! Thanks for going into the math of this

@petevenuti7355 Жыл бұрын

I wish I did, heck, I wish I had that and some calc in highschool if not middle school..

@sunnohh Жыл бұрын

Its amazing the garbage people do with linear algebra

@KEKW-lc4xi Жыл бұрын

You must have had an excellent math teacher. I struggled with math in college, even up to calculus 2. I often need to search for the meanings of math symbols and how they're used in different situations. In my opinion, research papers often fail to explain their work clearly, maybe because they're worried about job security or something? Whenever I read a research paper and find it needlessly complex, I'm reminded of the quote: "An idiot admires complexity, where a genius admires simplicity" - Terry Davis

@IcyyDicy Жыл бұрын

@@KEKW-lc4xi I fully agree with you. So many good details are hidden within formulas and notation that it's very hard to understand if you don't know what to look for. I personally find code way more readable than formulas, and I wish that papers would write out processes as charts or code to be clearer. But since that's what we've been doing since the beginning, it's expected, so that unfortunately isn't likely to change anytime soon.

@alanjlockett Жыл бұрын

Firstly, I appreciate these videos, which help me stay up to date with the state of research. Second, two quibbles. First, you cannot ignore the \Thetas, which do not cancel but instead implement rotary position embeddings. Second, although Retnet certainly looks like a Linear Transformer, I think the per head GroupNorm operation is actually implementing a kind of nonlinearity (scaling by the variance). It doesn't matter that it's applied after v is multiplied in, if you reimagine softmax as a Boltzmann distribution and think of extracting division by the partition function outside the expectation, I think you'll see what I mean -- though Retnet is certainly not taking a weighted average of the vs like transformer attention does.

@ai.srirambandi Жыл бұрын

Next do Bayesian Flow Networks by Alex Graves et al.

@clray123 Жыл бұрын

17:48 No, this decay is not really positional encoding. I suspect it's for numerical stability. For positional encoding they use xPos, which is a kind of RoPE (which also has built-in decay).

@DamianReloaded Жыл бұрын

A model that can run decently on disconnected end user devices to do language processing, maybe even speech recognition would be awesome.

@unkarsthug4429 Жыл бұрын

I mean, pixel phones since the integration of the tensor chip already do this.

@unkarsthug4429 Жыл бұрын

And by that, I mean onboard translation and speech processing.

@mathematicalninja2756 Жыл бұрын

Softmax is used to model multinomial probablity distrubutions, they have modelled it like a poission process.

@araldjean-charles3924 Жыл бұрын

Hi Yannic Kilcher. Let me start by saying thank you for what you do! You are a great teacher. You are doing an excellent service to humanity. Nevertheless, maestro, around time frame 11:26, you make a statement about non-linearity (about activation function). If you use Relu, you are dealing with a piece-wise linear transformation; one piece maps everything negative to zero (a null space mapping, sort of), and the other is identity mapping. Putting softmax aside, you have only linear maps and Relu activation; the whole neural network must have a more straightforward representation as a piece-wise linear map. My intuitive argument is as follows. Without non-linear activation, the entire network will become one matrix operator by a matrix composition. Suppose we perturb the above so that all negative values of a dimension get mapped to Zero (restricted to negative input value. This is linear) while all positive values of an input dimension are identically mapped; identity mapping is also linear. Why can't we have a piece-wise linear decomposition of such operators and call it a day? Does not this hole scream at you? (I am talking just about Neural Network with Relu, leaving out softmax at the output layer.) Thank you for the voodoo that you do so well! (I am from Haiti. Voodoo has a totally different meaning for me.)

@drdca8263 Жыл бұрын

I believe some work has been done on that. One thing is: the number of components in the piecewise-linear decomposition grows like exponentially in the number of layers, I think? In any case, there ends up being *a lot* of such components. But people have done stuff like, studying how the size of the components varies throughout the input space?

@araldjean-charles3924 Жыл бұрын

@@drdca8263 Thank you so much for sharing. The "Winning the lottery concept," pruning, and low-rank approximation scream that most space does not matter. The solution is concentrated in a small volume of a large space. We don't know how to approach it directly, so we presently use a shotgun shot approach (so-called winning the lotto).

@edh615 Жыл бұрын

if negative values get mapped to 0 that is not reversible, the ability to exclude a part of the input going forward is what gives expressivity to the network.

@drdca8263 Жыл бұрын

@@araldjean-charles3924 I found the paper I was thinking of. It is called “Interpreting Neural Networks through the Polytope Lens”.

@clray123 Жыл бұрын

It is not about removing non-linearity from the neural network (the FFN layers in between retention layers still have it). It about removing non-linearity from calculating attention scores.

@meisherenow Жыл бұрын

Seems like you could get at least some of the computational advantage of ditching the softmax but still have a non-linearity, as long as it's a point-wise non-linearity--relu, square, etc.

@ItIsKnown42 Жыл бұрын

Sounds huge! Please keep posting.

@Schaelpy Жыл бұрын

I do not trust a computer science paper with a unrelated quote and a optical illusion in it

@joshuascholar3220 Жыл бұрын

I love your channel. Please keep going!

@dr.mikeybee 10 ай бұрын

The softmax is a filter on the signal, so we are getting an unfiltered signal in retention networks. That means part of the signal we need is in there.

@RuslanLagashkin Жыл бұрын

It is not "greening out", it is you are breaking the Matrix

@mgostIH Жыл бұрын

14:00 isn't necessarily true in works like "Self-attention Does Not Need O(n^2) Memory" and implementations like Flash Attention. Better implementations of the standard design is why it's very hard for new works to displace the original.

@donglixp Жыл бұрын

Thanks for raising the question. FlashAtt is still O(N log(N)) or O(N sqrt(N)) in memory, and O(N^2) in computation.

@charstringetje Жыл бұрын

With Associativity you can change order of operations (change around the parentheses): (a+b)+c ≡ a+(b+c). For a commutative function, you can change order of parameters: a+b ≡ b+a

@AKumar-co7oe Жыл бұрын

so it's both in this case

@AKumar-co7oe Жыл бұрын

man, kind of ridiculous that i have to remember this stuff when I've had multiple interns from MIT

@NeoShameMan Жыл бұрын

You can basically bog down neural network at large into two main function, sorting input data together, then tagging them with class. LLm goes one step beyond and dissociate the learning of the input (embedding training) and the output token generation) ie it's virtually deeper than the "physical" model because really it's two model chained together.LLM are "special" in machine learning because the input distribution and the output distribution is the same. Embedding prevent the network from learning an arbitrarily sparse input, ie the data is self sorted, the self similarity encode the distribution bias that define the inherent semantic of the input. Most big LLM are very sparse with neuron only reacting to specific Ngram (aka bag of word), I encourage people to look at how chatbot using chatscript are programmed too (suzette and rosette). The reason is that we tokenized the input allowing better composition (ie higher level token), reproducing an adhoc parsing tree. This pose the question, what limit a DAG (typical network architecture) impose on that structure. The transformer is merely a clever acceleration structure, but if you look at the evolution of power of neural network, limiting the input field (attention mechanism) as been a way to accelerate learning and increase power. But I would go much deeper, I would ask you, what function are the neuron ACTUALLY are doing, I don't mean logistic regression and the kind, I mean *concretely*, try to separate each mechanism of the neuron (beyond the bs brain analogy or obscure math) and question yourself what do they do (the mul, the add, the bias, the activation). After all the math is merely the same as groceries store ticket you look at to see if you got over budget or not... IMHO LLM will go the way of NeRF vs Gaussian splatting ... YAGNNNI! (You Aren't Going to Need Neural Network for Intelligence)

@millionare5446 Жыл бұрын

dunning kruger

@clray123 Жыл бұрын

@@millionare5446given that small (L)LMs are increasingly demonstrated as able to reproduce the behavior/benchmark scores of huge LLMs, I would not be so fast to judge

@NeoShameMan Жыл бұрын

@@millionare5446 proof? You can't say something and quit Scott free 🤣 I can literally show assertion with code, I can demonstrate and cite paper. I can point to historical model. I recommend the paper neuron in large languages network models: dead, ngram, position. I implemented neural circuit from scratch to test hypothesis. I'm pretty confident in my claim that we will crack neuron out. Here us an exercise for you, to see if you aren't the one with dunning Kruger. Implement a small neural circuit that encode ordinality, such a token only detect and discriminate previous and following token, such abc but not bdf. Try to figure out what impact it has on the graph structure of the circuit, conclude about how the typical dag structure of neural network impact emergence of such circuitry.

@NeoShameMan Жыл бұрын

@@millionare5446 here is another claim you can test. Any neural net trained for embedding will converge into a similar distribution given similar dimensionality. Such that we can align both distribution by taking the relation of a group of cluster centroid. I'm well aware that the public of this channel is sophisticated, when I say go back to what a neuron does, I'm exhorting to look at stuff more concretely in terms of function. I was kind of amused of someone studying a toy Othello model, being amazed than the model learn which color it is playing with the state of a single neuron, when you are accustomed to create neural circuit by hand, it's not surprising because xoring is how you implement a switch in representation. But more importantly I'm implicitly trying to handwave in a particular direction. I made pretty early a prediction that we could train a language model close to the entropy limit, then the paper grip is all you need happen, surprising a lot of people but me. Compressor are a type of predictor, feed them random data ans they will produce a distribution similar to what they "learned". But compressor are overfitter, that's a different way to look at the problem, overfit then relax to generalize, what gzip is all you need did, was to show that joint probability of data is enough to create some form of discrimination. Ie when I talk about data self similarity, I'm talking about compression by another name. And the reason I bring chatscript specifically, and not like chatbot made with aiml, is to link back the notion of embedding, notably explicit embedding unlike current version. Implicit embedding is a black box that surely capture more semantic than explicit embedding, we have a history of failing to use explicit embedding to make any notable breakthrough, IMHO, such as conceptnet. However these were not useless, it was simply what I call the dictionary problem, it cost lot to exhaustively list all minutiae of the world in a model. To bring back to chatscript, they had leverage wordnet as an ontology to help generalize, and a lot of hand made rules to operate, ie dictionary problem, however it works to a degree afforded by the limitations they were operating. Initially I wanted to mentioned the stanford parser but choose chatscript on purpose. The big breakthrough llm really add is shattering the dictionary problem at the cost of black boxing. Pointing back to these historical model, I'm positing there is a way to learn a way to bridge the gap. Imagine you have an explicit embedding and an implicit embedding, you can probably learn a joint probability that will translate hidden class into known class, and measure the difference to uncover novel class, thus better understanding. One thing neural network, and language model in general do, is confusing the nature of class as human use it. Consider the phrase the capital of France is ___. What type is the next word, from a syntactic level we are waiting for a noun or a noun group, from an ontological level, we wait for a city name. Model don't explicitly differentiate between those two type, those are just cluster of words, or bag of word from another perspective. We know model store semantics in a structured laten space, such that words are point in that space, to infer Paris, it would take the vector capital and France then use a learn vector to point to another space where the cluster of word that encode the ontology city, and the differential capital and country has a joint distribution with the differential country capital. That also explains why llm or nn can be quantize, except some vector, it depend on the equivalence of words in a cluster, it also explains why some neuron have abnormaly large activation, they are cross domain link of the internal graph model. We can safely conclude that model learn more complex and abstract structures the deeper they are, but also the dag model mean there is redundancy at each layer or pass through neuron. By finding a way to express compressor as a model, extending them to parser domain and using explicit embedding, that looks like a desirable end game that remove the black boxing of neural net. Thus me making bold claim.

@millionare5446 Жыл бұрын

@@NeoShameMan which laboratory is letting you create neural circuits with biological neurons?

@samanthaqiu3416 Жыл бұрын

8:45 but I don't get how the causal mask by itself helps make transformers have token training parallelism, in fact you could say the same about recurrent networks, since they have the causal mask inbuilt into their latent propagation architecture

@sluicedeuce4367 Жыл бұрын

A causal mask allows a single sequence to be deconstructed into many training examples within one forward pass. For a given input: 1234 and a given target output: 2345, there are 4 training examples from the one sequence: 1 -> 2 12 -> 3 123 -> 4 1234 -> 5 The causal mask enforces this structure in the outputs by masking the attention matrices' upper triangular. A visualization can help cement how the math works, but this is the fundamental idea for autoregressive training.

@-mwolf Жыл бұрын

@@sluicedeuce4367But wouldnt just using multiple batches have the same effect, regarding parallelism? (except ofc not utilizing the data to its full extent)

@anmolt3840051 Жыл бұрын

@@-mwolf AFAIK training RNNs with batched data requires all the samples (within the batch) to have the same length, so with an RNN you'd have to run "123 -> 4" and "1234 -> 5" on different batches and recompute the common states (corresponding to inputs 1, 2 and 3) whereas a transformer can optimize on both "123 -> 4" and "1234 -> 5" objectives within the same batch execution without having to recompute anything. At least, that is my understanding

@sluicedeuce4367 Жыл бұрын

@@-mwolf Yes exactly right. Autoregressive prediction is all about predicting the next token, but using those sub-sequence examples is what makes training so powerful!

@darshank8748 Жыл бұрын

What about a presentation of Hyena? Really cool too

@robinwang6399 7 ай бұрын

Remembering stuff from the performer paper, it seems if you pick Q and K carefully, retention reduces to approximately attention with extra steps. So I think it is a good approximation plus corrections situation. At least in the chunk wise situation.

@Akshaylive Жыл бұрын

I haven't read the paper yet, but I'm not sure non-linearity is completely taken away. The complex conjugation is a non-linear operation, which is still very much in use here.

@Akshaylive Жыл бұрын

Additionally, since positional embedding is kinda baked in, I'm not sure how this will perform on non-textual tasks like working on images where positional embeddings are quite crucial and non-trivial.

@clray123 Жыл бұрын

@@AkshayliveIt's not baked in, if you look at the implementation, you see they use xPos for positional embedding, which you could replace with something else like traditional RoPE (in face xPos is supposed to be an improvement over RoPE which attenuates cyclical swings in attention scores over sequence length).

@clray123 Жыл бұрын

@@w花b The idea is that these specialized faculties/subregions do not necessarily need to be designed as such, but can spontaneously arise within that single unified architecture. When you look at a living organism, you see many specialized organs, but all of them are somehow encoded in the same DNA, using just 4 base pairs, and they evolved over millions of years by executing the same sort of "code" over and over again.

@clray123 Жыл бұрын

@@w花b Yes, although multimodal models such as ChatGPT currently do in fact include some designed subcomponents ("mixture of experts" architecture), so your thinking that a single architecture to rule them all may not be enough is not wrong either.

@abderrahimbenmoussa4359 Жыл бұрын

I am not very good at maths and this was very clear. Thanks.

@siminfan9300 Жыл бұрын

what’s the group link to the Saturday paper discussion? 👀

@edhofiko3168 Жыл бұрын

If you somehow think this is familiar because you read alot about linear transformer, then this is indeed familiar. Linear transformer introduce the idea of using kernel to calculate softmax so we can multiply K and V first then with Q, resulting in linear space complexity requirement. Then Performer came along and provide more theoritical background about what kernel to use. But, then researcher realize the whole thing resembles a fast weight programmer architecture (this is actually the recurrent form of RetNet with maybe slight difference) from way back in 90s which is actually better then rnn but they were kind of slow (i guess?). Now they decided that, maybe training by multiplying Q and K first for parallel training but inferencing by multiplying K and V first for optimizing space and comoutation might be the best of both worlds. But why linear attention in the sense of dropping softmax altogether works? Well there are a few paper, like Intention mechanism, which models transformer attention as linear regression with softmax kernel. So, they decided to ditch the softmax kernel altogether and it actually works better in some case.

@yorth8154 Жыл бұрын

The fact that the jury is still out on this paper is already the best possible outcome/praise. I came into this video thinking that by half of it the bullshit would be already too smelly to bear.

@100c0c Жыл бұрын

What's the significance of this? Is it just more efficient training/inference or does it add new capabilities to language models?

@yorth8154 Жыл бұрын

@@100c0c faster inference time and longer context window. There is also the fact that it apparently just scales better with parameter count, which alone could be reason enough to switch. However, no peer review yet and no serious foundation model has been released yet so it's still in "good potential" category.

@etopowertwon Жыл бұрын

@@100c0c I suspect little to none: it's not the first time RNN are introduced to transformers. See transformersXL. Better yet, see paper on YaRN. Same quadratic transformers. GovReport: ~3.5-~4.5 perplexity. Against 16 of "successor". Yarn give perplexities for 4 methods. Add them all together, to drive the point, PPL will be 15.5. That's still better than RetNet. It's almost as if you can't put super long context into fixed size memory that is being decayed and erased with each token and get a good result. Llama 2 uses "vanilla" transformers. No XL version. No other RNN based approach.

@clray123 Жыл бұрын

The papers' authors are a little overenthusiastic and cringeworthy in their selling of the invention. However, that is understandable if what they claim turns out to be true. Anyhow, the maths is sound, and nothing* keeps you from comparing RetNet performance with a transformer architecture yourself (in fact there are already two good implementations you could use straight away out there). If it was "all bullshit", you would have heard about the impossibility to reproduce results already. *nothing besides the money required to perform a large-scale LLM training, of course; which was also the paper authors' problem

@100c0c Жыл бұрын

@@clray123 So it just make things cheaper and enhances it's current LLM abilities, but doesn't add any new functionality?

@elanm4311 Жыл бұрын

Transformers with KV-cache (all modern transformers) generate each token in linear complexity by saving previous key-value pairs. Once people figure out how to keep/compress the KV-cache to a fixed size we will have all the benefits retentive networks for regular transformers.

@AM-yk5yd Жыл бұрын

Sounds like Memory Transformer.

@daniellawson9894 Жыл бұрын

wouldn't this be O(N) not linear? You don't have to redo attention for prev tokens, but for generating the current token, you still need to do attention to the previous O(n) tokens?

@kevinamiri909 Жыл бұрын

I suppose the next will be fullnet, purely written with dot Net framework, and high compatibility, while high performance only with windows

@clray123 Жыл бұрын

I'd say, if the RetNet "we don't need no softmax" trick turns out as justified, it would be a huge embarrassment for the top LLM researchers out there who have been overlooking this for multiple years. But then, every great discovery appears trivial in hindsight. The worry is that this potential for embarrassment (and also considerable sums of money already sunk because of the mistake) will actually stifle the adoption and testing of this new architecture.

@clray123 Жыл бұрын

Strangely enough, I don't see top researchers jumping to protect their reputation and assert "we have tried it without softmax before and it did not work for ... reason"...

@anonymousanon4822 Жыл бұрын

Softmax will not turn out to be a huge embarassement. It is incredibly cheap computationally and steers models to improve performance. It has driven the ai boom of the last 10 years. I'm not saying that it's impossible to be replaced or become abundant, but it is certainly not a "huge embarrassment"

@clray123 Жыл бұрын

@@anonymousanon4822 It's not about softmax in general, it's about softmax in attention scores. Designing an algorithm to be O(n^2) where O(n) suffices or O(n) where O(1) suffices SHOULD be a huge embarrassment for any CS expert. And because the models are monolithic beasts and you would have to retrain them to fix the mistake, it means that such a design error incurs very considerable costs on model training even if it pertains to the inference part of the algorithm (where it still incurs considerable costs on shared hosting and implementation complexity). Having such a mistake go unnoticed for 8 years while the faulty design has been pushed as brilliant would further add to the embarrassment. Everything hinges on whether the softmax in that location is really necessary or not for the (larger) models' performance. As Yannic mentioned, the jury is still out on that, but the competing researchers should all be scrambling to find out and to prove RetNet authors wrong if they want to uphold their own reputation. Rather than ignore this as "just another paper" or, worse, insinuate that the authors have purposely or accidentally withheld some critical information to make the results appear in their favor (which might be the case, be the story is too big to simply assume so). Given that the design decision in question (now) quite obviously (and again, embarrassingly) appears to be lacking theoretical support, this is the least one would expect from intelligent and honest people who came up with it. Not investigating it would reflect negatively on the entire field.

@quAdxify Жыл бұрын

Linearization has been attempted many times and never worked too well but maybe it does this time, who knows. I mean you can approximate any non-linear function by a piece-wise linear one if my memory serves me well.

@clray123 Жыл бұрын

@@quAdxify It would be interesting to know why the non-linear function was used at this place in the architecture to start with. But hey, such subtle design decisions, which cost hundreds of millions of dollars, are not really documented in AI engineering/research. There's probably a nice picture with it in someone's paper - and that's it.

@dr.mikeybee 10 ай бұрын

Very good work, Yannic.

@yannickpezeu3419 Жыл бұрын

Thanks ! You are a great teacher !

@woohooliving8683 Жыл бұрын

great video. the cross-chunk is attention across chunks. is that what u meant by accumulating the distant past?

@TooYoober Жыл бұрын

Retention is all you need?

@mloewen248 Жыл бұрын

Fantastic as always Yannic!

@SimonJackson13 Жыл бұрын

So the attention decay exponential effectively becomes the non-linear process? Wouldn't a Monte-Carlo on the best causal "mess" sequence-distance be in order?

@JTMoustache Жыл бұрын

Good to have you back

@nikilragav Жыл бұрын

So the goal is that at inference you can easily run it on some computer that doesn't do matrix operations (no gpu) or has small memory?

@TheReferrer72 Жыл бұрын

Microsoft Research is doing some solid stuff. Phi-1.5 is ground breaking.

@clementdato6328 Жыл бұрын

What is Phi-1.5

@TheReferrer72 Жыл бұрын

@@clementdato6328 Small LLM 1B parameter model that outperforms LLama 2 models that are 10x bigger. search "Textbooks are all you need technical paper"

@fontenbleau Жыл бұрын

Microsoft not related to this paper, there no any foreigners even for review, it's purely from China academia, but I would take it skeptically knowing academia in communistic political pliers on the balls

@clray123 Жыл бұрын

@@clementdato6328A small model with great results on plausible reasoning tasks which you can't use for anything practical because of Microsoft's license.

@potential900 Жыл бұрын

@@clementdato6328Name of a new model iirc, you'll find the technical report on it online.

@xyh6552 Жыл бұрын

So, can we still pass off an article without using softmax? This is a job at a second-grade level

@nabileljabri2768 Жыл бұрын

where can i join your discord?

@agentds1624 Жыл бұрын

I really do appreciate the humourus scepticism (or should i just call it sarcasm 🤔). Anyways, your videos would only be half as enjoyable without it. Also, beeing able to mathematicaly explain why softmax prevents recurrent implementation, yet describing associative property as "the property that sort of alows us to change the order of computation" (which is by any means correct) gives me hope for my PhD 😂.

@acasualviewer5861 Жыл бұрын

I'm hoping someone pretrains a huge model that we can all play with. The architecture is one thing. The trained model is another. It would be great if we could replace Llama or RoBERTa or other heavier models with these that can be run on simpler machines. It definitely scales better. I trained a 234M parameter model on Google Colab with no problems (just 3 hours of my life). So it's doable.

@ryanfranz6715 Жыл бұрын

Idk if this paper in particular will do it (it might just not sure), but it seems to me, conceptually, there should be no reason recurrence can’t be re-applied to transformers after-the-fact in a way that still retains high performance. I know that’s just a breakthrough away, and maybe this was it. Another thing I think is just 1 breakthrough away is the ability if a transformer-like architecture to “reason” over text input prior to producing an output. Like transformers clearly show what I call “shallow reasoning” over the text, but they just instantly spit out their “gut feeling” (which has become quite good in large models) without the ability to go back and think more or less carefully over specific pieces of text. And sure there’s prompting techniques that vaguely accomplish this like asking it to list out it’s steps (which already presents an opportunity for progressively self-improving but)… I’m talking about having that much deeper in the architecture. I think there’s nothing theoretically stopping that, just a breakthrough away, and if a huge model like GPT-4 suddenly had that ability… it’s like if we had a chess computer that can only think 1 move ahead and that by itself was as good as most humans already, and suddenly we let that computer think many moves ahead… think we’re in for some interesting stuff…

@potential900 Жыл бұрын

Have you looked at the recent Graph of Thoughts paper?

@klaudialustig3259 Жыл бұрын

Did anybody else notice the hardware they trained on? The paper says: 500 AMD MI200 GPUs

@fontenbleau Жыл бұрын

Plausible, A100 & H100 already sanctioned for any exportation into China, but recently also UAE was sanctioned by smuggling cards into China. A huge diplomatic scandal in fact.

@clray123 Жыл бұрын

Now, if I wanted to attack RetNet, my preferred angle would be that it might not play will with the concept of shared paging kv caches (see e.g. vLLM), so while it reduces the absolute amount of memory for a single session, it may not provide benefits while batching multiple sessions at scale. Not sure if this is really the case, but at least something that would need to be considered when pushing RetNet as replacement for kv cache based architectures.

@lejoon Жыл бұрын

I don't really understand why you say that the causal mask allows for parallel computing? I mean it's simply that the computation of a token doesn't really depend on its position because of self attention.

@etopowertwon Жыл бұрын

transformers take N tokens as input and produce N tokens as output. Each output token is used for training. If it weren't for causal mask, the network would learn that output[i] = input[i+1]. With causal mask tokens cant see the future and therefore you can calculate the loss from (output[i], input[i+1]). Thus you can train for N tokens in parallel from a single batch. In RNN you need to call the network N times. In transformers - you feed entire batch and calculate the loss for every token in one go.

@lejoon Жыл бұрын

Sure but that's not actually dependent on the casual mask itself but how the attention works. You can drop the casual mask and train a model on missing tokens and it would still be parallelizable. @@etopowertwon

@kan_drio 9 ай бұрын

I have a genuine, possibly dumb question. Regarding training parallelism, I don't understand why we care about it so much. When the training is parallel, it is also faster of course, but the memory usage is quadratic. Right? Why do we care about speed, but we don't care about the memory usage "exploding" quadratically? Both time and memory are costly resources, right?

@05xpeter Жыл бұрын

I'm confused the reason we invented neural networks was because linear models did not perform and what I have seen time and time again that if we don't introduce non-linearities it is just a linear model under the hood and have similar performance. There must be some non linearities, a linear model can't perform so well. Looking forward to read the paper.

@lennart6520 Жыл бұрын

I was first thinking the same and I believe the answer is that the FFNs in each layer are still non-linear.

@nicholaswalker6996 Жыл бұрын

I've trained a number of medium GPT2 transformers and RetNets, and RetNet training is slower (compared to transformer with fast attention) and their perplexity is lower (which they admit in the paper - they say gets better with larger networks). RetNets are much better for inference. There is something strange about the role of the GroupNorm which I'm trying to work out.......

@nicholaswalker6996 Жыл бұрын

Sorry Perplexity is larger - loss is worse.

@cajampa Жыл бұрын

@@nicholaswalker6996 You know you can edit a KZbin comment, right?

@corgirun7892 Жыл бұрын

The claim in this paper that RWKV cannot be trained in parallel is absolutely incorrect. RWKV can be trained in parallel, just like GPT.

@donglixp Жыл бұрын

BTW, mathematically, you could also train it sequentially parallel. But as long as you tried, there would be numerical issues, at least it's non-trivial to solve.

@donglixp Жыл бұрын

I would not say it's the issue of RWKV, but instead, it's caused by AFT.

@burkhardstackelberg1203 Жыл бұрын

Is the RetNet completely linear, or does it contain nonlinearities at some point? If not, I would expect some representation failure at some point.

@Luke-kv9ew Жыл бұрын

The use a swish gate in each layer: arxiv.org/pdf/1710.05941v1.pdf?source=post_page

@donglixp Жыл бұрын

It has LN and swish gating to add non-linearity.

@JerryFederspiel Жыл бұрын

When I read this I just couldn't tell what was new enough that would produce the claimed jump in performance. It seemed kind of premature to declare this a transformer successor.

@clray123 Жыл бұрын

I see a problem in that everyone is TOO skeptical and nobody seems to be checking these claims as a result. People still training old architectures because of this "that can't possibly work" assumption.

@etopowertwon Жыл бұрын

@@clray123 MS didn't release the weights, so it's not really possible to check results

@clray123 Жыл бұрын

@@etopowertwon They did release the implementation though (and there's even an independent one), so you could produce your own results and become famous in disproving their exaggerated claims (if there are any).

@JerryFederspiel Жыл бұрын

@@clray123That's a very reasonable concern.

@potential900 Жыл бұрын

@@clray123So what's the Nvidia equivalent of 500 AMD MI200 GPUs?

@triforce42 Жыл бұрын

Hmm, could be an erratum. do you call the sigmoidal activation the same thing as softmax around 20-22 mins? Also don't transformers only use softmax in later layers? Isn't it just a relu activation for the nonlinearity?

@pi5549 Жыл бұрын

There's a softmax in the attention mechanism. Say there''s 256 tokens. You'll get a 256x256 attention matrix and you softmax over each row. So 256 softmaxes, each softmaxing a 256 length vector.

@triforce42 Жыл бұрын

@@pi5549 oh that clears up a misconception on my part. But isn't it still unconventional to use the sigma to denote softmax or is this a convention I'm not aware of? Usually sigma denotes the sigmoidal activation function right?

@a1k0n Жыл бұрын

@@triforce42 sigmoid and softmax are literally the same thing; sigmoid is what you get when you have a softmax between only two alternatives.

@triforce42 Жыл бұрын

@@a1k0n if I understand correctly, sigmoid and softmax are not "literally the same thing," sigmoid takes a scalar input and softmax takes a vector input. Notice the summation present in the softmax formula that's not present in the sigmoid one on Wikipedia, for example

@a1k0n Жыл бұрын

@@triforce42 To see this, say we have a softmax(x1, x2) and we're looking at the result for x1 = e^x1 / (e^x1 + e^x2); normalize by dividing through by e^x1 = 1/(1+e^(x2-x1)); this is equivalent to sigmoid(x1-x2).

@billyf3346 Жыл бұрын

awesome. another paper to increase model size and power without any insight at the symbolic level or any features to push forward mechinterp.

@st0a Жыл бұрын

I hops Yannic will do an episode on model compression, now that's an interesting topic.

@athelstanrex Жыл бұрын

Best timeline

@clray123 Жыл бұрын

It mostly increases inference speed over long sequences... and decreases memory consumption. And these are important aspects, quite regardless of the topics you mention.

@jondo7680 Жыл бұрын

Yeah I was also thinking that they got rwkv wrong. So I'm curious how they 2 really compare.

@donglixp Жыл бұрын

You could refer to RWKV implementation, which is not sequentially parallel. This is also RWKV claimed in the paper, i.e., recurrent for both training and testing. BTW, mathematically, you could also train it sequentially parallel. But as long as you tried, there would be numerical issues, at least it's non-trivial to solve. I would not say it's the issue of RWKV, but instead, it's caused by AFT. The key difference is that AFT/RWKV has a normalization for Keys, and it's also elementwise.

@GabrieldeOliveira86 Жыл бұрын

If we could just replace Transformers with RetNet, we'll soon have faster and smaller LLM

@lucamatteobarbieri2493 Жыл бұрын

Early LLMs will soon be seen like dinosaurs. In a way neural networks are a lossy compression of the training dataset. So yes, given the recent advancements, a huge increase in performance is to be expected, just how early compression algorithms got better and better trough a serie of improvements. In nature brains do specialize to save energy, so I expect something similar to happen in portable devices: at a certain point extra intelligence becomes a waste of energy and resources.

@hanyanglee9018 Жыл бұрын

I predict that the next successor is called ReuNet

@gr8ape111 Жыл бұрын

For "simple" NLP tasks with data this seems like a very good model!

@blayno_mtops Жыл бұрын

Hey Yannic. Would RetNet be a solution for regression based problems? as I am coding a transformer to work on stock market data. In other words I don't want to overfit the model if this is mostly offers resolve to classification/LLM based restrictions

@AlexanderGambaryan Жыл бұрын

Great video, wanted to send it to my best friend but he's got epilepsy 🤷🏿‍♂️

@clray123 Жыл бұрын

Just make him shut his eyes at strategic moments.

@mshonle Жыл бұрын

I was surprised he used the S word to describe the flickering screen, but that may be cultural differences… think back to Tiger Woods unfortunate use of the term in 2006. Other than that, I appreciate the video but wanted to bring awareness that this is a very sensitive term in the UK.

@AlexanderGambaryan Жыл бұрын

@@mshonle what's an S word?

@mshonle Жыл бұрын

@@AlexanderGambaryan I tried to include a link to the Tiger Woods incident from 2006, but I can't include links. You can search for the story to see what it was. In this context, it was in reference to the flickering in the video. I don't think there was any harm intended, but it is a sensitive term in UK. It's kind of like how "the R word" was used a lot in the US in the 1980s as a general insult but we don't speak it now because that's not how we speak about disabilities.

@AlexanderGambaryan Жыл бұрын

@@mshonle I'm sorry, English is not my first (or second) language and I'm totally lost in the forest of S, R, N and any other words that people make such a big deal about

@fanero3046 Жыл бұрын

I got no idea how to picture those math equations into an ai.

@jawadmansoor6064 Жыл бұрын

any open source model based on this paper?

@fontenbleau Жыл бұрын

no, it's 100% chinese paper and closed research

@jawadmansoor6064 Жыл бұрын

microsoft is chinese ?@@fontenbleau

@fontenbleau Жыл бұрын

@@jawadmansoor6064it's not related to Microsoft, there's no any foreigners names, and western companies forbidden in China by law without independent local offices like on offshores, many ways to include just by association of one person with local chinese child-company by Microsoft. Financing by foreign companies also forbidden, you must open new local clone company. Microsoft tested many things in China including Ai you never seen, but it was in the past, it's only still present in China because Windows still used by bureaucracy, but latest revealing of "state OS" based on Linux Ubuntu core is the sign to Microsoft to leave officially. Knowing their history with Gogle and other giants (Yahoo chinese scandal made into Congress hearings), there's no way they'll save any presence in China.

@fontenbleau Жыл бұрын

@@jawadmansoor6064Like all chinese research papers in english (with only few in China speaking it) and they produce a tonnes of them every year, I think it's race for promotion or salary bonuses, in USSR no one cares about any translation into english except few, in isolation it produced it's own patents, research, also not published abroad because many blatantly copied all western examples (it's even encouraged by high officials). Legally it's strange, in China international patents and copyright not eligible (Michael Jordan court lost its own trademark to local brand pirates), if they publish it internationally it's also won't be protected abroad, there's no any logic in doing this except collecting "social rating" and for few exchange programmes still alive.

@Summersault666 Жыл бұрын

Is linear Transformers like low rank adaptation applied to transformer?

@TheRyulord Жыл бұрын

No. It's a transformer variant that has linear time complexity with respect to sequence length instead of quadratic like in a vanilla transformer.

@Summersault666 Жыл бұрын

@@TheRyulord well, Lora has linear complexity too

@Summersault666 Жыл бұрын

I read the paper I think I understand what he means now. On every iteration you only recalculate the attention change due to the new tokens predicted, reusing the past ones you once calculated.

@TheRyulord Жыл бұрын

@@Summersault666 The additional cost of LoRAs during training is linear but the whole model is still quadratic if it's a regular transformer. It's also worth mentioning that the additional cost of LoRAs during inference can be zero because you can fuse the adapter into the model itself.

@Summersault666 Жыл бұрын

@@TheRyulord yes, but you can create low rank transformers by default, in this case it will be linear. But I understand that he was mentioning the paper that the computational effort is reused and only calculated on every new predicted output.

@hkj4276 Жыл бұрын

20:00 associativity you want to say haha😆

@FD286 Жыл бұрын

Thank you : D

@pvlis Жыл бұрын

I think that they have misrepresented the difference between RetNet and RWKV. To me, it seems that at the end of the day, RetNet is to RWKV as MultiAttentionHead is to Transformers. I believe RWKV still trains in parallel just like RetNet.

@donglixp Жыл бұрын

You could refer to RWKV implementation, which is not sequentially parallel. This is also RWKV claimed in the paper, i.e., recurrent for both training and testing.

@donglixp Жыл бұрын

BTW, mathematically, you could also train it sequentially parallel. But as long as you tried, there would be numerical issues, at least it's non-trivial to solve.

@donglixp Жыл бұрын

I would not say it's the issue of RWKV, but instead, it's caused by AFT.

@donglixp Жыл бұрын

The key difference is that AFT/RWKV has a normalization for Keys, and it's also elementwise.

@Seek_Solve Жыл бұрын

The Failure to fetch error on HF chat ain't a good look. It's a shame, been holdin down open assistant from release. Hatin it. All the best, hope you guys can get it sorted. I know theres gotta be a lot goin on, NASA congrats.

@MCroppered Жыл бұрын

Why are you wearing sunglasses though?

@julius4858 Жыл бұрын

Can you make a video on the current state of OpenAssistant? Kinda out of the loop about the current landscape of opensource alternatives to ChatGPT.

@blayno_mtops Жыл бұрын

having to re-record is the worst

@Aldraz Жыл бұрын

Woah, this will literally change the world. This single paper. I can already see this. It will change everything. Bet.

@hermestrismegistus9142 Жыл бұрын

I doubt it. There will likely be significant tradeoffs. Free lunches are rare.

@Aldraz Жыл бұрын

@@hermestrismegistus9142 This seriously remind me RWKV architecture. I am on their Discord for a long time, seems like everything is going great now with RWKV5 and will be better than transformers by all means both in training and inference. I don't see why this couldn't be even better.

@clray123 Жыл бұрын

@@hermestrismegistus9142 Based on that argument alone, you could literally stop all future research.

@hermestrismegistus9142 Жыл бұрын

@@clray123 Why? Because there are tradeoffs? Cars generally aren't as strong as trucks but that hasn't stopped them from succeeding in the niches in which speed is more important. Linear attention may well have its niche but it is computational cheaper than traditional attention which implies there is a tradeoff unless the extra computation of traditional attention is completely wasteful which I highly doubt. As I said, there are few free lunches.

@clray123 Жыл бұрын

@@hermestrismegistus9142Of course there are free lunches if your algorithms are unnecessarily screwed up to begin with. Which the results of the RetNet paper are suggesting. I should add that the non-engineering purely empirical "alchemy" approach used in LLM research makes it quite likely that this sort of "free lunches" pop up here and there.

@Timotheeee1 Жыл бұрын

Can you make a video about the InstaFlow paper?

@fire17102 Жыл бұрын

Anyone seen anyone implementing an llm with RetNet?

@donglixp Жыл бұрын

We are working on this.

@clray123 Жыл бұрын

I did some testing, but I am not in a position to train base LLMs. From what I see in small-scale training (~1.5B) - and without much formal validation - it has the same performance as equivalent size transformers. It's certainly not "bullshit". What I notice is that it often repeats a single token ONCE (stutters), but it does not seem to suffer from the "repeat a token subsequence endlessly" problem as much as transformer does (without artificial countermeasures such as repeat penalty or banning seen ngrams). But this perhaps has to do with the improved positional embedding.

@clray123 Жыл бұрын

@@donglixpWould it be possible for you to retrain phi-1.5 using RetNet? Do the respective groups in Microsoft Research compete or cooperate?

@Mulnader Жыл бұрын

WTF do I watch 5:22 in the morning instead of going to sleep? It took me over 10 min to realize that I have no idea why Top Gun Guy calculates alphabet and gets cat as answer. #hyperfocus mode off!!!

@darogajee3286 Жыл бұрын

Welcome Agent Smith

@ssssssstssssssss Жыл бұрын

I wish researchers would just state what the pros and cons are. If they don't state what the cons are it either they are dishonest and keeping it from the reader or they haven't done enough work yet

@donglixp Жыл бұрын

The con is that we need to retrain our Transformers.

@clray123 Жыл бұрын

@@donglixp YES! And also admit that we may have been wrong about some basic stuff. Which is a GREAT barrier to some people, especially if their mistake made other people spend lots more money than they would otherwise have had to.

@anmolt3840051 Жыл бұрын

Associative property

@christianszegedy3694 Жыл бұрын

Your formula for softmax is a bit "original" - ie off, but nice explanation otherwise.

@syncrossus Жыл бұрын

So it's just a linear transformer with an explicit bias to the attention heads, cool.

@FelixPetersen Жыл бұрын

🎉

@kimchi_taco Жыл бұрын

I think it's good proof why softmax is needed by counter example. If there is not softmax, n of k/v vectors aren't different from single vector 😂 Of course, I don't believe their shady scalability claim.

@christianszegedy3694 Жыл бұрын

I disagree, I think that part is sound.

@swyveu Жыл бұрын

why the stupid sunglasses? They are too big btw...

@fontenbleau Жыл бұрын

So a challenge from China, there's no any foreigner in credits, only citizens of China. Why they won't able or have no will to invite any reputable foreigner to research is ranking down whole paper (ideology, politics, military? racism?). I would take it with huge grain of salt considering high corruption levels in China academia from state with free luxury apartments, expensive trips conferences but with salary on level of poverty-most bureaucracy built there like that. And "planning" socialistic goals in all industries including science force people to produce even false papers to achieve state plan and get salary bonuses. As USSR history show the best smart people are smart enough to run away, that's why certain scientists there was basically on conditions of hostages, local academia except only few always was below average, because smart are dangerous politically and russian scientists called this social method of negative selection as negative inversion like scythe through the grass. Biggest mystery for me why China even publish all their papers in english and abroad. USSR didn't do that, there's no sense, worlds patents and copyrights not eligible in China and their courts proved that, same abroad with publications from China, they declined all int agreements even with car drivers licenses (Paris treaty).

@shaycray7662 Жыл бұрын

Oh look, the woke censorious grifter is back. Nice!

@Adhil_parammel Жыл бұрын

new hype from china like lk-99

@sagetmaster4 Жыл бұрын

Hype?

@justfoundit Жыл бұрын

Probably not. Probably there's no obvious reason to have real percentages to weight the value vectors. I think any weighting can work, hence this solution.