Scaling Transformer to 1M tokens and beyond with RMT (Paper Explained)

  Рет қаралды 57,726

Yannic Kilcher

Yannic Kilcher

Күн бұрын

#ai #transformer #gpt4
This paper promises to scale transformers to 1 million tokens and beyond. We take a look at the technique behind it: The Recurrent Memory Transformer, and what its strenghts and weaknesses are.
OUTLINE:
0:00 - Intro
2:15 - Transformers on long sequences
4:30 - Tasks considered
8:00 - Recurrent Memory Transformer
19:40 - Experiments on scaling and attention maps
24:00 - Conclusion
Paper: arxiv.org/abs/2304.11062
Abstract:
This technical report presents the application of a recurrent memory to extend the context length of BERT, one of the most effective Transformer-based models in natural language processing. By leveraging the Recurrent Memory Transformer architecture, we have successfully increased the model's effective context length to an unprecedented two million tokens, while maintaining high memory retrieval accuracy. Our method allows for the storage and processing of both local and global information and enables information flow between segments of the input sequence through the use of recurrence. Our experiments demonstrate the effectiveness of our approach, which holds significant potential to enhance long-term dependency handling in natural language understanding and generation tasks as well as enable large-scale context processing for memory-intensive applications.
Authors: Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
KZbin: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 137
@YannicKilcher
@YannicKilcher Жыл бұрын
OUTLINE: 0:00 - Intro 2:15 - Transformers on long sequences 4:30 - Tasks considered 8:00 - Recurrent Memory Transformer 19:40 - Experiments on scaling and attention maps 24:00 - Conclusion Paper: arxiv.org/abs/2304.11062
@CosmiaNebula
@CosmiaNebula Жыл бұрын
TLDR: use a Transformer as a RNN. Imagine LSTM but for each LSTM block you use Transformer. Train it by backpropagate through 7 steps of the RNN ("backprop through time" or BPTT). Why now? Because finally algorithm and hardware has caught up enough to fit 7 copies of the Transformer into one hardware. What next? Perhaps rematerialization!
@thegreenxeno9430
@thegreenxeno9430 Жыл бұрын
Is Open Assisstant open to submissions of home video recordings for training data?
@herp_derpingson
@herp_derpingson Жыл бұрын
Yay, a normal video after what feels like years. Also, is it me or the recent papers have become increasingly easier to read? There is no obscure math and the code is published.
@joech1065
@joech1065 Жыл бұрын
As Clyde from South park would say, “ChatGPT, dude”
@NoNameAtAll2
@NoNameAtAll2 Жыл бұрын
I miss ML news :(
@Nif3
@Nif3 Жыл бұрын
Yes, I've noticed this as well - publications have become a lot shorter and more focused on practical applications.
@spacenerd113
@spacenerd113 Жыл бұрын
@KZbin- Sucks- reminds me of that meme: Just ADD MORE LAYERS!
@jidun9478
@jidun9478 Жыл бұрын
Thanks, for saying it finally. I have seen quite a few AI specialty channels talking about pasting stuff like the entire Harry Potter book series into a single prompt box :) OMG I couldn't even comment.
@joe_limon
@joe_limon Жыл бұрын
Ty for covering this
@GeorgeFosberry
@GeorgeFosberry Жыл бұрын
Thank you for great analysis that is accessible even to laymen like myself. Always a pleasure to watch your videos in contrast to AI hype riders (AAAAAAAAAAAAA TWO MILLION TOKENS CTX LENGTH IS HERE!!!11)
@neocrz
@neocrz Жыл бұрын
Nice. I was interested in that paper. Video came out right on time
@halocemagnum8351
@halocemagnum8351 Жыл бұрын
I've always loved the in depth paper reviews! Thanks so much for this one, it was great!
@adrianimfeld8360
@adrianimfeld8360 Жыл бұрын
Was literally waiting for your take on this paper, thx for covering it!
@adelhalawa974
@adelhalawa974 Жыл бұрын
Really appreciate not just the breakdown but you injecting your intuition throughout. Great vid
@perbojsen3433
@perbojsen3433 Жыл бұрын
Thank you for this nice video. Being brand new to this field, I nevertheless find your presentation and explanations very clear and easy to follow. I also appreciate your skepticism and how you look behind the hype.
@Skinishh
@Skinishh Жыл бұрын
Great explanation! The fact that the video is
@andres_pq
@andres_pq Жыл бұрын
Finally a paper review!!!
@alexeybalandin4676
@alexeybalandin4676 Жыл бұрын
A very concise and clear analysis, thank you very much!
@vivienseguy
@vivienseguy Жыл бұрын
Great paper review as usual!
@American_Moon_at_Odysee_com
@American_Moon_at_Odysee_com Жыл бұрын
Always thoughtful and good content, thank you.
@piratepartyftw
@piratepartyftw Жыл бұрын
Will you do Hyena next? Thanks!
@FredPauling
@FredPauling Жыл бұрын
I appreciate you taking the time to reduce the hype on this paper for non experts.
@Billy4321able
@Billy4321able Жыл бұрын
I was very skeptical when people were saying that it could read an entire book, in memory, all at once. As it turns out it was all just hype. Go figure.
@LaSalsePareille92
@LaSalsePareille92 Жыл бұрын
amazing review of this paper, thanks !
@learningwithlowell
@learningwithlowell Жыл бұрын
Great breakdown. Thank you!
@ChuanChihChou
@ChuanChihChou Жыл бұрын
Information only propagates bottom up in Transformer-XL so the maximum "receptive field" (effective context length) is finite regardless of how far back the BPTT goes. To be more precise, O(LC): L = number of layers, C = context length of each layer.
@jeffwads
@jeffwads Жыл бұрын
Having used the 30b model you guys created, I can say with confidence that it is an amazing model, far exceeding what I thought it would be capable of. Its comprehension appears to be at least GPT 3.5 level if not better. Well done.
@preddyshite6342
@preddyshite6342 Жыл бұрын
Tellme you use haven't used chatGPT3.5 in a while without telling me
@Klokinator
@Klokinator Жыл бұрын
OpenAssistant is absolutely not at ChatGPT's level. It is pretty good though, and certainly the best of the open source models out right now. I look forward to the next major iteration, and more importantly, I'M DOING MY PART! Contribute to the Oasst dataset!
@yildirimakbal6723
@yildirimakbal6723 Жыл бұрын
Great summary!
@breakablec
@breakablec Жыл бұрын
This seems to work for only a sparse information density that does not overwhelm the input memory
@agsystems8220
@agsystems8220 Жыл бұрын
For now. I guess you could let it control it's own read speed to let it run at the speed it wants, potentially even with backtracking. It is currently working like a book that turned over it's own pages at a set rate, no matter how fast the reader felt was appropriate.
@breakablec
@breakablec Жыл бұрын
@@agsystems8220 well also the input size could be warried with various pretrained model sizes and potentially smaller chunks and the overwhelming of inputs could be detected and adjusted for as well
@share4713
@share4713 Жыл бұрын
Finally, you don't know, but I am waiting everyday for a new video.
@killers31337
@killers31337 Жыл бұрын
I guess the interesting part is that they didn't use any additional weights to process memory. BERT's lack of causal masking makes it possible to update memory just passing it through transformer layers. This method might be fundamentally incompatible with autoregressive models. It might be possible to use a NN trained this way with other forms of memory - I would guess it doesn't really care if memory tokens come from the previous segment or elsewhere. So you can have a memory database and look up the most relevant memory for a specific segment.
@moomoo9820
@moomoo9820 Жыл бұрын
Overhype for the algo
@Alex-gc2vo
@Alex-gc2vo Жыл бұрын
Seems like you could do the same thing with prompting. Maybe even better. Just feed it chunks of the overall text with the prompt to take notes of information relative to the question. Then use all the notes to answer. You could also do it with a vector database.
@novantha1
@novantha1 Жыл бұрын
I actually had a very silly idea at one point where you would have a transformer model doing general processing and understanding, with the catch that it would rapidly forget information. However, each time it learned something, it would learn a small percentage of the weights involved would be sent to an RNN, almost in the background. The idea was that the RNN would be long term memory, and it would only learn things that were reinforced many times, and ideally retain specifically facts and figures. This isn't the same thing, but it seems that somebody had a similar thought.
@dik9091
@dik9091 Жыл бұрын
i had that tought today and also the tought if someone else also had the tought and now I see that is the case ;)
@ilia_zaitsev
@ilia_zaitsev Жыл бұрын
Indeed, feels like a kind of RNN but using attention layers instead of the dense ones :) Or, a recurrent transformer, depending on from what side to look...
@aBigBadWolf
@aBigBadWolf Жыл бұрын
You should do a video on the block-recurrent transformer! It's a mix between lstm and transformer and achieves sota on pg19.
@almoni127
@almoni127 Жыл бұрын
Great video as always! Just a small correction. Quadratic memory is not an issue since the introduction of flash attention. There are still the limitations of linear memory and quadratic running time.
@dik9091
@dik9091 Жыл бұрын
thank you was just breaking my head over it ;)
@codemark7464
@codemark7464 Жыл бұрын
thanks a lot!
@aboody006
@aboody006 Жыл бұрын
Woah I just read this today, and then I see this notification.
@hEmZoRz
@hEmZoRz 11 ай бұрын
I'm really, really waiting for your review on the LongNet that claims to scale to 1B tokens!
@kristoferkrus
@kristoferkrus Жыл бұрын
Awesome
@ilianos
@ilianos Жыл бұрын
Hi Yannic, great video! Are you planning to review the following paper? "Low-code LLM: Visual Programming over LLMs"
@arthurheuer
@arthurheuer 6 ай бұрын
i can hardly believe i laughed when hearing “a humungous 1 million, even 2 million tokens” in anticipation for how funny it will be in the future…
@easter.bunny.6
@easter.bunny.6 Жыл бұрын
Hi Yannic, thanks for your video. After watching your video, do you think this model can be used in decoder-only architecture?
@clray123
@clray123 Жыл бұрын
Sounds like the same approach as used by LlamaIndex (aka GPTIndex). It's true that it is not the same as having a 1M token context window, but the collected facts (and they can be something non-trivial, which still fits into the "small" 32K context window) can be then put together and summarized and inferred from as a final step. So it does in fact resemble what a human would do when extracting information from a long book - take notes on relevant topics while reading it, then write up some conclusions based on those notes alone.
@jonathanfranks1286
@jonathanfranks1286 Жыл бұрын
Sorry, could a model trained like that also output text with a big amount of token?
@clray123
@clray123 Жыл бұрын
@@jonathanfranks1286 Huh? There is no limit on the number of tokens any model can output.
@serta5727
@serta5727 Жыл бұрын
Cool thing❤
@barulicksama3838
@barulicksama3838 Жыл бұрын
You should do more videos on your new chat. You should promote it.
@SimSim314
@SimSim314 Жыл бұрын
It would be interesting to see a demo of any such system. Lets say open assist 30B with this...
@nettlesoup
@nettlesoup Жыл бұрын
Not an AI dev so this is just my layman's reading. As other comments have referenced the "paste entire Harry Potter book" example, isn't the advantage of this that you could tell the memorization function what you want it to treat as facts? So, you could ask, "Tell me all the spells Hermione casts when Ron is nearby and where they are", and then the first step is to tune the memorization network to detect facts that relate to this and treat any sentences that don't involve any spell casting as noise for memorization purposes. (How? I don't know, some kind of fact filter rule in plain English that gets added to each pass? Presumably you can use a LLM to generate that filter rule text). Then the location of the spell casting can be determined from the context of preceding sentences. Maybe another memorization could be the list of unique spells as they're taught so they can be detected out of scope, e.g. wingardium levitosa or whatever it is (not a big HP fan sorry).
@weert7812
@weert7812 Жыл бұрын
This seems like it could be a way to have agents which have more persistence in time.
@RuslanLagashkin
@RuslanLagashkin Жыл бұрын
Overhyping with all my might ) Seriously though, it is an obvious idea, just well executed. I guess at some point we'll have to write questions before the material to analyze, not in any part of prompt, as it is now in ChatGPT.
@Verrisin
@Verrisin Жыл бұрын
I mean, if they learn to generalize the compression ... it could remember a lot of stuff, and drop details but keep the basic idea ... - Then it would know "I need to look at X to find details" - it would output that as LOOKUP(X), something would include that thing in near-context (e.g. I look up source of a fn I roughly know) and it could do A LOT. - I mean ... this is how I work as a human. - I think if they figure out how to train it to have a general enough compression ... this approach is all that is needed.
@sandratoolan9598
@sandratoolan9598 Жыл бұрын
missed you , you look good in the glasses - its to much a brand already dude , no way back .
@ground_news
@ground_news Жыл бұрын
We enjoy watching your content and believe that both of our missions align well! Would love to connect to talk about a partnership
@serta5727
@serta5727 Жыл бұрын
Algo Support
@fitybux4664
@fitybux4664 Жыл бұрын
Maybe you could have it analyze every file in a large code base. Or have it be able to carry on a conversation that is weeks long.
@herp_derpingson
@herp_derpingson Жыл бұрын
Maybe
@makuru_dd3662
@makuru_dd3662 Жыл бұрын
Or, more importantly, you could have an enormous prompt.
@thegreenxeno9430
@thegreenxeno9430 Жыл бұрын
Attention should be sentence specific. Label grammatically- noun, verb, etc. Store labels locally in a vector db to remember context (conversation, story, etc.) Run transformer on vdb. [context labelling] Next step, analysis engine stores 'understandings' in rdb. ¿
@thegreenxeno9430
@thegreenxeno9430 Жыл бұрын
Like, the rules of grammar already exist. Just apply that labelling scheme.
@danielhenderson7050
@danielhenderson7050 Жыл бұрын
24:33 sketch is kinda funny :D
@kaikapioka9711
@kaikapioka9711 Жыл бұрын
Finally.
@BO2trickshoting
@BO2trickshoting Жыл бұрын
This would probably be useful for something like bing chat or just search engines in general.
@RuairiODonnellFOTO
@RuairiODonnellFOTO Жыл бұрын
What note taking tool is he using? Anyone have tips on organise all the papers/PDFs into a catalogue on my desktop. I've read loads of papers but just put them in one big folder. Any nice research organiser for PDFs or URLs (maybe that allow annotations for searching later)?
@evennot
@evennot Жыл бұрын
Why don't they just save the input sequence and reiterate over it when a question is presented? It's a genuine question: there's probably a reason there. Multiple transformers constantly working with input data (+ using recurrent connections, not in parallel) can't be slower than an additional question-specific transformer reiterating over text. Also dumb reiteration with something specific "in mind" would be nice for spotting contradicting facts from the input. People solve some tasks like this. Betting on acquiring all possible aspects of the input data into the "context cache" looks like an unsolvable problem for me
@lamhkak47
@lamhkak47 Жыл бұрын
I wonder if you could do a review on RWKV model? Heard that model is built by 1-madlad team
@albinoameise
@albinoameise Жыл бұрын
Would it be possible to have a step before the transformer that handles the input? E.g. first take the last section of the input (which is the task for the transformer) as a Query. Then take some memory of fixed length and run a attention block over the input section by section, taking the Query from before and doing attention between the memoy and the current section. If that works, the memory would be a dense representation of what is actually important from the input, regardless of length or task. Might be difficult to train though...
@theaugur1373
@theaugur1373 Жыл бұрын
Anyone know how this compares with the Reformer architecture? It was able to scale to about 1 million tokens.
@dik9091
@dik9091 Жыл бұрын
great news when it works
@yorth8154
@yorth8154 11 ай бұрын
New Billion token paper out. Can you make rundown for it please?
@siquod
@siquod Жыл бұрын
Why do they use autoregressive self-attention to generate and attend to the memory tokens? Wouldn't cross attention make more sense, mostly because then different semantic embeddings could be used for memory facts than for mere tokens?
@jnsi0
@jnsi0 Жыл бұрын
Seven segments - reminds me Miller's law 🤔
@dinkusstinkus4396
@dinkusstinkus4396 Жыл бұрын
To me the big reveal was that it had no other architecture, and they did it on a 1060
@marverickbin
@marverickbin Жыл бұрын
A question: BERT is encoder only transformer. It means the input are token ids, but the output are vector embeddings, so, they are not the same kind of data. Therefore, you cannot use the output as the input... How they manage to get memory tokens as output if the outputs are vector embeddings?
@-mwolf
@-mwolf Жыл бұрын
transformer xl reminds me of fwd fwd algorithm
@thegistofcalculus
@thegistofcalculus Жыл бұрын
It may be possible to use this architecture to read backwards and look for an answer instead of trying to memorize facts that may or may not be relevant when the question comes. Or maybe iterate forward with awareness of the question that is otherwise presented at the end.
@davidlatkin5525
@davidlatkin5525 Жыл бұрын
Can you make a video about SAM (Segment Anything Model) from Meta?
@davidconsumerofmath
@davidconsumerofmath Жыл бұрын
Load in entire code bases!!
@dik9091
@dik9091 Жыл бұрын
only know I can somewhat follow it, great. I understand the attention things and I thought why is that not applied to the conversation with feedback and I was wondering if that is not already been done or being researched. I am at the start, will draw concussions at the end. In the meanwhile we have MPT 7B model with 65k input with alibi? quote : These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing positional embeddings with Attention with Linear Biases (ALiBi). 2.40 yes that's what I immediately thought when I understood the self attention matrix, thats a non scaling bottleneck that can be solved with an analog signal matrix with calculating opamps and a en.wikipedia.org/wiki/Nonblocking_minimal_spanning_switch and I happen to build these switches , hmm
@holthuizenoemoet591
@holthuizenoemoet591 Жыл бұрын
So what would be better, increasing the context size of BERT to for example from 512 to 2048 or Using this recurrent memory technique and repeat the 512 four times?
@undergroundculture9009
@undergroundculture9009 Жыл бұрын
obviously increasing bert context size
@KevinGChiu
@KevinGChiu Жыл бұрын
How does it know what fact to put into memory before reading the question?
@alexbrown2288
@alexbrown2288 Жыл бұрын
Yannic looks a lot better without the sunglasses. He'd probably gain subscribers without them.
@DaniilKirilenko
@DaniilKirilenko Жыл бұрын
Hi Yannic! What pdf-reader do you use?
@rootthree9436
@rootthree9436 Жыл бұрын
onenote
@emmanuelkolawole6720
@emmanuelkolawole6720 Жыл бұрын
Hey Yannic, why don't you add pandasAI to your open assistant project? It will take the product to a new level of traffic. Also support the pandasAI project so it can go beyond beta soon
@zerotwo7319
@zerotwo7319 Жыл бұрын
lol a few weeks ago I was talking how that was a limitation, but .... what a time to be alive.
@cchance
@cchance Жыл бұрын
Is this similar to how automatic1111 surpasses the 75 token cap?
@lio1234234
@lio1234234 Жыл бұрын
Awesome stuff! Do you think this will be integrated into Open Assistant?
@Addoagrucu
@Addoagrucu Жыл бұрын
i don't know about this take. i kind of agree, except i think you're a bit too harsh on the utility this paper brings. to steelman the twitter hype i could say that the tradeoff between memory requirement (linear for this technique) and amount of functionality learned (which i think can be pushed further with better datasets) might make this a contender for a pretty robust method for large scale NLP. a study on how much complicated language understanding benchmarks suffer as a result of using all available vram to fit multiple of the same transformer into memory to do backprop over time as opposed to using all available vram to fit one big transformer would be helpful in trying to guide our opinions with empiricism.
@samsamhuns928
@samsamhuns928 Жыл бұрын
Sounds like RNNs with extra steps lol
@Veptis
@Veptis 4 ай бұрын
took 10 Months for Google to come up with Gemini ... but they aren't telling us how exactly.
@NeoShameMan
@NeoShameMan Жыл бұрын
I was hyped for 500ms only, does that count?
@MultiCraftTube
@MultiCraftTube Жыл бұрын
The italians are comming 😱
@darklordvadermort
@darklordvadermort Жыл бұрын
any comments/thoughts on hyena?
@klammer75
@klammer75 Жыл бұрын
We’ll put and eloquently described…gotta admit I was starstruck when I first saw the headline but you’re right, it’s an RNN not an absurdly long transformer window…Tku for this😎🦾
@binjianxin7830
@binjianxin7830 Жыл бұрын
7:44 maybe it’s about the model needs to be able to rule out negative facts?
@user-fq1hi9gc5w
@user-fq1hi9gc5w 8 ай бұрын
I know that kuratov
@timeTegus
@timeTegus Жыл бұрын
" So u are saying i can out in all harrypqtrer bools and ask qestions about them "😂
@creativityoverload2049
@creativityoverload2049 9 ай бұрын
So can it do machine translation?
@snippletrap
@snippletrap Жыл бұрын
How does it compare with RWKV?
@snapo1750
@snapo1750 Жыл бұрын
In theory RWKV is completely different from transformers as it uses ONLY RNN, because RWKV uses only RNN's there is no input context lenght limit, but in the learning process they only feed (afaik 8k tokens) therefore it should not be able to know more. The more beautyful thing about RWKV is that you dont need to quadratically increase your vram 🙂
@user-do4fd7nr7p
@user-do4fd7nr7p Жыл бұрын
16:48 😂😂
@m4ng4n
@m4ng4n Жыл бұрын
How does this fare vs MEGA?
@rumfordc
@rumfordc Жыл бұрын
why does open assistant brown nose for the WEF ?
@Phasma6969
@Phasma6969 Жыл бұрын
How?
@rumfordc
@rumfordc Жыл бұрын
@@Phasma6969 It describes them as heroes saving the world and agrees with every single one of their publicly stated agendas. It will even go so far as to ignore overrides on those topics (up to a point). I can understand how Microsoft and Google would reach this sort of behavior but am curious as to how Open Assistant comes by it.
@alexandermathews9710
@alexandermathews9710 Жыл бұрын
@@rumfordc probably because the data all the models are absorbing share similar outlooks
@rumfordc
@rumfordc Жыл бұрын
@@alexandermathews9710 yea its as if they're just pulling from the WEF's website and nowhere else. they should probably diversify their training set.
@alexandermathews9710
@alexandermathews9710 Жыл бұрын
@@rumfordc no i think the sheer amount of data that has been generated is in agreement with the WEF. this is one of the dangers of ai. a lack of diversity in data overall. not that wef information is purposefully selected its that the amount of it makes it look that way
@draken5379
@draken5379 Жыл бұрын
This 'RMT' seems really pointless. You can just use the same main LLM, to turn text into embeddings and store them in a vectorstore database. Then you are able to search that vectorstore database for everything related to the incoming input. Allowing an LLM to have a massive vast collection of data that is retrieved in a natural lang way. Super Simple Example: Told my Bot, "Dogs like Blue, Cats like red, Rats like Yellow". The LLM itself, detects these 'facts' in the input, and redirects them to a 'fact save' function. Which saves each fact to a vectorstore. I then asked. What color does dogs like ? The vectorstore DB is then queried with that input, which results in dogs like blue, which gets fed into the LLM along with the current input as a 'fact'. Crude and simple example, but shows you dont really need to go code out a totally new neural net to just handle something an LLM can already handle by design.
@BO2trickshoting
@BO2trickshoting Жыл бұрын
do you think this is what bing chat uses?
@draken5379
@draken5379 Жыл бұрын
@@BO2trickshoting Ya from what ive heard. The way Stripe,Bing,spotify etc are handling memory is via vectorstores.
@nevokrien95
@nevokrien95 8 ай бұрын
This isn't new and it's relatively oversimplified. We have preciverio and transformerlstm
@qeter129
@qeter129 Жыл бұрын
1 gagillion tokens of context...
@fontende
@fontende Жыл бұрын
Paper is paper, but where is working test...
@preddyshite6342
@preddyshite6342 Жыл бұрын
I'm running out of pants to shit
@ivanstepanovftw
@ivanstepanovftw Жыл бұрын
I don't like this idea from the paper... Why not just make embeddings of previous context?
RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)
1:02:17
The day of the sea 🌊 🤣❤️ #demariki
00:22
Demariki
Рет қаралды 89 МЛН
Tom & Jerry !! 😂😂
00:59
Tibo InShape
Рет қаралды 56 МЛН
ОСКАР ИСПОРТИЛ ДЖОНИ ЖИЗНЬ 😢 @lenta_com
01:01
Let's build GPT: from scratch, in code, spelled out.
1:56:20
Andrej Karpathy
Рет қаралды 4,4 МЛН
Yuhuai Wu | Memorizing Transformers
1:00:14
Harvard CMSA
Рет қаралды 3,3 М.
PEFT LoRA Explained in Detail - Fine-Tune your LLM on your local GPU
40:55
A Hackers' Guide to Language Models
1:31:13
Jeremy Howard
Рет қаралды 508 М.
TransformerFAM: Feedback attention is working memory
37:01
Yannic Kilcher
Рет қаралды 35 М.
Transformers explained | The architecture behind LLMs
19:48
AI Coffee Break with Letitia
Рет қаралды 19 М.
iPhone 12 socket cleaning #fixit
0:30
Tamar DB (mt)
Рет қаралды 53 МЛН
ПОКУПКА ТЕЛЕФОНА С АВИТО?🤭
1:00
Корнеич
Рет қаралды 3,2 МЛН