Only true OGs know this was originally named "2024 12 24 15 43 12"
@barni_77622 күн бұрын
yep
@kimyongtan38182 күн бұрын
Really? How can I find it
@johndoe60112 күн бұрын
It still is=
@ultrasound1459Күн бұрын
@@johndoe6011 he changed it again to the original title lol
@lem0nhead842 күн бұрын
Just wanted to say that we are very lucky to have this content for free.
@washedtoohotКүн бұрын
Are we?
@ДаниилИмани2 күн бұрын
00:00 - Introduction and abstract of the article 01:02 - Plots for comparing scaling properties of BLT vs. LLaMA 2 and LLaMA 3 03:28 - Architecture of Byte Latent Transformer 07:50 - Explains tokenization; byte-pair encoding 13:25 - Problems with tokenization 14:46 - Patch embeddings; dynamic tokenization 20:35 - Entropy-based grouping of bytes into patches 28:42 - Local encoder and local decoder 29:48 - Encoder hash n-gram embeddings 32:44 - BLT-specific hyperparameters: patch sizes 33:26 - Comparison with LLaMA architectures 35:35 - Limitations
@wolpumba40992 күн бұрын
*Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)* * *0:00** Introduction:* Introduces the Byte Latent Transformer (BLT), a novel architecture that replaces traditional tokenization with dynamically sized "patches." Claims improved scaling behavior compared to token-based LLMs. * *0:16** Dynamic Patching:* BLT uses patches as the fundamental unit of computation, dynamically adjusting their size based on text complexity, offering a more efficient representation. * *1:04** Scaling Comparison:* Presents graphs comparing BLT's scaling to LLaMA 2 and 3, showcasing BLT's superior performance at equivalent training FLOPs, using bits-per-byte as an analog to perplexity. * *3:28** BLT Architecture:* Explains the two-tiered architecture. An inner, standard Transformer LLM operates on patch embeddings, while an outer system handles patch creation and decoding. * *7:50** Tokenization Explained:* Briefly explains common tokenization methods like byte-pair encoding (BPE) and word piece, highlighting issues like large vocabulary sizes and out-of-vocabulary words. * *13:25** Problems with Tokenization:* Discusses problems stemming from fixed vocabularies, such as difficulty handling numbers and limited chunk sizes. * *14:46** Patch Embeddings:* Describes how patch embeddings are dynamically created from byte embeddings using a local encoder. This allows for flexible, non-fixed vocabulary representation. * *20:35** Entropy-Based Grouping:* Details the process of dynamically grouping bytes into patches based on the entropy of the next byte prediction from a small, separate byte-level Transformer. High entropy triggers a new patch. * *28:42** Local Encoder/Decoder:* Explains the function of the local encoder (bytes to patch embedding) and decoder (patch embedding to bytes), which operate more frequently than the inner LLM. * *29:48** Encoder Hash N-gram Embeddings:* Describes how n-gram byte embeddings are hashed and incorporated into the byte embeddings to provide contextual information for the local encoder. * *32:44** Patch Size Advantage:* Experiments show BLT achieves similar performance to LLaMA models with significantly larger patch sizes (6-8 bytes vs. 3.7-4.4 bytes). * *33:26** Comparison with LLaMA:* BLT remains competitive with LLaMA models while demonstrating superior performance in tasks requiring character-level understanding, such as spelling inversion. * *35:35** Limitations:* Acknowledges limitations in raw runtime performance compared to highly optimized token-based LLMs, but highlights that FLOP-matched comparisons demonstrate BLT's potential. Further optimization is needed, particularly regarding techniques like Flex attention. Also mentions potential improvements in jointly training components like the small patching LLM. I used gemini-1.5-pro on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.03 Input tokens: 22558 Output tokens: 601
@ai_outlineКүн бұрын
Computer science is evolving at an amazing pace. Impossible to keep track… thank you so much for this video!
@Kram10322 күн бұрын
Surely this could be iterated to allow even larger patches (call them roughly "sentence level" and then "paragraph level" or so), right? If it were possible to dynamically scale up to entire paragraphs or pages, we'd quite quickly cover entire books, possibly even with fairly short attention widths. Like, if your average patch covers roughly one page, if you have a max context length (at that level) of like 1024, most books ever written will comfortably fit inside this. All while, in principle, still having access to individual characters as needed. As for ASCII, surely this can work for BPE-style encodings that can handle arbitrary UTF-8 Unicode too?
@fuxtube2 күн бұрын
No, it couldn't. Byte patches then gets encoded into fixed dimensionality latents for main LLM, so you couldn't compress larger and larger chunks of information in it in "lossless" manner. Technique from paper improves dictionary handling and tokenization, but you can't trick information theory with it.
@kellymoses85662 күн бұрын
On hackernews someone had the same idea and one of the authors said that with more than two levels of patches it gets too hard to figure out how to allocate training computer time.
@lem0nhead842 күн бұрын
@@Kram1032 this surely doesn't scale. Can you imagine a LLM that you feed 1 page of text and, in 1 iteration, it spits out a whole new page? That would be impossible to train
@Kram10322 күн бұрын
@@lem0nhead84 it's not technically 1 iteration. There would then be several loops, right? The increasingly nested transformers would have different jobs. Effectively, the ~sentence- and ~paragraph- level transformers would just keep around the longer-scale state and tell that to the ~word-level transformer, and the increasingly larger-scale transformers would be more expensive but also would get run more rarely, right? Like, the ~paragraph-level transformer might only run once a second or so. If you get one that can generate an entire page "in one step", it might only run every few seconds. The underlying smaller-scale transformers would each run much more often though Like, I'm making no claims about this being faster. A single step on the scale of the largest transformer may take a long time. But for shorter texts, that largest transformer wouldn't even necessarily be invoked a single time because the EOT appears before that scale is relevant. So if we counted iterations, what would that be? Fractional iterations?
@Kram10322 күн бұрын
@@kellymoses8566 too hard as of right now, or too hard, fundamentally?
@helmutwollmersdorfer73142 күн бұрын
Ok, their method is back to the roots, i.e. "reinvent the wheel". Letter Successor Variety (LSV) was introduced by Harris (1955, 1967). Hafer and Weiss (1974) named them LSV and Letter Predecessor Variety (LPV) and introduced Letter Successor’s Entropy (LSE). This and improved methods are established in "conventional" (un)supervised text segmentation. If the variable length (e.g. 2...8) n-grams are stored in a trie, then indexing them via a hash is obvious.
@MordenorКүн бұрын
Thank you Mr Yannic for giving a thoughtful discussion on an alternative tokenisation scheme using charachterwise patching.
@c.25182 күн бұрын
For some reason your videos are not showing up often for me
@yeezythabest2 күн бұрын
That's why you systematically like to train the algorithm
@emuccinoКүн бұрын
Hit the bell icon
@ClaimClamКүн бұрын
That’s Eviltube for ya
@MrOlivmКүн бұрын
Might do better if you don’t subscribe. Pirate Software described a perverse incentive for shorts where, you don’t show to subscribers in video notification setup. If subscribers don’t watch it, immediately, it downranks it globally
@Quazgaa2 күн бұрын
Yannic to the rescue! I was honestly way more excited about this than o3 🤷
@gsusduke2 күн бұрын
this feels like a more complicated Perceiver. i guess the striding makes the cross attention layer a little less expensive, but the procedure used to determine the strides is complicated and kinda hacky
@acasualviewer58612 күн бұрын
Good explanation of the paper.. I saw another explanation and didn't understand a thing. You broke it down nicely.
@AndrewRafas2 күн бұрын
I have watched it even though I have already red the paper, and I liked it. However, the video is very silent relative to advertisement, so making it 50% louder would be nice. Thanks!
@braphog21Күн бұрын
I wonder if this could be extended to other modalities. You could start off with a classifier to determine the modality of the input data (text, image, audio, etc.) then use a different encoder for each modality, then feed that into a "unifying" encoder which then feeds "patches" into the latent transformer (doing the reverse to decode).
@JTMoustache2 күн бұрын
Supposedly, once trained the outer encoding/decoding could then be used as an interface to any inner loop LLM. No need to retrain it, no ?
@farrael0042 күн бұрын
Not if they are trained end-to-end with the Latent Transformer like he said. In that case, you need to train either future Latent Transformers with the pre-trained loop, or a different outer loop with the same Latent Transformer. You won't be able to mix and match two separate models that were trained differently.
@st33lbird2 күн бұрын
Do you think this addresses the classical "chunking" problem?
@braphog21Күн бұрын
With this more modular approach I wonder if the local encoder/decoder could be replaced to "increase" the performance of the inner transformer (by eliciting preferred behaviour).
@PeterKornezosКүн бұрын
Very nice paper. I have been thinking of this idea for a while now. I have two things to ask. Wouldn't it be beneficial if instead of n-grams we used some sort of convolution and would it be a good idea to have a second layer of patching? I am saying that because patching sort of makes words but not exactly and patching patches would sort of make sentences which to my mind makes sense to do because there are many ways to say the same thing thus the second patching should capture what we want to say and the first how we want to say it.
@QuadraticPerplexity2 күн бұрын
Not to be confused with Byte Latent Tomatoes, obviously.
@yannickpezeu34192 күн бұрын
Thanks a lot ! Wouldn't it be possible to do several layer of tokenization/encoding ? So with 4 to 5 such layers, the central llm would produce next idea instead of next token.
@braphog21Күн бұрын
I wonder if this could be extended so that instead of encoding/decoding "words" (groups of tokens) it would encode/decode groups of words - either by adding another encode/decode step to group the groups of tokens or as a single unit.
@jondo76802 күн бұрын
You aren't sure if you understood. I am sure that I didn't. We are not the same 👔
@HUEHUEUHEPony2 күн бұрын
We can't tokenize, otherwise how can we count how many r has strawberry?
@draken53792 күн бұрын
First thing that pops into my mind, with your example of 'ER' is common, so lets just make 'er' a token. That will hinder the model in being able to learn relations between the token 'e' and 'er'. I feel like tokens not being single chars, is something that was done as an 'easy fix' to try save compute, but at this point its 100% hindering the models. Its the reason every model, in its base form, is so bad at counting words, etc, because it pretty much has to do chain of thought reasoning into order to count words, because its been hindered by the token setup so much, it needs to work insanely hard to do something any human can without thinking even. Hell, if you ask any good LLM, about this topic, they will say that training on non char-level tokens, WILL hinder the model in many aspects that could even compound.
@mshonle2 күн бұрын
Seems like you could have your own data (e.g., corporation-specific documents) and instead of fine-tuning an LLM to work better with your data, you could instead use NER and learn/compute the patch for that entity and work with this additional NER pre-processing to work directly with these specific terms. For example, the name of the CEO could be mapped to a patch that effectively means “gen X tech bro billionaire from California with blah blah blah.” You’d probably need to inject some extra context to the prompt to map in the most salient points about each custom entity. This could give you a form of learning that exists between the space of fine-tuning and ICL.
@1989arrvindКүн бұрын
Great 👍👍👍
@thivuxhale2 күн бұрын
what do you think is the implication of this paper?
@seidtgeist2 сағат бұрын
04:10 ah.. the song of my people
@_ARCATEC_Күн бұрын
Thank you.
@florianvahl54942 күн бұрын
Small large language model :D so a language model
@HadiLqКүн бұрын
you are awesome!
@twobob2 күн бұрын
NGL Yannic, this doesn't feel like a step TOWARD LLM transparency, amiright?
@twobob2 күн бұрын
Good explanation. Thanks
@johnkost2514Күн бұрын
So N-grams ..
@mike___-fi5kp2 күн бұрын
1 mins!
@rm-ra83172 күн бұрын
2 mins
@YuvarajaPolixena2 күн бұрын
Thanks for the breakdown! I need some advice: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?
@TheatricsOfTheAbsurd2 күн бұрын
😂😂😂😂 bait one way or another
@siddharth-gandhi2 күн бұрын
seems...hacky
@farrael0042 күн бұрын
Welcome to machine learning!
@bright_minary65372 күн бұрын
Less hacky than tokens I guess?
@mmcrypto24032 күн бұрын
Since I became so rich in cryptocurrency I realise that crypto is the future cuz I invested 10k and made up to 36k as weekly profit I appreciate the help of your channel 😃🙂...
@altcoinsdaily17542 күн бұрын
For me trading has not been good well 😞 for me every day I came to watch video's I only see people appreciating how good they trading works
@altcoinsdaily17542 күн бұрын
How do you manage to use signal to make such profit 🤷.
@mmcrypto24032 күн бұрын
It's Juliana Hadas doing, she's changed my life
@mmcrypto24032 күн бұрын
She is a popular investment manager in Texas and even has a Google page and helps people grow various assets like cryptocurrencies, real estate, stocks, ETFs, etc. ✓ ✓ A very blessed and intelligent woman.
@RichmaxPetal2 күн бұрын
WOW!!! You know her too? I'm also a proud beneficiary of her platform