Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)

  Рет қаралды 22,195

Yannic Kilcher

Yannic Kilcher

Күн бұрын

Пікірлер
@scoffpickle9655
@scoffpickle9655 2 күн бұрын
Only true OGs know this was originally named "2024 12 24 15 43 12"
@barni_7762
@barni_7762 2 күн бұрын
yep
@kimyongtan3818
@kimyongtan3818 2 күн бұрын
Really? How can I find it
@johndoe6011
@johndoe6011 2 күн бұрын
It still is=
@ultrasound1459
@ultrasound1459 Күн бұрын
@@johndoe6011 he changed it again to the original title lol
@lem0nhead84
@lem0nhead84 2 күн бұрын
Just wanted to say that we are very lucky to have this content for free.
@washedtoohot
@washedtoohot Күн бұрын
Are we?
@ДаниилИмани
@ДаниилИмани 2 күн бұрын
00:00 - Introduction and abstract of the article 01:02 - Plots for comparing scaling properties of BLT vs. LLaMA 2 and LLaMA 3 03:28 - Architecture of Byte Latent Transformer 07:50 - Explains tokenization; byte-pair encoding 13:25 - Problems with tokenization 14:46 - Patch embeddings; dynamic tokenization 20:35 - Entropy-based grouping of bytes into patches 28:42 - Local encoder and local decoder 29:48 - Encoder hash n-gram embeddings 32:44 - BLT-specific hyperparameters: patch sizes 33:26 - Comparison with LLaMA architectures 35:35 - Limitations
@wolpumba4099
@wolpumba4099 2 күн бұрын
*Byte Latent Transformer: Patches Scale Better Than Tokens (Paper Explained)* * *0:00** Introduction:* Introduces the Byte Latent Transformer (BLT), a novel architecture that replaces traditional tokenization with dynamically sized "patches." Claims improved scaling behavior compared to token-based LLMs. * *0:16** Dynamic Patching:* BLT uses patches as the fundamental unit of computation, dynamically adjusting their size based on text complexity, offering a more efficient representation. * *1:04** Scaling Comparison:* Presents graphs comparing BLT's scaling to LLaMA 2 and 3, showcasing BLT's superior performance at equivalent training FLOPs, using bits-per-byte as an analog to perplexity. * *3:28** BLT Architecture:* Explains the two-tiered architecture. An inner, standard Transformer LLM operates on patch embeddings, while an outer system handles patch creation and decoding. * *7:50** Tokenization Explained:* Briefly explains common tokenization methods like byte-pair encoding (BPE) and word piece, highlighting issues like large vocabulary sizes and out-of-vocabulary words. * *13:25** Problems with Tokenization:* Discusses problems stemming from fixed vocabularies, such as difficulty handling numbers and limited chunk sizes. * *14:46** Patch Embeddings:* Describes how patch embeddings are dynamically created from byte embeddings using a local encoder. This allows for flexible, non-fixed vocabulary representation. * *20:35** Entropy-Based Grouping:* Details the process of dynamically grouping bytes into patches based on the entropy of the next byte prediction from a small, separate byte-level Transformer. High entropy triggers a new patch. * *28:42** Local Encoder/Decoder:* Explains the function of the local encoder (bytes to patch embedding) and decoder (patch embedding to bytes), which operate more frequently than the inner LLM. * *29:48** Encoder Hash N-gram Embeddings:* Describes how n-gram byte embeddings are hashed and incorporated into the byte embeddings to provide contextual information for the local encoder. * *32:44** Patch Size Advantage:* Experiments show BLT achieves similar performance to LLaMA models with significantly larger patch sizes (6-8 bytes vs. 3.7-4.4 bytes). * *33:26** Comparison with LLaMA:* BLT remains competitive with LLaMA models while demonstrating superior performance in tasks requiring character-level understanding, such as spelling inversion. * *35:35** Limitations:* Acknowledges limitations in raw runtime performance compared to highly optimized token-based LLMs, but highlights that FLOP-matched comparisons demonstrate BLT's potential. Further optimization is needed, particularly regarding techniques like Flex attention. Also mentions potential improvements in jointly training components like the small patching LLM. I used gemini-1.5-pro on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.03 Input tokens: 22558 Output tokens: 601
@ai_outline
@ai_outline Күн бұрын
Computer science is evolving at an amazing pace. Impossible to keep track… thank you so much for this video!
@Kram1032
@Kram1032 2 күн бұрын
Surely this could be iterated to allow even larger patches (call them roughly "sentence level" and then "paragraph level" or so), right? If it were possible to dynamically scale up to entire paragraphs or pages, we'd quite quickly cover entire books, possibly even with fairly short attention widths. Like, if your average patch covers roughly one page, if you have a max context length (at that level) of like 1024, most books ever written will comfortably fit inside this. All while, in principle, still having access to individual characters as needed. As for ASCII, surely this can work for BPE-style encodings that can handle arbitrary UTF-8 Unicode too?
@fuxtube
@fuxtube 2 күн бұрын
No, it couldn't. Byte patches then gets encoded into fixed dimensionality latents for main LLM, so you couldn't compress larger and larger chunks of information in it in "lossless" manner. Technique from paper improves dictionary handling and tokenization, but you can't trick information theory with it.
@kellymoses8566
@kellymoses8566 2 күн бұрын
On hackernews someone had the same idea and one of the authors said that with more than two levels of patches it gets too hard to figure out how to allocate training computer time.
@lem0nhead84
@lem0nhead84 2 күн бұрын
@@Kram1032 this surely doesn't scale. Can you imagine a LLM that you feed 1 page of text and, in 1 iteration, it spits out a whole new page? That would be impossible to train
@Kram1032
@Kram1032 2 күн бұрын
@@lem0nhead84 it's not technically 1 iteration. There would then be several loops, right? The increasingly nested transformers would have different jobs. Effectively, the ~sentence- and ~paragraph- level transformers would just keep around the longer-scale state and tell that to the ~word-level transformer, and the increasingly larger-scale transformers would be more expensive but also would get run more rarely, right? Like, the ~paragraph-level transformer might only run once a second or so. If you get one that can generate an entire page "in one step", it might only run every few seconds. The underlying smaller-scale transformers would each run much more often though Like, I'm making no claims about this being faster. A single step on the scale of the largest transformer may take a long time. But for shorter texts, that largest transformer wouldn't even necessarily be invoked a single time because the EOT appears before that scale is relevant. So if we counted iterations, what would that be? Fractional iterations?
@Kram1032
@Kram1032 2 күн бұрын
@@kellymoses8566 too hard as of right now, or too hard, fundamentally?
@helmutwollmersdorfer7314
@helmutwollmersdorfer7314 2 күн бұрын
Ok, their method is back to the roots, i.e. "reinvent the wheel". Letter Successor Variety (LSV) was introduced by Harris (1955, 1967). Hafer and Weiss (1974) named them LSV and Letter Predecessor Variety (LPV) and introduced Letter Successor’s Entropy (LSE). This and improved methods are established in "conventional" (un)supervised text segmentation. If the variable length (e.g. 2...8) n-grams are stored in a trie, then indexing them via a hash is obvious.
@Mordenor
@Mordenor Күн бұрын
Thank you Mr Yannic for giving a thoughtful discussion on an alternative tokenisation scheme using charachterwise patching.
@c.2518
@c.2518 2 күн бұрын
For some reason your videos are not showing up often for me
@yeezythabest
@yeezythabest 2 күн бұрын
That's why you systematically like to train the algorithm
@emuccino
@emuccino Күн бұрын
Hit the bell icon
@ClaimClam
@ClaimClam Күн бұрын
That’s Eviltube for ya
@MrOlivm
@MrOlivm Күн бұрын
Might do better if you don’t subscribe. Pirate Software described a perverse incentive for shorts where, you don’t show to subscribers in video notification setup. If subscribers don’t watch it, immediately, it downranks it globally
@Quazgaa
@Quazgaa 2 күн бұрын
Yannic to the rescue! I was honestly way more excited about this than o3 🤷
@gsusduke
@gsusduke 2 күн бұрын
this feels like a more complicated Perceiver. i guess the striding makes the cross attention layer a little less expensive, but the procedure used to determine the strides is complicated and kinda hacky
@acasualviewer5861
@acasualviewer5861 2 күн бұрын
Good explanation of the paper.. I saw another explanation and didn't understand a thing. You broke it down nicely.
@AndrewRafas
@AndrewRafas 2 күн бұрын
I have watched it even though I have already red the paper, and I liked it. However, the video is very silent relative to advertisement, so making it 50% louder would be nice. Thanks!
@braphog21
@braphog21 Күн бұрын
I wonder if this could be extended to other modalities. You could start off with a classifier to determine the modality of the input data (text, image, audio, etc.) then use a different encoder for each modality, then feed that into a "unifying" encoder which then feeds "patches" into the latent transformer (doing the reverse to decode).
@JTMoustache
@JTMoustache 2 күн бұрын
Supposedly, once trained the outer encoding/decoding could then be used as an interface to any inner loop LLM. No need to retrain it, no ?
@farrael004
@farrael004 2 күн бұрын
Not if they are trained end-to-end with the Latent Transformer like he said. In that case, you need to train either future Latent Transformers with the pre-trained loop, or a different outer loop with the same Latent Transformer. You won't be able to mix and match two separate models that were trained differently.
@st33lbird
@st33lbird 2 күн бұрын
Do you think this addresses the classical "chunking" problem?
@braphog21
@braphog21 Күн бұрын
With this more modular approach I wonder if the local encoder/decoder could be replaced to "increase" the performance of the inner transformer (by eliciting preferred behaviour).
@PeterKornezos
@PeterKornezos Күн бұрын
Very nice paper. I have been thinking of this idea for a while now. I have two things to ask. Wouldn't it be beneficial if instead of n-grams we used some sort of convolution and would it be a good idea to have a second layer of patching? I am saying that because patching sort of makes words but not exactly and patching patches would sort of make sentences which to my mind makes sense to do because there are many ways to say the same thing thus the second patching should capture what we want to say and the first how we want to say it.
@QuadraticPerplexity
@QuadraticPerplexity 2 күн бұрын
Not to be confused with Byte Latent Tomatoes, obviously.
@yannickpezeu3419
@yannickpezeu3419 2 күн бұрын
Thanks a lot ! Wouldn't it be possible to do several layer of tokenization/encoding ? So with 4 to 5 such layers, the central llm would produce next idea instead of next token.
@braphog21
@braphog21 Күн бұрын
I wonder if this could be extended so that instead of encoding/decoding "words" (groups of tokens) it would encode/decode groups of words - either by adding another encode/decode step to group the groups of tokens or as a single unit.
@jondo7680
@jondo7680 2 күн бұрын
You aren't sure if you understood. I am sure that I didn't. We are not the same 👔
@HUEHUEUHEPony
@HUEHUEUHEPony 2 күн бұрын
We can't tokenize, otherwise how can we count how many r has strawberry?
@draken5379
@draken5379 2 күн бұрын
First thing that pops into my mind, with your example of 'ER' is common, so lets just make 'er' a token. That will hinder the model in being able to learn relations between the token 'e' and 'er'. I feel like tokens not being single chars, is something that was done as an 'easy fix' to try save compute, but at this point its 100% hindering the models. Its the reason every model, in its base form, is so bad at counting words, etc, because it pretty much has to do chain of thought reasoning into order to count words, because its been hindered by the token setup so much, it needs to work insanely hard to do something any human can without thinking even. Hell, if you ask any good LLM, about this topic, they will say that training on non char-level tokens, WILL hinder the model in many aspects that could even compound.
@mshonle
@mshonle 2 күн бұрын
Seems like you could have your own data (e.g., corporation-specific documents) and instead of fine-tuning an LLM to work better with your data, you could instead use NER and learn/compute the patch for that entity and work with this additional NER pre-processing to work directly with these specific terms. For example, the name of the CEO could be mapped to a patch that effectively means “gen X tech bro billionaire from California with blah blah blah.” You’d probably need to inject some extra context to the prompt to map in the most salient points about each custom entity. This could give you a form of learning that exists between the space of fine-tuning and ICL.
@1989arrvind
@1989arrvind Күн бұрын
Great 👍👍👍
@thivuxhale
@thivuxhale 2 күн бұрын
what do you think is the implication of this paper?
@seidtgeist
@seidtgeist 2 сағат бұрын
04:10 ah.. the song of my people
@_ARCATEC_
@_ARCATEC_ Күн бұрын
Thank you.
@florianvahl5494
@florianvahl5494 2 күн бұрын
Small large language model :D so a language model
@HadiLq
@HadiLq Күн бұрын
you are awesome!
@twobob
@twobob 2 күн бұрын
NGL Yannic, this doesn't feel like a step TOWARD LLM transparency, amiright?
@twobob
@twobob 2 күн бұрын
Good explanation. Thanks
@johnkost2514
@johnkost2514 Күн бұрын
So N-grams ..
@mike___-fi5kp
@mike___-fi5kp 2 күн бұрын
1 mins!
@rm-ra8317
@rm-ra8317 2 күн бұрын
2 mins
@YuvarajaPolixena
@YuvarajaPolixena 2 күн бұрын
Thanks for the breakdown! I need some advice: My OKX wallet holds some USDT, and I have the seed phrase. (alarm fetch churn bridge exercise tape speak race clerk couch crater letter). How can I transfer them to Binance?
@TheatricsOfTheAbsurd
@TheatricsOfTheAbsurd 2 күн бұрын
😂😂😂😂 bait one way or another
@siddharth-gandhi
@siddharth-gandhi 2 күн бұрын
seems...hacky
@farrael004
@farrael004 2 күн бұрын
Welcome to machine learning!
@bright_minary6537
@bright_minary6537 2 күн бұрын
Less hacky than tokens I guess?
@mmcrypto2403
@mmcrypto2403 2 күн бұрын
Since I became so rich in cryptocurrency I realise that crypto is the future cuz I invested 10k and made up to 36k as weekly profit I appreciate the help of your channel 😃🙂...
@altcoinsdaily1754
@altcoinsdaily1754 2 күн бұрын
For me trading has not been good well 😞 for me every day I came to watch video's I only see people appreciating how good they trading works
@altcoinsdaily1754
@altcoinsdaily1754 2 күн бұрын
How do you manage to use signal to make such profit 🤷.
@mmcrypto2403
@mmcrypto2403 2 күн бұрын
It's Juliana Hadas doing, she's changed my life
@mmcrypto2403
@mmcrypto2403 2 күн бұрын
She is a popular investment manager in Texas and even has a Google page and helps people grow various assets like cryptocurrencies, real estate, stocks, ETFs, etc. ✓ ✓ A very blessed and intelligent woman.
@RichmaxPetal
@RichmaxPetal 2 күн бұрын
WOW!!! You know her too? I'm also a proud beneficiary of her platform
The Dark Matter of AI [Mechanistic Interpretability]
24:09
Welch Labs
Рет қаралды 67 М.
This Physicist Says We’re Using Maths Entirely Wrong
9:46
Sabine Hossenfelder
Рет қаралды 28 М.
1% vs 100% #beatbox #tiktok
01:10
BeatboxJCOP
Рет қаралды 67 МЛН
Try this prank with your friends 😂 @karina-kola
00:18
Andrey Grechka
Рет қаралды 9 МЛН
The Best Band 😅 #toshleh #viralshort
00:11
Toshleh
Рет қаралды 22 МЛН
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
are we cooked w/ o3?
13:58
ThePrimeTime
Рет қаралды 224 М.
How A.I. Could Change Science Forever
20:03
Cool Worlds
Рет қаралды 133 М.
The Simple Math Problem That Revolutionized Physics
32:44
Veritasium
Рет қаралды 8 МЛН
The Trillion Dollar Equation
31:22
Veritasium
Рет қаралды 10 МЛН