Visually explaining Byte Latent Transformers - LLMs just got a massive breakthrough!

Рет қаралды 5,808

Күн бұрын

Пікірлер: 71

@npc4416 15 күн бұрын

WOAHHHH this is incredible, such a great breakdown of the paper and just from the title itself alone! this is underrated high quality expert content that i was yearning for on youtube, now i feel like i can finally begin to understand these complex papers... THANK YOU!!!

@avb_fj 15 күн бұрын

Thanks! Super appreciate the kind words. The title gimmick was something I relied on to give the viewer something easy to grab on to and fall back upon as the concepts themselves were quite diverse and complex. It gave me a sense of progress when I was scripting the video and probably that reflected in your viewing experience as well. 😊

@npc4416 15 күн бұрын

@@avb_fj yessss it make the complex paper much easier to digest as it made it into understanding a simple sentence! love it! keep up the good work!!!

@voncolborn9437 19 күн бұрын

Oh man, this video is going to require several epochs to digest it. :-)

@avb_fj 19 күн бұрын

Totally understandable. This video took me about 8 days to make! It was an overwhelming experience at the start for me too. Feel free to watch at your own pace!

@hughmanwho 18 күн бұрын

Sounds like someone who has been fine tuning or training an AI recently.. Love it!

@pakawatnakwijit2294 15 күн бұрын

Such a great explanation!! Your approach on explaining the paper by its title is so inspiring! This paper was in my reading list since it published. Now I can finally put it in my finished list.

@raman_scifi2sci 14 күн бұрын

My god! The quality! You have created a masterpiece, respect. AMAZING video, thanks a lot.

@Whysicist 19 күн бұрын

The definition of entropy is that it is proportional to the natural log of the number of accessible states.

@uis246 17 күн бұрын

*states with equal probability distribution

@npc4416 11 күн бұрын

so basically the more the model is confused about an the next token prediction the most is the entropy, so it gets broken down into finer and finer elements untill it gets it?

@uis246 10 күн бұрын

@@npc4416 yes.

@npc4416 9 күн бұрын

@@uis246 nice

@MustiKadhim 17 күн бұрын

This was so nicely explained mate, keep it up!

@avb_fj 16 күн бұрын

Thanks! Glad it was helpful.

@theodoreshachtman9990 19 күн бұрын

This was an incredible video, thank you so much! Just subscribed 😄

@avb_fj 19 күн бұрын

Thanks and welcome to the channel! 🙏🏼

@saveerjain6833 20 күн бұрын

Best NLP channel ever plz never stop

@AlbertMunda 17 күн бұрын

Great Work. With the detailed explanation, it becomes easy to get through with it. :)

@Kim-uu8fc 17 күн бұрын

Great presentation and explained very well

@ethanotto4717 15 күн бұрын

You are perhaps the best AI youtuber who covers that details of papers. Your clarity*detail factor is unlike anything I've seen. Keep up this excellent work.

@avb_fj 15 күн бұрын

Wow thanks a lot. One of the best compliments I have received on this channel.

@abomb1125 16 күн бұрын

Fascinating and well showcased. You explain the most relevant parts while handholding key understandings needed to grasp the deeper insights this paper displayed. I reviewed the paper briefly, and while very promising, adds even more research openings to dive into emergent properties. Further thoughts on this is the idea of adding MLPs or sparse autoencoders inside these smaller local encoders. Learning insights from features in this local transformer could be a gateway to uncovering cross modality. Food for thought perhaps.

@avb_fj 16 күн бұрын

Very true. My favorite papers are ones that open up whole new avenues for research. The idea of Sparse Autoencoders and perhaps some form of Dictionary Learning indeed seems like one of those directions…. I feel a someone in the world must already be working on that idea. There’s the fine four-way energy between generality, inductive bias, scalability, and performance… so it’ll be interesting to see if the adding more inductive biases (ex sparse autoencoders) into the transformer architecture is able to challenge the status quo. Super exciting field for sure, and your comment has some excellent insights.

@abomb1125 16 күн бұрын

@ From the looks of it, these local encoders operate on a much smaller dimensional space than the traditional architecture. My intuition tells me that adding additional complexity won’t affect flops relative to the massive latent transformer. But patching these byte level tokens decreases flops by a certain degree, and I can’t remember the actual metric deltas without referencing the paper. Perhaps the “minute” performance gained (“better than tokens”) would be lost with the additional complexity.

@akarshrastogi3682 18 күн бұрын

High quality video - thorough, well explained and appropriately paced and structured. Would love to see you go over the breadth and depth of many other subsects of important research (historical and current). A topic of interest may be RLHF/DPO/KTO/ORPO etc and other alignment methods. One small but necessary improvement: The noise of car and bike horns in the background. It gets really annoying and distracting, eg at 25:28

@avb_fj 18 күн бұрын

Thanks for the suggestions! That’s a great list of topics to aim for in the new year. Thanks for pointing out the background noise, I’ll keep this in mind going forward! 😊

@hrmanager6883 18 күн бұрын

Very impressive explanation, you are doing great for learning community, great prayers for you to get going, and BiG Thank you ❤

@avb_fj 18 күн бұрын

Thanks! Very kind words! 🙏🏼

@spedroxsac403 19 күн бұрын

I just wanted to know this and it's here..thank's bro

@ernestosantiesteban6333 18 күн бұрын

Wow, really amazing video man!

@pabloescobar2738 19 күн бұрын

Thank for audio 😊, thank for job

@avb_fj 19 күн бұрын

Thanks for your comment! Out of curiousity, which language did you watch the video in?

@pabloescobar2738 19 күн бұрын

@avb_fj spanish, but i stand english .

@pabloescobar2738 19 күн бұрын

@@avb_fjspain eur. But i stand english.

@avb_fj 19 күн бұрын

@ awesome to know! Glad the Spanish audio was working. Cheers and happy holidays!

@kiffeeify 18 күн бұрын

Really nice explanation :) I skimmed through the paper a couple of days ago, but your video really helped me structure my understanding of whats going on there. One question: What do you think - would it be possible to add extra "deeper nested" latent transformers, which operate on even shorter sequences of patches of patches? I am thinking in the direction of a model that "thinks in sentences, paragraphs or chapters" on the most inner latent level, whereas the less deeply latent levels generate the details on the relevant "zoom level". Curious to get some thoughts on that - I can't image the original authors haven't thought about that :D

@avb_fj 17 күн бұрын

Very interesting idea indeed! A hierarchical type model, where at inner layers you have more and more abstract/global level attention. It's a pretty neat idea for sure, as long as how "patches of patches" is being obtained is general enough. In BLT, it is obtaned from the Entropy model which is trained seperately on next byte prediction, but this won't be possible to do for grouping the "latent" patches. Generality is one of the strong suits for transformers, so when we introduce any inductive biases/constraints like these (explicit hierarchical modelling where we are forcing the model to "learn the data in a certain human way"), there is also always a chance that the model loses some generality in favour of the additional constraints. So yeah one will have to experiment and try all the benchmarks, compare the scaling tendencies, and if it works out, it sounds like a hell of a paper!

@al-ekramelaheehridoy7297 11 күн бұрын

Great explanation! Thanks

@fuhodev9548 17 күн бұрын

thank you so much! that's much easier to understand the paper! Hope you can make a video about LCM (Large Concept Models)

@avb_fj 17 күн бұрын

Awesome! I should probably do a dedicated LCM video, but fwiw my next video will be about the “best of AI research” from 2024, and LCM will absolutely be highlighted there!

@devbites77 10 күн бұрын

Really fascinating and informative!!😊 I wonder when this will be fully implemented and in what systems.

@tornyu 9 күн бұрын

I've been playing around with visualising entropy, and I recently notices the pattern of the high entropy of the first characters in words. I was starting to think about how you could use it to avoid tokenisation, but it would have taken me a long time to come up with this architecture!

@sheldonsebastian7232 19 күн бұрын

Wow this was well explained!

@avb_fj 19 күн бұрын

I’m glad you think so! :)

@123456crapface 18 күн бұрын

Just found your channel and immediately subscribed. You break it down very well. Thank you! Any chance you can make a video on LCM?

@avb_fj 18 күн бұрын

Thanks! Welcome to the channel. Yeah LCM has been on my mind too, but it’s been getting pushed away by other projects. I’ll take your comment as a sign and try to work on it next month.

@phanquochung3924 18 күн бұрын

keep these great explainers coming

@carlkim2577 18 күн бұрын

Great video! I would appreciate a summary section explaining what you think are the long term implications and viability of new innovations. I lack the domain expertise to do so.

@avb_fj 18 күн бұрын

Thanks. You are correct. I’ll keep this in mind going forward!

@tornyu 9 күн бұрын

I wonder if this makes interpretability harder, because it further separates the latent space from the input embeddings? If the model is smaller, it might be easier. But is the latent transformer smaller in general than a token transformer, or is it just that it can scale better to longer sequences?

@hhk5724 17 күн бұрын

Hi, what do you use for animation/presentation?

@avb_fj 16 күн бұрын

If you are interested in the final animations in this video or any other, I share them on my Patreon. If you are interested in how I produce it, I have my own python libraries built on top of Manim. In the past I’ve use matplotlib animations too. It’s a bunch of code that only I understand so I haven’t shared them anywhere. Some parts of the video are just PowerPoint and a video editing software (I use davinci resolve). In the past I have also used vector graphics softwares like free-tier Cavalry. Some creators use the Adobe suite (photoshop, after effects) but I don’t like to pay a monthly subscription fee so I have never considered them.

@hhk5724 16 күн бұрын

@@avb_fj Thnx! it looks cool

@Alice_Fumo 18 күн бұрын

Such an excellent video. I'm not sure how well this actually works multimodally, as distances in different modalities work differently. Text is onedimensional, but images are two-dimensional and video threedimensional. The encoder if just being fed the byte string could never group this into semantically meaningful patches for more than one-dimensional data. Or at least, that's what my intuition tells me. Multimodality would probably work to some extent regardless, but likely not ideally.

@avb_fj 18 күн бұрын

This is a very fair intuition. Guess we will see what researchers come up with to solve this issue. In vision transformers, people create 16x16 patches and feed that in as a flat sequence. Maybe someone will figure out a cool way to use clever 2d rotary positional embeddings on the byte stream and just train the same BLT model on a massive amount of data. Very interesting avenue of research for sure to keep an eye out.

@Alice_Fumo 18 күн бұрын

@@avb_fj I'm currently in the middle of thinking about ways to solve this. I think it would be entirely viable to make something which divides an image into variable-sized patches and feed it into the same model, but I'm less certain about the viability of decoding. Using different encoders for different modalities dynamically seems simple enough, but I'm not sure how one would know to switch out the output decoder, or how to decode multi-dimensional patches of uneven shape to begin with. With something like images, in general, for images, it seems like one would want to generate all the patches first and then decode them. It is truly nontrivial, but I really think if done well, it would be the best approach yet. Also, I guess diffusion models in general are just patch to image decoders. I think they use cross-attention, too. So yeah, I suppose it's just a question of knowing when the model wants to use a different modality.

@avb_fj 18 күн бұрын

Great insights. I agree it should be possible to feed the image into the model too. Whether the initial input are patches or as a sequence of bytes remains to be seen. Going with the Byte-only approach I do feel that the ultimate general version would probably learn the “variable patching” on its own directly from the byte stream with a similar entropy model. The decoding part definitely is more challenging both from a design and engineering pov. One of the leading approaches (like Gemini) train a VQ-VAE style architecture and uses a special decoder token to switch between the LLM unembedding layer for language generation… and the VQVAE decoder for image generation. Diffusion is interesting but it’ll be prohibitively slow due to the iterative process, esp for real time chat applications. But yeah you are right about that, conditional diffusion model do use cross attention between text embeddings obtained from Clip models and image embeddings.

@timmygilbert4102 17 күн бұрын

That's great, i have been predicting that we would eventually do huffman like encoding in some way, because that make sense. I posit we can probably generalize it to the whole architecture, because DAG with layers imply redundancy, which might reduce network size close to logn. It's a shame I'm not in a position rn to research this more.

@avb_fj 17 күн бұрын

I understand the fomo of being involved in research for a while and then having to move away into other things. I feel the pain too. Hope you get your fair chances and grab the ones you get! This is a very exciting research topic, and as a student of the field, I hope the next big breakthrough in intelligence arrives from some novel/innovative research idea instead of “training 10x more parameters”.

@timmygilbert4102 17 күн бұрын

@@avb_fj any new novel idea will push the frontier done by 10x, brute force always win, or more like it stand on the shoulder of giant insight.

@avb_fj 17 күн бұрын

@ true 😊

@mickolesmana5899 18 күн бұрын

funilly enough, this is architecture actually use "True" encoder and decoder where it transform into latent

@avb_fj 16 күн бұрын

Yeah it gets mentioned in the video in the “Latent” section!

@mickolesmana5899 16 күн бұрын

@@avb_fj oh my bad, i mean compare to normal transformer where de/encoder doesn't actually compress token into latent. Even though generally de/encoder architecture will project data into latent space. But this time it is

@avb_fj 16 күн бұрын

@ aha yeah I get it now. The usual transformer encoder-decoder acts as in old seq2seq paradigm but there’s no compression coz every decoder token attends over ever encoder token anyway. BLT does do some form of temporal compression/clustering with the patching stuff. Nice observation.

@AIShipped 17 күн бұрын

Great video!

@K.Solowoniuk 17 күн бұрын

Nice video. The background music is horrifying!

@avb_fj 17 күн бұрын

Oh I was going for a soothing vibe with this music. Guess certain earphones and audio settings may accentuate certain frequencies that make the track sound sinister. Still, I hope it didn't impede your overall viewing experience too much or distract you from the material in the video. If it did, do let me know coz then I gotta find a better track for next videos.

@locusruizlopez5997 18 күн бұрын

Pfffv el audio en español ha traducido "bytes" como "mordidas"

@avb_fj 17 күн бұрын

Oh wow. Yeah this is largely on KZbin's automatic translation algorithms - but I will have a look if I can fix it somehow! Still it must have been super funny to constantly hear "Bite latent transformers".