WOAHHHH this is incredible, such a great breakdown of the paper and just from the title itself alone! this is underrated high quality expert content that i was yearning for on youtube, now i feel like i can finally begin to understand these complex papers... THANK YOU!!!
@avb_fj15 күн бұрын
Thanks! Super appreciate the kind words. The title gimmick was something I relied on to give the viewer something easy to grab on to and fall back upon as the concepts themselves were quite diverse and complex. It gave me a sense of progress when I was scripting the video and probably that reflected in your viewing experience as well. 😊
@npc441615 күн бұрын
@@avb_fj yessss it make the complex paper much easier to digest as it made it into understanding a simple sentence! love it! keep up the good work!!!
@voncolborn943719 күн бұрын
Oh man, this video is going to require several epochs to digest it. :-)
@avb_fj19 күн бұрын
Totally understandable. This video took me about 8 days to make! It was an overwhelming experience at the start for me too. Feel free to watch at your own pace!
@hughmanwho18 күн бұрын
Sounds like someone who has been fine tuning or training an AI recently.. Love it!
@pakawatnakwijit229415 күн бұрын
Such a great explanation!! Your approach on explaining the paper by its title is so inspiring! This paper was in my reading list since it published. Now I can finally put it in my finished list.
@raman_scifi2sci14 күн бұрын
My god! The quality! You have created a masterpiece, respect. AMAZING video, thanks a lot.
@Whysicist19 күн бұрын
The definition of entropy is that it is proportional to the natural log of the number of accessible states.
@uis24617 күн бұрын
*states with equal probability distribution
@npc441611 күн бұрын
so basically the more the model is confused about an the next token prediction the most is the entropy, so it gets broken down into finer and finer elements untill it gets it?
@uis24610 күн бұрын
@@npc4416 yes.
@npc44169 күн бұрын
@@uis246 nice
@MustiKadhim17 күн бұрын
This was so nicely explained mate, keep it up!
@avb_fj16 күн бұрын
Thanks! Glad it was helpful.
@theodoreshachtman999019 күн бұрын
This was an incredible video, thank you so much! Just subscribed 😄
@avb_fj19 күн бұрын
Thanks and welcome to the channel! 🙏🏼
@saveerjain683320 күн бұрын
Best NLP channel ever plz never stop
@AlbertMunda17 күн бұрын
Great Work. With the detailed explanation, it becomes easy to get through with it. :)
@Kim-uu8fc17 күн бұрын
Great presentation and explained very well
@ethanotto471715 күн бұрын
You are perhaps the best AI youtuber who covers that details of papers. Your clarity*detail factor is unlike anything I've seen. Keep up this excellent work.
@avb_fj15 күн бұрын
Wow thanks a lot. One of the best compliments I have received on this channel.
@abomb112516 күн бұрын
Fascinating and well showcased. You explain the most relevant parts while handholding key understandings needed to grasp the deeper insights this paper displayed. I reviewed the paper briefly, and while very promising, adds even more research openings to dive into emergent properties. Further thoughts on this is the idea of adding MLPs or sparse autoencoders inside these smaller local encoders. Learning insights from features in this local transformer could be a gateway to uncovering cross modality. Food for thought perhaps.
@avb_fj16 күн бұрын
Very true. My favorite papers are ones that open up whole new avenues for research. The idea of Sparse Autoencoders and perhaps some form of Dictionary Learning indeed seems like one of those directions…. I feel a someone in the world must already be working on that idea. There’s the fine four-way energy between generality, inductive bias, scalability, and performance… so it’ll be interesting to see if the adding more inductive biases (ex sparse autoencoders) into the transformer architecture is able to challenge the status quo. Super exciting field for sure, and your comment has some excellent insights.
@abomb112516 күн бұрын
@ From the looks of it, these local encoders operate on a much smaller dimensional space than the traditional architecture. My intuition tells me that adding additional complexity won’t affect flops relative to the massive latent transformer. But patching these byte level tokens decreases flops by a certain degree, and I can’t remember the actual metric deltas without referencing the paper. Perhaps the “minute” performance gained (“better than tokens”) would be lost with the additional complexity.
@akarshrastogi368218 күн бұрын
High quality video - thorough, well explained and appropriately paced and structured. Would love to see you go over the breadth and depth of many other subsects of important research (historical and current). A topic of interest may be RLHF/DPO/KTO/ORPO etc and other alignment methods. One small but necessary improvement: The noise of car and bike horns in the background. It gets really annoying and distracting, eg at 25:28
@avb_fj18 күн бұрын
Thanks for the suggestions! That’s a great list of topics to aim for in the new year. Thanks for pointing out the background noise, I’ll keep this in mind going forward! 😊
@hrmanager688318 күн бұрын
Very impressive explanation, you are doing great for learning community, great prayers for you to get going, and BiG Thank you ❤
@avb_fj18 күн бұрын
Thanks! Very kind words! 🙏🏼
@spedroxsac40319 күн бұрын
I just wanted to know this and it's here..thank's bro
@ernestosantiesteban633318 күн бұрын
Wow, really amazing video man!
@pabloescobar273819 күн бұрын
Thank for audio 😊, thank for job
@avb_fj19 күн бұрын
Thanks for your comment! Out of curiousity, which language did you watch the video in?
@pabloescobar273819 күн бұрын
@avb_fj spanish, but i stand english .
@pabloescobar273819 күн бұрын
@@avb_fjspain eur. But i stand english.
@avb_fj19 күн бұрын
@ awesome to know! Glad the Spanish audio was working. Cheers and happy holidays!
@kiffeeify18 күн бұрын
Really nice explanation :) I skimmed through the paper a couple of days ago, but your video really helped me structure my understanding of whats going on there. One question: What do you think - would it be possible to add extra "deeper nested" latent transformers, which operate on even shorter sequences of patches of patches? I am thinking in the direction of a model that "thinks in sentences, paragraphs or chapters" on the most inner latent level, whereas the less deeply latent levels generate the details on the relevant "zoom level". Curious to get some thoughts on that - I can't image the original authors haven't thought about that :D
@avb_fj17 күн бұрын
Very interesting idea indeed! A hierarchical type model, where at inner layers you have more and more abstract/global level attention. It's a pretty neat idea for sure, as long as how "patches of patches" is being obtained is general enough. In BLT, it is obtaned from the Entropy model which is trained seperately on next byte prediction, but this won't be possible to do for grouping the "latent" patches. Generality is one of the strong suits for transformers, so when we introduce any inductive biases/constraints like these (explicit hierarchical modelling where we are forcing the model to "learn the data in a certain human way"), there is also always a chance that the model loses some generality in favour of the additional constraints. So yeah one will have to experiment and try all the benchmarks, compare the scaling tendencies, and if it works out, it sounds like a hell of a paper!
@al-ekramelaheehridoy729711 күн бұрын
Great explanation! Thanks
@fuhodev954817 күн бұрын
thank you so much! that's much easier to understand the paper! Hope you can make a video about LCM (Large Concept Models)
@avb_fj17 күн бұрын
Awesome! I should probably do a dedicated LCM video, but fwiw my next video will be about the “best of AI research” from 2024, and LCM will absolutely be highlighted there!
@devbites7710 күн бұрын
Really fascinating and informative!!😊 I wonder when this will be fully implemented and in what systems.
@tornyu9 күн бұрын
I've been playing around with visualising entropy, and I recently notices the pattern of the high entropy of the first characters in words. I was starting to think about how you could use it to avoid tokenisation, but it would have taken me a long time to come up with this architecture!
@sheldonsebastian723219 күн бұрын
Wow this was well explained!
@avb_fj19 күн бұрын
I’m glad you think so! :)
@123456crapface18 күн бұрын
Just found your channel and immediately subscribed. You break it down very well. Thank you! Any chance you can make a video on LCM?
@avb_fj18 күн бұрын
Thanks! Welcome to the channel. Yeah LCM has been on my mind too, but it’s been getting pushed away by other projects. I’ll take your comment as a sign and try to work on it next month.
@phanquochung392418 күн бұрын
keep these great explainers coming
@carlkim257718 күн бұрын
Great video! I would appreciate a summary section explaining what you think are the long term implications and viability of new innovations. I lack the domain expertise to do so.
@avb_fj18 күн бұрын
Thanks. You are correct. I’ll keep this in mind going forward!
@tornyu9 күн бұрын
I wonder if this makes interpretability harder, because it further separates the latent space from the input embeddings? If the model is smaller, it might be easier. But is the latent transformer smaller in general than a token transformer, or is it just that it can scale better to longer sequences?
@hhk572417 күн бұрын
Hi, what do you use for animation/presentation?
@avb_fj16 күн бұрын
If you are interested in the final animations in this video or any other, I share them on my Patreon. If you are interested in how I produce it, I have my own python libraries built on top of Manim. In the past I’ve use matplotlib animations too. It’s a bunch of code that only I understand so I haven’t shared them anywhere. Some parts of the video are just PowerPoint and a video editing software (I use davinci resolve). In the past I have also used vector graphics softwares like free-tier Cavalry. Some creators use the Adobe suite (photoshop, after effects) but I don’t like to pay a monthly subscription fee so I have never considered them.
@hhk572416 күн бұрын
@@avb_fj Thnx! it looks cool
@Alice_Fumo18 күн бұрын
Such an excellent video. I'm not sure how well this actually works multimodally, as distances in different modalities work differently. Text is onedimensional, but images are two-dimensional and video threedimensional. The encoder if just being fed the byte string could never group this into semantically meaningful patches for more than one-dimensional data. Or at least, that's what my intuition tells me. Multimodality would probably work to some extent regardless, but likely not ideally.
@avb_fj18 күн бұрын
This is a very fair intuition. Guess we will see what researchers come up with to solve this issue. In vision transformers, people create 16x16 patches and feed that in as a flat sequence. Maybe someone will figure out a cool way to use clever 2d rotary positional embeddings on the byte stream and just train the same BLT model on a massive amount of data. Very interesting avenue of research for sure to keep an eye out.
@Alice_Fumo18 күн бұрын
@@avb_fj I'm currently in the middle of thinking about ways to solve this. I think it would be entirely viable to make something which divides an image into variable-sized patches and feed it into the same model, but I'm less certain about the viability of decoding. Using different encoders for different modalities dynamically seems simple enough, but I'm not sure how one would know to switch out the output decoder, or how to decode multi-dimensional patches of uneven shape to begin with. With something like images, in general, for images, it seems like one would want to generate all the patches first and then decode them. It is truly nontrivial, but I really think if done well, it would be the best approach yet. Also, I guess diffusion models in general are just patch to image decoders. I think they use cross-attention, too. So yeah, I suppose it's just a question of knowing when the model wants to use a different modality.
@avb_fj18 күн бұрын
Great insights. I agree it should be possible to feed the image into the model too. Whether the initial input are patches or as a sequence of bytes remains to be seen. Going with the Byte-only approach I do feel that the ultimate general version would probably learn the “variable patching” on its own directly from the byte stream with a similar entropy model. The decoding part definitely is more challenging both from a design and engineering pov. One of the leading approaches (like Gemini) train a VQ-VAE style architecture and uses a special decoder token to switch between the LLM unembedding layer for language generation… and the VQVAE decoder for image generation. Diffusion is interesting but it’ll be prohibitively slow due to the iterative process, esp for real time chat applications. But yeah you are right about that, conditional diffusion model do use cross attention between text embeddings obtained from Clip models and image embeddings.
@timmygilbert410217 күн бұрын
That's great, i have been predicting that we would eventually do huffman like encoding in some way, because that make sense. I posit we can probably generalize it to the whole architecture, because DAG with layers imply redundancy, which might reduce network size close to logn. It's a shame I'm not in a position rn to research this more.
@avb_fj17 күн бұрын
I understand the fomo of being involved in research for a while and then having to move away into other things. I feel the pain too. Hope you get your fair chances and grab the ones you get! This is a very exciting research topic, and as a student of the field, I hope the next big breakthrough in intelligence arrives from some novel/innovative research idea instead of “training 10x more parameters”.
@timmygilbert410217 күн бұрын
@@avb_fj any new novel idea will push the frontier done by 10x, brute force always win, or more like it stand on the shoulder of giant insight.
@avb_fj17 күн бұрын
@ true 😊
@mickolesmana589918 күн бұрын
funilly enough, this is architecture actually use "True" encoder and decoder where it transform into latent
@avb_fj16 күн бұрын
Yeah it gets mentioned in the video in the “Latent” section!
@mickolesmana589916 күн бұрын
@@avb_fj oh my bad, i mean compare to normal transformer where de/encoder doesn't actually compress token into latent. Even though generally de/encoder architecture will project data into latent space. But this time it is
@avb_fj16 күн бұрын
@ aha yeah I get it now. The usual transformer encoder-decoder acts as in old seq2seq paradigm but there’s no compression coz every decoder token attends over ever encoder token anyway. BLT does do some form of temporal compression/clustering with the patching stuff. Nice observation.
@AIShipped17 күн бұрын
Great video!
@K.Solowoniuk17 күн бұрын
Nice video. The background music is horrifying!
@avb_fj17 күн бұрын
Oh I was going for a soothing vibe with this music. Guess certain earphones and audio settings may accentuate certain frequencies that make the track sound sinister. Still, I hope it didn't impede your overall viewing experience too much or distract you from the material in the video. If it did, do let me know coz then I gotta find a better track for next videos.
@locusruizlopez599718 күн бұрын
Pfffv el audio en español ha traducido "bytes" como "mordidas"
@avb_fj17 күн бұрын
Oh wow. Yeah this is largely on KZbin's automatic translation algorithms - but I will have a look if I can fix it somehow! Still it must have been super funny to constantly hear "Bite latent transformers".