Those curious where the 1.58 comes from - It's log(3)/log(2) = 1.5849625. Basically if you have a long sequence of random three state values you can represent it with no less than 1.58 bits per three state value.
@wwkk49649 ай бұрын
Thanks! I thought it was the square root of 5 halfs.
@swarnavasamanta26289 ай бұрын
Actually you need to explain it a bit more why that is. Modern computers, use 1 bit to represent 0 and 1. And 2 bits to represent -1. Given you generate a random sequence of 3 possible integers of which 2 are 1 bit and another is a 2 bit one the average number of bits used would eventually converge to log(3)/log(2). Edit: If you are curious why it's log 3/ log 2. Base 10 log that is. Because we use the decimal system and computers use binary, number of bits required to store a number is log base 2 of that number. Since it's 3 states of numbers we are using or 3 unique digits, it's log_2(3), or log 3/log 2
@jarrod7529 ай бұрын
@@swarnavasamanta2628So if I understand you correctly, you have 1 bit to represent 1 number, and it uses 2 bits to represent the other 2 numbers? So for example, if you have a 1, you stop. but if you have a 0, then you check for 00, or 01? Giving you 3 options stored in a compact way?
@swarnavasamanta26289 ай бұрын
@@jarrod752 You have 2 bits to represent -1 and 1 bit to represent 0 or 1. So you need 2 bits to store whenever the weight is -1 and 1 bit to store the 0 and 1 weights
@PaulSpades9 ай бұрын
I wish they were not retarded and recognised that they're speaking about balanced ternary, not force a three value number into a binary format.
@gabrielsandstedt9 ай бұрын
Summary: The video introduces a groundbreaking advancement in the field of Large Language Models (LLMs) by presenting the concept of 1.58-bit LLMs, a significant departure from traditional 32-bit or 16-bit floating-point representations used in these models. This new approach utilizes ternary values (-1, 0, +1) for model parameters, drastically simplifying the computational operations needed to run these models. Traditionally, LLMs rely on complex matrix operations, which are computationally intensive and require high-performance hardware like GPUs, optimized for such tasks through technologies like NVIDIA's CUDA. However, the 1.58-bit LLMs leverage simple arithmetic operations, reducing the need for specialized hardware and potentially lowering energy consumption and operational costs. This method significantly cuts down computational complexity, allowing advanced models to run on less powerful hardware, even devices without GPUs. It suggests a shift towards more sustainable AI technology use, with reduced environmental impact due to lower energy requirements. Moreover, it opens up avenues for hardware innovation, with the potential development of new devices optimized for ternary computation rather than complex floating-point matrix calculations. This advancement is not just a technical feat; it represents a shift towards making high-performance AI more accessible and cost-effective, paving the way for future innovations in AI hardware and software. The 1.58-bit LLMs maintain performance levels comparable to traditional models while offering improvements in speed, memory usage, throughput, and energy efficiency. This development could redefine how LLMs are scaled and trained, offering a new paradigm for AI model deployment that is both high-performing and environmentally conscious.
@blacksage819 ай бұрын
I wonder how fast this method would run on a Groq chip? Strange times we live in.
@__________________________69109 ай бұрын
Which website you are using for YT video summarization
@gabrielsandstedt9 ай бұрын
I just fed the entire youtube transcript into gpt4.5 and told it to summarize@@__________________________6910
@Custodian1239 ай бұрын
That's not a summary, that's a transcript 😂
@ysy699 ай бұрын
@@Custodian123 a good one 😛
@StevenAkinyemi9 ай бұрын
Bro. This is actually game changing! Why is this not a popular technique yet?
@StevenAkinyemi9 ай бұрын
Oh. It is new. Crazy stuff
@arnaudlelong23429 ай бұрын
arXiv papers are not peer-reviewed.
@bourdainedepiment39629 ай бұрын
Like all vaporware, because it doesn't work.
@MichaelBarry-gz9xl9 ай бұрын
The research only came out yesterday, and it's preprint
@jarrod7529 ай бұрын
I'm guessing as a research paper, it'll either be implemented into everything in 2 weeks or it won't work.
@Cat-vs7rc9 ай бұрын
Why not address the elephant in the room. The 1.5 bit model cannot store the same amount of information as 16bit. So why is it performing at an equal level. Very fishy. metrics gamed? most probably.
@UnknownOrc9 ай бұрын
Good Point. no idea what any of this mean anything or anyway.
@MichaelBarry-gz9xl9 ай бұрын
Density. Entropy. Compression. etc
@wenhanzhou58269 ай бұрын
I am also super skeptical about this, some papers shows pruning the network may give some performance boost, but the performance degrades when too many weights are set to zero. This is like pruning and quantization at the same time, which would likely work in a very controlled manner, but sounds too good to be true in this paper.
@michalchik9 ай бұрын
It may be that our training algorithms cannot efficiently use the data resolution of floating point the size of floating point numbers is way too large for the proper atomic level of this kind of information. Think of it this way, after you get to TV screens with 16 bit colors do you actually get much more by switching to 32-bit color? You mostly get much bigger files and much slower processing.
@nikflix83319 ай бұрын
My thoughts exactly. I’d imagine these 1.58 bit models would be more prone to catastrophic forgetting as well. They can’t hold nearly as much information.
@JG27Korny9 ай бұрын
So basically the main idea is to get rid of multiplication as multiplication by 1 is the same, by -1 the same but with - sign. And multiplication by zero is zero. So 1 is obvious. 0 is interesting as it allows for additional complexity representation to be encoded in the neural network. I expect they could rediscover some bitwise hack techniques from the era of the beginning of 3D gaming. To use bitwise operations instead of multiplication. In that way you get the same efficiency not doing multiplication but the complexity goes up significantly. I like though the quantization as it introduces some noise in the neural network and by doing so you get probably some interesting results as the NN can outperform the original full precision model on data on which the original was not trained for.
@morososaas33979 ай бұрын
Its so hard to believe that 3 possible parameter values can capture the patterns and signal from the data as good as with 16-bit float. Mind blowing stuff for sure.
@arkaprovobhattacharjee86919 ай бұрын
that's what i want to know, how are they doing it?
@ZuckFukerberg9 ай бұрын
@@arkaprovobhattacharjee8691they have released the technical paper so you can freely check it
@ciarantaaffe41999 ай бұрын
If you understand Digital-to-Analog conversion techniques, it is trivial. In fact, you wouldn't need a negative one value either, but they have yet to figure that part out.
@eugenmalatov54709 ай бұрын
Exactly, a lot of information gets lost. So you would assume that the number of knods (parameters) would need to go up.
@nguyenvanduy2479 ай бұрын
Maybe the trick is that with ternaries there's now a convenient way to represent sparsity (the zero value), which I heard is pretty important in neural networks.
@whig019 ай бұрын
Bitwise Mamba + bitwise inference for the win.
@IAmCandal9 ай бұрын
I was thinking mamba.
@magicmarcell9 ай бұрын
+ bitmap which is irrelevant but f it were in a simulation
@minimal37349 ай бұрын
+ analog inference chips.
@siddharthagrawal83009 ай бұрын
@@minimal3734 why would you use analog inference chips for 1.58 bits 💀
@ytpah98239 ай бұрын
🎯 Key Takeaways for quick navigation: 00:00 *🔄 The video introduces changes in the LLM (large language model) world, highlighting the transition from traditional deep learning models to ones that no longer require GPUs for high-performance matrix multiplication.* 00:43 *🔍 The discussion introduces the concept of 1bit LLMs, suggesting a move towards more efficient computational models that retain performance parity with current LLMs but at a reduced computational cost.* 01:11 *🧮 Explains the shift from 32-bit or 16-bit floating-point representations to a ternary system (using -1, 0, 1) for model parameters, significantly simplifying the computational process by eliminating the need for multiplication.* 03:31 *🆕 Introduces the "B 1.58 model," which uses ternary values instead of binary, enhancing learning capabilities and performance by incorporating a zero value alongside -1 and 1.* 05:08 *💡 Discusses the potential for new hardware development optimized for the ternary computation model, suggesting a significant shift away from GPU reliance and towards more specialized computing solutions.* 06:02 *🚀 Highlights the paper's assertion that the 1.58 bit LLM architecture offers comparable accuracy to traditional models while improving latency, memory efficiency, throughput, and energy consumption.* 07:12 *📈 Provides evidence of the new model's effectiveness through comparisons with the Llama LLM architecture, showing equal or better performance on various metrics, including perplexity and downstream task performance.* 09:58 *🎛️ Elaborates on the technical implementation of the 1.58 bit LLM, retaining the Transformer architecture but altering the numerical representation and computational approach within the model.* 11:48 *🌍 Suggests a significant impact on the scalability and application of LLMs across different hardware platforms, including mobile and edge computing, due to the reduced computational requirements.* 13:11 *📉 Concludes with the potential for dramatic improvements in hardware efficiency and cost reduction for deploying large-scale LLMs, due to the shift to a 1.58 bit computational model.* Made with HARPA AI
@zandrrlife9 ай бұрын
This isn't even a click bait title. This is an improvement of the original bitnet paper. This is actually a huge deal. Very. Original method, would reduce performance by almost half. We been working with it for our own LM architecture. Trying to figure out clever ways to mitigate these issurs to a degree, which required whole nee activation function for sparse representations, increasing nonlinearity...blah blah lol. This improvement seems it retains majority of this performance. Wow. Just going off vid screenshot. I have to dig into this paper now lol. Listen we won't have gpt4+ models locally without super efficient low depth quantization. This is morr holistic since you have to pretrain. Wow. Also generationing weights directly is more feasible at this depth, we believe that conditional weight generation is going to forever change deep learning. I know i sound crazy, but even compute won't be an advantage soon. We are going to change that. I know i sound crazy..but remember i said this. Within a year you wont have to pay hundreds of thousands/millions to pretrain a model, simply bring your dataset, model code, and hyperparameters. That's your prompt by the way. This year is going to be insane for all.
@AbeDillon9 ай бұрын
I'm suprised it took so long to study this. Years ago, I realized you could model combinational logic as a "neural net". Weights would have to be +1, 0 (not connected), and -1 (negated). The gates would be like neurons with various activation functions. I never wrote about it because i figured surely someone in the feild had already written a paper on the concept. It seems so obvious.
@AbeDillon9 ай бұрын
Since I missed the boat on this discovery, I might as well share some other insights I think are rather obvious, but haven't seen in literature (though I haven't looked very hard): 1) I bet the majority of training can be done at low precision too. Since weights start out highly randomized, they (almost by definition) don't carry much information. One should be able to train a 1.58-bit (trit) model until things settle, then add a bit of precision and continue training until the model parameters settle, then add another bit and so-on until you reach the desired performance. It makes little sense to train at high precision if you're going to throw away most of that precision when it comes time for inference anyway. I don't know how much precision is required for meta-parameters like momentum, but it shouldn't be that much more than the actual parameters themselves.
@AbeDillon9 ай бұрын
2) I think the field is missing a key fundamental element besides neurons and weights: delays. AFAIK, delays have yet to be explicitly modeled in ANNs. I think it would help at a theoretical level for understanding RNNs. I think we might gain a lot from modeling neuron inputs as adaptive FIR filters instead of single-weighted, single-valued signals. Digital Signal Processing engineers have a whole toolbox of techniques based on explicitly modeling delays.
@unclecode9 ай бұрын
When accuracy is so sensitive even to quantization, it's hard to grasp how this makes things better and faster! Every upgrade has a price tag, so what's the catch here? I'm curious to give their model a go, hopefully they add it to Hugging Face.
@d_b_9 ай бұрын
Seems huge! Can't wait to run 70B+ on my phone with millisecond response times
@footube39 ай бұрын
Assuming you have 13.8GB of free memory on your phone to run it of course!
@TragicGFuel9 ай бұрын
@@footube3 that's not a large amount
@Solo21219 ай бұрын
@@TragicGFuel He said memory not storage.
@irnehhenri9 ай бұрын
My Xiaomi Mi 10 Ultra has 16GB of RAM, of which 12GB is literally always free, because that's an insane amount for a phone... but there have been phones like that for years now!
@irnehhenri9 ай бұрын
I did actually try to run llama.cpp on my phone a while back - there was a project that compiled it natively for Android, but I couldn't get it to not crash. I could have tried compiling it myself, but I got bored and figured it would probably be way too slow with a phone CPU anyway
@i6od9 ай бұрын
i wish they released at least one model to test it out lol
@MichaelBarry-gz9xl9 ай бұрын
They are planning on releasing the models for research
@footube39 ай бұрын
This is huge! Thanks for bringing it to our attention ❤
@vincentvoillot63659 ай бұрын
From 16bits float to 1bit ternary, bring back memory, from CD (16bit int) to SACD (1bit DSD/SDM)... Does a text-to-sound with a 1.58 quantization could output DSD directly ?
@ПавелКуликов-м9м9 ай бұрын
I still couldn't understand the main thing. In this work, the original model was trained in 16 bits, and then it was quantized in -1,0,1, or did they learn to train the model immediately in -1,0,1, managed to do full backpropagation in this representation, etc.?
@JG27Korny9 ай бұрын
From what I understand you get the full benefits only if you train on 1,58 bits. If you do post training quantization you get what you get, lower precision that scales down 8 bit, 4, bit, 2, bit, 1 bit.
@noname767879 ай бұрын
When you backpropagate, you use the fp16 values as reference but tell the the ternary model to only use 1, 0, or -1 (paraphrasing from someone else comment from other video)
@otterhopper9 ай бұрын
This feels like the signal to sell NVDA while they're at their high, before the shift occurs where their GPUs are no longer essential for AI.
@MichaelBarry-gz9xl9 ай бұрын
NVIDIA is in an excellent position to optimize their GPU's for this. It will actually save them money. NVIDIA will be just as excited as this as we are. The good thing is that so too will RISC-V, Arm, Intel, AMD etc
@otterhopper9 ай бұрын
@@MichaelBarry-gz9xl Yeah you could certainly be right. I wondered if that will be how it plays out too. My gut says it'll be a matter of how adaptive nVidia is and how quickly they'll be able to pivot from their existing momentum (which is substantial) in the whole CUDA stack with the FP16/etc matrix math to doing custom circuits (ASICs like being done by Groq). I am also a bit confused about this because it seems like there is still a need to train using the existing FP16 math and that this binary-ish technique is more for the inference stage, after quantizing an FP16 model down to a model with 1/0/-1, or at least that's my read on it. If that's so, then Groq and Cerebrus and others who are well down the custom ASIC paths may be better positioned to pivot to ASICs for this binary-like math purpose, specializing in inference (which is the larger market I believe) and leave the training phase to nVidia hardware (or similar).
@iyziejane9 ай бұрын
There are multiple reasons to sell if you've been holding for a while. Regardless of what architecture pulls ahead, nvidia will have more competition soon. We may also be in a speculative bubble, since the investor frenzy is banking on a wildly optimistic economic transformation due to AI, and that may not come to pass.
@KillFrenzy969 ай бұрын
This is a good thing for both top end and consumer level AI. The top end will always use the extra headroom to improve the quality of the model. Consumers will finally run good models that they can run on a regular PC, perhaps opening the window for games running AI locally.
@tuhinswe9 ай бұрын
no drawback?
@MichaelBarry-gz9xl9 ай бұрын
None that I can see. It's seems better in every single way. 5x smaller. 2x context length. 4x lower latency. 11x greater batch sizes. Several orders of magnitude less energy required. Better scaling law. It needs retraining from scratch so it'll be the good part of a year before we get some decent models to play with. It will also spur on the development of new hardware. So we'll all be wanting to buy new hardware, which will make it even better
@EobardUchihaThawne9 ай бұрын
the idea of having integers as weights made sense to me for a while. but my man 😂only using -1, 0, and 1 is very cool
@EobardUchihaThawne9 ай бұрын
but still in abstraction it doesnt feel reliable somehow😂
@VincentVonDudler9 ай бұрын
@EobardUchihaThawne I don't know sh-t about the human mind *but* in whatever way it functions it can probably be abstracted to something simpler than floating points. Probably much more along the lines of two or three values like -1, 0, 1.
@torretacosmica9 ай бұрын
why? human mind is "analogic", it is not discrete, so is "quantum level" presition...@tVonDudler
@SoCalGuitarist9 ай бұрын
Wooo, this sounds pretty cool, and may make CPU inference on low power devices much more realistic. Groq folks probably aren't too happy about this (tho lets be real, this will make groq's already incredible performance even more incredible)
@cbuchner19 ай бұрын
Some earlier tensor cores of nVidia GPUs (Turing, Ampere A100) have a 1 bit matrix multiplication mode. It was labeled as experimental in the documentation. Not sure if they kept this in later hardware revisions.
@Anzeljaeg9 ай бұрын
This is truly a big change, ty for the info
@build.aiagents9 ай бұрын
This is phenomenal
@mawungeteye66099 ай бұрын
So what happens to 1bit Mamba models
@farrael0049 ай бұрын
"It (BitNet b1.58) matches the full-precision Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance" I don't buy this at all. I will only believe it when I have a model like that running locally in my computer.
@Daniel-Six9 ай бұрын
1.58 = Log 3 in base 2. (corrected as below... Uh-doiii!)
@Jvo_Rien9 ай бұрын
log_2(3) or log(3)/log(2) actually. But good catch anyway! This is related to entropy. This BitNet shifts from a binary to a ternary representation. So if you have a sequence of values with 3 possible states each, you need about 1.58 bits per value to encode them efficiently without loss of information.
@timothywcrane9 ай бұрын
Doesn't this do the same thing to an LLM computation that first using UMAP on high density embedding at a 3 dimensional level for visualization does in Topic modeling using classic ML methods? I always use the HDBscan output for hierarchical and algorithmic insights before 3d vis because of the density of the parameters gives out more precise, but harder to handle high density data. The 3d data is still VERY usable and holds most understanding from HD. Is this comparable? Nice to know after struggling with the frustration of Nvidia CUDA update dragging..
@BobKane-g6x9 ай бұрын
How does this differ from all the other quantization methods that we have used so far? I have worked with 2-bit, 4-bit, 5-bit, and 6-bit quantized versions. Is this a completely new approach?
@crankysysadmin9 ай бұрын
Quantization takes a model created in fp16 and "compresses" it down to 6,5,4,3,2 bits. This technique requires one to *train* the model at 1.58 bits.
@MarkShank9 ай бұрын
The extreme wording of the title is warranted here if you didn’t know about the October paper, which I did not. So, thanks for the video!
@MrSur5129 ай бұрын
Elaborate please?
@차우현-d8d9 ай бұрын
@MrSur512 there was a paper published in October about the previous version of this named Bitnet
@biskero9 ай бұрын
did they provide a real example with tests? this is going to be great for low power computational devices like RPI and so on...
@anirudh5149 ай бұрын
Amazzing!! You are always very much up to date with research work in this AI field.
@footube39 ай бұрын
Question: If word2vec is able to represent semantically similar words based on them existing near each other in an n dimensional vector space, how can it achieve the same result if there are only three possible positions in each dimension, or are the vectors produced by word2vec solely the data that is fed to the model, but not the model weights itself? Does this also have anything to do with why activations are 8 bit, or is that also unrelated?
@siddharthagrawal83009 ай бұрын
consider that embeddings have 512 to 1024 or even more dimensions, that means with just 1, -1, and 0, u can represent 3^1024 unique elements in an embedding. that's more than the number of atoms in the universe afaik. I am unsure if this produces embeddings in 1.58 bits but that is still a very large space to generate information over.
@KingBadger3d8 ай бұрын
Great video. Very cool stuff
@richardchin15459 ай бұрын
If the 1.58bit model is out performing the traditional model on current hardware what could it do with optimised hardware? Would be interested to know also if it is quicker (and cheaper) to train. Possible energy/environmental benefits too? Could it make putting LLMs in phones and small, portable devices more doable?
@AndersTornqvistsvedbergh9 ай бұрын
It should be possible to do the calculations with mostly analogue circuits, just a couple of transistors per weight really, one layer adding and one layer subtracting (or doing nothing). Like 10000x10000 weigh layer update in one ns with very little energy.
@rasol1369 ай бұрын
Thank you for sharing! This process will revolutionize local open source AI!
@petrus49 ай бұрын
Ternary is apparently a very old, alternative number system, which permits decision making with the use of -1. 0, and 1, unlike just binary's 0 and 1. Apparently a computer called Setun which was based on it, was built by the Soviets in the 1950s. It makes me wonder what other innovations might be possible, if we were to look at such obscure and potentially unexplored ideas.
@ScibbieGames9 ай бұрын
Ternary is difficult to make into logical gates. These ternaries are still represented by binary bits here.
@dakrontu5 ай бұрын
That it took so long to find out that a 1-bit approach is viable, suggests that it was very counter-intuitive, so it is good that we eventually found out, because in terms of hardware, this is revolutionary. It also possibly sheds some light on how neurons work, where the complexity may just be a consequence of the biology, but unnecessary in that it does not add to the performance, but (and this is key) nor does it get in the way. So maybe we can simplify a neuron model along the 1-bit lines.
@nonetrix30669 ай бұрын
If it ends up being true, I have my doubts but regardless, can we even make models smaller after this? I don't think it's possible really. I think maybe the next step is trying to trim parameters, I think I saw this being talked about once, just trying to find parameters that don't contribute much but it would be computationally expensive
@henkhbit57489 ай бұрын
We have to wait until a Company will implement this technique. Very promising. Thanks for the update on new llm technique.
@AC-go1tp9 ай бұрын
Great explanation! It will be great if you can run a more in depth explanation of the paper and share some resources contrasting conventional matrix multiplication with the 1Bit approach. Thanks a lot!
@69x9 ай бұрын
couldn't help but think of quantum computers when seeing the -1 0 1 bits, is there possibility for this?
@iyziejane9 ай бұрын
How do you train such models? I mean, changing a parameter from 1, to 0, to -1 is a big discrete step. How do you know when it's time to change it?
@1littlecoder9 ай бұрын
That's where it'll be interesting to see if the authors share the code. They haven't yet.
@iyziejane9 ай бұрын
@@1littlecoder Thanks for the reply (and for this very nice high-level summary video). I looked into the paper, the closest they come to discussing training is to mention that other groups have used post-training quantization (which is the most natural guess), but then they criticize such methods and don't say what they do instead (a little bit suspicious, but maybe being guarded is normal with so much money at stake). Clearly to make such a big change of a 1 to 0 or -1 at once, the decision to change must be based on many training samples somehow. The best way I can think of to do this, without storing a hidden floating-point for each parameter, is by a Monte Carlo method (each training sample suggest some small direction for each trit to change in, and RNG is used to accept the change probabilistically, so that on average the trit parameters are being sampled from a relevant distribution that takes into account all the training samples). Just a guess!
@1littlecoder9 ай бұрын
@@iyziejane Given your interest, read up on Extropic. They're proposing a new chip and compute!
@iyziejane9 ай бұрын
@@1littlecoder Ah that's a serendipitous recommendation, I knew the CEO Verdon from his days as a graduate student in quantum computing. I might actually get in touch to help their efforts...
@1littlecoder9 ай бұрын
@@iyziejane woah; that's so nice of you! Thanks for sharing this!
@prodigiart9 ай бұрын
This is fantastic news for people running local open source models, if the performance translates.
@OnigoroshiZero9 ай бұрын
This is huge and actually shocking... But, how can the 3 values (-1, 0, 1) do the same work as the 16-bit ones without losing any precision? Also, wouldn't this mean that something on the level of GPT-4 will work with a single powerful GPU (something like a 3090) if it uses this 1.58-bit form, as it will scale down massively both in size and required computational power? edit: A deeper dive on this if possible will be helpful.
@MichaelBarry-gz9xl9 ай бұрын
Yes. 120B parameters on a 3090. Hook a few up and you can absolutely give GPT4 a run for it's money. Don't forgot GPT 5 will use this tech, and probably improve it and keep the improvements secret
@VincentVonDudler9 ай бұрын
For speed: Increase the quantity of calculations and Lower the complexity of calculation? For model size: The model is highly compressed because the values are highly compressable (only three options)? So maybe the model is bigger in terms of overall weights / information held in the model but much lower in disc size? Disclaimer: I have no idea wtf I'm talking about.
@marcfruchtman94739 ай бұрын
There is a lot of potential here. Originally, I thought this was a completely Ternary process. BUT, I doubt it. So this model is probably using similar method as "BitNet: Scaling 1-bit Transformers for Large Language Models." which explains "BitNet employs low-precision binary weights and quantized activations, while maintaining high precision for the optimizer states and gradients during training."... In other words, this method is not a completely Ternary process. It is still using high precision for gradients etc (If it is using all Ternary, then please explain how it is avoiding the same issues as over quantization)
@Alice_Fumo9 ай бұрын
I checked the paper for a sec and found these models actually eat more memory than 4-bit quantized ones and don't offer much of a speedup compared to them either. Don't know where the memory inefficiency comes from and whether it could be fixed. If this is the best it can do, it's quite Zzz
@MichaelBarry-gz9xl9 ай бұрын
Huh? Did we read the same paper? I did my own calculations in my head before reading the paper, and the paper confirmed what I suspected. This architecture can fit around 40B parameters on a little 8GB card
@Alice_Fumo9 ай бұрын
@@MichaelBarry-gz9xl "BitNet b1.58 1.3B 1.14 (2.93x) 0.97 (1.67x) 11.29" taking 1.14 GB to run a 1.3B model gives us 8 / 1.14 = 7.017 * 1.3 = 9.1B for an 8GB card, assuming of course 0 bytes being used for stuff like display drivers. However, we'll also need some more memory for our context window.
@MichaelBarry-gz9xl9 ай бұрын
@@Alice_Fumo the smaller models are larger than expected yes I noticed that too and perhaps its head which is 8 bits? But it levels of add or gets bigger and reaches the expected 1.58 bits. Check the chart that compares the size using the log scale on the left. 70B was around 10Gb, and the 7B was around 10Gb or less. I'm going off memory, but everything seemed as I expected it to be (with the exception of the smaller models)
@Alice_Fumo9 ай бұрын
@@MichaelBarry-gz9xl ah, you're right. I kinda missed that graph. This would put the memory consumption of 70b models to ~20gb (they only gave the ratio for that one which is why I overlooked it), which is actually within the realm of consumer GPUs (although barely). Most interestingly, it would let one run mixtral (or similar) on a single GPU.
@TragicGFuel9 ай бұрын
@@MichaelBarry-gz9xl I still do not quite get how it would be matching performance with the 32 bit variants? I'm just an undergrad student, so I'm probably missing a lot of the necessary details. If you could explain it, or guide me to some resources that could, I would be grateful.
@MrErick11609 ай бұрын
Hey, where do you get all these papers? How do you stay informed with all the new important papers? I'd like to start reading some specific to my work (data science)
@1littlecoder9 ай бұрын
My main source is this guy on Twitter - twitter.com/_akhaliq (Very high signal to noise ratio)
@MichaelBarry-gz9xl9 ай бұрын
Hugging Face Daily Papers. Alternatively just filter the arxiv to only include keywords such as LLM
@TommyJefferson18019 ай бұрын
@@MichaelBarry-gz9xlhey I checked your comments on KZbin and they're good and you seem to be one of the few knowledgeable out here. I'm working on something interesting but want to verify with you. If possible can we connect on discord. Thanks!
@Tomjones123459 ай бұрын
can this method be used for only inference or training as well. if I understand correctly the hardware requirements would remain the same for training.
@geldoku9 ай бұрын
Is this in any way related to neuromorphic/analogue computers?
@stevencooley33419 ай бұрын
Isn't there some kind of restriction on sale of high-performance GPU/LPU hardware to China? That might be fueling this. It might also be cause for doubt. I hope it's true though!
@abhi88mcet9 ай бұрын
how would u get the signed bit (+/-) "without multiplying" with the weights
@Jvo_Rien9 ай бұрын
from what i understood, it's a ternary representation {-1, 0, 1} so 3 base elements.
@jonatasdouradoporto23969 ай бұрын
Because it is technically a 2-bit representation, inverting the first bit allows you to 'multiply' it by (-1).
@jjiteshh9 ай бұрын
I thought with quantisation the model performance would drop. Maybe the memory requirement has gone up as a trade off? The matrix dimension has to increase to have all of the information be stored in some way. i m not expert in anyway just feels its not that simple..
@maxieroo6299 ай бұрын
9:20 I thought so too, but…. No? The memory footprint is clearly smaller. I’m genuinely shocked and it seems to decrease the relative memory requirements more the bigger the model size is. So we get smaller llms (not parameter wise, but storage and ram wise) with pretty much no loss in quality (and even some gain) with WAY faster inference times. This truly will change everything. Imagine this on mixtral
@jjiteshh9 ай бұрын
@@maxieroo629 if true, it will certainly be shocking
@MichaelBarry-gz9xl9 ай бұрын
No, it's just that the current models are really really bad, like incredibly inefficient. That's being rectified little by little
@zacboyles13969 ай бұрын
@@maxieroo629perhaps the math performed during quantization simply produces less relevant values than are decided by this 1.58 bit’s (rounding?) rules? When you’re quantizing you may inadvertently keep less statistically relevant information, or values that amount to less statistically relevant information during inference whereas this technique performs a similar function at a different time that just happens to produce similar results to the original…or perhaps the paper was run through Google’s marketing team and thus the entire thing is bogus 😅 In any case - well, other than if Google really was involved 😂 - I can’t wait to check this out on some local 1.58 bit mistral models!
@deeplearning70979 ай бұрын
Very nice, thank you very much.
@oryxchannel9 ай бұрын
Please provide transcripts to study your videos alongside Gemini. Thank you!
@battleforevermore9 ай бұрын
What are the equivalent CPU to a 3090 gpu for this? or you do you still need a gpu for parrel processing ?
@hemanthkumar-tj4hs9 ай бұрын
Hello, how about acc to predict tokens? , even in traditional 16bit fp time there is little acc down the slope on selecting tokens
@frankjohannessen63839 ай бұрын
I file this under "I believe it when I see it". I find it suspicious that they only show scores for tiny models (up to 3B), but they have tokens/s and model-size for bigger 1.58-bit models (upto 70B)
@zerorusher9 ай бұрын
If this is indeed real, thats game changer. But how can the model resolution be maintained with so much compression? LLMs are naturally lossy, but this takes it to the extreme.
@pokerandphilosophy83289 ай бұрын
From what I understand, this isn't a compression method. It's an alternative way of encoding the parameters and of processing them during training and inference that makes a more efficient use of the available memory space. Models must be trained from scratch using this new encoding scheme.
@SonGoku-pc7jl9 ай бұрын
waw!!! thanks for this amazing information! :)
@walterbaltzley45469 ай бұрын
Wow! This takes LLM Design out of the realm of high-level abstraction and pulls it down to the hardware level. Looks like old C++ Guys like me are about to become relevant again. If the performance of these models can match that of FP16 on the output side, then this is truly game-changing. The cost of building out AI Infrastructure dropped by several orders of magnitude. All we need to do is develop efficient libraries for performing matrix math using 2-Bit Unsigned Integers.
@MichaelBarry-gz9xl9 ай бұрын
Not only that but it now seems logical to ditch the tokenizers and start using binary. That way we can train the models on existing binaries. Text in executable out and vice versa. Think about the ramifications. In the future everything will be open source, whether we like it or not.
@walterbaltzley45469 ай бұрын
@@MichaelBarry-gz9xlBy binary in and text out, I assume you mean binary in and source-code out - that is an intriguing proposition. People have already predicted that AI spells the end for programmers. That is not a question of if but when. Anyone want to start a Job Death Pool?
@MichaelBarry-gz9xl9 ай бұрын
@@walterbaltzley4546 I mean compiled code out. And in. Source code sure, but we can already do that. Imagine "feeding" it a copy of Microsoft Windows and then saying I want this but change X, Y, Z. Out comes the binaries, the source code, the documentation, everything. Next just say, change it around slightly so it doesn't infringe copyright. Boom.
@MichaelBarry-gz9xl9 ай бұрын
@@walterbaltzley4546 it's not the end of programmers, that's like saying I already have an iPhone so I don't need apple anymore. Sure but you want updates and improvements and someone to blame when it goes wrong.
@MichaelBarry-gz9xl9 ай бұрын
@@walterbaltzley4546 I think the best way to look at the future of AI is: ANY to ANY modality. I.E, voice in Blockbuster movie out. Video in poem out. Picture in software out. Text in video game out. Video game in movie out. And so on. ANY to ANY, is the way to see it. Also think in terms of millions and then billions and then trillions, and so on, of tokens per second. Now you should see where this is going.
@edwardrhodes44039 ай бұрын
if one used this with current hardware, what would happen? Would it be quicker?
@MichaelBarry-gz9xl9 ай бұрын
They used current hardware to build it and test it. Yes it's a lot faster, but it could be faster still
@justinnine49409 ай бұрын
In computer everything is just 0 or 1. A FP16 is just 16 1-bit digits
@Tony_Indiana9 ай бұрын
I asked gemini for about 2 hours how it works.It was mostly unsure.
@ixwix9 ай бұрын
How does one get reasonable gradients on 1/1.58 Bits?
@PaulBrunt9 ай бұрын
You scale up the parameters and let the network decide the precision attributed to each feature. Ternary gradients should work just as well if you up the parameters, but it remove redundant calculation. It will take longer to train on hardware not designed for it but will be way faster on hardware that is.
@WildEngineering9 ай бұрын
i think using -1,0,1,2 with 2 total bits would be a little smarter but i see why using only one bit is easier because multiplication on a single bit is just AND
@petardjurkovic10159 ай бұрын
Great explanations
@PaulSpades9 ай бұрын
If this is true, ternary computing just found its application, finally. But I doubt this representation is useful on current hardware, or faster than 8 or 16 bit integer. Well, unless the matrix library does some mad bit banging and masking.
@mattmazurek9 ай бұрын
how does it compare to quantized models?
@maxieroo6299 ай бұрын
It’s better in seemingly all ways to previous quantized models. This isn’t a model specifically though, it’s a new quantization method being demonstrated with Llama. It means any LLM using this method can have the full performance of a non-quantized model (16 bit), with a size smaller than other quantizations (in storage, and in ram during inference) and a massive inference speed increase. I genuinely can’t see the drawback here
@muhammedajmalg64269 ай бұрын
thanks for sharing!
@claffert9 ай бұрын
So, going from "All You Need is Attention" to "All You Need is 1.58 Bits"?
@impactframes9 ай бұрын
Great videos ❤ there it goes Sama 7Ts unless this can't be used in training like quantization. I have been using 2bit xss quants models. they are accurate enough and given you can build them from the ground up to be compatible as 1bit or 1.5bit I think is revolutionary not sure we won't need GPUs at all even with linear mults. I did miss this one with all the noise too.
@michaelmccoubrey42119 ай бұрын
Very intresting. I think I'll need to read the paper though. For the neural net to be able to approximate any functions I would of thought you need at least some form of multiplication for it to be a non linear model (as in non-linear classification thresholds or non-linear reggresions). If they are not multiplying values with wheights then I would at least expect that some form of multiplication is being done in the activation function.
@MichaelBarry-gz9xl9 ай бұрын
Optimizing the activations are being left for future work
@AndersTornqvistsvedbergh9 ай бұрын
the activation function is RELU, so that's not multiplication, it´s ´still nonlinear enough. The attention normalization step would still need normal multiplication, for instance.
@xlr555usa9 ай бұрын
Are we going to hit a wall with LLM? Garbage in garbage out, we need a system that verifies the AI is not corrupting itself, plus we need more energy to feed this monster. Is the Singularity near or far?
@ugk43219 ай бұрын
Thank You
@wanfuse9 ай бұрын
if instead of 3 you use 4 values (value not present at all being the fourth ) you end up with 16 values instead of 9, when you actually use it in the calculation you just shift by 1 or 2. please explain your 1.58 bits better, how do you get this value?
@MichaelBarry-gz9xl9 ай бұрын
GPT 4 is your friend: The concept of ternary computing, where a bit can have three possible values (-1, 0, 1), is indeed different from the traditional binary system. The value of 1.58 bits for a ternary system comes from the calculation of information entropy. In information theory, the entropy ( H ) of a system is a measure of the amount of information required to describe the state of the system. For a system with ( n ) equally likely states, the entropy in bits is given by: [ H = \log_2(n) ] For a ternary system with three possible states, the entropy is: [ H = \log_2(3) \approx 1.58496 ] So, when we say that a ternary digit is equivalent to 1.58 bits, we’re referring to the amount of information it can convey, not the number of binary bits required to represent it. This is why a single ternary value is said to be worth about 1.58 binary bits in terms of information capacity. It’s a measure of the “information density” that the ternary system can achieve compared to the binary system.
@MichaelBarry-gz9xl9 ай бұрын
As we must use binary we can encode this with 1.58 bits per digit. But if we have ternary chips, we can encode 1 digit in 1 bit. Because each bit would have an extra possible value.
@wanfuse9 ай бұрын
@@MichaelBarry-gz9xl ah thanks your referring to entropy , my premise still holds, null value can be a 4th state, but your not concerned with storage as much as computation "speed"
@MichaelBarry-gz9xl9 ай бұрын
@@wanfuse Yh there's always a trade-off. If we made the entire computer ternary, it would run this architecture great, but existing binaries would then take up 58% more space. So it would need to be a dedicated chip. We would store it in binary, move it to the chip where it would be stored and acted on in ternary, then exported back as binary. So always a trade off. As to why they landed on this specific setup I can't answer because the paper is lacking in details. It's more of a "look what we can do" kind of paper, lacking in any serious detail, because: Microsoft. Over time we will converge on the ideal tradeoff. Maybe your tradeoff turns out to scale better? But their tradeoff absolutely blows FP16 off the map. Time will tell, and maybe we'll end up with both.
@wanfuse9 ай бұрын
@@MichaelBarry-gz9xl thanks for your detailed replies, it isn't necessary to mimic the brain but once you account for noise in the brain the brain uses about 6-10 bits of information ( as I calculated it) however the brain likely takes advantage of this noise. I have recently looked into different encoding schemes and there is some overlap in this work.
@jimlynch93909 ай бұрын
Wish I could give it two thumbs up. This is almost too good to be true.
@ymusicyt9 ай бұрын
This is groundbreaking 😮
@MichaelBarry-gz9xl9 ай бұрын
40 Billion parameters is going to be the new 7B. With this architecture we can fit a 40B model on a 8Gb graphics card. I think you you should change the title back to the clickbaity one. This really is groundbreaking, the most exciting paper for a long time! We'll be hitting GPT4 level performance once we have some good foundation models. LLaMa v4 anyone 🙏 I thought the first bitnet paper was impressive, this blows it out the water
@MichaelBarry-gz9xl9 ай бұрын
Or for the 3090/4090 you can have 120B!
@BHBalast9 ай бұрын
120B models could be in the territory od GPT 3.5. Fine tuned models of this size are perfectly good for doing truly usefull stuff like calling functions, so that's great to know.
@MichaelBarry-gz9xl9 ай бұрын
@@BHBalast I honestly think a well designed 120B could blow 3.5 out the water, maybe GPT 3.85 ish if you know what I mean.
@TommyJefferson18019 ай бұрын
@@MichaelBarry-gz9xlmistral is pretty close. I mean Gpt 3.5 was before all the LLM hype right and post that we saw a lot of innovation
@testales9 ай бұрын
I like this title more, it's a lot less click-baity. ;-) Though I still wonder if this even changes anything at all. If the paper is is already from October last year, it's old by todays standards. So there is probably a good reasons why it has not been adopted yet.
@1littlecoder9 ай бұрын
when compared with the previous one?
@testales9 ай бұрын
@@1littlecoder Yes, and since people where discussing this, I thought I'd add my 2cents of feedback.
@1littlecoder9 ай бұрын
Thank you, appeciate it!
@MichaelBarry-gz9xl9 ай бұрын
Pretraining is expensive and takes time, not many people can afford it. Hence out of the thousands of models on the hub, it's basically just LLaMA and Mistral. The original paper was good, but the accuracy wasn't there. Now the accuracy is there and I think companies such as meta, maybe even Microsoft as they played a part in this research, will be scrambling to make bigger models with this. But it takes a long time and a lot of money
@testales9 ай бұрын
@@MichaelBarry-gz9xl What do you mean by "the accuracy wasn't there. Now the accuracy is there". I don't think that there is new hardware or software with increased "accuracy" since october 2023. Also the most time consuming part of training for these companies is most likely the preparation of the datasets. Apart from the biggest models training takes only days or weeks with the hardware that these big tech companies have. For the training datasets they could just use the existing ones. If on the other hand this 1.58BitNet approach allows for training of smaller models like 7b from scratch with more affordable hardware or if there is a way to "compress" existing models by converting or re-training them into the new format, one would expect that there were are already some open source examples floating around.
@sherpya9 ай бұрын
3 values cannot be represented using 1 bit
@nyx2119 ай бұрын
No, you'd need about 1.58 bits.
@MichaelBarry-gz9xl9 ай бұрын
It's not binary, it's ternary. That's why new hardware will be more efficient. In binary terms this takes up 1.58 bits, not 1
@AAjax9 ай бұрын
If only the paper authors had thought of that, they could have saved a lot of time!
@MichaelBarry-gz9xl9 ай бұрын
To address the core of the confusion: 3 values can absolutely be represented with 1 bit. 1 ternary bit holds 3 values. It's only binary that holds 2 values. But because binary is so ubiquitous people often assume it is the only game in town. The moral of the story, and the thing to remember is that binary is not the only game in town. You can have any number of values in your bits (so long as you are a chip manufacturer). The values themselves are not the bits, they are properties of the bits. And a bit can have as many properties/states/values as the manufacturer desires.
@xlr555usa9 ай бұрын
GTC is coming up, Nvidia sessions should be online soon. Ive been checking them out more and more and Remix tech is raging.
@marcosbenigno30779 ай бұрын
Obrigado. Thanks
@ysy699 ай бұрын
This seems to be a game changer
@kevintai86569 ай бұрын
Just wonder if anyone can reproduce their result lol
@ramus45979 ай бұрын
Great Video bro. Actually Im doing a project in AI which converts an UI design into front end code. Can you upload videos regarding this. It will be very useful for my project. Thanks in advance
@23612449 ай бұрын
One network with many hidden layers = deep neural network
@nilo_river9 ай бұрын
If I were CEO of Intel I would be working on this right now.
@SR-zi1pw9 ай бұрын
Thala suthudhu how it performs like 16
@Kutsushita_yukino9 ай бұрын
atleast theres no “shocking” this time
@1littlecoder9 ай бұрын
I haven't used that word in my title at least in my last 10 to 15 videos that I could verify
@Moyemor9 ай бұрын
Wes Roth only uses click bait ( shock , entire industry shock ) same as matt. But 1 little coder didn't use this fucking words
@user-qr4jf4tv2x9 ай бұрын
how ironic crypto bro's used to use that
@nawabifaissal96259 ай бұрын
shocking truly lol
@JohnMcclaned9 ай бұрын
SHOCKS THE WORLD
@kiiikoooPT9 ай бұрын
All I have to say is that this in no way is a 1bit tecnology like you make it to be, or even 1.58, cause a bit is either 0 or 1, you can not make an half bit, or you can but you need 2 bits to say that it is an half, or that it is positive or negative if you say it is -1 0 1, to represent that you need 2 bits, no other way to make it. I see another comment below where people are talking about compression and whatever, but if you give an ai an 1h 1GB video to learn from it and the same video in MP4 with 300mb the ai will learn the same, it will not increase 1GB neither 300mb, it learns that the video has the things you alredy teached it before or not, example if cats show in the video or people or whatever. It does not compresss the video and put it in another file somewhere like most people seem to think that it is how it works. Most ais are trained with TBs of data, if it worked like that, we would never be able to run it on or basic pcs. If you guys see even the 70B parameter models, dont get to terrabytes of size. And it has nothing to do with compression of image or whatever. It has to do with what your brain do also, if you see a new movie, you will remember some scenes that clicked with you or shocked you, but you will probably not remember the color of the dress the actriz was wearing. You will not compress the video in your head, what you will retain is the things that your brain has connections already from other movies or irl situations. Or trully new things that surprise you or have an impact on you, so your brain remembers it, but not pixel by pixel, is based on parameters that you know already. Like LLMs At least this is how I see it, but maybe I'm ignorant. I talk from the litle experience I have with llms and coding. And that is the reason that I dont understand why you are calling this a 1bit model, it is possible to do it, but you will need a way bigger ammount of parameters to make it learn anything if the only options you have is to give it 0 or 1 to tune its learning skills in each neuron I think. Is like saying I will make an ai that learns all 256 color range for the red in rgb with 1 bit, you can do it, but it needs at least 256 1 bit parameters to detect all possible 256 values, or at least half of that. since it is in bits. 0 and 1. If there are any experts that can show me where I'm thinking wrong about it, I will be glad to read your opinion and learn with it. Like I said I talk from the litle I know. And for me is weird seing people saying 1.58bits, cause that is not possible, bits are real numbers, there are no decimals in bits. unless if you reserve another one to make it decimal. and still it can only be half, 0.5 or positive negative, 1 or -1 but for that you need 2 bits like I said, not 1 is impossible.
@Unmannedair9 ай бұрын
One more step towards the merging of AI neural networks and quantum annealing style computing...
@zyxwvutsrqponmlkh9 ай бұрын
Nothing is free. I have seen attempts to take 32bit models and quantize them to two bit. They produce outputs, but not particularly good outputs; typically that drastic of a quantization is inferior to a smaller model that is not quantized so severely but they do still function. I have to wonder how the output of this approach differs from what would result from a model that was trained as a 2 bit model from the start.
@aiamfree9 ай бұрын
This essentially creates different "engines" ... like cars there are v4 and v12, you drive what you can afford! Currently they're all too expensive lol
@clray1239 ай бұрын
The main issue here is these are tiny models they are playing with. There is no proof that this technique scales to anything above (in terms of perplexity/quality). It reminds me of RetNet, which was also supposed to be a big breakthrough and hasn't been released with any open weights since.
@night80029 ай бұрын
reminds me of pcm vs dsd in audio
@nathank51409 ай бұрын
I don’t know why they ever used floats. Seems like complete overkill. I suspect just because GPUs were originally used for gaming.
@agnichatian9 ай бұрын
This sounds like an investment scheme. Nothing in this video explains how the resolution can be reduced from 16 bit int/float to only 1.58 ternary. They either didn't need that resolution in the first place, this whole time, implying that the whole industry was dumb (unlikely). Or the conversion algorithm is creating many output matrices for each input matrix to make up the difference, whick would be just re-representing each big word with several small words.
@NLPprompter9 ай бұрын
everyone makes this viral so it's will get more attentions such research is our future of AI. gooooooooo