Floating Points are no more, Changes everything for LLMs!!!

Рет қаралды 26,132

Күн бұрын

Пікірлер: 333

@stevenelliott216 9 ай бұрын

Those curious where the 1.58 comes from - It's log(3)/log(2) = 1.5849625. Basically if you have a long sequence of random three state values you can represent it with no less than 1.58 bits per three state value.

@wwkk4964 9 ай бұрын

Thanks! I thought it was the square root of 5 halfs.

@swarnavasamanta2628 9 ай бұрын

Actually you need to explain it a bit more why that is. Modern computers, use 1 bit to represent 0 and 1. And 2 bits to represent -1. Given you generate a random sequence of 3 possible integers of which 2 are 1 bit and another is a 2 bit one the average number of bits used would eventually converge to log(3)/log(2). Edit: If you are curious why it's log 3/ log 2. Base 10 log that is. Because we use the decimal system and computers use binary, number of bits required to store a number is log base 2 of that number. Since it's 3 states of numbers we are using or 3 unique digits, it's log_2(3), or log 3/log 2

@jarrod752 9 ай бұрын

@@swarnavasamanta2628So if I understand you correctly, you have 1 bit to represent 1 number, and it uses 2 bits to represent the other 2 numbers? So for example, if you have a 1, you stop. but if you have a 0, then you check for 00, or 01? Giving you 3 options stored in a compact way?

@swarnavasamanta2628 9 ай бұрын

@@jarrod752 You have 2 bits to represent -1 and 1 bit to represent 0 or 1. So you need 2 bits to store whenever the weight is -1 and 1 bit to store the 0 and 1 weights

@PaulSpades 9 ай бұрын

I wish they were not retarded and recognised that they're speaking about balanced ternary, not force a three value number into a binary format.

@gabrielsandstedt 9 ай бұрын

Summary: The video introduces a groundbreaking advancement in the field of Large Language Models (LLMs) by presenting the concept of 1.58-bit LLMs, a significant departure from traditional 32-bit or 16-bit floating-point representations used in these models. This new approach utilizes ternary values (-1, 0, +1) for model parameters, drastically simplifying the computational operations needed to run these models. Traditionally, LLMs rely on complex matrix operations, which are computationally intensive and require high-performance hardware like GPUs, optimized for such tasks through technologies like NVIDIA's CUDA. However, the 1.58-bit LLMs leverage simple arithmetic operations, reducing the need for specialized hardware and potentially lowering energy consumption and operational costs. This method significantly cuts down computational complexity, allowing advanced models to run on less powerful hardware, even devices without GPUs. It suggests a shift towards more sustainable AI technology use, with reduced environmental impact due to lower energy requirements. Moreover, it opens up avenues for hardware innovation, with the potential development of new devices optimized for ternary computation rather than complex floating-point matrix calculations. This advancement is not just a technical feat; it represents a shift towards making high-performance AI more accessible and cost-effective, paving the way for future innovations in AI hardware and software. The 1.58-bit LLMs maintain performance levels comparable to traditional models while offering improvements in speed, memory usage, throughput, and energy efficiency. This development could redefine how LLMs are scaled and trained, offering a new paradigm for AI model deployment that is both high-performing and environmentally conscious.

@blacksage81 9 ай бұрын

I wonder how fast this method would run on a Groq chip? Strange times we live in.

@__________________________6910 9 ай бұрын

Which website you are using for YT video summarization

@gabrielsandstedt 9 ай бұрын

I just fed the entire youtube transcript into gpt4.5 and told it to summarize@@__________________________6910

@Custodian123 9 ай бұрын

That's not a summary, that's a transcript 😂

@ysy69 9 ай бұрын

@@Custodian123 a good one 😛

@StevenAkinyemi 9 ай бұрын

Bro. This is actually game changing! Why is this not a popular technique yet?

@StevenAkinyemi 9 ай бұрын

Oh. It is new. Crazy stuff

@arnaudlelong2342 9 ай бұрын

arXiv papers are not peer-reviewed.

@bourdainedepiment3962 9 ай бұрын

Like all vaporware, because it doesn't work.

@MichaelBarry-gz9xl 9 ай бұрын

The research only came out yesterday, and it's preprint

@jarrod752 9 ай бұрын

I'm guessing as a research paper, it'll either be implemented into everything in 2 weeks or it won't work.

@Cat-vs7rc 9 ай бұрын

Why not address the elephant in the room. The 1.5 bit model cannot store the same amount of information as 16bit. So why is it performing at an equal level. Very fishy. metrics gamed? most probably.

@UnknownOrc 9 ай бұрын

Good Point. no idea what any of this mean anything or anyway.

@MichaelBarry-gz9xl 9 ай бұрын

Density. Entropy. Compression. etc

@wenhanzhou5826 9 ай бұрын

I am also super skeptical about this, some papers shows pruning the network may give some performance boost, but the performance degrades when too many weights are set to zero. This is like pruning and quantization at the same time, which would likely work in a very controlled manner, but sounds too good to be true in this paper.

@michalchik 9 ай бұрын

It may be that our training algorithms cannot efficiently use the data resolution of floating point the size of floating point numbers is way too large for the proper atomic level of this kind of information. Think of it this way, after you get to TV screens with 16 bit colors do you actually get much more by switching to 32-bit color? You mostly get much bigger files and much slower processing.

@nikflix8331 9 ай бұрын

My thoughts exactly. I’d imagine these 1.58 bit models would be more prone to catastrophic forgetting as well. They can’t hold nearly as much information.

@JG27Korny 9 ай бұрын

So basically the main idea is to get rid of multiplication as multiplication by 1 is the same, by -1 the same but with - sign. And multiplication by zero is zero. So 1 is obvious. 0 is interesting as it allows for additional complexity representation to be encoded in the neural network. I expect they could rediscover some bitwise hack techniques from the era of the beginning of 3D gaming. To use bitwise operations instead of multiplication. In that way you get the same efficiency not doing multiplication but the complexity goes up significantly. I like though the quantization as it introduces some noise in the neural network and by doing so you get probably some interesting results as the NN can outperform the original full precision model on data on which the original was not trained for.

@morososaas3397 9 ай бұрын

Its so hard to believe that 3 possible parameter values can capture the patterns and signal from the data as good as with 16-bit float. Mind blowing stuff for sure.

@arkaprovobhattacharjee8691 9 ай бұрын

that's what i want to know, how are they doing it?

@ZuckFukerberg 9 ай бұрын

@@arkaprovobhattacharjee8691they have released the technical paper so you can freely check it

@ciarantaaffe4199 9 ай бұрын

If you understand Digital-to-Analog conversion techniques, it is trivial. In fact, you wouldn't need a negative one value either, but they have yet to figure that part out.

@eugenmalatov5470 9 ай бұрын

Exactly, a lot of information gets lost. So you would assume that the number of knods (parameters) would need to go up.

@nguyenvanduy247 9 ай бұрын

Maybe the trick is that with ternaries there's now a convenient way to represent sparsity (the zero value), which I heard is pretty important in neural networks.

@whig01 9 ай бұрын

Bitwise Mamba + bitwise inference for the win.

@IAmCandal 9 ай бұрын

I was thinking mamba.

@magicmarcell 9 ай бұрын

+ bitmap which is irrelevant but f it were in a simulation

@minimal3734 9 ай бұрын

+ analog inference chips.

@siddharthagrawal8300 9 ай бұрын

@@minimal3734 why would you use analog inference chips for 1.58 bits 💀

@ytpah9823 9 ай бұрын

🎯 Key Takeaways for quick navigation: 00:00 *🔄 The video introduces changes in the LLM (large language model) world, highlighting the transition from traditional deep learning models to ones that no longer require GPUs for high-performance matrix multiplication.* 00:43 *🔍 The discussion introduces the concept of 1bit LLMs, suggesting a move towards more efficient computational models that retain performance parity with current LLMs but at a reduced computational cost.* 01:11 *🧮 Explains the shift from 32-bit or 16-bit floating-point representations to a ternary system (using -1, 0, 1) for model parameters, significantly simplifying the computational process by eliminating the need for multiplication.* 03:31 *🆕 Introduces the "B 1.58 model," which uses ternary values instead of binary, enhancing learning capabilities and performance by incorporating a zero value alongside -1 and 1.* 05:08 *💡 Discusses the potential for new hardware development optimized for the ternary computation model, suggesting a significant shift away from GPU reliance and towards more specialized computing solutions.* 06:02 *🚀 Highlights the paper's assertion that the 1.58 bit LLM architecture offers comparable accuracy to traditional models while improving latency, memory efficiency, throughput, and energy consumption.* 07:12 *📈 Provides evidence of the new model's effectiveness through comparisons with the Llama LLM architecture, showing equal or better performance on various metrics, including perplexity and downstream task performance.* 09:58 *🎛️ Elaborates on the technical implementation of the 1.58 bit LLM, retaining the Transformer architecture but altering the numerical representation and computational approach within the model.* 11:48 *🌍 Suggests a significant impact on the scalability and application of LLMs across different hardware platforms, including mobile and edge computing, due to the reduced computational requirements.* 13:11 *📉 Concludes with the potential for dramatic improvements in hardware efficiency and cost reduction for deploying large-scale LLMs, due to the shift to a 1.58 bit computational model.* Made with HARPA AI

@zandrrlife 9 ай бұрын

This isn't even a click bait title. This is an improvement of the original bitnet paper. This is actually a huge deal. Very. Original method, would reduce performance by almost half. We been working with it for our own LM architecture. Trying to figure out clever ways to mitigate these issurs to a degree, which required whole nee activation function for sparse representations, increasing nonlinearity...blah blah lol. This improvement seems it retains majority of this performance. Wow. Just going off vid screenshot. I have to dig into this paper now lol. Listen we won't have gpt4+ models locally without super efficient low depth quantization. This is morr holistic since you have to pretrain. Wow. Also generationing weights directly is more feasible at this depth, we believe that conditional weight generation is going to forever change deep learning. I know i sound crazy, but even compute won't be an advantage soon. We are going to change that. I know i sound crazy..but remember i said this. Within a year you wont have to pay hundreds of thousands/millions to pretrain a model, simply bring your dataset, model code, and hyperparameters. That's your prompt by the way. This year is going to be insane for all.

@AbeDillon 9 ай бұрын

I'm suprised it took so long to study this. Years ago, I realized you could model combinational logic as a "neural net". Weights would have to be +1, 0 (not connected), and -1 (negated). The gates would be like neurons with various activation functions. I never wrote about it because i figured surely someone in the feild had already written a paper on the concept. It seems so obvious.

@AbeDillon 9 ай бұрын

Since I missed the boat on this discovery, I might as well share some other insights I think are rather obvious, but haven't seen in literature (though I haven't looked very hard): 1) I bet the majority of training can be done at low precision too. Since weights start out highly randomized, they (almost by definition) don't carry much information. One should be able to train a 1.58-bit (trit) model until things settle, then add a bit of precision and continue training until the model parameters settle, then add another bit and so-on until you reach the desired performance. It makes little sense to train at high precision if you're going to throw away most of that precision when it comes time for inference anyway. I don't know how much precision is required for meta-parameters like momentum, but it shouldn't be that much more than the actual parameters themselves.

@AbeDillon 9 ай бұрын

2) I think the field is missing a key fundamental element besides neurons and weights: delays. AFAIK, delays have yet to be explicitly modeled in ANNs. I think it would help at a theoretical level for understanding RNNs. I think we might gain a lot from modeling neuron inputs as adaptive FIR filters instead of single-weighted, single-valued signals. Digital Signal Processing engineers have a whole toolbox of techniques based on explicitly modeling delays.

@unclecode 9 ай бұрын

When accuracy is so sensitive even to quantization, it's hard to grasp how this makes things better and faster! Every upgrade has a price tag, so what's the catch here? I'm curious to give their model a go, hopefully they add it to Hugging Face.

@d_b_ 9 ай бұрын

Seems huge! Can't wait to run 70B+ on my phone with millisecond response times

@footube3 9 ай бұрын

Assuming you have 13.8GB of free memory on your phone to run it of course!

@TragicGFuel 9 ай бұрын

@@footube3 that's not a large amount

@Solo2121 9 ай бұрын

@@TragicGFuel He said memory not storage.

@irnehhenri 9 ай бұрын

My Xiaomi Mi 10 Ultra has 16GB of RAM, of which 12GB is literally always free, because that's an insane amount for a phone... but there have been phones like that for years now!

@irnehhenri 9 ай бұрын

I did actually try to run llama.cpp on my phone a while back - there was a project that compiled it natively for Android, but I couldn't get it to not crash. I could have tried compiling it myself, but I got bored and figured it would probably be way too slow with a phone CPU anyway

@i6od 9 ай бұрын

i wish they released at least one model to test it out lol

@MichaelBarry-gz9xl 9 ай бұрын

They are planning on releasing the models for research

@footube3 9 ай бұрын

This is huge! Thanks for bringing it to our attention ❤

@vincentvoillot6365 9 ай бұрын

From 16bits float to 1bit ternary, bring back memory, from CD (16bit int) to SACD (1bit DSD/SDM)... Does a text-to-sound with a 1.58 quantization could output DSD directly ?

@ПавелКуликов-м9м 9 ай бұрын

I still couldn't understand the main thing. In this work, the original model was trained in 16 bits, and then it was quantized in -1,0,1, or did they learn to train the model immediately in -1,0,1, managed to do full backpropagation in this representation, etc.?

@JG27Korny 9 ай бұрын

From what I understand you get the full benefits only if you train on 1,58 bits. If you do post training quantization you get what you get, lower precision that scales down 8 bit, 4, bit, 2, bit, 1 bit.

@noname76787 9 ай бұрын

When you backpropagate, you use the fp16 values as reference but tell the the ternary model to only use 1, 0, or -1 (paraphrasing from someone else comment from other video)

@otterhopper 9 ай бұрын

This feels like the signal to sell NVDA while they're at their high, before the shift occurs where their GPUs are no longer essential for AI.

@MichaelBarry-gz9xl 9 ай бұрын

NVIDIA is in an excellent position to optimize their GPU's for this. It will actually save them money. NVIDIA will be just as excited as this as we are. The good thing is that so too will RISC-V, Arm, Intel, AMD etc

@otterhopper 9 ай бұрын

@@MichaelBarry-gz9xl Yeah you could certainly be right. I wondered if that will be how it plays out too. My gut says it'll be a matter of how adaptive nVidia is and how quickly they'll be able to pivot from their existing momentum (which is substantial) in the whole CUDA stack with the FP16/etc matrix math to doing custom circuits (ASICs like being done by Groq). I am also a bit confused about this because it seems like there is still a need to train using the existing FP16 math and that this binary-ish technique is more for the inference stage, after quantizing an FP16 model down to a model with 1/0/-1, or at least that's my read on it. If that's so, then Groq and Cerebrus and others who are well down the custom ASIC paths may be better positioned to pivot to ASICs for this binary-like math purpose, specializing in inference (which is the larger market I believe) and leave the training phase to nVidia hardware (or similar).

@iyziejane 9 ай бұрын

There are multiple reasons to sell if you've been holding for a while. Regardless of what architecture pulls ahead, nvidia will have more competition soon. We may also be in a speculative bubble, since the investor frenzy is banking on a wildly optimistic economic transformation due to AI, and that may not come to pass.

@KillFrenzy96 9 ай бұрын

This is a good thing for both top end and consumer level AI. The top end will always use the extra headroom to improve the quality of the model. Consumers will finally run good models that they can run on a regular PC, perhaps opening the window for games running AI locally.

@tuhinswe 9 ай бұрын

no drawback?

@MichaelBarry-gz9xl 9 ай бұрын

None that I can see. It's seems better in every single way. 5x smaller. 2x context length. 4x lower latency. 11x greater batch sizes. Several orders of magnitude less energy required. Better scaling law. It needs retraining from scratch so it'll be the good part of a year before we get some decent models to play with. It will also spur on the development of new hardware. So we'll all be wanting to buy new hardware, which will make it even better

@EobardUchihaThawne 9 ай бұрын

the idea of having integers as weights made sense to me for a while. but my man 😂only using -1, 0, and 1 is very cool

@EobardUchihaThawne 9 ай бұрын

but still in abstraction it doesnt feel reliable somehow😂

@VincentVonDudler 9 ай бұрын

@EobardUchihaThawne I don't know sh-t about the human mind *but* in whatever way it functions it can probably be abstracted to something simpler than floating points. Probably much more along the lines of two or three values like -1, 0, 1.

@torretacosmica 9 ай бұрын

why? human mind is "analogic", it is not discrete, so is "quantum level" presition...@tVonDudler

@SoCalGuitarist 9 ай бұрын

Wooo, this sounds pretty cool, and may make CPU inference on low power devices much more realistic. Groq folks probably aren't too happy about this (tho lets be real, this will make groq's already incredible performance even more incredible)

@cbuchner1 9 ай бұрын

Some earlier tensor cores of nVidia GPUs (Turing, Ampere A100) have a 1 bit matrix multiplication mode. It was labeled as experimental in the documentation. Not sure if they kept this in later hardware revisions.

@Anzeljaeg 9 ай бұрын

This is truly a big change, ty for the info

@build.aiagents 9 ай бұрын

This is phenomenal

@mawungeteye6609 9 ай бұрын

So what happens to 1bit Mamba models

@farrael004 9 ай бұрын

"It (BitNet b1.58) matches the full-precision Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance" I don't buy this at all. I will only believe it when I have a model like that running locally in my computer.

@Daniel-Six 9 ай бұрын

1.58 = Log 3 in base 2. (corrected as below... Uh-doiii!)

@Jvo_Rien 9 ай бұрын

log_2(3) or log(3)/log(2) actually. But good catch anyway! This is related to entropy. This BitNet shifts from a binary to a ternary representation. So if you have a sequence of values with 3 possible states each, you need about 1.58 bits per value to encode them efficiently without loss of information.

@timothywcrane 9 ай бұрын

Doesn't this do the same thing to an LLM computation that first using UMAP on high density embedding at a 3 dimensional level for visualization does in Topic modeling using classic ML methods? I always use the HDBscan output for hierarchical and algorithmic insights before 3d vis because of the density of the parameters gives out more precise, but harder to handle high density data. The 3d data is still VERY usable and holds most understanding from HD. Is this comparable? Nice to know after struggling with the frustration of Nvidia CUDA update dragging..

@BobKane-g6x 9 ай бұрын

How does this differ from all the other quantization methods that we have used so far? I have worked with 2-bit, 4-bit, 5-bit, and 6-bit quantized versions. Is this a completely new approach?

@crankysysadmin 9 ай бұрын

Quantization takes a model created in fp16 and "compresses" it down to 6,5,4,3,2 bits. This technique requires one to *train* the model at 1.58 bits.

@MarkShank 9 ай бұрын

The extreme wording of the title is warranted here if you didn’t know about the October paper, which I did not. So, thanks for the video!

@MrSur512 9 ай бұрын

Elaborate please?

@차우현-d8d 9 ай бұрын

@MrSur512 there was a paper published in October about the previous version of this named Bitnet

@biskero 9 ай бұрын

did they provide a real example with tests? this is going to be great for low power computational devices like RPI and so on...

@anirudh514 9 ай бұрын

Amazzing!! You are always very much up to date with research work in this AI field.

@footube3 9 ай бұрын

Question: If word2vec is able to represent semantically similar words based on them existing near each other in an n dimensional vector space, how can it achieve the same result if there are only three possible positions in each dimension, or are the vectors produced by word2vec solely the data that is fed to the model, but not the model weights itself? Does this also have anything to do with why activations are 8 bit, or is that also unrelated?

@siddharthagrawal8300 9 ай бұрын

consider that embeddings have 512 to 1024 or even more dimensions, that means with just 1, -1, and 0, u can represent 3^1024 unique elements in an embedding. that's more than the number of atoms in the universe afaik. I am unsure if this produces embeddings in 1.58 bits but that is still a very large space to generate information over.

@KingBadger3d 8 ай бұрын

Great video. Very cool stuff

@richardchin1545 9 ай бұрын

If the 1.58bit model is out performing the traditional model on current hardware what could it do with optimised hardware? Would be interested to know also if it is quicker (and cheaper) to train. Possible energy/environmental benefits too? Could it make putting LLMs in phones and small, portable devices more doable?

@AndersTornqvistsvedbergh 9 ай бұрын

It should be possible to do the calculations with mostly analogue circuits, just a couple of transistors per weight really, one layer adding and one layer subtracting (or doing nothing). Like 10000x10000 weigh layer update in one ns with very little energy.

@rasol136 9 ай бұрын

Thank you for sharing! This process will revolutionize local open source AI!

@petrus4 9 ай бұрын

Ternary is apparently a very old, alternative number system, which permits decision making with the use of -1. 0, and 1, unlike just binary's 0 and 1. Apparently a computer called Setun which was based on it, was built by the Soviets in the 1950s. It makes me wonder what other innovations might be possible, if we were to look at such obscure and potentially unexplored ideas.

@ScibbieGames 9 ай бұрын

Ternary is difficult to make into logical gates. These ternaries are still represented by binary bits here.

@dakrontu 5 ай бұрын

That it took so long to find out that a 1-bit approach is viable, suggests that it was very counter-intuitive, so it is good that we eventually found out, because in terms of hardware, this is revolutionary. It also possibly sheds some light on how neurons work, where the complexity may just be a consequence of the biology, but unnecessary in that it does not add to the performance, but (and this is key) nor does it get in the way. So maybe we can simplify a neuron model along the 1-bit lines.

@nonetrix3066 9 ай бұрын

If it ends up being true, I have my doubts but regardless, can we even make models smaller after this? I don't think it's possible really. I think maybe the next step is trying to trim parameters, I think I saw this being talked about once, just trying to find parameters that don't contribute much but it would be computationally expensive

@henkhbit5748 9 ай бұрын

We have to wait until a Company will implement this technique. Very promising. Thanks for the update on new llm technique.

@AC-go1tp 9 ай бұрын

Great explanation! It will be great if you can run a more in depth explanation of the paper and share some resources contrasting conventional matrix multiplication with the 1Bit approach. Thanks a lot!

@69x 9 ай бұрын

couldn't help but think of quantum computers when seeing the -1 0 1 bits, is there possibility for this?

@iyziejane 9 ай бұрын

How do you train such models? I mean, changing a parameter from 1, to 0, to -1 is a big discrete step. How do you know when it's time to change it?

@1littlecoder 9 ай бұрын

That's where it'll be interesting to see if the authors share the code. They haven't yet.

@iyziejane 9 ай бұрын

@@1littlecoder Thanks for the reply (and for this very nice high-level summary video). I looked into the paper, the closest they come to discussing training is to mention that other groups have used post-training quantization (which is the most natural guess), but then they criticize such methods and don't say what they do instead (a little bit suspicious, but maybe being guarded is normal with so much money at stake). Clearly to make such a big change of a 1 to 0 or -1 at once, the decision to change must be based on many training samples somehow. The best way I can think of to do this, without storing a hidden floating-point for each parameter, is by a Monte Carlo method (each training sample suggest some small direction for each trit to change in, and RNG is used to accept the change probabilistically, so that on average the trit parameters are being sampled from a relevant distribution that takes into account all the training samples). Just a guess!

@1littlecoder 9 ай бұрын

@@iyziejane Given your interest, read up on Extropic. They're proposing a new chip and compute!

@iyziejane 9 ай бұрын

@@1littlecoder Ah that's a serendipitous recommendation, I knew the CEO Verdon from his days as a graduate student in quantum computing. I might actually get in touch to help their efforts...

@1littlecoder 9 ай бұрын

@@iyziejane woah; that's so nice of you! Thanks for sharing this!

@prodigiart 9 ай бұрын

This is fantastic news for people running local open source models, if the performance translates.

@OnigoroshiZero 9 ай бұрын

This is huge and actually shocking... But, how can the 3 values (-1, 0, 1) do the same work as the 16-bit ones without losing any precision? Also, wouldn't this mean that something on the level of GPT-4 will work with a single powerful GPU (something like a 3090) if it uses this 1.58-bit form, as it will scale down massively both in size and required computational power? edit: A deeper dive on this if possible will be helpful.

@MichaelBarry-gz9xl 9 ай бұрын

Yes. 120B parameters on a 3090. Hook a few up and you can absolutely give GPT4 a run for it's money. Don't forgot GPT 5 will use this tech, and probably improve it and keep the improvements secret

@VincentVonDudler 9 ай бұрын

For speed: Increase the quantity of calculations and Lower the complexity of calculation? For model size: The model is highly compressed because the values are highly compressable (only three options)? So maybe the model is bigger in terms of overall weights / information held in the model but much lower in disc size? Disclaimer: I have no idea wtf I'm talking about.

@marcfruchtman9473 9 ай бұрын

There is a lot of potential here. Originally, I thought this was a completely Ternary process. BUT, I doubt it. So this model is probably using similar method as "BitNet: Scaling 1-bit Transformers for Large Language Models." which explains "BitNet employs low-precision binary weights and quantized activations, while maintaining high precision for the optimizer states and gradients during training."... In other words, this method is not a completely Ternary process. It is still using high precision for gradients etc (If it is using all Ternary, then please explain how it is avoiding the same issues as over quantization)

@Alice_Fumo 9 ай бұрын

I checked the paper for a sec and found these models actually eat more memory than 4-bit quantized ones and don't offer much of a speedup compared to them either. Don't know where the memory inefficiency comes from and whether it could be fixed. If this is the best it can do, it's quite Zzz

@MichaelBarry-gz9xl 9 ай бұрын

Huh? Did we read the same paper? I did my own calculations in my head before reading the paper, and the paper confirmed what I suspected. This architecture can fit around 40B parameters on a little 8GB card

@Alice_Fumo 9 ай бұрын

@@MichaelBarry-gz9xl "BitNet b1.58 1.3B 1.14 (2.93x) 0.97 (1.67x) 11.29" taking 1.14 GB to run a 1.3B model gives us 8 / 1.14 = 7.017 * 1.3 = 9.1B for an 8GB card, assuming of course 0 bytes being used for stuff like display drivers. However, we'll also need some more memory for our context window.

@MichaelBarry-gz9xl 9 ай бұрын

@@Alice_Fumo the smaller models are larger than expected yes I noticed that too and perhaps its head which is 8 bits? But it levels of add or gets bigger and reaches the expected 1.58 bits. Check the chart that compares the size using the log scale on the left. 70B was around 10Gb, and the 7B was around 10Gb or less. I'm going off memory, but everything seemed as I expected it to be (with the exception of the smaller models)

@Alice_Fumo 9 ай бұрын

@@MichaelBarry-gz9xl ah, you're right. I kinda missed that graph. This would put the memory consumption of 70b models to ~20gb (they only gave the ratio for that one which is why I overlooked it), which is actually within the realm of consumer GPUs (although barely). Most interestingly, it would let one run mixtral (or similar) on a single GPU.

@TragicGFuel 9 ай бұрын

@@MichaelBarry-gz9xl I still do not quite get how it would be matching performance with the 32 bit variants? I'm just an undergrad student, so I'm probably missing a lot of the necessary details. If you could explain it, or guide me to some resources that could, I would be grateful.

@MrErick1160 9 ай бұрын

Hey, where do you get all these papers? How do you stay informed with all the new important papers? I'd like to start reading some specific to my work (data science)

@1littlecoder 9 ай бұрын

My main source is this guy on Twitter - twitter.com/_akhaliq (Very high signal to noise ratio)

@MichaelBarry-gz9xl 9 ай бұрын

Hugging Face Daily Papers. Alternatively just filter the arxiv to only include keywords such as LLM

@TommyJefferson1801 9 ай бұрын

@@MichaelBarry-gz9xlhey I checked your comments on KZbin and they're good and you seem to be one of the few knowledgeable out here. I'm working on something interesting but want to verify with you. If possible can we connect on discord. Thanks!

@Tomjones12345 9 ай бұрын

can this method be used for only inference or training as well. if I understand correctly the hardware requirements would remain the same for training.

@geldoku 9 ай бұрын

Is this in any way related to neuromorphic/analogue computers?

@stevencooley3341 9 ай бұрын

Isn't there some kind of restriction on sale of high-performance GPU/LPU hardware to China? That might be fueling this. It might also be cause for doubt. I hope it's true though!

@abhi88mcet 9 ай бұрын

how would u get the signed bit (+/-) "without multiplying" with the weights

@Jvo_Rien 9 ай бұрын

from what i understood, it's a ternary representation {-1, 0, 1} so 3 base elements.

@jonatasdouradoporto2396 9 ай бұрын

Because it is technically a 2-bit representation, inverting the first bit allows you to 'multiply' it by (-1).

@jjiteshh 9 ай бұрын

I thought with quantisation the model performance would drop. Maybe the memory requirement has gone up as a trade off? The matrix dimension has to increase to have all of the information be stored in some way. i m not expert in anyway just feels its not that simple..

@maxieroo629 9 ай бұрын

9:20 I thought so too, but…. No? The memory footprint is clearly smaller. I’m genuinely shocked and it seems to decrease the relative memory requirements more the bigger the model size is. So we get smaller llms (not parameter wise, but storage and ram wise) with pretty much no loss in quality (and even some gain) with WAY faster inference times. This truly will change everything. Imagine this on mixtral

@jjiteshh 9 ай бұрын

@@maxieroo629 if true, it will certainly be shocking

@MichaelBarry-gz9xl 9 ай бұрын

No, it's just that the current models are really really bad, like incredibly inefficient. That's being rectified little by little

@zacboyles1396 9 ай бұрын

@@maxieroo629perhaps the math performed during quantization simply produces less relevant values than are decided by this 1.58 bit’s (rounding?) rules? When you’re quantizing you may inadvertently keep less statistically relevant information, or values that amount to less statistically relevant information during inference whereas this technique performs a similar function at a different time that just happens to produce similar results to the original…or perhaps the paper was run through Google’s marketing team and thus the entire thing is bogus 😅 In any case - well, other than if Google really was involved 😂 - I can’t wait to check this out on some local 1.58 bit mistral models!

@deeplearning7097 9 ай бұрын

Very nice, thank you very much.

@oryxchannel 9 ай бұрын

Please provide transcripts to study your videos alongside Gemini. Thank you!

@battleforevermore 9 ай бұрын

What are the equivalent CPU to a 3090 gpu for this? or you do you still need a gpu for parrel processing ?

@hemanthkumar-tj4hs 9 ай бұрын

Hello, how about acc to predict tokens? , even in traditional 16bit fp time there is little acc down the slope on selecting tokens

@frankjohannessen6383 9 ай бұрын

I file this under "I believe it when I see it". I find it suspicious that they only show scores for tiny models (up to 3B), but they have tokens/s and model-size for bigger 1.58-bit models (upto 70B)

@zerorusher 9 ай бұрын

If this is indeed real, thats game changer. But how can the model resolution be maintained with so much compression? LLMs are naturally lossy, but this takes it to the extreme.

@pokerandphilosophy8328 9 ай бұрын

From what I understand, this isn't a compression method. It's an alternative way of encoding the parameters and of processing them during training and inference that makes a more efficient use of the available memory space. Models must be trained from scratch using this new encoding scheme.

@SonGoku-pc7jl 9 ай бұрын

waw!!! thanks for this amazing information! :)

@walterbaltzley4546 9 ай бұрын

Wow! This takes LLM Design out of the realm of high-level abstraction and pulls it down to the hardware level. Looks like old C++ Guys like me are about to become relevant again. If the performance of these models can match that of FP16 on the output side, then this is truly game-changing. The cost of building out AI Infrastructure dropped by several orders of magnitude. All we need to do is develop efficient libraries for performing matrix math using 2-Bit Unsigned Integers.

@MichaelBarry-gz9xl 9 ай бұрын

Not only that but it now seems logical to ditch the tokenizers and start using binary. That way we can train the models on existing binaries. Text in executable out and vice versa. Think about the ramifications. In the future everything will be open source, whether we like it or not.

@walterbaltzley4546 9 ай бұрын

@@MichaelBarry-gz9xlBy binary in and text out, I assume you mean binary in and source-code out - that is an intriguing proposition. People have already predicted that AI spells the end for programmers. That is not a question of if but when. Anyone want to start a Job Death Pool?

@MichaelBarry-gz9xl 9 ай бұрын

@@walterbaltzley4546 I mean compiled code out. And in. Source code sure, but we can already do that. Imagine "feeding" it a copy of Microsoft Windows and then saying I want this but change X, Y, Z. Out comes the binaries, the source code, the documentation, everything. Next just say, change it around slightly so it doesn't infringe copyright. Boom.

@MichaelBarry-gz9xl 9 ай бұрын

@@walterbaltzley4546 it's not the end of programmers, that's like saying I already have an iPhone so I don't need apple anymore. Sure but you want updates and improvements and someone to blame when it goes wrong.

@MichaelBarry-gz9xl 9 ай бұрын

@@walterbaltzley4546 I think the best way to look at the future of AI is: ANY to ANY modality. I.E, voice in Blockbuster movie out. Video in poem out. Picture in software out. Text in video game out. Video game in movie out. And so on. ANY to ANY, is the way to see it. Also think in terms of millions and then billions and then trillions, and so on, of tokens per second. Now you should see where this is going.

@edwardrhodes4403 9 ай бұрын

if one used this with current hardware, what would happen? Would it be quicker?

@MichaelBarry-gz9xl 9 ай бұрын

They used current hardware to build it and test it. Yes it's a lot faster, but it could be faster still

@justinnine4940 9 ай бұрын

In computer everything is just 0 or 1. A FP16 is just 16 1-bit digits

@Tony_Indiana 9 ай бұрын

I asked gemini for about 2 hours how it works.It was mostly unsure.

@ixwix 9 ай бұрын

How does one get reasonable gradients on 1/1.58 Bits?

@PaulBrunt 9 ай бұрын

You scale up the parameters and let the network decide the precision attributed to each feature. Ternary gradients should work just as well if you up the parameters, but it remove redundant calculation. It will take longer to train on hardware not designed for it but will be way faster on hardware that is.

@WildEngineering 9 ай бұрын

i think using -1,0,1,2 with 2 total bits would be a little smarter but i see why using only one bit is easier because multiplication on a single bit is just AND

@petardjurkovic1015 9 ай бұрын

Great explanations

@PaulSpades 9 ай бұрын

If this is true, ternary computing just found its application, finally. But I doubt this representation is useful on current hardware, or faster than 8 or 16 bit integer. Well, unless the matrix library does some mad bit banging and masking.

@mattmazurek 9 ай бұрын

how does it compare to quantized models?

@maxieroo629 9 ай бұрын

It’s better in seemingly all ways to previous quantized models. This isn’t a model specifically though, it’s a new quantization method being demonstrated with Llama. It means any LLM using this method can have the full performance of a non-quantized model (16 bit), with a size smaller than other quantizations (in storage, and in ram during inference) and a massive inference speed increase. I genuinely can’t see the drawback here

@muhammedajmalg6426 9 ай бұрын

thanks for sharing!

@claffert 9 ай бұрын

So, going from "All You Need is Attention" to "All You Need is 1.58 Bits"?

@impactframes 9 ай бұрын

Great videos ❤ there it goes Sama 7Ts unless this can't be used in training like quantization. I have been using 2bit xss quants models. they are accurate enough and given you can build them from the ground up to be compatible as 1bit or 1.5bit I think is revolutionary not sure we won't need GPUs at all even with linear mults. I did miss this one with all the noise too.

@michaelmccoubrey4211 9 ай бұрын

Very intresting. I think I'll need to read the paper though. For the neural net to be able to approximate any functions I would of thought you need at least some form of multiplication for it to be a non linear model (as in non-linear classification thresholds or non-linear reggresions). If they are not multiplying values with wheights then I would at least expect that some form of multiplication is being done in the activation function.

@MichaelBarry-gz9xl 9 ай бұрын

Optimizing the activations are being left for future work

@AndersTornqvistsvedbergh 9 ай бұрын

the activation function is RELU, so that's not multiplication, it´s ´still nonlinear enough. The attention normalization step would still need normal multiplication, for instance.

@xlr555usa 9 ай бұрын

Are we going to hit a wall with LLM? Garbage in garbage out, we need a system that verifies the AI is not corrupting itself, plus we need more energy to feed this monster. Is the Singularity near or far?

@ugk4321 9 ай бұрын

Thank You

@wanfuse 9 ай бұрын

if instead of 3 you use 4 values (value not present at all being the fourth ) you end up with 16 values instead of 9, when you actually use it in the calculation you just shift by 1 or 2. please explain your 1.58 bits better, how do you get this value?

@MichaelBarry-gz9xl 9 ай бұрын

GPT 4 is your friend: The concept of ternary computing, where a bit can have three possible values (-1, 0, 1), is indeed different from the traditional binary system. The value of 1.58 bits for a ternary system comes from the calculation of information entropy. In information theory, the entropy ( H ) of a system is a measure of the amount of information required to describe the state of the system. For a system with ( n ) equally likely states, the entropy in bits is given by: [ H = \log_2(n) ] For a ternary system with three possible states, the entropy is: [ H = \log_2(3) \approx 1.58496 ] So, when we say that a ternary digit is equivalent to 1.58 bits, we’re referring to the amount of information it can convey, not the number of binary bits required to represent it. This is why a single ternary value is said to be worth about 1.58 binary bits in terms of information capacity. It’s a measure of the “information density” that the ternary system can achieve compared to the binary system.

@MichaelBarry-gz9xl 9 ай бұрын

As we must use binary we can encode this with 1.58 bits per digit. But if we have ternary chips, we can encode 1 digit in 1 bit. Because each bit would have an extra possible value.

@wanfuse 9 ай бұрын

@@MichaelBarry-gz9xl ah thanks your referring to entropy , my premise still holds, null value can be a 4th state, but your not concerned with storage as much as computation "speed"

@MichaelBarry-gz9xl 9 ай бұрын

@@wanfuse Yh there's always a trade-off. If we made the entire computer ternary, it would run this architecture great, but existing binaries would then take up 58% more space. So it would need to be a dedicated chip. We would store it in binary, move it to the chip where it would be stored and acted on in ternary, then exported back as binary. So always a trade off. As to why they landed on this specific setup I can't answer because the paper is lacking in details. It's more of a "look what we can do" kind of paper, lacking in any serious detail, because: Microsoft. Over time we will converge on the ideal tradeoff. Maybe your tradeoff turns out to scale better? But their tradeoff absolutely blows FP16 off the map. Time will tell, and maybe we'll end up with both.

@wanfuse 9 ай бұрын

@@MichaelBarry-gz9xl thanks for your detailed replies, it isn't necessary to mimic the brain but once you account for noise in the brain the brain uses about 6-10 bits of information ( as I calculated it) however the brain likely takes advantage of this noise. I have recently looked into different encoding schemes and there is some overlap in this work.

@jimlynch9390 9 ай бұрын

Wish I could give it two thumbs up. This is almost too good to be true.

@ymusicyt 9 ай бұрын

This is groundbreaking 😮

@MichaelBarry-gz9xl 9 ай бұрын

40 Billion parameters is going to be the new 7B. With this architecture we can fit a 40B model on a 8Gb graphics card. I think you you should change the title back to the clickbaity one. This really is groundbreaking, the most exciting paper for a long time! We'll be hitting GPT4 level performance once we have some good foundation models. LLaMa v4 anyone 🙏 I thought the first bitnet paper was impressive, this blows it out the water

@MichaelBarry-gz9xl 9 ай бұрын

Or for the 3090/4090 you can have 120B!

@BHBalast 9 ай бұрын

120B models could be in the territory od GPT 3.5. Fine tuned models of this size are perfectly good for doing truly usefull stuff like calling functions, so that's great to know.

@MichaelBarry-gz9xl 9 ай бұрын

@@BHBalast I honestly think a well designed 120B could blow 3.5 out the water, maybe GPT 3.85 ish if you know what I mean.

@TommyJefferson1801 9 ай бұрын

@@MichaelBarry-gz9xlmistral is pretty close. I mean Gpt 3.5 was before all the LLM hype right and post that we saw a lot of innovation

@testales 9 ай бұрын

I like this title more, it's a lot less click-baity. ;-) Though I still wonder if this even changes anything at all. If the paper is is already from October last year, it's old by todays standards. So there is probably a good reasons why it has not been adopted yet.

@1littlecoder 9 ай бұрын

when compared with the previous one?

@testales 9 ай бұрын

@@1littlecoder Yes, and since people where discussing this, I thought I'd add my 2cents of feedback.

@1littlecoder 9 ай бұрын

Thank you, appeciate it!

@MichaelBarry-gz9xl 9 ай бұрын

Pretraining is expensive and takes time, not many people can afford it. Hence out of the thousands of models on the hub, it's basically just LLaMA and Mistral. The original paper was good, but the accuracy wasn't there. Now the accuracy is there and I think companies such as meta, maybe even Microsoft as they played a part in this research, will be scrambling to make bigger models with this. But it takes a long time and a lot of money

@testales 9 ай бұрын

@@MichaelBarry-gz9xl What do you mean by "the accuracy wasn't there. Now the accuracy is there". I don't think that there is new hardware or software with increased "accuracy" since october 2023. Also the most time consuming part of training for these companies is most likely the preparation of the datasets. Apart from the biggest models training takes only days or weeks with the hardware that these big tech companies have. For the training datasets they could just use the existing ones. If on the other hand this 1.58BitNet approach allows for training of smaller models like 7b from scratch with more affordable hardware or if there is a way to "compress" existing models by converting or re-training them into the new format, one would expect that there were are already some open source examples floating around.

@sherpya 9 ай бұрын

3 values cannot be represented using 1 bit

@nyx211 9 ай бұрын

No, you'd need about 1.58 bits.

@MichaelBarry-gz9xl 9 ай бұрын

It's not binary, it's ternary. That's why new hardware will be more efficient. In binary terms this takes up 1.58 bits, not 1

@AAjax 9 ай бұрын

If only the paper authors had thought of that, they could have saved a lot of time!

@MichaelBarry-gz9xl 9 ай бұрын

To address the core of the confusion: 3 values can absolutely be represented with 1 bit. 1 ternary bit holds 3 values. It's only binary that holds 2 values. But because binary is so ubiquitous people often assume it is the only game in town. The moral of the story, and the thing to remember is that binary is not the only game in town. You can have any number of values in your bits (so long as you are a chip manufacturer). The values themselves are not the bits, they are properties of the bits. And a bit can have as many properties/states/values as the manufacturer desires.

@xlr555usa 9 ай бұрын

GTC is coming up, Nvidia sessions should be online soon. Ive been checking them out more and more and Remix tech is raging.

@marcosbenigno3077 9 ай бұрын

Obrigado. Thanks

@ysy69 9 ай бұрын

This seems to be a game changer

@kevintai8656 9 ай бұрын

Just wonder if anyone can reproduce their result lol

@ramus4597 9 ай бұрын

Great Video bro. Actually Im doing a project in AI which converts an UI design into front end code. Can you upload videos regarding this. It will be very useful for my project. Thanks in advance

@2361244 9 ай бұрын

One network with many hidden layers = deep neural network

@nilo_river 9 ай бұрын

If I were CEO of Intel I would be working on this right now.

@SR-zi1pw 9 ай бұрын

Thala suthudhu how it performs like 16

@Kutsushita_yukino 9 ай бұрын

atleast theres no “shocking” this time

@1littlecoder 9 ай бұрын

I haven't used that word in my title at least in my last 10 to 15 videos that I could verify

@Moyemor 9 ай бұрын

Wes Roth only uses click bait ( shock , entire industry shock ) same as matt. But 1 little coder didn't use this fucking words

@user-qr4jf4tv2x 9 ай бұрын

how ironic crypto bro's used to use that

@nawabifaissal9625 9 ай бұрын

shocking truly lol

@JohnMcclaned 9 ай бұрын

SHOCKS THE WORLD

@kiiikoooPT 9 ай бұрын

All I have to say is that this in no way is a 1bit tecnology like you make it to be, or even 1.58, cause a bit is either 0 or 1, you can not make an half bit, or you can but you need 2 bits to say that it is an half, or that it is positive or negative if you say it is -1 0 1, to represent that you need 2 bits, no other way to make it. I see another comment below where people are talking about compression and whatever, but if you give an ai an 1h 1GB video to learn from it and the same video in MP4 with 300mb the ai will learn the same, it will not increase 1GB neither 300mb, it learns that the video has the things you alredy teached it before or not, example if cats show in the video or people or whatever. It does not compresss the video and put it in another file somewhere like most people seem to think that it is how it works. Most ais are trained with TBs of data, if it worked like that, we would never be able to run it on or basic pcs. If you guys see even the 70B parameter models, dont get to terrabytes of size. And it has nothing to do with compression of image or whatever. It has to do with what your brain do also, if you see a new movie, you will remember some scenes that clicked with you or shocked you, but you will probably not remember the color of the dress the actriz was wearing. You will not compress the video in your head, what you will retain is the things that your brain has connections already from other movies or irl situations. Or trully new things that surprise you or have an impact on you, so your brain remembers it, but not pixel by pixel, is based on parameters that you know already. Like LLMs At least this is how I see it, but maybe I'm ignorant. I talk from the litle experience I have with llms and coding. And that is the reason that I dont understand why you are calling this a 1bit model, it is possible to do it, but you will need a way bigger ammount of parameters to make it learn anything if the only options you have is to give it 0 or 1 to tune its learning skills in each neuron I think. Is like saying I will make an ai that learns all 256 color range for the red in rgb with 1 bit, you can do it, but it needs at least 256 1 bit parameters to detect all possible 256 values, or at least half of that. since it is in bits. 0 and 1. If there are any experts that can show me where I'm thinking wrong about it, I will be glad to read your opinion and learn with it. Like I said I talk from the litle I know. And for me is weird seing people saying 1.58bits, cause that is not possible, bits are real numbers, there are no decimals in bits. unless if you reserve another one to make it decimal. and still it can only be half, 0.5 or positive negative, 1 or -1 but for that you need 2 bits like I said, not 1 is impossible.

@Unmannedair 9 ай бұрын

One more step towards the merging of AI neural networks and quantum annealing style computing...

@zyxwvutsrqponmlkh 9 ай бұрын

Nothing is free. I have seen attempts to take 32bit models and quantize them to two bit. They produce outputs, but not particularly good outputs; typically that drastic of a quantization is inferior to a smaller model that is not quantized so severely but they do still function. I have to wonder how the output of this approach differs from what would result from a model that was trained as a 2 bit model from the start.

@aiamfree 9 ай бұрын

This essentially creates different "engines" ... like cars there are v4 and v12, you drive what you can afford! Currently they're all too expensive lol

@clray123 9 ай бұрын

The main issue here is these are tiny models they are playing with. There is no proof that this technique scales to anything above (in terms of perplexity/quality). It reminds me of RetNet, which was also supposed to be a big breakthrough and hasn't been released with any open weights since.

@night8002 9 ай бұрын

reminds me of pcm vs dsd in audio

@nathank5140 9 ай бұрын

I don’t know why they ever used floats. Seems like complete overkill. I suspect just because GPUs were originally used for gaming.

@agnichatian 9 ай бұрын

This sounds like an investment scheme. Nothing in this video explains how the resolution can be reduced from 16 bit int/float to only 1.58 ternary. They either didn't need that resolution in the first place, this whole time, implying that the whole industry was dumb (unlikely). Or the conversion algorithm is creating many output matrices for each input matrix to make up the difference, whick would be just re-representing each big word with several small words.