Scalable MatMul-free Language Modeling (Paper Explained)

Рет қаралды 33,639

Yannic Kilcher

Күн бұрын

Пікірлер: 114

@KostyaCholak 7 ай бұрын

Loved that references for BitNet are 10 and 11

@eoghanf 7 ай бұрын

Your point about estimating whether non-straight lines cross based on three datapoints is a very good one. HOWEVER, the reason for giving them the benefit of the doubt on the training dynamics side is that the *inference* time power efficiency gain (which you don't spend any time on!) is massive. From the abstract "We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency". That's pretty amazing.

@ttul 7 ай бұрын

The FPGA angle is what's interesting about this research. The paper proposes replacing all feed-forward operations in large language models with more computationally efficient operations, mostly by using ternary weights (i.e. -1, 0, and 1 are the only allowed values). Ternary weights are basically a simple logic gate with only three permitted operations: a) Change the sign of the input (i.e. flip the sign bit and copy the rest) b) Output zero c) Copy the input to the output If your goal is to make a neural network scream on hardware, having only three simple operations to choose from means you can use simple logic gates. The researchers tried this out in FPGAs and this is a promising area of research. From FPGA's it's not a big leap to ASICs, which nets the most power efficient computation theoretically possible. So if ternary gate networks can be made to scale, everyone should be excited. Caveats: 1. The attention mechanism is replaced with a parallelizable form of recurrent neural network because applying ternary operations to attention does not train. 2. A linearized Gated Recurrent Unit (GRU) architecture allows for parallel computation; this is a neat trick. 3. The channel mixer (a feed-forward equivalent) uses dense layers with ternary accumulation operators. Results show performance comparable to traditional Transformers, with better scaling properties at larger model sizes. Yannick expresses some skepticism about the projected crossover point where this architecture would outperform traditional Transformers. But I think the really interesting thing about this is the FPGA/ASIC aspect.

@robmacl7 7 ай бұрын

You could also reduce some work by pre processing the weights to just drop the zero entries, but this would be somewhat a nuisance for a hardware realization because the work needed would vary by output element.

@hjups 7 ай бұрын

@@robmacl7 Why would variable work be an issue? You replace a deterministic sequence with signal barriers that only occur at synchronization points in the compute graph. The bigger issue with dropping zero entries would be the extra step needed for decompression into a dense operation (e.g. stored as RLE or a Sparse format), and then aligning fetches to DRAM bursts.

@KevinHorecka 7 ай бұрын

"stay hydrated" was a shockingly helpful reminder that I haven't drank any water today. Thanks!

@philiptren2792 7 ай бұрын

19:15 I think the model will learn to be more efficient with the extra accuracy. We can increase the length of the vector and the model will learn to use higher accuracy for the important values and a lower one for the ones where precision doesn’t matter as much, saving unnecessary precision. It’s like quantizing each and every weight of the model independently and exactly the right amount.

@wolpumba4099 7 ай бұрын

*Summary* *Problem:* * *(**2:30**)* Matrix multiplications (MatMuls) are the core of modern machine learning, but they are resource-intensive and require specialized hardware like GPUs. *Proposed Solution:* * *(**0:00**)* This paper proposes eliminating MatMuls entirely from large language models (LLMs) while maintaining competitive performance. * *(**16:35**)* The architecture replaces: * *(**16:35**)* *Attention layers* with parallelizable recurrent layers inspired by GRUs. * *(**5:55**)* *Dense layers* with "ternary accumulation," using quantized weights limited to -1, 0, and 1. This replaces multiplication with simpler selection and addition operations. *Key Findings:* * *(**38:30**)* *Performance:* The MatMul-free models perform on par with state-of-the-art Transformers at scales up to 2.7 billion parameters. * *(**38:30**)* *Scaling Laws:* The performance gap between MatMul-free models and traditional Transformers seems to decrease with increasing model size, suggesting a potential crossover point where MatMul-free models become more efficient. However, the video author expresses skepticism about this extrapolation. * *(**45:00**)* *Hardware Efficiency:* The proposed architecture significantly reduces memory usage and latency. Implementing it on custom hardware like FPGAs, optimized for ternary operations, could lead to even greater efficiency gains. *Author's Opinion (Yannic Kilcher):* * *(**48:20**)* The research is exciting and promising for edge computing and energy-efficient AI. * *(**48:20**)* He remains skeptical about: * Whether MatMul-free models can truly surpass traditional Transformers in performance, especially for complex tasks. * The validity of extrapolating scaling laws based on limited data points. * The simplification trade-offs (like removing state-dependent hidden state updates) might limit the architecture's ultimate capabilities. *Overall:* The paper offers a compelling alternative to traditional MatMul-heavy LLMs, with potential for improved hardware efficiency. While challenges and open questions remain, it presents a promising direction for future research and development. i used gemini 1.5 pro to summarize the transcript

@interstellarsurfer 7 ай бұрын

I guess Gemini isn't completely useless. 🤷‍♂️

@theupsider 6 ай бұрын

thats what LLMs are for. thanks

@Mordenor 7 ай бұрын

Thank you Mr Yannic for explaining MatMul free Language Modelling to your viewers!

@unvergebeneid 6 ай бұрын

Anything that uses balanced ternary is already a superior method in my book :D

@HansKonrad-ln1cg 7 ай бұрын

i have heard that after training you can basically throw away 90% of a network without changing the behaviour too much. that is because most of the weights are near zero which basically means a non-existent connection of the neurons. so if you omitt the calculation right away by taking it as exactly zero with the ternary values you save a lot of time that would have otherwise been spent with multiplying by zero for no reason.

@pauldruhg2992 7 ай бұрын

Why stop with terniary? Go for powers of two and bit shifting. Speed and precision win-win.

@WalterSamuels 7 ай бұрын

Can you elaborate?

@danielg3857 7 ай бұрын

@@WalterSamuels he means replacing ternary logic gates with three possible outputs(1, 0, -1), with just binary logic gates/functions, to benefit from even better math hacks so to speak, you can do neat tricks with binary numbers/functions; haven't even watched most of the video mind you, just reading the abstract and comments so far

@pauldruhg2992 6 ай бұрын

@@WalterSamuels multiplication and division by powers of two can be replaced with bit-shifting, which is faster

@adeeelh 6 ай бұрын

+100 to the rant at 25:32 about researchers relying on tricks instead of the main idea of the paper. It's my biggest pet peeve with deep learning papers.

@RPG_Guy-fx8ns 7 ай бұрын

if you have a layer of 64 neurons, the weights would be 16 bytes per neuron. You can use a look up table with 256 entries, instead of summing the binary digits. That way, most of the math is just turned into jumps into that table, finding 2 sums to subtract. its 16 boolean AND operations, to compare the previous layer output and this neuron's weights, 16 array lookups, adding them up as 2 totals, then subtracting the 2 bytes. That would be extremely fast compared to other neural networks, but I wonder if it can match the quality of other solutions.

@adamrak7560 7 ай бұрын

Dot product in-memory architectures would be extremely fast and efficient for the inference. Less so for training. So _if_ we change the architecture there are relatively simple ways we could add a few order of magnitudes to the inference performance.

@Balorng 7 ай бұрын

Inference speed equals model performance because, currently, algorithms like "Graph of Thoughts", extensive multi-agentic systems, "smart RAG" and, most importantly, metacognition in general is extremely inference-heavy (you can generate orders of magnitude of "subcounscious" tokens for each one shown to the user), so is generating oodles of very high-quality training data to create "leaner" yet more performant models using much less data by eliminating junk. I particularly liked the idea of creating multiple "interlocking" variants of data designed to combat llm flaw of A = B, B =/= A "reversal curse" and otherwise their inability to truly generalize. My pet "internal model of LMM attention" is actually DNA sequencing. A huge pattern is broken apart into small chunks and then pieced together into new patterns by having them mech with each other using semantic distance similarity - that explains both the strong and weak points of LMMs. While I think that using graph RAG and symbolic logic metacognitive systems is still a must to make LMMs truly useful, simply having more patterns that are "rotated/translated" this way and that should create better "illusion of general intelligence" at the very least...

@hjups 7 ай бұрын

"Extremely fast and efficient" is relative. Samsung and SK Hynix already do that with their HBM-PIM, but are only able to get a 2x-3x improvement. That's at most 2 orders of magnitude (in base 2). That 2x is still valuable, but it's limited by communication depth (sum trees can't be faster than log2 N), and the technology nodes used by DRAM are relatively slow compared to CMOS.

@adamrak7560 7 ай бұрын

@@hjupsHBM-PIM is a generic processor near each pair of DRAM banks, with a quite underpowered FPU. It is not a highly parallel and specific dot-product engine. So for AI inferencing it is unsurprisingly very weak. For AI inferencing we only need a dot-product engine, and very little control circuitry, or registers.

@hjups 7 ай бұрын

@@adamrak7560 That's incorrect. The HBM-PIM implementations are a special-function SIMD ALU near each bank (they have an ISA of 16 instructions or something small like that), one of which has a dot-product sum tree (I can't recall which one it was). And you do need more than just a dot-product engine for efficient inference. You also need the ability to perform element-wise addition, multiplication, and some movement operations for transpose.

@eruiluvatar236 7 ай бұрын

I believe that you could still implement a fast "ternary multiplication" in a current GPU by using logic gates operating on multiple weights per register. Matmults are crazy fast on GPUs but by squeezing multiple weights together in a single register it might end up being faster.

@FryGuy1013 7 ай бұрын

As someone who has written CUDA code, this is relatively straightforward to do on GPUs. So your concern seems kind of unfounded that it will be basically the same performance as a full floating point multiplications

@Noxeus1996 7 ай бұрын

As someone who has written most of the llama.cpp CUDA code, matrix multiplications on GPUs are only so fast due to specialized hardware, i.e. tensor cores. Without specialized instructions for Bitnet or whatever I doubt that the performance will be (much) better than just doing dense 16 bit matrix multiplications unless you also quantize the activations to 4/8 bits.

@clray123 7 ай бұрын

What I missed in the video and in the paper is an interpretation of replacing the weights with -1, 0, 1. And that would be: matrix multiplication xW is just calculation of n vector dot products - one dot product between x and each row of W. A dot product of two vectors is max when the vectors point in the same direction, min when the vectors point in the opposite direction, 0 if they are orthogonal. So it's basically deciding "let's glue all the KQV vectors, whose direction we compare with x, to the base axes (of the coordinate system), rather than allow them to point in any direction". I think that's what they call "privileged bases" in interpretability research. But given that you can only fit so many orthogonal vectors in n dimensions (and a lot more "almost" orthogonal vectors), it feels like it should impact the ability of the model to uniquely represent inputs.

@pavalep 6 ай бұрын

thanks for the informative vid :)

@brendawilliams8062 4 ай бұрын

I wish I had a new one everyday

@hjups 7 ай бұрын

With usefulness, there's still an underlying assumption that 1) the comparable performance will hold with increased scale / specialized models, and 2) properties required for improved reliability in transformers also translate to this architecture. My guess is that (1) depends on the task / benchmark, and (2) is unlikely to occur (SSMs are missing some of these properties), which will set an upper bound on the model size and usability. That said, this approach is probably applicable for more classical NLP tasks which are easier than generative AI, and maybe some sort of low-effort HCI (e.g. take this JSON packet and convert it into a human understandable response).

@ronhightower6549 7 ай бұрын

Hopefully the research community gets these fundamental improvements figured out before Sam Altman spends a trillion dollars on data centers running Nvidia MatMul devices.

@danielmewes 7 ай бұрын

Might still need it for training?

@TheNerd484 7 ай бұрын

It would be funny if this happens like a month after he buys them. It would also mean we get a lot of cheap compute cards

@eadweard. 7 ай бұрын

@@TheNerd484Resentment-powered compute.

@clray123 7 ай бұрын

Too late. Also, Anthropic spends substantial resources on interpretability of transformer-based models. As far as I'm aware, these interpretability gains do not translate easily into other architectures.

@jswew12 7 ай бұрын

@@danielmewescorrect me if I an wrong, but isn’t training also possible on the FPGA they introduce? It’s been a couple weeks since I read the paper and I haven’t finished this video, but I could have sworn that all the operations they need for training are programmed into the FPGA and are shown to be better than GPU equivalents. Could be a problem of scale maybe?

@eoghanf 7 ай бұрын

I would really be interested in knowing more about the how the Straight-Through Estimator allows these things to train. That's the big mystery to me.

@alan2here 7 ай бұрын

Evolution, the models are the species, we cause mutation and are also the environment, speciation is common.

@jmirodg7094 7 ай бұрын

It is only a first attempt I'm keen to see the following papers...

@josehugoelsas8699 6 ай бұрын

One important thing to notice is that this approach is trading off very regular, very high numerical intensity normal matmul, with very sparse, very memory irregular filtering operations to do the ternary if statements. For me it is not clear if this will yield any improvement over present GPU or other accelerator architectures. Also, it relies heavily on quantization, which can be fragile depending on the situation. It is not much of a problem for inference, but can be a problem for training. Multiplying floats, specially dense matrices, is cheap, what is expensive is moving data, and I don't see how this paper improves on this front.

@serhanciftlikci3651 7 ай бұрын

I think it all boils down to the classical idea or bias-variance tradeoff. Using ternary weights results with a biased model (hence the big loss gap compared to the transformer at the start). They can populate more weights but it will remove all gains from the inference. If they can also find a component to increase the variance of the system, it may be the new way to train LLMs in the future.

@abdulshabazz8597 6 ай бұрын

This algorithm can be further adapted to arbitrary, non-binary bit-arrays to further improve their performance by first factoring the RHS matrices into primes, which are essentially then viewed as unary values, and summing each tensor of primes and their product's in parallel...

@WalterSamuels 7 ай бұрын

Look into VSA (hyperdimensional computing), and balanced ternary notation.

@jimbo8853 7 ай бұрын

Devs learning linear algebra to upskill for AI in shambles

@Decocoa 7 ай бұрын

Joking aside mate why would devs need Linear Algebra for Ai? Surely the basics from high school should be sufficient? You abstract away the layers and optimisers with TF?

@jamescunningham8092 7 ай бұрын

@@DecocoaTo be truly effective in an environment where the state of the art changes all the time, you need at least a little understanding of how things work. Without any understanding of linear algebra you’d be at a big disadvantage.

@coversine479 7 ай бұрын

@@Decocoa if you don't know LA and Calculus you can't understand AI papers. Period. But if you are just an application developer using someone else's AI API obviously you don't need to know how it works internally to use it

@VladMysla 7 ай бұрын

30:26 in hidden state it actually depends on the previous state to select what to forget

@ssssssstssssssss 7 ай бұрын

I saw this the other day and really liked how they claim not to be doing matrix multiplication while still doing matrix multiplication. It's just an efficient implementation of a special case. It makes me feel a bit disappointed despite the contribution of the paper looking to be quite solid.

@brendawilliams8062 4 ай бұрын

Curves. Agreed

@alan2here 7 ай бұрын

PC's today can get a tertiary value into 2 bits, utilising 75% of the space, and compute with it fairly efficiently. Maybe not so practical to compute with but 3 tertiaries also fit into 5 bit giving 84%, and 10 tertiary values in 16 bits (2 bytes) utilising 90%. 😮

@alan2here 7 ай бұрын

Unfortunately 2^n is never equal to 3^n for any integer other than 0

@JBoy340a 6 ай бұрын

The FPGA is interesting. It would be interesting what to see what this means for a portable real-time devices.

@VincentKun 7 ай бұрын

About data dependency did you saw the Illusion of State in state space model paper? Every time they try to get to something recurrent they lose parallelization and state dependency is one of those cases

@hasko_not_the_pirate 7 ай бұрын

19:20 Isn’t the essential trade-off that they encode learned models in a 1.6 bit “ternary” data type rather than a 8 bit, 16 bit, or 32 bit float data type for the weight matrix? It seems likely that you would need roughly 20 times as many weights to encode the same information as a float32 weight matrix, which would then increase compute complexity accordingly.

@sentinelav 7 ай бұрын

40:25 "More bang for your flop" 💀

@JoeTaber 7 ай бұрын

I wonder if a tenstorrent device would be able to process these operations efficiently.

@AleksandrUmnov 7 ай бұрын

6:24 the pigeon moment

@hannesstark5024 6 ай бұрын

Using straight through estimator sounds to me like that for both our forward and backward pass we still need to compute everything in floating points and then we quantize the gradients to the level of our weights. So we would have no compute efficiency benefits. Does someone know what I am missing here?

@TheNerd484 7 ай бұрын

IMO, if any architecture will yield actually intelligent AIs, it would look very similar to this. I think training would be the main hard part. I'm of the opinion that if this model were trained such that it does not have to output a token on every iteration, you would see significant performance improvement basically for free.

@rockapedra1130 7 ай бұрын

18:16 I like the duplication hack. I wonder if brains use that. Synapses would be +1 = excitatory synapse, -1 = inhibitory synapse, 0 = no synapse, other numbers = multiple synapses. Maybe. Who knows. LOL

@LuizFernando-hv1td 7 ай бұрын

I think you would be interested in looking into SNNs! From what I understand, when you include the time dimension, something like this happens in the form of spike frequency.

@rockapedra1130 7 ай бұрын

@@LuizFernando-hv1tdhey, that's pretty cool! If we add spiking frequency and an "integration window" to the mix then it works even better! Then we can do: spike freq * int window * (num exc synapses - num inh synapses) = value! That allows arbitrary precision with ternary synapses. If I were a brain engineer, I'd do that! Probably everybody does already ... Lol.

@clray123 7 ай бұрын

I have a nagging suspicion that the attention complication they do after the ternary quantizing of the QKV weights is there to recover (as in "store elsewhere") the same weights that they claim to have dropped...

@ekstrapolatoraproksymujacy412 7 ай бұрын

Attention layer is needed for in context learning and in context learning capability is strongly corelated with intelligence, architectures like RWKV struggles with this, looking at a loss and most of the current benchmarks is very misleading regarding actual performance, those things mostly measures how much the model remembered not how well it generalizes, that's why mobody really uses those "modern rnn" thingies, they only look good on paper, not in practice.

@clray123 7 ай бұрын

I don't understand putting this linearized architecture in the same basket as state-space models at 30:22. The (selective) "accumulation of the past" in state-space models (specifically Mamba) makes the next state data-dependent (namely on all the selectively accumulated past data). Not just on the next token. Or are you saying that because of the selectivity newer tokens may have no chance of using information from older tokens that have been rejected by selection (but this is kinda the tradeoff for not having to maintain a KV cache of indefinite length).

@FredericoKlein 7 ай бұрын

a multiplication by 2 is just a bit shift in binary (in floating point, its just adding 1 to the exponent, isnt it?) So they could have done 2, 4, 8,... and -2, -4, -8.. couldnt they?

@TheTruthOfAI 7 ай бұрын

this paper is wild as hell... even coming out with FPGA solutioning.. to be honest, is one of those papers that i dont fully entirely 101% grasp.. i did tried some of this ternary multilateration approach.. according the "book", its numerical floating precision by example on 13 operators reaches 100% precision of float16.. truth is on the battlefield it doesnt perform good within my experiments.

@tarumath319 7 ай бұрын

A lot of people talk about bitnet and that improvement over it but the big guys in AI like OpenAI seem to not care about it.

@clray123 7 ай бұрын

Sunk cost fallacy. The hardware they've already paid for needs to be amortized first. It's very difficult to admit to investors they've burnt so much money by committing to an unripe architecture.

@fiNitEarth 7 ай бұрын

Well didn’t they compare their model to transformer ++ which also quantizes its weights to trinary?

@hermannschmidt9788 7 ай бұрын

Bitcoin mining used to be run on GPUs first. Then came the FPGAs, followed by ASICs. I wonder if this progression will apply to transformer networks as well. This would put Nvidia out of business. Calculating a hash value is a simpler task, however.

@clray123 7 ай бұрын

Why do you think Nvidia would be incapable of manufacturing (and foremost patenting) these other circuits?

@hermannschmidt9788 7 ай бұрын

@@clray123 I just followed the mining analogy. They stayed with the GPUs, which is their core competence, and gave away this business.

@albinoameise 7 ай бұрын

But your idea of simply repeating the input tokens for attention does not necessarily result into too many tokens. Because you can use this np.where operation once in a step before doing that to thin out the input tokens with a ternary thinning matrix and then replicating and 'attending' only those with values > 0. So I find your idea at least worthy to try!

@aitarun 7 ай бұрын

1 bit and 1.56 bit llm paper came long back. I wonder why are not these models available yet. There are quantized models but no model is available which was trained @ 1 or 1.56 bits. Seems like some accuracy related issues not making them worthy as their full precision counter part.

@MrBioloidboy 6 ай бұрын

Sentient ai is here! Can I try brain tech data science integrations now?

@norlesh 7 ай бұрын

How does this effect the GPU poor such as myself (humble RTX 2080) - I'm wondering how this would perform implemented as something like llama.cpp tailored to run on CPU and system ram with the GPU just for icing when available.

@JBoy340a 6 ай бұрын

Yes. As a fellow 2080 owner I often run into issues with resources. it would be nice to see these sort of issues go away.

@erickmarin6147 7 ай бұрын

Been trying to verilog something like that myself for a while

@aneeshprasobhan 7 ай бұрын

NVIDIAs shares rely on this paper not getting too much attention xD

@tarumath319 7 ай бұрын

They would just need to add ternary accelerators and maybe more int8 ones.

@eadweard. 7 ай бұрын

Is that a pun?

@aneeshprasobhan 7 ай бұрын

@@eadweard. i tried xD

@aneeshprasobhan 6 ай бұрын

@@eadweard. i tried xD

@kazedcat 6 ай бұрын

Nvidia could just add ternary operation to their GPU. It is a super simple hardware "copy if 1, zero out if zero and negate if -1". They only need to add a single new instruction VTerAcc "Vector Ternary Accumulate"

@charstringetje 7 ай бұрын

Am I the first to see that Q=K=V, and that we can reduce all MatMul to ⅓ the current operations without introducing other operations? 🙃 3:44

@charstringetje 7 ай бұрын

Oh, I spoke too soon... Handwaving follows.

@clray123 7 ай бұрын

The weight matrices are "obviously" supposed to be different, but in some cases the same K and V submatrices are reused for subsets of Q (or for all Q), indeed leading to memory savings (although not to 1/3). See papers on multi-query attention (MQA -> all Qs share same KV) and grouped-query attention (GQA -> some Qs share same KV).

@evilby 7 ай бұрын

TTT on the way?

@khaledbouzaiene3959 7 ай бұрын

i wich you explain the fpga or asic part how this done using addition or element wise instead of matrix multiplication

@hjups 7 ай бұрын

The authors don't go into detail nor is the RTL code in their repo. From their description and diagram, it's a stand-alone DMA unit, which takes in the address of the ternary matrix, the address of the activation matrix (most likely), and the address of the destination matrix (most likely). Then it fetches a column of the transposed ternary matrix to store in a local buffer, and streams the rows of the activation matrix into an accumulator, which then gets written back to the destination address.

@cherubin7th 7 ай бұрын

Nvidia is cooked

@cherubin7th 7 ай бұрын

@adityashukla9840 7 ай бұрын

Can you please make a video on DUCK net

@bjarke7886 7 ай бұрын

ESM3 ESM3 ESM3 ESM3 ESM3 ESM3

@kop-lg7lo 7 ай бұрын

kinda cool, but surely we not ready for this type of architecture

@mrpocock 7 ай бұрын

Is this not an opinionated relu?

@seanreynoldscs 7 ай бұрын

I’m calling BS. They are approximating the floating points by having overly large weights matrices. This paper could also be called, having a smaller network sometimes outperformed a larger network for small datasets.

@g_glop 7 ай бұрын

MatMul? i'm allergic

@christospapadopoulos7894 6 ай бұрын

8 authors for a scientific paper is absurd, at this point who even is the main one?

@Navhkrin 7 ай бұрын

Big doubt this approaches scales. It is giving me vibes of kind of research that works for that one specific tailor engineered scenario and sucks for everything else. Otherwise we would have seen significantly higher amount of experiments in various settings

@clray123 7 ай бұрын

That is one stupid argument to make, with that approach you can disqualify any new idea ("the idea must obviously be bad otherwise we would have seen it before").

@deltamico 7 ай бұрын

It's more like "the idea must be bad because otherwise the author would be willing to explore it's capabilities in different settings" which is not always true but absolutely has grounds

@clray123 7 ай бұрын

@@deltamico But this whole "but does it scale" argument assumes the researchers have infinite money to burn on hardware. They obviously don't, that's why they explore new ideas with smaller models.

@Jononor 7 ай бұрын

Integer quantization is standard practice in edge/mobile/TinyML. Sub byte quantization and even binary networks have considerable research in the last decade. Most research has been on CNN, Transformers and LLMs has not seen as much research yet - but it is coming. No one knows if ternary or no matmul will be the best representation though...