The Economics of AI are Failing, But We Can Fix It (With Lasers)

Рет қаралды 21,904

Күн бұрын

Пікірлер: 97

@thedevincrutcher Ай бұрын

Training hardware and electricity is far more expensive than what's needed for inference after the model has been trained. We will eventually discover a point of diminishing returns, especially as systems ingest more synthetic data that will eventually lead to model collapse. But once the training is done, what can we do with these machines? ASICs, by nature, can only perform one type of operation. Do they become the ultimate form of eWaste?

@SquintyGears Ай бұрын

I was at a conference yesterday where one of the panelists said he didn't believe in the self ingestion collapse. His argument was that the content that is propagated is curated. A person picked the best outputs or the ones that had a quality that they liked. Now I personally think that is a way too optimistic view. Loads of unfiltered trash is getting uploaded constantly. But I do think it's a good enough argument to make me question if models will ever collapse completely. It feel like they'll just enshittify, but frankly... So does everything else... Also the metrics shared from an environmental researcher (same conference different panelist) about 3M gpu hours to train an LLM and then depending on the size of the resulting model 200 to 600M queries to match that training energy use. And you can napkin math what that means based on where any research lab is or datacenters running these tasks.

@mzamroni Ай бұрын

they will use it to train larger model i don't think meta throws away hardware they used for llama 2

@musaran2 Ай бұрын

IMO AI research alone can keep those ASICS busy for their lifespan.

@aniksamiurrahman6365 Ай бұрын

@@SquintyGears Hi guys. I'm a noob. So, pardon my stupidity. Why not we try to use ML models to discover/establish a theory of how these things actually work? Say, why use LLMs just to parrot language and not figure out a theory of it? Wouldn't that be a worthwhile investment as such will not only reduce computation needed, will end the need of further training. Just as you don't need to train your rocket on Newton's Law.

@SquintyGears Ай бұрын

@@aniksamiurrahman6365 mathematically speaking what LLM's do is probabilistic guessing. a guess on how things work is also basically how humans came to conclusions on gravity and everything else. but we aren't able to make the computer run proper tests. in models that have tried this the computer ended up editing the limits of the test rather than work harder to get the desired output values. and because of a number of different mathematical proof, you are more incredibly more likely to fall into a "good enough" final result rather than the actual ground truth. That's why if your goal is finding out the underlying theory, machine learning will never find it. If you're trying to get something that will help engineers build something with tight tolerances based on test data they collected, it will almost always find a solution that fits the desired precision.

@elevul Ай бұрын

I mean, isn't the final goal to reach AGI so that you can replace employees and thus sell B2B? I assume the current B2C side of the business is just to make revenue while they're rushing to the final goal of AGI, after which B2C will not really matter anymore.

@ChrisJackson-js8rd Ай бұрын

i have known a handful of of organizations that have started to move their customer service away from call centres towards ai chatbots in various cloud providers. it seems to be more than 10x the price of human labour. i am aware of one company that got a surprise bill one month to the effect of 20 million USD

@cheezyuser Ай бұрын

28 minutes, this will be a good listen.

@eruiluvatar236 Ай бұрын

I believe that in memory compute + analog matrix multipliers has way more potential than trying to scale the current architectures. There is several orders of magnitude of efficiency and density improvements possible in there. Imagine a chip that combines flash cells for the weights, DRAM cells for the inputs and activations (and the gradientes for training ) and that does an analog matrix multiplication followed by a ADC and a more general purpose gpu like thing to do things like activation functions. The internal bandwidth between those components could be crazy high and even at moderate clocks it could be real fast. You wouldn't need to fit a whole model per chip but fitting a whole layer would be great as the bandwidth to communicate between layers is much smaller.

@Wobbothe3rd Ай бұрын

As Bill Dally has explained many times, analog multiplication fails whenever the data has to eventually be converted back into digital form. It's one of those things that sounds like a good idea but can't be done in practice. Apparently Nvidia has tried this many times and the hurdle of AD conversion can't be overcome.

@SquintyGears Ай бұрын

This isn't possible. Even putting dac and adc problems asside. You're suggesting building a machine to the specific neuron network topology spec. It's inherently a single purpose machine at that point. Bandwidth doesn't matter if you need to scrap the whole hardware every time you update the model. And if the chip dynamically scales it's input and output space and loads in weights per query it's not faster than a GPU.

@eruiluvatar236 Ай бұрын

@@SquintyGears Kinda but not really. It only assumes a that there is a matrix multiplication and a maximum number of weights and activations per layer (and even that can be worked around). Those assumptions are true of most neural networks. As for the specific topology, you would need at least as many chips as you have layers. If the topology can be static it is easy but if you want to be able to change it, some kind of bus or routing between the chips is needed, it could look like a small fpga or even a cpld or several of them in a mesh for a larger scale (same thing can be built in inside of the chip affording more flexibility, maybe allowing chips to process more than one layer if the layers are small). This of course would mean that the hardware would be under utilized if the model is smaller than what the hardware can do and if the model is larger, it won't work (at least with a naive implementation) but that already happens to some extent with GPUs. Regarding the DACs and ADCs issue, I am aware of it but I believe that it can and will be overcome. If you do enough work without converting (like a whole matrix multiplication of a large matrix) the overhead doesn't seem too bad. Specially in the context of neural networks, at least for inference where you don't need much precission. I don't think that what I proposed here is any less flexible than google's TPUs.

@SquintyGears Ай бұрын

@@eruiluvatar236 It's more complicated than you seem to think to have partially occupied layers in a system like this. The multi connectivity of neural networks means that it will throw off the output measurements completely and probably even reflect signals in the dead ends. Grounding doesn't magically happen internally. And you have to be aware how silly it sounds to say you would just line them up on a bus. Having a package for each layer is like walking back all the way to the 80s. Your memory card was a bunch of 10pin packages side by side and the whole paper sheet sized thing was 256kB. No matter how you run the bus it can't be faster. Even if you imagine that this system would have virtually no actual computing execution time, you've introduced so many sources of signalling latency that it'll always end up slower than the current paradigm. Load, wait, execute, wait, output read. Replacing computation time with hyper complex tracing and bus topologies is not likely to be a valid solution. It's not stupid to consider something like this at all. But when you take a close look at the complexity the industry faces implementing the existing bus standards and upgrading them each generation... Ddr, pcie, it should be clear what the trade off we're talking about is here.

@mzamroni Ай бұрын

but how fast will be the npu that Samsung or micron can put in the dram modules? 45 tops npu in lunar lake is almost as big as the 4 p cores i prefer npu card with 8 dimm slots and large sram cache. 8x 125 GB will suffice llama 405B model in native bfloat16. large sram cache is what GeForce 4000 does to compensate its slower/narrower gddr bus soldered hbm or gddr is much faster but loading large model needs several accelerators. the interconnect is much slower than dimm slots. for example, 1 ddr5 slot is equivalent of pcie5 x16 lanes

@jimiscott Ай бұрын

Most, most companies do not need LLMs. Most companies want a fine-tuned small, or medium sized model (something < ~8B parameters) for their specific needs. You can run a small or medium model on an A5500 and that can service many multiple users.

@TechTechPotato Ай бұрын

As mentioned, this changes when you move to an agentic workflows. A dozen or more specialised 8B models adds up

@tommihommi1 Ай бұрын

@@TechTechPotatobut LLMs talking to LLMs instead of well defined APIs is just stupid, error prone, and crazy inefficient

@axiom1650 Ай бұрын

To me this sounds like the equivalent of this in the 80's/90's: Most, most companies do not need powerful desktop or portable computers. a few low-powered desktops with a single moderate mainframe for their specific needs. When models will outperform humans on any thinking related tasks, we'll continue to exponentially build out compute infrastructure I believe. Even several percent increase in model intelligence will continuously be worth chasing with billions of dollars.

@TechTechPotato Ай бұрын

Tommi, imagine three specialised 8B models working end to end to replace one 405B model.

@erkinalp Ай бұрын

@@TechTechPotato just 24B strong, but stronger than the 405B model; how could that work?

@novantha1 Ай бұрын

I’ve seen an interesting trend in AI research if you’re on the floor working in training models and so on. In 2017 all the money in the world would not have gotten you a modern AI model’s performance. In around 2019, or 2020 or so, we had GPT 1 and GPT 2, which were kind of the first modern LLMs, and it took a large research team, and a lot of money (perhaps in the millions) to train it, and yet we look at those models and think of them as almost unusably small. And yet, we can replicate those models for under $100 today. This is due to better training dynamics, better hyperparameters, better GPUs, better frameworks, better data, and more. Stable Diffusion certainly costed either tens or hundreds of millions to make originally, and nowadays, one can produce an analogous (or even superior) model for anywhere between $100-6000 depending on your existing hardware and patience. I don’t really think that we’ll ever see a complete drop off in AI, but we might see a stop in frontier-class (AGI lab) models. If a hobbyist can run around and replicate previous flagships but trained from scratch for specific use cases (as well as produce models in new modalities), then I don’t see why an enterprise can’t do a slightly larger model for their own internal use case.

@notsojharedtroll23 Ай бұрын

We tend to forget that these are technologies and by extension, tools. We should learn about them, see its pros n cons and apply them when needed

@antonystringfellow5152 Ай бұрын

Running software neural networks is insanely energy wasteful. There's no way we'll ever achieve affordable/practical AGI this way. We have to find a practical way of storing the weights and biases in the hardware substrate, a type of neuromorphic processor design where energy is only used as required to perform each task. I know this brings its own challenges but I can't see much of a future for AI that relies solely on software.

@autohmae Ай бұрын

people are designing and building specialized hardware, but most of them fail as the way the models work/demands/requirements change keeps changing to fast.

@lost4468yt 8 күн бұрын

It'll be used massively if it reaches superhuman? Or even subhuman at increased work rates. Once you reach those you have things that can simply do more than humans. Look at how many insane things humans scale crazily for, now imagine a 24/7 actor that's even better or simply faster. It becomes a requirement to use it all over the place? If they keep on scaling like that then you're talking about entirely new regimes where it's simply required. It has got to a position where Microsoft and Google are funding nuclear fission reactors... That's simply an insane requirement. Do you think Google doesn't value a developer that can output 5x the amount of high quality code etc and costs $1 million/year, when the top developers can already be hitting $400k/year? People are really underestimating how much these costs can scale. The 5x the output thing is simple, but what about when that output is simply qualitatively better? Then you enter an entirely new regime of costs. Can the US afford to not switch to model based fighter pilots if they're qualitatively better? I mean they immediately become qualitatively better on maximum g force, and then you start building airframes to handle way more than humans can. These all have the potential for huge amounts of scaling on the cost side...

@paulbrooks4395 Ай бұрын

At those power levels (even 5x, let alone 20x) we're talking about rapidly accelerating global warming in terms of power demands. This AI revolution, at scale, could be analogous to the crypto boom, but on a far larger scale if the AI gold rush continues on into the future. Musk recently was talking about a new datacenter for AI that runs on 12-14 ish diesel generators so it's not on the grid. This isn't sustainable.

@lost4468yt 8 күн бұрын

But it's also funding huge levels of fission now, with entirely new economic regimes? Yeah duh Musk is going to suggest the dumbest thing. But realistically chasing fission seems to be the better solution. This also has the ability to solve climate change in multiple ways that weren't thinkable before. We can't just look at the potential energy scaling and go on that only? If we get the ability to build fission reactors rapidly and on a cost scale that deal with modern power economics, then that's an unimaginably large win for climate change?

@slowskis Ай бұрын

I am interested to see how far the infrastructure gets pushed before the industry right sizes. It would be funny to me if all this money gets spent to build out inference infrastructure in the cloud using data centers and afterwards inference compute gets done on local machines anyway.

@alexmacgregor Ай бұрын

Super insightful Ian!

@XYang2023 Ай бұрын

If you can run an agentic workflow with a SLM on an edge device, it might be more cost effective.

@couldntfindafreename Ай бұрын

Yeah, if you can still get useful results based on lower quality inference.

@MrKelaher Ай бұрын

definitely agree, particularly as corpus and resultant models specifically tuned to "take actions" and "do reasoning with rollback" become more common. A 15W TPD unified memory ARM based edge node or mobile phone can already do 40Tops on 7b-ish models and that can do amazing things already - way more impressive than the "cloud tethered" latency impacted things like Rabbit R1.

@aldarrin Ай бұрын

Wouldn't optical lines be less prone to interference than copper traces? Will we see these in consumer MBs in a few -> 10 years?

@noname-gp6hk Ай бұрын

Consumer motherboards are driven by cost more than any other factor. There really isn't anything in a desktop PC that needs optics, it doesn't solve any real problems and if it increases cost then I don't see a use for it in that application.

@aldarrin Ай бұрын

@@noname-gp6hk So, there are absolutely no signal issues with memory and/or PCIe lane traces? None whatsoever?

@SquintyGears Ай бұрын

@@aldarrinYeah but I think his point is more that they'll scale down the capabilities and segment these features in HEDT instead of making them mainstream. And "you'll be happy with the 5% improvement we give you"...

@chrisrogers1092 Ай бұрын

Optical connections are not limited by the speed of light. In a fiber optic link for instance the light bounces around the tube so much that its actually comparable to a copper link in latency.

@sloanNYC Ай бұрын

It definitely makes me wonder why so many are investing so much $$$ in the hardware right now. But I guess they don't want to fall behind in market share.

@mzamroni Ай бұрын

as written in the meta hugging face page, creating midrange level llama 70B model needs 100 years in single Nvidia dgx server. the chat gpt equivalent llama 405B model needs more than 400 years that was why they bought so many dgx. but amd etc. is catching up. open ai etc. don't want to be locked into Nvidia too

@jpvalverde85 Ай бұрын

Optics will enable scaling, but if i have to take a guess, playing smart on data locality can bring more effective solutions cost and energy wise. Huge models may bring some benefits and make the technology thrive but we already are going into diminishing returns without even covering the costs.

@Pavlobot5 Ай бұрын

when are we getting optical DDR traces in the motherboard?

@MrKelaher Ай бұрын

InfiniBand ? Has supported scaled optical distributed compute/shared memory model for some time ? As to the value chain - I have significant use-cases that demonstrably save money and time for tasks humans are slower/worse at with current model API costs and model capabilities. I actually think existing models are barely explored from a value POV even as new models generate even newer and "maybe" needed capabilities. Another concern is that perhaps API costs are "discounted" at the moment and that enshitification will ensue. Finally, aware of workloads that worked well on certain proprietary models that are no longer work well as their endpoints where replaced with updated "sort of identical" fine-tunings that were WORSE for those use cases, leaving people the scramble to find other models for the same tasks.

@TheBackyardChemist Ай бұрын

How about adding support for CXL memory expanders into GPUs?

@Nobody_Of_Interest Ай бұрын

I think MI300X has CXL support. Whether anyone uses it effectively, is anyone's guess. The bandwidth may be too low to be practical for LLM inference.

@novantha1 Ай бұрын

@@Nobody_Of_InterestTo the best of my knowledge, CXL and bandwidth are…Weird. I haven’t played around with any modules myself (I don’t have $4000+ to drop on an experiment), but from what I’ve read I’m pretty sure the bandwidth works additively, in the sense that to some extent you can add the bandwidth of the CXL module’s PCIe connection to the memory bandwidth of the accelerator (or CPU, making them surprisingly viable inference and training devices), but there’s obviously some sort of limitation on the amount of bandwidth you can add to a single PCIe device (defined presumably by the PCIe slot it’s hosted in), so there’s either a premium ratio between CXL expanders and PCIe devices, or you have to get more out of it via software and the architecture of your AI model. As an example, you might imagine a Mixture of Experts model where a Transformer’s feed forward network is replaced by several small MLPs and the appropriate expert is selected for each token inferred. As each expert is smaller, it’s less “expensive” to load it, and by virtue of allowing the experts to specialize, loading the is more “valuable” than loading the same number of parameters in a dense network. This is a weird way to look at it, but you might say that it allows for a higher “effective” bandwidth, or “software enabled” bandwidth, and so you could see storing the expert parameters in CXL expanders, and leaving the main parameters on-device. There’s a happy synergy there where CXL expanders usually have a lot more capacity per unit of bandwidth (similar to the tradeoff seen in CPUs), and the backwards pass is calculated per-expert, so you could also store the gradient in the CXL modules, as well, meaning you could potentially train a monstrously sized model in a single accelerator, like an MI300X. Back of the napkin math suggests with 64 32GB CXL modules (obviously this is a ridiculous configuration but bear with me), a single MI300X could probably train a 100B parameter model (in contrast to its base capacity of around a 4-8B model depending on exact setup), though I think in this specific configuration it would have to be a low-performance MoE (mixture of experts) because I think you would have to limit to best-of-one/topK MoE, whereas generally a fine-grained mixture of experts with multiple experts usually works better.

@notDacian Ай бұрын

I remember Intel had Silicon Photonics for years now, never heard about it being used in any products thou, wonder why?

@cryptocsguy9282 Ай бұрын

@notDacian most things related to silicon photonics that I've heard about is all related to research in chip design, I agree that I've heard nothing about such devices being used commercially

@TechTechPotato Ай бұрын

They mothballed their programmable tofino switch business, that was meant to be their lead CPO platform. Now it's a bit of a mix - they're showing stuff but it's the same as a few months ago? I can't get a good read - even former employees of that division are baffled

@Aziqfajar Ай бұрын

Sounds as if they're gagging on the tech, whether to continue or just throw it out. Def not my kink though

@quibster Ай бұрын

it can be argued now that manufacturers, mainly nvidia, have had a taste of real business customers, that they have forgotten their legacy- value proposition. its been a few generations of gpus now from nvidia but also in general, where customers are only receiving minimal gains, or equal performance to previous flagships. that doesn't go unnoticed. everyone in gaming is still talking completely unjokingly about the lack of any price:performance generic graphics cards, because nothing else offers previous flagship performance for a reasonable price. they are paying extremely close attention to these slim offerings in benchmarking- there is no value proposition as there used to be. even with mid-year refreshes to existing models there is no value proposition. there is worry mounting for these companies to stay relevant across computing, including gaming, where the audience is ever-increasing and yet a tonne of traction has been lost already. yes memory is way too slow and too expensive for current needs. memory being in demand they claim, while having no valid excuse for how overdue their timeline is for a breakthrough or a major expansion. if these minor gains take so long, they ought to be expanding and offering quantity of previous flagship memory for a cheaper price. completely seriously, the memory industry's timeline is so slow and disrespectful to current trends (16:32 holy sh**) that it is a part of what makes them more vulnerable to natural disasters; as we have seen the effects of natural disasters on memory manufacturing in the past, we can compare this against how other sectors handle it and it does not take a lot of research to show that sectors in computer hardware outside of memory, are more capable of recovery or switching gears when necessary. memory has typically not shown this. their slow pace of progress could even indicate that we are subsidizing their recovery right now without knowing it. i can get all this power from a $20 LLM subscription, but why doesn't that kind of value scale back to hardware customers? it makes no sense, until you start digging into what the heck is going on with memory

@musaran2 Ай бұрын

AFAIK optical has more cost, latency, power draw and unreliability. 1st densify compute. Only then consider optical connects.

@PaulLembo Ай бұрын

I like that we are talking end to end here. I don’t think the end to end math works or is close. Too many field of dreams use cases at the user level assuming this arms race of AI spending is the new normal.

@skypickle29 Ай бұрын

anyone remember folding at home? a distributed model for doing the calculations for folding a protein.another one was SETI at home. Ultimately i can imagine 'training at home' in secret. model builders will offload training bits to the edge to reduce their costs and users might not even notice. It will be bundled into the client so that you do training for openAI's model every time you use chat GPT. After all, training is just mat mul massively parallel. And billions of people with cell phones is a massively parallel resource just waiting to be used. Sure it's SLOW but it will be FREE for openAI. Sort of like the way facebook leveraged your desire to validate yourself by allowing you to be an exhiibitionist online all the while collating your data and selling it to databrokers

@SquintyGears Ай бұрын

Truely dystopian... Reliant on the possibility of splitting and collapsing the training workload effectively. Which isn't proven?

@niamhleeson3522 Ай бұрын

This won't happen because training has huge memory and bandwidth requirements compared to the usual computing. People would notice that the app is using much more cell data than expected. At OpenAI's scale you could not even hold the whole matrix in a cell phone's memory.

@SquintyGears Ай бұрын

@@niamhleeson3522 he's intending for inference. Once you have all the weights figured out there's no reason you couldn't load or bake them into a completely different kind of system. Even a mechanical, analog or one time use system. It's still very impossible, but not for the reasons you mentioned.

@niamhleeson3522 Ай бұрын

@@SquintyGears the original commenter was literally talking about distributing training to edge devices. what do you mean by "intending for inference"?

@SquintyGears Ай бұрын

@@niamhleeson3522 oh I'm sorry I was replying to comments on a different thread and i got it mixed up.

@esra_erimez Ай бұрын

11:56 Isn't this why NVidea bought Mellanox?

@and7063 Ай бұрын

Not even the stranded assets that the AI boom is leaving can be 'fixed'. Those coal fired power plants are opening at record pace and will continue to operate for a couple decades or more, undoing much of the work that even the big tech companies were bragging about becoming green and sustainable.

@lost4468yt 8 күн бұрын

But Microsoft and Google are already doing things we never predicted before, like private fission reactors for specific uses? If the industry collapses we'll suddenly have a ton of research into more scalable fission reactors that have a different economic regime? Not to mention the reactors themselves... Also if she's close to AGI than you suddenly have entirely new regimes and potential solutions? Fusion clearly isn't usable on the timescales we need if we continue with humans only. But if you can get human level models that run faster then we have the potential for building them much more quickly and rapidly. It goes from too far out to a smaller solution.

@Stan_144 Ай бұрын

How Ayar Labs compares to POET ?

@autohmae Ай бұрын

So basically we are getting more silicon photonics ? Broader market adoption ?

@iankester-haney3315 Ай бұрын

AI can't solve every problem. Every one is just trying to cover up their bad search algorithms. Look at the horrendous state of autocorrect, filtering in the bad just as much as the good.

@1samm1 Ай бұрын

Add to that, as also visible in autocorrect, purposeful censorship, and you have a lot more e.g. health relevant problems that AI explicitly won't help solve, either at all or in a non-political manner

@henrikmikaelkristensen4784 Ай бұрын

Nobody is using LLMs for autocorrect yet. You want a sophisticated, contextually correct answer in realtime. An LLM can give you that, but you need a big model that won't fit locally on a phone.

@matthew.m.stevick Ай бұрын

_yet_

@bananasmileclub5528 Ай бұрын

Ең жақсы KZbin

The Economics of AI are Failing, But We Can Fix It (With Lasers)

Пікірлер: 97